本篇博文主要内容为 2025-04-29 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-04-29)

今日共更新775篇论文,其中:

  • 自然语言处理82篇(Computation and Language (cs.CL))
  • 人工智能200篇(Artificial Intelligence (cs.AI))
  • 计算机视觉165篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习206篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] AutoJudge: Judge Decoding Without Manual Annotation

【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)推理过程中效率低下的问题,特别是在生成响应时对所有令牌进行精确匹配所导致的计算开销过大。解决方案的关键在于引入一种任务特定的有损推测解码方法,通过识别影响下游响应质量的重要令牌,并放松对“不重要”令牌的精确性要求,从而加快生成速度。该方法依赖于一种半贪婪搜索算法来判断哪些目标与草稿模型之间的差异需要修正以保持质量,哪些可以跳过,并利用基于现有LLM嵌入的轻量级分类器在推理阶段预测哪些不匹配的令牌可以安全接受而不影响最终答案的质量。

链接: https://arxiv.org/abs/2504.20039
作者: Roman Garipov,Fedor Velikonivtsev,Ruslan Svirschevski,Vage Egiazarian,Max Ryabinin
机构: HSE University (高等经济学院); Yandex (雅音); ITMO University (伊塔莫大学); IST Austria (奥地利科学技术研究所); Together AI (共同人工智能)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint, Work in progress

点击查看摘要

Abstract:We introduce AutoJudge, a framework that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the generated response, relaxing the guarantee so that the “unimportant” tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft model should be corrected to preserve quality, and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We test our approach with Llama 3.2 1B (draft) and Llama 3.1 8B (target) models on zero-shot GSM8K reasoning, where it achieves up to 1.5x more accepted tokens per verification cycle with under 1% degradation in answer accuracy compared to standard speculative decoding and over 2x with small loss in accuracy. When applied to the LiveCodeBench benchmark, our approach automatically detects other, programming-specific important tokens and shows similar speedups, demonstrating its ability to generalize across tasks.
zh

[NLP-1] Better To Ask in English? Evaluating Factual Accuracy of Multilingual LLM s in English and Low-Resource Languages

【速读】: 该论文试图解决多语言大型语言模型(Multilingual Large Language Models, LLMs)在低资源语言,尤其是印地语系语言中的事实准确性问题。其解决方案的关键在于通过对比模型在英语和印地语系语言中的表现,利用IndicQuest数据集中的问答对,评估模型在不同语言下的可靠性,从而揭示当前LLMs在多语言理解能力上的局限性。

链接: https://arxiv.org/abs/2504.20022
作者: Pritika Rohera,Chaitrali Ginimav,Gayatri Sawant,Raviraj Joshi
机构: Pune Institute of Computer Technology ( Pune Institute of Computer Technology); Indian Institute of Technology Madras (Indian Institute of Technology Madras); L3Cube Labs, Pune (L3Cube Labs, Pune)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multilingual Large Language Models (LLMs) have demonstrated significant effectiveness across various languages, particularly in high-resource languages such as English. However, their performance in terms of factual accuracy across other low-resource languages, especially Indic languages, remains an area of investigation. In this study, we assess the factual accuracy of LLMs - GPT-4o, Gemma-2-9B, Gemma-2-2B, and Llama-3.1-8B - by comparing their performance in English and Indic languages using the IndicQuest dataset, which contains question-answer pairs in English and 19 Indic languages. By asking the same questions in English and their respective Indic translations, we analyze whether the models are more reliable for regional context questions in Indic languages or when operating in English. Our findings reveal that LLMs often perform better in English, even for questions rooted in Indic contexts. Notably, we observe a higher tendency for hallucination in responses generated in low-resource Indic languages, highlighting challenges in the multilingual understanding capabilities of current LLMs.
zh

[NLP-2] LLM -Generated Fake News Induces Truth Decay in News Ecosystem: A Case Study on Neural News Recommendation SIGIR2025

【速读】: 该论文试图解决大规模生成式AI(Generative AI)伪造新闻对新闻推荐系统中真实新闻排名的影响问题。其关键解决方案是构建一个模拟流程和包含约56,000条多样化生成新闻的数据集,以研究生成式AI在新闻推荐系统中的影响,并揭示了“真实性衰减”现象,即随着生成式AI伪造新闻的引入,真实新闻在新闻排序中的优势逐渐减弱。

链接: https://arxiv.org/abs/2504.20013
作者: Beizhe Hu,Qiang Sheng,Juan Cao,Yang Li,Danding Wang
机构: Media Synthesis and Forensics Lab, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: ACM SIGIR 2025 Full Paper

点击查看摘要

Abstract:Online fake news moderation now faces a new challenge brought by the malicious use of large language models (LLMs) in fake news production. Though existing works have shown LLM-generated fake news is hard to detect from an individual aspect, it remains underexplored how its large-scale release will impact the news ecosystem. In this study, we develop a simulation pipeline and a dataset with ~56k generated news of diverse types to investigate the effects of LLM-generated fake news within neural news recommendation systems. Our findings expose a truth decay phenomenon, where real news is gradually losing its advantageous position in news ranking against fake news as LLM-generated news is involved in news recommendation. We further provide an explanation about why truth decay occurs from a familiarity perspective and show the positive correlation between perplexity and news ranking. Finally, we discuss the threats of LLM-generated fake news and provide possible countermeasures. We urge stakeholders to address this emerging challenge to preserve the integrity of news ecosystems.
zh

[NLP-3] Knowledge Distillation of Domain-adapted LLM s for Question-Answering in Telecom

【速读】: 该论文试图解决在领域特定任务中,知识蒸馏(Knowledge Distillation, KD)过程中教师模型和学生模型是否需要进行领域适应的问题。其解决方案的关键在于通过系统性实验研究教师模型、学生模型以及两者同时进行监督微调(Supervised Fine-tuning, SFT)对蒸馏模型性能的影响,并分析词汇一致性及不同KD算法(如原始KD和双空间KD, DSKD)的作用。实验结果表明,在词汇一致的情况下,教师模型的SFT能够提升蒸馏模型的性能,而同时对教师和学生进行SFT则在所有评估指标上表现更优。

链接: https://arxiv.org/abs/2504.20000
作者: Rishika Sen,Sujoy Roychowdhury,Sumit Soman,H. G. Ranjani,Srikhetra Mohanty
机构: Ericsson R&D (爱立信研发)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Knowledge Distillation (KD) is one of the approaches to reduce the size of Large Language Models (LLMs). A LLM with smaller number of model parameters (student) is trained to mimic the performance of a LLM of a larger size (teacher model) on a specific task. For domain-specific tasks, it is not clear if teacher or student model, or both, must be considered for domain adaptation. In this work, we study this problem from perspective of telecom domain Question-Answering (QA) task. We systematically experiment with Supervised Fine-tuning (SFT) of teacher only, SFT of student only and SFT of both prior to KD. We design experiments to study the impact of vocabulary (same and different) and KD algorithms (vanilla KD and Dual Space KD, DSKD) on the distilled model. Multi-faceted evaluation of the distillation using 14 different metrics (N-gram, embedding and LLM-based metrics) is considered. Experimental results show that SFT of teacher improves performance of distilled model when both models have same vocabulary, irrespective of algorithm and metrics. Overall, SFT of both teacher and student results in better performance across all metrics, although the statistical significance of the same depends on the vocabulary of the teacher models.
zh

[NLP-4] D-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

【速读】: 该论文旨在解决任务导向型对话(Task-Oriented Dialogue, TOD)系统评估方法不足的问题,特别是传统自动评估指标无法检测用户与智能体交互过程中出现的关键中间错误。其解决方案的核心是提出TD-EVAL(Turn and Dialogue-level Evaluation),这是一个两阶段的评估框架,通过细粒度的回合级分析与整体对话级比较相结合,实现对对话系统的全面评估。在回合级,从对话连贯性、后端知识一致性及策略合规性三个维度进行评价;在对话级,则通过TOD Agent Arena的成对比较来衡量对话质量。实验表明,TD-EVAL能够有效识别传统指标遗漏的对话错误,并且与人类判断具有更好的一致性。

链接: https://arxiv.org/abs/2504.19982
作者: Emre Can Acikgoz,Carl Guo,Suvodip Dey,Akul Datta,Takyoung Kim,Gokhan Tur,Dilek Hakkani-Tür
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. While traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce TD-EVAL (Turn and Dialogue-level Evaluation), a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn level, we evaluate each response along three TOD-specific dimensions: conversation cohesion, backend knowledge consistency, and policy compliance. Meanwhile, we design TOD Agent Arena that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ 2.4 and \tau-Bench, we demonstrate that TD-EVAL effectively identifies the conversational errors that conventional metrics miss. Furthermore, TD-EVAL exhibits better alignment with human judgments than traditional and LLM-based metrics. These findings demonstrate that TD-EVAL introduces a new paradigm for TOD system evaluation, efficiently assessing both turn and system levels with a plug-and-play framework for future research.
zh

[NLP-5] Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking

【速读】: 该论文试图解决在线虚假信息传播带来的可信事实核查需求,特别是在传统人工众包核查中存在质量波动和偏见的问题。其解决方案的关键在于引入生成式 AI (Generative AI) 驱动的自主代理,这些代理能够模拟人类行为与决策,在事实核查任务中表现出更高的真实性分类准确性、更强的内部一致性以及更低的社会与认知偏见敏感性。通过系统性地依赖如准确性、精确性和信息量等结构化标准,这些代理为基于众包的事实核查系统提供了可扩展、一致且较少偏见的贡献。

链接: https://arxiv.org/abs/2504.19940
作者: Luigia Costabile,Gian Marco Orlando,Valerio La Gatta,Vincenzo Moscato
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The growing spread of online misinformation has created an urgent need for scalable, reliable fact-checking solutions. Crowdsourced fact-checking - where non-experts evaluate claim veracity - offers a cost-effective alternative to expert verification, despite concerns about variability in quality and bias. Encouraged by promising results in certain contexts, major platforms such as X (formerly Twitter), Facebook, and Instagram have begun shifting from centralized moderation to decentralized, crowd-based approaches. In parallel, advances in Large Language Models (LLMs) have shown strong performance across core fact-checking tasks, including claim detection and evidence evaluation. However, their potential role in crowdsourced workflows remains unexplored. This paper investigates whether LLM-powered generative agents - autonomous entities that emulate human behavior and decision-making - can meaningfully contribute to fact-checking tasks traditionally reserved for human crowds. Using the protocol of La Barbera et al. (2024), we simulate crowds of generative agents with diverse demographic and ideological profiles. Agents retrieve evidence, assess claims along multiple quality dimensions, and issue final veracity judgments. Our results show that agent crowds outperform human crowds in truthfulness classification, exhibit higher internal consistency, and show reduced susceptibility to social and cognitive biases. Compared to humans, agents rely more systematically on informative criteria such as Accuracy, Precision, and Informativeness, suggesting a more structured decision-making process. Overall, our findings highlight the potential of generative agents as scalable, consistent, and less biased contributors to crowd-based fact-checking systems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2504.19940 [cs.CL] (or arXiv:2504.19940v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.19940 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-6] GenCLS: Pushing the Boundaries of Generative Classification in LLM s Through Comprehensive SFT and RL Studies Across Diverse Datasets

【速读】: 该论文旨在解决传统判别方法在文本分类任务中未能充分利用大型语言模型(Large Language Models, LLMs)内在生成能力的问题。现有研究主要依赖简单的监督微调(Supervised Fine-Tuning, SFT),缺乏对训练与推理提示之间相互作用的深入探索,并未系统性地结合强化学习(Reinforcement Learning, RL)与推理时提示策略。论文提出的解决方案是GenCLS++框架,其关键在于联合优化SFT与RL,并在训练和推理过程中系统性地探索五种高层策略维度,包括上下文学习变体、类别定义、显式不确定性标签、语义无关数值标签以及基于困惑度的解码策略。通过SFT“策略预热”后应用基于规则的简单奖励进行RL训练,显著提升了分类性能。

链接: https://arxiv.org/abs/2504.19898
作者: Mingqian He,Fei Zhao,Chonggang Lu,Ziyan Liu,Yue Wang,Haofu Qian
机构: Zhejiang University (浙江大学); Xiaohongshu Inc. (小红书公司); Beijing University of Posts and Telecommunications (北京邮电大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As a fundamental task in machine learning, text classification plays a crucial role in many areas. With the rapid scaling of Large Language Models (LLMs), particularly through reinforcement learning (RL), there is a growing need for more capable discriminators. Consequently, advances in classification are becoming increasingly vital for enhancing the overall capabilities of LLMs. Traditional discriminative methods map text to labels but overlook LLMs’ intrinsic generative strengths. Generative classification addresses this by prompting the model to directly output labels. However, existing studies still rely on simple SFT alone, seldom probing the interplay between training and inference prompts, and no work has systematically leveraged RL for generative text classifiers and unified SFT, RL, and inference-time prompting in one framework. We bridge this gap with GenCLS++, a framework that jointly optimizes SFT and RL while systematically exploring five high-level strategy dimensions-in-context learning variants, category definitions, explicit uncertainty labels, semantically irrelevant numeric labels, and perplexity-based decoding-during both training and inference. After an SFT “policy warm-up,” we apply RL with a simple rule-based reward, yielding sizable extra gains. Across seven datasets, GenCLS++ achieves an average accuracy improvement of 3.46% relative to the naive SFT baseline; on public datasets, this improvement rises to 4.00%. Notably, unlike reasoning-intensive tasks that benefit from explicit thinking processes, we find that classification tasks perform better without such reasoning steps. These insights into the role of explicit reasoning provide valuable guidance for future LLM applications.
zh

[NLP-7] semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storag e

【速读】: 该论文旨在解决现有大语言模型(Large Language Model, LLM)推理系统中存储效率低下导致的服务性能瓶颈问题。具体而言,现有的解耦系统虽然通过计算资源的分离实现了异步计算,但带来了存储挑战,包括权重重复、键值缓存(KV cache)传输开销、存储不平衡以及KV缓存迁移困难等问题,进而影响高请求率下的服务性能。论文提出的解决方案关键在于设计一种新型的LLM推理系统——semi-PD,其核心是实现计算资源的解耦与存储资源的统一,通过引入计算资源控制器和统一内存管理器,有效降低资源调整开销,并采用服务级别目标(SLO)感知的动态分区算法优化SLO达成率,从而在保持低延迟的同时提升系统吞吐量。

链接: https://arxiv.org/abs/2504.19867
作者: Ke Hong,Lufang Chen,Zhong Wang,Xiuhong Li,Qiuli Mao,Jianping Ma,Chao Xiong,Guanyu Wu,Buhe Han,Guohao Dai,Yun Liang,Yu Wang
机构: 未知
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 18 pages, 16 figures

点击查看摘要

Abstract:Existing large language model (LLM) serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticated scheduling issues in the unified system but leads to storage challenges including 1) replicated weights for both phases that prevent flexible deployment, 2) KV cache transfer overhead between the two phases, 3) storage imbalance that causes substantial wasted space of the GPU capacity, and 4) suboptimal resource adjustment arising from the difficulties in migrating KV cache. Such storage inefficiency delivers poor serving performance under high request rates. In this paper, we identify that the advantage of the disaggregated system lies in the disaggregated computation, i.e., partitioning the computational resource to enable the asynchronous computation of two phases. Thus, we propose a novel LLM serving system, semi-PD, characterized by disaggregated computation and unified storage. In semi-PD, we introduce a computation resource controller to achieve disaggregated computation at the streaming multi-processor (SM) level, and a unified memory manager to manage the asynchronous memory access from both phases. semi-PD has a low-overhead resource adjustment mechanism between the two phases, and a service-level objective (SLO) aware dynamic partitioning algorithm to optimize the SLO attainment. Compared to state-of-the-art systems, semi-PD maintains lower latency at higher request rates, reducing the average end-to-end latency per request by 1.27-2.58x on DeepSeek series models, and serves 1.55-1.72x more requests adhering to latency constraints on Llama series models. Comments: 18 pages, 16 figures Subjects: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2504.19867 [cs.CL] (or arXiv:2504.19867v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.19867 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-8] Efficient Domain-adaptive Continual Pretraining for the Process Industry in the German Language

【速读】: 该论文旨在解决领域自适应持续预训练(Domain-adaptive continual pretraining, DAPT)在特定语言和领域中数据获取困难的问题,尤其是在非英语语言如德语的工业领域中。其解决方案的关键在于引入一种称为ICL-augmented pretraining (ICL-APT) 的高效方法,该方法利用上下文学习(in-context learning, ICL)和k近邻(k-nearest neighbors, kNN)技术,通过增强目标数据中的领域相关文本,显著降低GPU计算时间并保持模型性能。

链接: https://arxiv.org/abs/2504.19856
作者: Anastasia Zhukova,Christian E. Matt,Terry Ruas,Bela Gipp
机构: University of Göttingen (哥廷根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Domain-adaptive continual pretraining (DAPT) is a state-of-the-art technique that further trains a language model (LM) on its pretraining task, e.g., language masking. Although popular, it requires a significant corpus of domain-related data, which is difficult to obtain for specific domains in languages other than English, such as the process industry in the German language. This paper introduces an efficient approach called ICL-augmented pretraining or ICL-APT that leverages in-context learning (ICL) and k-nearest neighbors (kNN) to augment target data with domain-related and in-domain texts, significantly reducing GPU time while maintaining strong model performance. Our results show that this approach performs better than traditional DAPT by 3.5 of the average IR metrics (e.g., mAP, MRR, and nDCG) and requires almost 4 times less computing time, providing a cost-effective solution for industries with limited computational capacity. The findings highlight the broader applicability of this framework to other low-resource industries, making NLP-based solutions more accessible and feasible in production environments.
zh

[NLP-9] o MT or not to MT: An eye-tracking study on the reception by Dutch readers of different translation and creativity levels

【速读】: 该论文试图解决不同翻译模式(机器翻译、后编辑、人工翻译和原文)中创造性表达与错误对读者认知负荷的影响问题,特别是探讨创造性潜力单元(UCP)在其中的作用。解决方案的关键在于通过眼动追踪、问卷调查和回顾性思维访谈(RTA)等多方法数据的三角验证,发现UCP的增加会提升认知负荷,并且这一效应在人工翻译中最为显著,而在机器翻译中最低,同时未观察到错误对认知负荷的显著影响。

链接: https://arxiv.org/abs/2504.19850
作者: Kyo Gerrits,Ana Guerberof-Arenas
机构: University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL)
备注: This paper has been accepted to the MT Summit 2025 to be held in Geneva on June 23-27 2025

点击查看摘要

Abstract:This article presents the results of a pilot study involving the reception of a fictional short story translated from English into Dutch under four conditions: machine translation (MT), post-editing (PE), human translation (HT) and original source text (ST). The aim is to understand how creativity and errors in different translation modalities affect readers, specifically regarding cognitive load. Eight participants filled in a questionnaire, read a story using an eye-tracker, and conducted a retrospective think-aloud (RTA) interview. The results show that units of creative potential (UCP) increase cognitive load and that this effect is highest for HT and lowest for MT; no effect of error was observed. Triangulating the data with RTAs leads us to hypothesize that the higher cognitive load in UCPs is linked to increases in reader enjoyment and immersion. The effect of translation creativity on cognitive load in different translation modalities at word-level is novel and opens up new avenues for further research. All the code and data are available at this https URL
zh

[NLP-10] Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance

【速读】: 该论文试图解决在对大型语言模型(Large Language Models, LLMs)进行大量微调或合并之前,如何准确预测其性能的问题。传统方法如缩放定律虽然考虑了参数规模或训练数据量等全局因素,但往往忽略了模型之间的显式谱系关系。论文提出的解决方案是基于谱系约束的矩阵分解(Lineage-Regularized Matrix Factorization, LRMF)框架,其关键在于通过图拉普拉斯正则化编码LLMs之间的祖先关系,并利用多跳父子连接来提升实例级和基准级性能预测的准确性。

链接: https://arxiv.org/abs/2504.19811
作者: Takuya Tamura,Taro Yano,Masafumi Enomoto,Masafumi Oyamada
机构: NEC Corporation(日本电气公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately forecasting the performance of Large Language Models (LLMs) before extensive fine-tuning or merging can substantially reduce both computational expense and development time. Although prior approaches like scaling laws account for global factors such as parameter size or training tokens, they often overlook explicit lineage relationships - i.e., which models are derived or merged from which parents. In this work, we propose a novel Lineage-Regularized Matrix Factorization (LRMF) framework that encodes ancestral ties among LLMs via a graph Laplacian regularizer. By leveraging multi-hop parent-child connections, LRMF consistently outperforms conventional matrix factorization and collaborative filtering methods in both instance-level and benchmark-level performance prediction. Our large-scale study includes 2,934 publicly available Hugging Face models and 21,000+ instances across 6 major benchmarks, showing that lineage constraints yield up to 7-10 percentage points higher correlation with actual performance compared to baselines. Moreover, LRMF effectively addresses the cold-start problem, providing accurate estimates for newly derived or merged models even with minimal data. This lineage-guided strategy thus offers a resource-efficient way to inform hyperparameter tuning, data selection, and model combination in modern LLM development.
zh

[NLP-11] Moral Reasoning Across Languages: The Critical Role of Low-Resource Languages in LLM s

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多语言道德推理任务中的能力评估问题,特别是针对不同语言类型和上下文复杂度的挑战。其解决方案的关键在于构建了一个多语言道德推理基准(Multilingual Moral Reasoning Benchmark, MMRB),并在该基准上评估模型性能,同时通过微调开源模型LLaMA-3-8B来优化多语言对齐与对抗性训练,以提升模型在低资源语言中的表现。研究发现,低资源语言对多语言推理的影响比高资源语言更为显著,凸显了其在多语言自然语言处理中的关键作用。

链接: https://arxiv.org/abs/2504.19759
作者: Huichi Zhou,Zehao Xu,Munan Zhao,Kaihong Li,Yiqiang Li,Hongtao Wang
机构: Imperial College London (帝国理工学院); North China Electric Power University (华北电力大学); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:In this paper, we introduce the Multilingual Moral Reasoning Benchmark (MMRB) to evaluate the moral reasoning abilities of large language models (LLMs) across five typologically diverse languages and three levels of contextual complexity: sentence, paragraph, and document. Our results show moral reasoning performance degrades with increasing context complexity, particularly for low-resource languages such as Vietnamese. We further fine-tune the open-source LLaMA-3-8B model using curated monolingual data for alignment and poisoning. Surprisingly, low-resource languages have a stronger impact on multilingual reasoning than high-resource ones, highlighting their critical role in multilingual NLP.
zh

[NLP-12] Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation ECIR2025

【速读】: 该论文试图解决在检索增强生成(Retrieval-augmented generation, RAG)系统中,如何在大型语言模型(Large language models, LLMs)的输入约束下有效管理大量外部知识的问题。传统方法通过将外部文档分割为固定大小的片段来缓解输入限制,但这种方法常导致上下文碎片化,影响检索的完整性与生成的连贯性。论文提出的关键解决方案是对比分析两种先进方法——晚期分割(late chunking)和上下文检索(contextual retrieval),旨在优化RAG系统的有效性与效率,其中上下文检索更有效地保持语义连贯性,而晚期分割则在计算效率上更具优势。

链接: https://arxiv.org/abs/2504.19754
作者: Carlo Merola,Jaspinder Singh
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 2 figures, Second Workshop on Knowledge-Enhanced Information Retrieval, ECIR 2025

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has become a transformative approach for enhancing large language models (LLMs) by grounding their outputs in external knowledge sources. Yet, a critical question persists: how can vast volumes of external knowledge be managed effectively within the input constraints of LLMs? Traditional methods address this by chunking external documents into smaller, fixed-size segments. While this approach alleviates input limitations, it often fragments context, resulting in incomplete retrieval and diminished coherence in generation. To overcome these shortcomings, two advanced techniques, late chunking and contextual retrieval, have been introduced, both aiming to preserve global context. Despite their potential, their comparative strengths and limitations remain unclear. This study presents a rigorous analysis of late chunking and contextual retrieval, evaluating their effectiveness and efficiency in optimizing RAG systems. Our results indicate that contextual retrieval preserves semantic coherence more effectively but requires greater computational resources. In contrast, late chunking offers higher efficiency but tends to sacrifice relevance and completeness.
zh

[NLP-13] LLM -Assisted Automated Deductive Coding of Dialogue Data: Leverag ing Dialogue-Specific Characteristics to Enhance Contextual Understanding

【速读】: 该论文试图解决对话数据自动化编码中的上下文复杂性问题,特别是在理解与解释复杂语境信息方面存在的挑战。其解决方案的关键在于提出一种基于大型语言模型(Large Language Models, LLMs)的新型自动化编码框架,该框架通过对话特定特征——交流行为(communicative acts)和交流事件(communicative events)进行话语编码预测,并采用独立提示、角色提示及思维链方法提升编码准确性;同时,通过多模型协作与事件与行为之间的关联性实现一致性检查,从而显著提高编码精度。

链接: https://arxiv.org/abs/2504.19734
作者: Ying Na,Shihui Feng
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Dialogue data has been a key source for understanding learning processes, offering critical insights into how students engage in collaborative discussions and how these interactions shape their knowledge construction. The advent of Large Language Models (LLMs) has introduced promising opportunities for advancing qualitative research, particularly in the automated coding of dialogue data. However, the inherent contextual complexity of dialogue presents unique challenges for these models, especially in understanding and interpreting complex contextual information. This study addresses these challenges by developing a novel LLM-assisted automated coding approach for dialogue data. The novelty of our proposed framework is threefold: 1) We predict the code for an utterance based on dialogue-specific characteristics – communicative acts and communicative events – using separate prompts following the role prompts and chain-of-thoughts methods; 2) We engaged multiple LLMs including GPT-4-turbo, GPT-4o, DeepSeek in collaborative code prediction; 3) We leveraged the interrelation between events and acts to implement consistency checking using GPT-4o. In particular, our contextual consistency checking provided a substantial accuracy improvement. We also found the accuracy of act predictions was consistently higher than that of event predictions. This study contributes a new methodological framework for enhancing the precision of automated coding of dialogue data as well as offers a scalable solution for addressing the contextual challenges inherent in dialogue analysis.
zh

[NLP-14] Evaluate-and-Purify: Fortifying Code Language Models Against Adversarial Attacks Using LLM -as-a-Judge

【速读】: 该论文旨在解决代码语言模型在软件工程任务中面临的身份标识符替换攻击(identifier substitution attacks)所带来的安全漏洞问题。现有攻击方法虽然成功率较高,但生成的对抗样本往往具有不自然的代码模式。论文提出EP-Shield,其关键在于通过自然性感知的推理框架,对对抗样本进行评估与净化,从而提升攻击的隐蔽性并确保目标模型能够恢复正确的预测。

链接: https://arxiv.org/abs/2504.19730
作者: Wenhan Mu,Ling Xu,Shuren Pei,Le Mi,Huichi Zhou
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 25 pages, 6 figures

点击查看摘要

Abstract:The widespread adoption of code language models in software engineering tasks has exposed vulnerabilities to adversarial attacks, especially the identifier substitution attacks. Although existing identifier substitution attackers demonstrate high success rates, they often produce adversarial examples with unnatural code patterns. In this paper, we systematically assess the quality of adversarial examples using LLM-as-a-Judge. Our analysis reveals that over 80% of adversarial examples generated by state-of-the-art identifier substitution attackers (e.g., ALERT) are actually detectable. Based on this insight, we propose EP-Shield, a unified framework for evaluating and purifying identifier substitution attacks via naturalness-aware reasoning. Specifically, we first evaluate the naturalness of code and identify the perturbed adversarial code, then purify it so that the victim model can restore correct prediction. Extensive experiments demonstrate the superiority of EP-Shield over adversarial fine-tuning (up to 83.36% improvement) and its lightweight design 7B parameters) with GPT-4-level performance.
zh

[NLP-15] aming the Titans: A Survey of Efficient LLM Inference Serving

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成式人工智能(Generative AI)推理服务中面临的低延迟与高吞吐量问题,这一挑战主要源于模型参数量庞大导致的内存开销以及注意力机制带来的高计算需求。论文提出的解决方案关键在于通过多层级优化策略,包括实例级的模型部署、请求调度、解码长度预测、存储管理及解耦合架构,集群级的GPU部署、多实例负载均衡和云服务方案,以及针对新兴场景的特定任务与辅助方法,以全面提升LLM推理服务的效率与性能。

链接: https://arxiv.org/abs/2504.19720
作者: Ranran Zhen,Juntao Li,Yixin Ji,Zhenlin Yang,Tong Liu,Qingrong Xia,Xinyu Duan,Zhefeng Wang,Baoxing Huai,Min Zhang
机构: Soochow University (苏州大学); Huawei Cloud (华为云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: work in progress;11 pages of main paper with 7 main figures, overall 20 pages

点击查看摘要

Abstract:Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.
zh

[NLP-16] Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLM s SEMEVAL-2025

【速读】: 该论文旨在解决多语言环境下主题索引(subject indexing)的准确性与效率问题,具体是通过大型语言模型(LLMs)对双语TIBKAT数据库中的书目记录进行主题预测。其解决方案的关键在于将传统的基于机器学习的文本分类(XMTC)算法与创新的LLM方法相结合,包括用于翻译和合成数据生成的LLM技术,以及融合单语模型的预测结果,从而提升多语言场景下的主题索引性能。

链接: https://arxiv.org/abs/2504.19675
作者: Osma Suominen,Juho Inkinen,Mona Lehtinen
机构: National Library of Finland (芬兰国家图书馆); University of Helsinki (赫尔辛基大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 6 pages, 4 figures, submitted to SemEval-2025 workshop Task 5: LLMs4Subjects

点击查看摘要

Abstract:This paper presents the Annif system in SemEval-2025 Task 5 (LLMs4Subjects), which focussed on subject indexing using large language models (LLMs). The task required creating subject predictions for bibliographic records from the bilingual TIBKAT database using the GND subject vocabulary. Our approach combines traditional natural language processing and machine learning techniques implemented in the Annif toolkit with innovative LLM-based methods for translation and synthetic data generation, and merging predictions from monolingual models. The system ranked first in the all-subjects category and second in the tib-core-subjects category in the quantitative evaluation, and fourth in qualitative evaluations. These findings demonstrate the potential of combining traditional XMTC algorithms with modern LLM techniques to improve the accuracy and efficiency of subject indexing in multilingual contexts.
zh

[NLP-17] Multimodal Conditioned Diffusive Time Series Forecasting

【速读】: 该论文旨在解决时间序列预测(Time Series Forecasting, TSF)中现有基于扩散模型的方法主要关注单模态数值序列,而忽视了时间序列数据中丰富的多模态信息的问题。其解决方案的关键在于提出一种多模态条件扩散模型(Multimodal Conditioned Diffusion Model for TSF, MCD-TSF),通过联合利用时间戳和文本作为额外引导信息,以增强时间序列建模与预测能力。具体而言,时间戳用于在时间维度上建立不同数据点之间的时序与语义关联,而文本则作为时间序列历史的补充描述,并以无分类器的方式自适应对齐和动态控制。

链接: https://arxiv.org/abs/2504.19669
作者: Chen Su,Yuanhe Tian,Yan Song
机构: University of Science and Technology of China (中国科学技术大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion models achieve remarkable success in processing images and text, and have been extended to special domains such as time series forecasting (TSF). Existing diffusion-based approaches for TSF primarily focus on modeling single-modality numerical sequences, overlooking the rich multimodal information in time series data. To effectively leverage such information for prediction, we propose a multimodal conditioned diffusion model for TSF, namely, MCD-TSF, to jointly utilize timestamps and texts as extra guidance for time series modeling, especially for forecasting. Specifically, Timestamps are combined with time series to establish temporal and semantic correlations among different data points when aggregating information along the temporal dimension. Texts serve as supplementary descriptions of time series’ history, and adaptively aligned with data points as well as dynamically controlled in a classifier-free manner. Extensive experiments on real-world benchmark datasets across eight domains demonstrate that the proposed MCD-TSF model achieves state-of-the-art performance.
zh

[NLP-18] A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks

【速读】: 该论文试图解决低资源语言,特别是中央库尔德语(Central-Kurdish language, CKL)在自然语言处理(Natural Language Processing, NLP)任务中缺乏标准化和全面的词性标注(Part-of-Speech tagging, POS tagging)的问题。解决方案的关键在于构建一个准确且全面的CKL词性标注集,通过整合不同研究中的POS标签以及库尔德语语言学专家的意见,实现对CKL语料的大规模标注,并提升库尔德语NLP任务的性能。

链接: https://arxiv.org/abs/2504.19645
作者: Shadan Shukr Sabr,Nazira Sabr Mustafa,Talar Sabah Omar,Salah Hwayyiz Rasool,Nawzad Anwer Omer,Darya Sabir Hamad,Hemin Abdulhameed Shams,Omer Mahmood Kareem,Rozhan Noori Abdullah,Khabat Atar Abdullah,Mahabad Azad Mohammad,Haneen Al-Raghefy,Safar M. Asaad,Sara Jamal Mohammed,Twana Saeed Ali,Fazil Shawrow,Halgurd S. Maghdid
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 4 figures, 2 tables

点击查看摘要

Abstract:- The field of natural language processing (NLP) has dramatically expanded within the last decade. Many human-being applications are conducted daily via NLP tasks, starting from machine translation, speech recognition, text generation and recommendations, Part-of-Speech tagging (POS), and Named-Entity Recognition (NER). However, low-resourced languages, such as the Central-Kurdish language (CKL), mainly remain unexamined due to shortage of necessary resources to support their development. The POS tagging task is the base of other NLP tasks; for example, the POS tag set has been used to standardized languages to provide the relationship between words among the sentences, followed by machine translation and text recommendation. Specifically, for the CKL, most of the utilized or provided POS tagsets are neither standardized nor comprehensive. To this end, this study presented an accurate and comprehensive POS tagset for the CKL to provide better performance of the Kurdish NLP tasks. The article also collected most of the POS tags from different studies as well as from Kurdish linguistic experts to standardized part-of-speech tags. The proposed POS tagset is designed to annotate a large CKL corpus and support Kurdish NLP tasks. The initial investigations of this study via comparison with the Universal Dependencies framework for standard languages, show that the proposed POS tagset can streamline or correct sentences more accurately for Kurdish NLP tasks.
zh

[NLP-19] VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning

【速读】: 该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在处理图像时效率低下的问题,其核心在于当前LVLMs依赖于在令牌级别处理整个图像,而人类则能够在概念层面高效地分析信息并生成内容。解决方案的关键是提出一种端到端的自监督视觉概念建模框架(Visual Concept Modeling, VCM),通过跨多个采样实例的隐式对比学习和视觉-语言微调,构建视觉概念模型,无需依赖昂贵的概念级标注。该方法显著降低了计算成本,同时保持了在多种图像理解任务中的高性能。

链接: https://arxiv.org/abs/2504.19627
作者: Run Luo,Renke Shan,Longze Chen,Ziqiang Liu,Lu Wang,Min Yang,Xiaobo Xia
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences (中国科学院大学); National University of Singapore (新加坡国立大学); MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China (教育部脑启发智能感知与认知重点实验室,中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: VCM

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are pivotal for real-world AI tasks like embodied intelligence due to their strong vision-language reasoning abilities. However, current LVLMs process entire images at the token level, which is inefficient compared to humans who analyze information and generate content at the conceptual level, extracting relevant visual concepts with minimal effort. This inefficiency, stemming from the lack of a visual concept model, limits LVLMs’ usability in real-world applications. To address this, we propose VCM, an end-to-end self-supervised visual concept modeling framework. VCM leverages implicit contrastive learning across multiple sampled instances and vision-language fine-tuning to construct a visual concept model without requiring costly concept-level annotations. Our results show that VCM significantly reduces computational costs (e.g., 85% fewer FLOPs for LLaVA-1.5-7B) while maintaining strong performance across diverse image understanding tasks. Moreover, VCM enhances visual encoders’ capabilities in classic visual concept perception tasks. Extensive quantitative and qualitative experiments validate the effectiveness and efficiency of VCM.
zh

[NLP-20] Coreference Resolution for Vietnamese Narrative Texts ACL

【速读】: 该论文旨在解决越南语(Vietnamese)在自然语言处理(NLP)中的指代消解(coreference resolution)问题,该任务在越南语中尤为具有挑战性,因其属于资源较少的语言,缺乏足够的标注数据集。论文的关键解决方案是构建了一个全面的标注数据集,该数据集基于广泛阅读的越南在线新闻平台VnExpress的叙事文本,并制定了详细的标注指南以确保实体标注的一致性和准确性。此外,还评估了大型语言模型(LLMs)如GPT-3.5-Turbo和GPT-4在该数据集上的性能,结果显示GPT-4在准确性和响应一致性方面显著优于GPT-3.5-Turbo。

链接: https://arxiv.org/abs/2504.19606
作者: Hieu-Dai Tran,Duc-Vu Nguyen,Ngan Luu-Thuy Nguyen
机构: University of Information Technology, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
类目: Computation and Language (cs.CL)
备注: Accepted at PACLIC 2024

点击查看摘要

Abstract:Coreference resolution is a vital task in natural language processing (NLP) that involves identifying and linking different expressions in a text that refer to the same entity. This task is particularly challenging for Vietnamese, a low-resource language with limited annotated datasets. To address these challenges, we developed a comprehensive annotated dataset using narrative texts from VnExpress, a widely-read Vietnamese online news platform. We established detailed guidelines for annotating entities, focusing on ensuring consistency and accuracy. Additionally, we evaluated the performance of large language models (LLMs), specifically GPT-3.5-Turbo and GPT-4, on this dataset. Our results demonstrate that GPT-4 significantly outperforms GPT-3.5-Turbo in terms of both accuracy and response consistency, making it a more reliable tool for coreference resolution in Vietnamese.
zh

[NLP-21] Arabic Metaphor Sentiment Classification Using Semantic Information

【速读】: 该论文试图解决阿拉伯语隐喻在在线语境中对情感影响的分类问题,具体而言是通过语义标签进行阿拉伯语隐喻的情感分类。解决方案的关键在于设计新的自动工具,该工具结合了语义情感标签以实现更准确的情感分类,并通过F-score、召回率和精确率等标准方法进行评估,从而揭示隐喻对情感的影响。

链接: https://arxiv.org/abs/2504.19590
作者: Israa Alsiyat
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, I discuss the testing of the Arabic Metaphor Corpus (AMC) [1] using newly designed automatic tools for sentiment classification for AMC based on semantic tags. The tool incorporates semantic emotional tags for sentiment classification. I evaluate the tool using standard methods, which are F-score, recall, and precision. The method is to show the impact of Arabic online metaphors on sentiment through the newly designed tools. To the best of our knowledge, this is the first approach to conduct sentiment classification for Arabic metaphors using semantic tags to find the impact of the metaphor.
zh

[NLP-22] Graph-Based Spectral Decomposition for Parameter Coordination in Language Model Fine-Tuning

【速读】: 该论文旨在解决大规模语言模型在微调过程中的效率低下与结构感知不足的问题。其关键解决方案是引入图谱分析(graph spectral analysis)以优化参数协同更新机制,通过将模型参数视为图中的节点构建加权图,并利用拉普拉斯谱分解实现参数空间的频域建模与结构表示,进而设计结合任务损失与谱正则化项的联合损失函数,并引入谱滤波机制在优化过程中以结构感知方式调整梯度,从而提升训练稳定性和收敛性。

链接: https://arxiv.org/abs/2504.19583
作者: Hanlu Zhang,Yumeng Ma,Shuo Wang,Guiran Liu,Binrong Zhu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper proposes a parameter collaborative optimization algorithm for large language models, enhanced with graph spectral analysis. The goal is to improve both fine-tuning efficiency and structural awareness during training. In the proposed method, the parameters of a pre-trained language model are treated as nodes in a graph. A weighted graph is constructed, and Laplacian spectral decomposition is applied to enable frequency-domain modeling and structural representation of the parameter space. Based on this structure, a joint loss function is designed. It combines the task loss with a spectral regularization term to facilitate collaborative updates among parameters. In addition, a spectral filtering mechanism is introduced during the optimization phase. This mechanism adjusts gradients in a structure-aware manner, enhancing the model’s training stability and convergence behavior. The method is evaluated on multiple tasks, including traditional fine-tuning comparisons, few-shot generalization tests, and convergence speed analysis. In all settings, the proposed approach demonstrates superior performance. The experimental results confirm that the spectral collaborative optimization framework effectively reduces parameter perturbations and improves fine-tuning quality while preserving overall model performance. This work contributes significantly to the field of artificial intelligence by advancing parameter-efficient training methodologies for large-scale models, reinforcing the importance of structural signal processing in deep learning optimization, and offering a robust, generalizable framework for enhancing language model adaptability and performance.
zh

[NLP-23] m-KAILIN: Knowledge-Driven Agent ic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

【速读】: 该论文旨在解决现有开源生物医学标注科学语料库在数量和质量上的不足,以及生物医学知识复杂层次对大型语言模型(LLM)训练带来的挑战。其解决方案的关键在于提出一种基于知识驱动的多智能体框架,通过协作式多智能体架构,利用医学主题词表(MeSH)层次结构指导专业代理,自主提取、合成并自评高质量文本数据,生成与生物医学本体一致且覆盖全面的问答对,从而提升生物医学领域语言模型的性能。

链接: https://arxiv.org/abs/2504.19565
作者: Meng Xiao,Xunxin Cai,Chengrui Wang,Yuanchun Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 22 pages, Large Language Model, Agentic AI, Dataset Distillation, Multi-agent Collaboration

点击查看摘要

Abstract:The rapid progress of large language models (LLMs) in biomedical research has underscored the limitations of existing open-source annotated scientific corpora, which are often insufficient in quantity and quality. Addressing the challenge posed by the complex hierarchy of biomedical knowledge, we propose a knowledge-driven, multi-agent framework for scientific corpus distillation tailored for LLM training in the biomedical domain. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. These agents collectively generate and refine domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.
zh

[NLP-24] Detecting Effects of AI-Mediated Communication on Language Complexity and Sentiment

【速读】: 该论文试图解决AI-mediated communication (AI-MC)对社交媒体语言模式及情感表达的影响问题。其解决方案的关键在于通过对比分析2020年与2024年涉及唐纳德·特朗普的推文数据,结合Flesch-Kincaid可读性评分和情感极性分数,评估语言复杂性和情感倾向的变化,从而揭示AI在社交媒体沟通中的影响及其对语言和情感表达模式的改变。

链接: https://arxiv.org/abs/2504.19556
作者: Kristen Sussman,Daniel Carter
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 5 pages, 3 figures, Companion Proceedings of the ACM Web Conference 2025

点击查看摘要

Abstract:Given the subtle human-like effects of large language models on linguistic patterns, this study examines shifts in language over time to detect the impact of AI-mediated communication (AI- MC) on social media. We compare a replicated dataset of 970,919 tweets from 2020 (pre-ChatGPT) with 20,000 tweets from the same period in 2024, all of which mention Donald Trump during election periods. Using a combination of Flesch-Kincaid readability and polarity scores, we analyze changes in text complexity and sentiment. Our findings reveal a significant increase in mean sentiment polarity (0.12 vs. 0.04) and a shift from predominantly neutral content (54.8% in 2020 to 39.8% in 2024) to more positive expressions (28.6% to 45.9%). These findings suggest not only an increasing presence of AI in social media communication but also its impact on language and emotional expression patterns.
zh

[NLP-25] FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation

【速读】: 该论文旨在解决多GPU计算系统中因GPU间通信导致的性能瓶颈问题,特别是在消费级GPU上。现有设计无法同时优化 tile-wise overlapping(块级重叠)、interference-free computation(无干扰计算)和communication agnosticism(通信无关性)这三个关键特性。解决方案的关键在于提出FlashOverlap,这是一种轻量级设计,通过块级重叠、无干扰计算和通信无关性来有效减少通信开销,其创新点在于采用一种新的信号机制来识别块级数据依赖关系而不中断计算过程,并重新排列数据以连续地址进行通信,从而提升整体性能。

链接: https://arxiv.org/abs/2504.19519
作者: Ke Hong,Xiuhong Li,Minxu Liu,Qiuli Mao,Tianqi Wu,Zixiao Huang,Lufang Chen,Zhong Wang,Yichong Zhang,Zhenhua Zhu,Guohao Dai,Yu Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 11 figures, 4 tables

点击查看摘要

Abstract:Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing. Inter-GPU communication becomes a bottleneck in multi-GPU computing systems, particularly on consumer-grade GPUs. By exploiting concurrent hardware execution, overlapping computation and communication latency is an effective technique for mitigating the communication overhead. We identify that an efficient and adaptable overlapping design should satisfy (1) tile-wise overlapping to maximize the overlapping opportunity, (2) interference-free computation to maintain the original computational performance, and (3) communication agnosticism to reduce the development burden against varying communication primitives. Nevertheless, current designs fail to simultaneously optimize for all of those features. To address the issue, we propose FlashOverlap, a lightweight design characterized by tile-wise overlapping, interference-free computation, and communication agnosticism. FlashOverlap utilizes a novel signaling mechanism to identify tile-wise data dependency without interrupting the computation process, and reorders data to contiguous addresses, enabling communication by simply calling NCCL APIs. Experiments show that such a lightweight design achieves up to 1.65x speedup, outperforming existing works in most cases. Comments: 17 pages, 11 figures, 4 tables Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2504.19519 [cs.DC] (or arXiv:2504.19519v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2504.19519 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-26] Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding CVPR2025

【速读】: 该论文旨在解决开放词汇3D场景理解中的语义分割问题,以提升物理智能,使具身代理能够在真实世界环境中动态地解释和交互。其解决方案的关键在于提出一种名为MPEC(Masked Point-Entity Contrastive learning)的方法,该方法通过利用3D实体与语言的对齐以及不同点云视图下的点实体一致性,构建特定实体的特征表示,从而提高语义区分能力和实例差异化能力。

链接: https://arxiv.org/abs/2504.19500
作者: Yan Wang,Baoxiong Jia,Ziyu Zhu,Siyuan Huang
机构: State Key Laboratory of General Artificial Intelligence, BIGAI; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: CVPR 2025

点击查看摘要

Abstract:Open-vocabulary 3D scene understanding is pivotal for enhancing physical intelligence, as it enables embodied agents to interpret and interact dynamically within real-world environments. This paper introduces MPEC, a novel Masked Point-Entity Contrastive learning method for open-vocabulary 3D semantic segmentation that leverages both 3D entity-language alignment and point-entity consistency across different point cloud views to foster entity-specific feature representations. Our method improves semantic discrimination and enhances the differentiation of unique instances, achieving state-of-the-art results on ScanNet for open-vocabulary 3D semantic segmentation and demonstrating superior zero-shot scene understanding capabilities. Extensive fine-tuning experiments on 8 datasets, spanning from low-level perception to high-level reasoning tasks, showcase the potential of learned 3D features, driving consistent performance gains across varied 3D scene understanding tasks. Project website: this https URL
zh

[NLP-27] Improving Reasoning Performance in Large Language Models via Representation Engineering ICLR2025

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理能力上的表现是否可以被有效调控的问题,以及如何理解LLMs中的推理机制是否与人类存在本质差异。其解决方案的关键在于采用表示工程方法,通过从LLM处理推理任务时的残差流(residual stream)中读取模型激活值,生成一个控制向量(control vector),并在推理阶段对模型进行干预,从而调节模型的表征空间以提升特定任务的性能。该方法无需额外训练即可改善模型在归纳、演绎和数学推理任务上的表现,并通过KL散度和熵等指标评估控制向量对最终输出的影响。

链接: https://arxiv.org/abs/2504.19483
作者: Bertram Højer,Oliver Jarvis,Stefan Heinrich
机构: IT University of Copenhagen (哥本哈根信息技术大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Has been accepted at “The Thirteenth International Conference on Learning Representations (ICLR 2025)” Link to publication: this https URL

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have resulted in increasingly anthropomorphic language concerning the ability of LLMs to reason. Whether reasoning in LLMs should be understood to be inherently different is, however, widely debated. We propose utilizing a representation engineering approach wherein model activations are read from the residual stream of an LLM when processing a reasoning task. The activations are used to derive a control vector that is applied to the model as an inference-time intervention, modulating the representational space of the model, to improve performance on the specified task. We publish the code for deriving control vectors and analyzing model representations. The method allows us to improve performance on reasoning benchmarks and assess how control vectors influence the final logit distribution of a model via metrics such as KL divergence and entropy. We apply control vectors to Mistral-7B-Instruct and a range of Pythia models on an inductive, a deductive and mathematical reasoning task. We show that an LLM can, to a certain degree, be controlled to improve its perceived reasoning ability by modulating activations. The intervention is dependent upon the ability to reliably extract the model’s typical state when correctly solving a task. Our results suggest that reasoning performance can be modulated in the same manner as other information-processing tasks performed by LLMs and demonstrate that we are capable of improving performance on specific tasks via a simple intervention on the residual stream with no additional training.
zh

[NLP-28] Conflicts in Texts: Data Implications and Challenges

【速读】: 该论文试图解决自然语言处理(Natural Language Processing, NLP)模型在实际应用中因依赖和生成冲突信息而导致的不可靠性和信任问题。解决方案的关键在于将冲突信息统一归类为三个核心领域:网络上的自然文本、人工标注数据以及模型交互,并分析其影响及提出缓解策略,旨在推动发展能够更有效地推理和调和冲突信息的冲突感知NLP系统。

链接: https://arxiv.org/abs/2504.19472
作者: Siyi Liu,Dan Roth
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As NLP models become increasingly integrated into real-world applications, it becomes clear that there is a need to address the fact that models often rely on and generate conflicting information. Conflicts could reflect the complexity of situations, changes that need to be explained and dealt with, difficulties in data annotation, and mistakes in generated outputs. In all cases, disregarding the conflicts in data could result in undesired behaviors of models and undermine NLP models’ reliability and trustworthiness. This survey categorizes these conflicts into three key areas: (1) natural texts on the web, where factual inconsistencies, subjective biases, and multiple perspectives introduce contradictions; (2) human-annotated data, where annotator disagreements, mistakes, and societal biases impact model training; and (3) model interactions, where hallucinations and knowledge conflicts emerge during deployment. While prior work has addressed some of these conflicts in isolation, we unify them under the broader concept of conflicting information, analyze their implications, and discuss mitigation strategies. We highlight key challenges and future directions for developing conflict-aware NLP systems that can reason over and reconcile conflicting information more effectively.
zh

[NLP-29] BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在临床场景中的评估不足问题,尤其是现有基准测试未能充分反映真实世界电子健康记录(Electronic Health Record, EHR)数据的复杂性。其解决方案的关键在于构建一个全面的多语言基准测试集BRIDGE,涵盖来自九种语言的真实临床数据源的87项任务,并系统评估了52个最先进的LLMs在不同推理策略下的表现,从而揭示模型规模、语言、自然语言处理任务和临床专科对性能的影响。

链接: https://arxiv.org/abs/2504.19467
作者: Jiageng Wu,Bowen Gu,Ren Zhou,Kevin Xie,Doug Snyder,Yixing Jiang,Valentina Carducci,Richard Wyss,Rishi J Desai,Emily Alsentzer,Leo Anthony Celi,Adam Rodman,Sebastian Schneeweiss,Jonathan H. Chen,Santiago Romero-Brufau,Kueiyu Joshua Lin,Jie Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, current evaluations of LLMs in clinical contexts remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world electronic health record (EHR) data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. We systematically evaluated 52 state-of-the-art LLMs (including DeepSeek-R1, GPT-4o, Gemini, and Llama 4) under various inference strategies. With a total of 13,572 experiments, our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding.
zh

[NLP-30] Mitigating Modality Bias in Multi-modal Entity Alignment from a Causal Perspective SIGIR2025

【速读】: 该论文旨在解决多模态实体对齐(Multi-Modal Entity Alignment, MMEA)中视觉模态可能带来的偏差问题,即现有方法过度依赖视觉特征,导致在图像相似性较低的情况下性能下降。解决方案的关键在于提出一种反事实去偏框架CDMEA,从因果角度分析视觉模态的偏差,并通过估计两种模态的总效应(Total Effect, TE),排除视觉模态的自然直接效应(Natural Direct Effect, NDE),从而确保模型基于总间接效应(Total Indirect Effect, TIE)进行预测,有效融合视觉与图结构模态,降低视觉模态的偏差影响。

链接: https://arxiv.org/abs/2504.19458
作者: Taoyu Su,Jiawei Sheng,Duohe Ma,Xiaodong Li,Juwei Yue,Mengxiao Song,Yingkai Tang,Tingwen Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院)
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2025, 11 pages, 10 figures, 4 tables,

点击查看摘要

Abstract:Multi-Modal Entity Alignment (MMEA) aims to retrieve equivalent entities from different Multi-Modal Knowledge Graphs (MMKGs), a critical information retrieval task. Existing studies have explored various fusion paradigms and consistency constraints to improve the alignment of equivalent entities, while overlooking that the visual modality may not always contribute positively. Empirically, entities with low-similarity images usually generate unsatisfactory performance, highlighting the limitation of overly relying on visual features. We believe the model can be biased toward the visual modality, leading to a shortcut image-matching task. To address this, we propose a counterfactual debiasing framework for MMEA, termed CDMEA, which investigates visual modality bias from a causal perspective. Our approach aims to leverage both visual and graph modalities to enhance MMEA while suppressing the direct causal effect of the visual modality on model predictions. By estimating the Total Effect (TE) of both modalities and excluding the Natural Direct Effect (NDE) of the visual modality, we ensure that the model predicts based on the Total Indirect Effect (TIE), effectively utilizing both modalities and reducing visual modality bias. Extensive experiments on 9 benchmark datasets show that CDMEA outperforms 14 state-of-the-art methods, especially in low-similarity, high-noise, and low-resource data scenarios.
zh

[NLP-31] owards Long Context Hallucination Detection

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理长上下文输入时容易产生的上下文幻觉(contextual hallucination)问题,即模型生成与给定上下文不一致或缺乏依据的信息。解决方案的关键在于构建一个专门用于长上下文幻觉检测的数据集,并提出一种新颖的架构,使预训练的编码器模型(如BERT)能够通过分解与聚合机制处理长上下文,从而有效检测上下文幻觉。

链接: https://arxiv.org/abs/2504.19457
作者: Siyi Liu,Kishaloy Halder,Zheng Qi,Wei Xiao,Nikolaos Pappas,Phu Mon Htut,Neha Anna John,Yassine Benajiba,Dan Roth
机构: AWS AI Labs (AWS人工智能实验室); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, they are prone to contextual hallucination, generating information that is either unsubstantiated or contradictory to the given context. Although many studies have investigated contextual hallucinations in LLMs, addressing them in long-context inputs remains an open problem. In this work, we take an initial step toward solving this problem by constructing a dataset specifically designed for long-context hallucination detection. Furthermore, we propose a novel architecture that enables pre-trained encoder models, such as BERT, to process long contexts and effectively detect contextual hallucinations through a decomposition and aggregation mechanism. Our experimental results show that the proposed architecture significantly outperforms previous models of similar size as well as LLM-based models across various metrics, while providing substantially faster inference.
zh

[NLP-32] Systematic Bias in Large Language Models : Discrepant Response Patterns in Binary vs. Continuous Judgment Tasks

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在决策任务中可能因响应格式(如二元格式与连续格式)的不同而产生的系统性偏差问题。研究发现,LLMs在二元格式下更倾向于给出“负面”判断,而这一偏差在不同任务和模型中均存在。解决方案的关键在于识别并关注响应格式对模型判断的影响,强调在设计任务时需谨慎考虑格式选择,以减少潜在的系统性偏误。

链接: https://arxiv.org/abs/2504.19445
作者: Yi-Long Lu,Chunhui Zhang,Wei Wang
机构: State Key Laboratory of General Artificial Intelligence, BIGAI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in tasks such as psychological text analysis and decision-making in automated workflows. However, their reliability remains a concern due to potential biases inherited from their training process. In this study, we examine how different response format: binary versus continuous, may systematically influence LLMs’ judgments. In a value statement judgments task and a text sentiment analysis task, we prompted LLMs to simulate human responses and tested both formats across several models, including both open-source and commercial models. Our findings revealed a consistent negative bias: LLMs were more likely to deliver “negative” judgments in binary formats compared to continuous ones. Control experiments further revealed that this pattern holds across both tasks. Our results highlight the importance of considering response format when applying LLMs to decision tasks, as small changes in task design can introduce systematic biases.
zh

[NLP-33] Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

【速读】: 该论文试图解决预训练代码模型依赖高质量人工编写的参考注释,而这些注释随着软件演化容易过时,从而影响模型性能的问题。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)生成高质量的代码注释,替代传统的人工注释,并通过无参考评估任务验证生成注释在语义一致性上的优势,进而重构预训练数据集以提升代码智能相关任务的性能。

链接: https://arxiv.org/abs/2504.19444
作者: Kang Yang,Xinjun Mao,Shangwen Wang,Yanlin Wang,Tanghaoran Zhang,Bo Lin,Yihao Qin,Zhang Zhang,Yao Lu,Kamal Al-Sabahi
机构: National University of Defense Technology (国防科技大学); Sun Yat-sen University (中山大学); College of Banking and Financial Studies (银行与金融学院)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: Awarded the ACM SIGSOFT Distinguished Paper Award in ICPC 2025

点击查看摘要

Abstract:Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model performance. Large language models (LLMs) excel at generating high-quality code comments. We investigate whether replacing human-written comments with LLM-generated ones improves pre-training datasets. Since standard metrics cannot assess reference comment quality, we propose two novel reference-free evaluation tasks: code-comment inconsistency detection and semantic code search. Results show that LLM-generated comments are more semantically consistent with code than human-written ones, as confirmed by manual evaluation. Leveraging this finding, we rebuild the CodeSearchNet dataset with LLM-generated comments and re-pre-train CodeT5. Evaluations demonstrate that models trained on LLM-enhanced data outperform those using original human comments in code summarization, generation, and translation tasks. This work validates rebuilding pre-training datasets with LLMs to advance code intelligence, challenging the traditional reliance on human reference comments.
zh

[NLP-34] Context-Guided Dynamic Retrieval for Improving Generation Quality in RAG Models

【速读】: 该论文旨在解决静态检索增强生成(Retrieval-Augmented Generation, RAG)架构在上下文适应性和知识获取方面的局限性。其关键解决方案是提出一种状态感知的动态知识检索机制,该机制通过多层级感知检索向量构建策略和可微文档匹配路径,实现检索与生成模块的端到端联合训练与协同优化,从而提升大型语言模型在开放域问答和复杂生成任务中的语义理解和知识调度效率。

链接: https://arxiv.org/abs/2504.19436
作者: Jacky He,Guiran Liu,Binrong Zhu,Hanlu Zhang,Hongye Zheng,Xiaokai Wang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper focuses on the dynamic optimization of the Retrieval-Augmented Generation (RAG) architecture. It proposes a state-aware dynamic knowledge retrieval mechanism to enhance semantic understanding and knowledge scheduling efficiency in large language models for open-domain question answering and complex generation tasks. The method introduces a multi-level perceptive retrieval vector construction strategy and a differentiable document matching path. These components enable end-to-end joint training and collaborative optimization of the retrieval and generation modules. This effectively addresses the limitations of static RAG structures in context adaptation and knowledge access. Experiments are conducted on the Natural Questions dataset. The proposed structure is thoroughly evaluated across different large models, including GPT-4, GPT-4o, and DeepSeek. Comparative and ablation experiments from multiple perspectives confirm the significant improvements in BLEU and ROUGE-L scores. The approach also demonstrates stronger robustness and generation consistency in tasks involving semantic ambiguity and multi-document fusion. These results highlight its broad application potential and practical value in building high-quality language generation systems.
zh

[NLP-35] Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长期多轮对话中保持一致性的问题,其核心挑战在于固定上下文窗口限制了模型对长时间对话内容的持续理解与响应。解决方案的关键在于提出Mem0架构,该架构通过动态提取、整合和检索对话中的关键信息,实现可扩展的记忆中心化机制;进一步地,引入基于图的记忆表示方法以捕捉对话元素之间的复杂关系结构,从而提升模型在多轮对话中的连贯性和准确性。

链接: https://arxiv.org/abs/2504.19413
作者: Prateek Chhikara,Dev Khant,Saket Aryan,Taranjeet Singh,Deshraj Yadav
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.
zh

[NLP-36] Context Selection and Rewriting for Video-based EducationalQuestion Generation

【速读】: 该论文试图解决教育问答生成(Educational Question Generation, EQG)系统在真实课堂内容上的性能不足问题,尤其是现有数据集依赖于预定义和编辑过的文本,无法反映实际课堂中包含讲座语音和配套幻灯片的内容。解决方案的关键在于引入一种基于大语言模型的框架,通过动态选择和重写与目标时间戳和答案相关的上下文,提升生成问题的质量和相关性。该框架首先根据答案相关性和时间接近性从讲座转录文本和视频关键帧中选择上下文,随后将多模态上下文整合并重写为包含目标答案的知识性陈述,从而增强上下文与目标答案之间的逻辑关联。

链接: https://arxiv.org/abs/2504.19406
作者: Mengxia Yu,Bang Nguyen,Olivia Zino,Meng Jiang
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Educational question generation (EQG) is a crucial component of intelligent educational systems, significantly aiding self-assessment, active learning, and personalized education. While EQG systems have emerged, existing datasets typically rely on predefined, carefully edited texts, failing to represent real-world classroom content, including lecture speech with a set of complementary slides. To bridge this gap, we collect a dataset of educational questions based on lectures from real-world classrooms. On this realistic dataset, we find that current methods for EQG struggle with accurately generating questions from educational videos, particularly in aligning with specific timestamps and target answers. Common challenges include selecting informative contexts from extensive transcripts and ensuring generated questions meaningfully incorporate the target answer. To address the challenges, we introduce a novel framework utilizing large language models for dynamically selecting and rewriting contexts based on target timestamps and answers. First, our framework selects contexts from both lecture transcripts and video keyframes based on answer relevance and temporal proximity. Then, we integrate the contexts selected from both modalities and rewrite them into answer-containing knowledge statements, to enhance the logical connection between the contexts and the desired answer. This approach significantly improves the quality and relevance of the generated questions. Our dataset and code are released in this https URL.
zh

[NLP-37] ICL CIPHERS: Quantifying "Learning in In-Context Learning via Substitution Ciphers

【速读】: 该论文试图解决如何区分和量化In-Context Learning (ICL)中的任务检索(task retrieval)与任务学习(task learning)的问题,特别是验证大语言模型(LLMs)是否能够在具有双射(bijective)映射的ICL CIPHERS任务中进行“学习”。解决方案的关键在于引入一种基于替代密码(substitution ciphers)的任务重构方法,通过在上下文输入中替换部分标记为无关标记,生成不可读的英文句子,但保留可逆的潜在模式,从而构建一个具有明确抽象任务定义的测试环境。实验表明,LLMs在处理具有双射映射的ICL CIPHERS任务时表现优于非双射基线,证明其具备解码隐含密码的能力。

链接: https://arxiv.org/abs/2504.19395
作者: Zhouxiang Fang,Aayush Mishra,Muhan Gao,Anqi Liu,Daniel Khashabi
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent works have suggested that In-Context Learning (ICL) operates in dual modes, i.e. task retrieval (remember learned patterns from pre-training) and task learning (inference-time learning'' from demonstrations). However, disentangling these the two modes remains a challenging goal. We introduce ICL CIPHERS, a class of task reformulations based on substitution ciphers borrowed from classic cryptography. In this approach, a subset of tokens in the in-context inputs are substituted with other (irrelevant) tokens, rendering English sentences less comprehensible to human eye. However, by design, there is a latent, fixed pattern to this substitution, making it reversible. This bijective (reversible) cipher ensures that the task remains a well-defined task in some abstract sense, despite the transformations. It is a curious question if LLMs can solve ICL CIPHERS with a BIJECTIVE mapping, which requires deciphering the latent cipher. We show that LLMs are better at solving ICL CIPHERS with BIJECTIVE mappings than the NON-BIJECTIVE (irreversible) baseline, providing a novel approach to quantify learning’’ in ICL. While this gap is small, it is consistent across the board on four datasets and six models. Finally, we examine LLMs’ internal representations and identify evidence in their ability to decode the ciphered inputs.
zh

[NLP-38] Explanatory Summarization with Discourse-Driven Planning ACL

【速读】: 该论文试图解决当前自动摘要方法在生成通俗化科学文档摘要时未能显式建模解释内容的问题,导致摘要中解释性内容的比例难以与人工撰写摘要对齐。解决方案的关键在于提出一种基于计划的框架,利用话语结构来组织摘要生成,并通过提示响应来引导解释性句子的生成,具体包括两种话语驱动的规划策略,分别将计划作为输入的一部分或输出前缀的一部分进行条件控制。

链接: https://arxiv.org/abs/2504.19339
作者: Dongqi Liu,Xi Yu,Vera Demberg,Mirella Lapata
机构: Saarland University (萨尔兰大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by the Transactions of the Association for Computational Linguistics (TACL)

点击查看摘要

Abstract:Lay summaries for scientific documents typically include explanations to help readers grasp sophisticated concepts or arguments. However, current automatic summarization methods do not explicitly model explanations, which makes it difficult to align the proportion of explanatory content with human-written summaries. In this paper, we present a plan-based approach that leverages discourse frameworks to organize summary generation and guide explanatory sentences by prompting responses to the plan. Specifically, we propose two discourse-driven planning strategies, where the plan is conditioned as part of the input or part of the output prefix, respectively. Empirical experiments on three lay summarization datasets show that our approach outperforms existing state-of-the-art methods in terms of summary quality, and it enhances model robustness, controllability, and mitigates hallucination.
zh

[NLP-39] Unified Multi-Task Learning Model Fusion for Efficient Language Model Guardrailing

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在防范不良行为时存在的延迟高、内存消耗大、托管成本高及输出非结构化等问题,旨在开发更高效且性能优越的防护分类器。其解决方案的关键在于通过任务特定的数据生成来微调分类器,使其在保持较小规模的同时显著超越当前最先进(State of the Art, SoTA)方法;同时,采用预训练于大规模合成数据集上的多任务模型\textttMultiTaskGuard以提升泛化能力;此外,利用基于搜索的模型融合方法构建的最高效模型\textttUniGuard,通过优化参数组合实现单策略与多策略防护模型的协同,从而在多个公开数据集和自建防护基准测试中取得了显著的F1分数提升。

链接: https://arxiv.org/abs/2504.19333
作者: James O’ Neill,Santhosh Subramanian,Eric Lin,Vaikkunth Mugunthan
机构: DynamoAI(动态AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The trend towards large language models (LLMs) for guardrailing against undesired behaviors is increasing and has shown promise for censoring user inputs. However, increased latency, memory consumption, hosting expenses and non-structured outputs can make their use prohibitive. In this work, we show that task-specific data generation can lead to fine-tuned classifiers that significantly outperform current state of the art (SoTA) while being orders of magnitude smaller. Secondly, we show that using a single model, \textttMultiTaskGuard, that is pretrained on a large synthetically generated dataset with unique task instructions further improves generalization. Thirdly, our most performant models, \textttUniGuard, are found using our proposed search-based model merging approach that finds an optimal set of parameters to combine single-policy models and multi-policy guardrail models. % On 7 public datasets and 4 guardrail benchmarks we created, our efficient guardrail classifiers improve over the best performing SoTA publicly available LLMs and 3 ^\textrd party guardrail APIs in detecting unsafe and safe behaviors by an average F1 score improvement of \textbf29.92 points over Aegis-LlamaGuard and \textbf21.62 over \textttgpt-4o, respectively. Lastly, our guardrail synthetic data generation process that uses custom task-specific guardrail poli Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.19333 [cs.CL] (or arXiv:2504.19333v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.19333 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-40] BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

【速读】: 该论文试图解决现有基准测试(如BrowseComp)主要针对英语环境,忽视了其他主要信息生态系统(尤其是中文)在语言、基础设施和审查方面的复杂性问题。其解决方案的关键在于构建一个专为评估大语言模型代理在中文网络上表现而设计的高难度基准测试——BrowseComp-ZH,该基准包含289个跨11个不同领域的多跳问题,并通过两阶段质量控制协议确保问题难度和答案的独特性,从而全面评估模型的检索与推理能力。

链接: https://arxiv.org/abs/2504.19314
作者: Peilin Zhou,Bruce Leon,Xiang Ying,Can Zhang,Yifan Shao,Qichen Ye,Dading Chong,Zhiling Jin,Chenxuan Xie,Meng Cao,Yuxin Gu,Sixin Hong,Jing Ren,Jian Chen,Chao Liu,Yining Hua
机构: Hong Kong University of Science and Technology (Guangzhou); Peking University; Mindverse AI; Alibaba Group; Zhejiang University; Zhejiang University of Technology; MBZUAI; NIO; HSBC; Harvard T.H. Chan School of Public Health
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems – most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI’s DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation – capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at this https URL.
zh

[NLP-41] AndroidGen: Building an Android Language Agent under Data Scarcity

【速读】: 该论文试图解决在移动设备上广泛应用基于大型语言模型(Large Language Models, LLMs)代理所面临的挑战,特别是数据稀缺性问题。现有LLMs在完成任务时表现出较低的完成率,并且需要高质量的数据源,而人工标注数据存在时间成本高和劳动强度大的问题。解决方案的关键在于提出一个名为AndroidGen的框架,通过自动化收集用户任务轨迹并用于训练开源LLMs,从而构建无需手动标注轨迹的开源移动代理。

链接: https://arxiv.org/abs/2504.19298
作者: Hanyu Lai,Junjie Gao,Xiao Liu,Yifan Xu,Shudan Zhang,Yuxiao Dong,Jie Tang
机构: Tsinghua University (清华大学); Zhipu AI (智普AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have opened up a world of possibilities for various NLP tasks, sparking optimism for the future. Despite their potential, LLMs have yet to be widely used as agents on real mobile devices. The main challenge is the need for high-quality data sources. Time constraints and labor intensity often hinder human annotation. On the other hand, existing LLMs exhibit inadequate completion rates and need a robust data filtration strategy. Given these challenges, we develop a framework called AndroidGen to enhance the capabilities of LLM-based agents under data scarcity. In addition, we leverage AndroidGen to collect trajectories given human tasks and train open-source LLMs on these trajectories to develop an open-source mobile agent without manually labeled trajectories. We extensively evaluate AndroidGen with AndroidWorld, AitW, and various popular applications, demonstrating its improvements and revealing potential areas for future improvement. Code, model, and data are available at this https URL.
zh

[NLP-42] Anyprefer: An Agent ic Framework for Preference Data Synthesis

【速读】: 该论文旨在解决高质量偏好数据稀缺的问题,以提升基础模型与人类价值观对齐的效果。传统方法依赖人工标注,成本高且效率低,而自奖励方法因奖励模型与目标模型共享参数可能导致偏差放大。论文提出的解决方案关键在于设计Anyprefer框架,该框架将数据合成过程建模为合作的双玩家马尔可夫博弈,通过引入外部工具辅助评判模型准确评估目标模型输出,并结合反馈机制优化模型提示,从而提升数据质量和模型对齐性能。

链接: https://arxiv.org/abs/2504.19276
作者: Yiyang Zhou,Zhaoyang Wang,Tianle Wang,Shangyu Xing,Peng Xia,Bo Li,Kaiyuan Zheng,Zijian Zhang,Zhaorun Chen,Wenhao Zheng,Xuchao Zhang,Chetan Bansal,Weitong Zhang,Ying Wei,Mohit Bansal,Huaxiu Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-quality preference data is essential for aligning foundation models with human values through preference learning. However, manual annotation of such data is often time-consuming and costly. Recent methods often adopt a self-rewarding approach, where the target model generates and annotates its own preference data, but this can lead to inaccuracies since the reward model shares weights with the target model, thereby amplifying inherent biases. To address these issues, we propose Anyprefer, a framework designed to synthesize high-quality preference data for aligning the target model. Anyprefer frames the data synthesis process as a cooperative two-player Markov Game, where the target model and the judge model collaborate together. Here, a series of external tools are introduced to assist the judge model in accurately rewarding the target model’s responses, mitigating biases in the rewarding process. In addition, a feedback mechanism is introduced to optimize prompts for both models, enhancing collaboration and improving data quality. The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs. Extensive experiments show that Anyprefer significantly improves model alignment performance across four main applications, covering 21 datasets, achieving average improvements of 18.55% in five natural language generation datasets, 3.66% in nine vision-language understanding datasets, 30.05% in three medical image analysis datasets, and 16.00% in four visuo-motor control tasks.
zh

[NLP-43] VIST-GPT : Ushering in the Era of Visual Storytelling with LLM s?

【速读】: 该论文旨在解决视觉叙事生成中的问题,即如何从图像序列中生成连贯且符合视觉语境的叙述。其解决方案的关键在于利用多模态模型,特别是基于Transformer的架构和大规模多模态模型,以提升叙事生成的质量。此外,论文还提出了新的无参考评估指标RoViST和GROOVIST,以更准确地衡量视觉叙事的视觉基础性、连贯性和非冗余性,从而克服传统评估指标(如BLEU、METEOR、ROUGE和CIDEr)在该任务中的局限性。

链接: https://arxiv.org/abs/2504.19267
作者: Mohamed Gado,Towhid Taliee,Muhammad Memon,Dmitry Ignatov,Radu Timofte
机构: University of Würzburg (维尔茨堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual storytelling is an interdisciplinary field combining computer vision and natural language processing to generate cohesive narratives from sequences of images. This paper presents a novel approach that leverages recent advancements in multimodal models, specifically adapting transformer-based architectures and large multimodal models, for the visual storytelling task. Leveraging the large-scale Visual Storytelling (VIST) dataset, our VIST-GPT model produces visually grounded, contextually appropriate narratives. We address the limitations of traditional evaluation metrics, such as BLEU, METEOR, ROUGE, and CIDEr, which are not suitable for this task. Instead, we utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focusing on visual grounding, coherence, and non-redundancy. These metrics provide a more nuanced evaluation of narrative quality, aligning closely with human judgment.
zh

[NLP-44] Uncertainty Quantification for Language Models: A Suite of Black-Box White-Box LLM Judge and Ensemble Scorers ALT

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中持续存在的幻觉(hallucination)问题。随着LLMs在医疗和金融等高风险领域中的广泛应用,有效的幻觉检测变得至关重要。论文提出了一种适用于零资源环境的幻觉检测框架,其关键在于通过适配多种不确定性量化(Uncertainty Quantification, UQ)技术,包括黑盒UQ、白盒UQ和LLM-as-a-Judge,将其转化为标准化的响应级置信度分数,并引入可调的集成方法,以结合不同置信度分数,从而优化特定应用场景下的检测性能。

链接: https://arxiv.org/abs/2504.19254
作者: Dylan Bouchard,Mohit Singh Chauhan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: UQLM repository: this https URL

点击查看摘要

Abstract:Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper’s companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.
zh

[NLP-45] Dynamic Embedded Topic Models: properties and recommendations based on diverse corpora

【速读】: 该论文试图解决动态嵌入式主题模型(Dynamic Embedded Topic Model)在不同历时语料库中的实施选择对其效果的影响问题,旨在明确其应用和进一步发展的关键决策。解决方案的关键在于确定能够最大化实际学术应用价值的优先事项,包括词汇量的实用可扩展性以充分发挥嵌入表示的优势,以及更灵活的时间区间建模以适应历史写作时间分布的不均衡性。

链接: https://arxiv.org/abs/2504.19209
作者: Elisabeth Fittschen,Bella Xia,Leib Celnik,Paul Dilley,Tom Lippincott
机构: Johns Hopkins University (约翰霍普金斯大学); University of Hamburg (汉堡大学); University of Iowa (爱荷华大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:We measure the effects of several implementation choices for the Dynamic Embedded Topic Model, as applied to five distinct diachronic corpora, with the goal of isolating important decisions for its use and further development. We identify priorities that will maximize utility in applied scholarship, including the practical scalability of vocabulary size to best exploit the strengths of embedded representations, and more flexible modeling of intervals to accommodate the uneven temporal distributions of historical writing. Of similar importance, we find performance is not significantly or consistently affected by several aspects that otherwise limit the model’s application or might consume the resources of a grid search.
zh

[NLP-46] WuNeng: Hybrid State with Attention

【速读】: 该论文旨在解决大型语言模型在表达能力和计算效率之间难以平衡的问题,特别是如何在不显著增加参数量的情况下提升模型的上下文连贯性和序列建模能力。其解决方案的关键在于将基于循环神经网络(RNN)的RWKV-7与先进的注意力机制相结合,通过引入额外的状态驱动头(state-driven heads)和跨头交互技术,增强模型的表示能力,同时利用多标记状态处理机制捕捉全局依赖关系,从而在保持高效性的同时显著提升模型的表达力。

链接: https://arxiv.org/abs/2504.19191
作者: Liu Xiao,Li Zhiyuan,Lin Yueyu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The WuNeng architecture introduces a novel approach to enhancing the expressivity and power of large language models by integrating recurrent neural network (RNN)-based RWKV-7 with advanced attention mechanisms, prioritizing heightened contextual coherence over reducing KV cache size. Building upon the hybrid-head concept from Hymba, WuNeng augments standard multi-head attention with additional RWKV-7 state-driven heads, rather than replacing existing heads, to enrich the model’s representational capacity. A cross-head interaction technique fosters dynamic synergy among standard, state-driven, and newly introduced middle heads, leveraging concatenation, additive modulation, and gated fusion for robust information integration. Furthermore, a multi-token state processing mechanism harnesses the continuous RWKV-7 state to capture intricate, sequence-wide dependencies, significantly boosting expressivity. Remarkably, these enhancements are achieved with minimal additional parameters, ensuring efficiency while empowering the model to excel in complex reasoning and sequence generation tasks. WuNeng sets a new standard for balancing expressivity and computational efficiency in modern neural architectures.
zh

[NLP-47] Hierarchical Attention Generates Better Proofs

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在形式化定理证明中因逐标记处理而无法捕捉数学证明固有层次结构的问题。解决方案的关键在于引入层次化注意力(Hierarchical Attention),这是一种正则化方法,通过将LLMs的注意力机制与数学推理结构对齐,构建从基础元素到高层概念的五级层次结构,从而确保证明生成过程中的结构化信息流动。

链接: https://arxiv.org/abs/2504.19188
作者: Jianlong Chen,Chao Li,Yang Yuan,Andrew C Yao
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Shanghai Qi Zhi Institute (上海期智研究院); IIIS, Tsinghua University (清华大学交叉信息研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 15 pages with 3 figures

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in formal theorem proving, but their token-level processing often fails to capture the inherent hierarchical nature of mathematical proofs. We introduce \textbfHierarchical Attention, a regularization method that aligns LLMs’ attention mechanisms with mathematical reasoning structures. Our approach establishes a five-level hierarchy from foundational elements to high-level concepts, ensuring structured information flow in proof generation. Experiments demonstrate that our method improves proof success rates by 2.05% on miniF2F and 1.69% on ProofNet while reducing proof complexity by 23.81% and 16.50% respectively. The code is available at this https URL.
zh

[NLP-48] SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)推理步骤逐步可靠性的评估问题,这一问题由于获取高质量的步骤级监督数据困难且成本高昂而难以解决。论文提出的解决方案的关键在于引入Self-Play Critic (SPC),该方法通过对抗自博弈游戏使评判模型不断进化其对推理步骤的评估能力,从而无需人工进行步骤级标注。SPC通过微调两个基础模型分别扮演“狡猾生成器”和“评判者”,在对抗游戏中相互提升,最终实现错误检测能力的持续增强。

链接: https://arxiv.org/abs/2504.19162
作者: Jiaqi Chen,Bang Zhang,Ruotian Ma,Peisong Wang,Xiaodan Liang,Zhaopeng Tu,Xiaolong Li,Kwan-Yee K. Wong
机构: The University of Hong Kong (香港大学); Tencent (腾讯); Tsinghua University (清华大学); MBZUAI (MBZUAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project: this https URL

点击查看摘要

Abstract:Evaluating the step-by-step reliability of large language model (LLM) reasoning, such as Chain-of-Thought, remains challenging due to the difficulty and cost of obtaining high-quality step-level supervision. In this paper, we introduce Self-Play Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-level annotation. SPC involves fine-tuning two copies of a base model to play two roles, namely a “sneaky generator” that deliberately produces erroneous steps designed to be difficult to detect, and a “critic” that analyzes the correctness of reasoning steps. These two models engage in an adversarial game in which the generator aims to fool the critic, while the critic model seeks to identify the generator’s errors. Using reinforcement learning based on the game outcomes, the models iteratively improve; the winner of each confrontation receives a positive reward and the loser receives a negative reward, driving continuous self-evolution. Experiments on three reasoning process benchmarks (ProcessBench, PRM800K, DeltaBench) demonstrate that our SPC progressively enhances its error detection capabilities (e.g., accuracy increases from 70.8% to 77.7% on ProcessBench) and surpasses strong baselines, including distilled R1 model. Furthermore, applying SPC to guide the test-time search of diverse LLMs significantly improves their mathematical reasoning performance on MATH500 and AIME2024, outperforming state-of-the-art process reward models.
zh

[NLP-49] APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries

【速读】: 该论文试图解决当前形式化数学库中证明工程任务的自动化问题,特别是针对现有基准测试无法反映真实世界中迭代性和工程密集型工作流的局限性。其解决方案的关键在于引入自动化证明工程(Automated Proof Engineering, APE)范式,利用大语言模型(LLMs)自动执行如特性添加、证明重构和错误修复等任务,并构建了APE-Bench I基准,该基准基于Mathlib4的真实提交历史,包含多样化的文件级任务,通过结合Lean编译器与LLM-as-a-Judge的混合方法进行验证。此外,还开发了Eleanstic,一种面向多版本Mathlib的可扩展并行验证基础设施,以支持大规模证明检查。

链接: https://arxiv.org/abs/2504.19110
作者: Huajian Xin,Luming Li,Xiaoran Jin,Jacques Fleuriot,Wenda Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent progress in large language models (LLMs) has shown promise in formal theorem proving, yet existing benchmarks remain limited to isolated, static proof tasks, failing to capture the iterative, engineering-intensive workflows of real-world formal mathematics libraries. Motivated by analogous advances in software engineering, we introduce the paradigm of Automated Proof Engineering (APE), which aims to automate proof engineering tasks such as feature addition, proof refactoring, and bug fixing using LLMs. To facilitate research in this direction, we present APE-Bench I, the first realistic benchmark built from real-world commit histories of Mathlib4, featuring diverse file-level tasks described in natural language and verified via a hybrid approach combining the Lean compiler and LLM-as-a-Judge. We further develop Eleanstic, a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib. Empirical results on state-of-the-art LLMs demonstrate strong performance on localized edits but substantial degradation on handling complex proof engineering. This work lays the foundation for developing agentic workflows in proof engineering, with future benchmarks targeting multi-file coordination, project-scale verification, and autonomous agents capable of planning, editing, and repairing formal libraries.
zh

[NLP-50] Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation

【速读】: 该论文旨在解决私有检索增强生成(Retrieval-Augmented Generation, RAG)系统在数据稀缺性和数据隐私保护之间的矛盾问题。传统私有RAG系统因缺乏领域专用数据和面临严格的数据隐私挑战而难以部署,因此需要在数据安全与可用性之间找到平衡。该研究提出的解决方案关键在于引入联邦学习(Federated Learning, FL)技术,构建了一个名为FedE4RAG的框架,通过客户端RAG检索模型的协作训练,结合参数聚合与分发机制,在不直接共享原始数据的前提下保障数据隐私。此外,该框架还引入知识蒸馏和同态加密技术,以提升本地RAG检索器的泛化能力并防止数据泄露。

链接: https://arxiv.org/abs/2504.19101
作者: Qianren Mao,Qili Zhang,Hanwen Hao,Zhentao Han,Runhua Xu,Weifeng Jiang,Qi Hu,Zhijun Chen,Tyler Zhou,Bo Li,Yangqiu Song,Jin Dong,Jianxin Li,Philip S. Yu
机构: Zhongguancun Laboratory (中关村实验室); Beihang University (北京航空航天大学); Nanyang Technological University (南洋理工大学); Hong Kong University of Science and Technology (香港科技大学); Beijing Academy of Blockchain and Edge Computing (北京区块链与边缘计算研究院); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (北京未来区块链与隐私计算高精尖创新中心); University of Illinois at Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution for enhancing the accuracy and credibility of Large Language Models (LLMs), particularly in Question Answer tasks. This is achieved by incorporating proprietary and private data from integrated databases. However, private RAG systems face significant challenges due to the scarcity of private domain data and critical data privacy issues. These obstacles impede the deployment of private RAG systems, as developing privacy-preserving RAG systems requires a delicate balance between data security and data availability. To address these challenges, we regard federated learning (FL) as a highly promising technology for privacy-preserving RAG services. We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG). This framework facilitates collaborative training of client-side RAG retrieval models. The parameters of these models are aggregated and distributed on a central-server, ensuring data privacy without direct sharing of raw data. In FedE4RAG, knowledge distillation is employed for communication between the server and client models. This technique improves the generalization of local RAG retrievers during the federated learning process. Additionally, we apply homomorphic encryption within federated learning to safeguard model parameters and mitigate concerns related to data leakage. Extensive experiments conducted on the real-world dataset have validated the effectiveness of FedE4RAG. The results demonstrate that our proposed framework can markedly enhance the performance of private RAG systems while maintaining robust data privacy protection.
zh

[NLP-51] Efficient Reasoning for LLM s through Speculative Chain-of-Thought

【速读】: 该论文旨在解决大型推理语言模型在任务求解过程中存在的推理成本高和响应延迟大的问题。其解决方案的关键在于提出一种名为“推测式思维链”(Speculative Chain-of-Thought, SCoT)的方法,通过大模型与小模型的协同工作,加速平均推理速度,从而降低推理延迟。SCoT在思维层级上使用轻量级的草稿模型进行初步生成,并利用目标模型对最佳思维链草稿进行选择和错误修正,以此提升推理效率并保持复杂问题的预测准确性。

链接: https://arxiv.org/abs/2504.19095
作者: Jikai Wang,Juntao Li,Lijun Wu,Min Zhang
机构: Soochow University (苏州大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large reasoning language models such as OpenAI-o1 and Deepseek-R1 have recently attracted widespread attention due to their impressive task-solving abilities. However, the enormous model size and the generation of lengthy thought chains introduce significant reasoning costs and response latency. Existing methods for efficient reasoning mainly focus on reducing the number of model parameters or shortening the chain-of-thought length. In this paper, we introduce Speculative Chain-of-Thought (SCoT), which reduces reasoning latency from another perspective by accelerated average reasoning speed through large and small model collaboration. SCoT conducts thought-level drafting using a lightweight draft model. Then it selects the best CoT draft and corrects the error cases with the target model. The proposed thinking behavior alignment improves the efficiency of drafting and the draft selection strategy maintains the prediction accuracy for complex problems. Experimental results on GSM8K, MATH, GaoKao, CollegeMath and Olympiad datasets show that SCoT reduces reasoning latency by 48% \sim 66% for Deepseek-R1-Distill-Qwen-32B while achieving near-target-model-level performance. Our code is available at this https URL.
zh

[NLP-52] Sample-Efficient Language Model for Hinglish Conversational AI

【速读】: 该论文旨在解决在缺乏标准化和高质量对话数据的情况下,构建高效且性能良好的Hinglish(印地语与英语混合语言)对话模型的问题。其关键解决方案是通过合成生成对话数据并结合现有Hinglish数据集的洞察,来缓解数据稀缺问题,同时采用微调技术对预训练的跨语言模型(如Gemma3-4B和Qwen2.5-7B)进行优化,从而在保持计算效率的同时实现具有竞争力的对话生成性能。

链接: https://arxiv.org/abs/2504.19070
作者: Sakshi Singh,Abhinav Prakash,Aakriti Shah,Chaitanya Sachdeva,Sanjana Dumpala
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 tables, 2 figures

点击查看摘要

Abstract:This paper presents our process for developing a sample-efficient language model for a conversational Hinglish chatbot. Hinglish, a code-mixed language that combines Hindi and English, presents a unique computational challenge due to inconsistent spelling, lack of standardization, and limited quality of conversational data. This work evaluates multiple pre-trained cross-lingual language models, including Gemma3-4B and Qwen2.5-7B, and employs fine-tuning techniques to improve performance on Hinglish conversational tasks. The proposed approach integrates synthetically generated dialogues with insights from existing Hinglish datasets to address data scarcity. Experimental results demonstrate that models with fewer parameters, when appropriately fine-tuned on high-quality code-mixed data, can achieve competitive performance for Hinglish conversation generation while maintaining computational efficiency.
zh

[NLP-53] ClimaEmpact: Domain-Aligned Small Language Models and Datasets for Extreme Weather Analytics

【速读】: 该论文试图解决极端天气事件评估中局部且细粒度数据稀缺的问题,这一数据缺口限制了对极端天气事件潜在影响的分析和有效决策。解决方案的关键在于提出一种名为Extreme Weather Reasoning-Aware Alignment (EWRA)的方法,通过将从大型语言模型(Large Language Models, LLMs)中提取的结构化推理路径融入小型语言模型(Small Language Models, SLMs),从而提升其在极端天气分析任务中的表现。此外,论文还构建了一个名为ExtremeWeatherNews的大规模极端天气相关新闻数据集,与EWRA共同构成了ClimaEmpact框架,以实现对极端天气任务的精准分类、主题标注和情感分析。

链接: https://arxiv.org/abs/2504.19066
作者: Deeksha Varshney,Keane Ong,Rui Mao,Erik Cambria,Gianmarco Mengaldo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Accurate assessments of extreme weather events are vital for research and policy, yet localized and granular data remain scarce in many parts of the world. This data gap limits our ability to analyze potential outcomes and implications of extreme weather events, hindering effective decision-making. Large Language Models (LLMs) can process vast amounts of unstructured text data, extract meaningful insights, and generate detailed assessments by synthesizing information from multiple sources. Furthermore, LLMs can seamlessly transfer their general language understanding to smaller models, enabling these models to retain key knowledge while being fine-tuned for specific tasks. In this paper, we propose Extreme Weather Reasoning-Aware Alignment (EWRA), a method that enhances small language models (SLMs) by incorporating structured reasoning paths derived from LLMs, and ExtremeWeatherNews, a large dataset of extreme weather event-related news articles. EWRA and ExtremeWeatherNews together form the overall framework, ClimaEmpact, that focuses on addressing three critical extreme-weather tasks: categorization of tangible vulnerabilities/impacts, topic labeling, and emotion analysis. By aligning SLMs with advanced reasoning strategies on ExtremeWeatherNews (and its derived dataset ExtremeAlign used specifically for SLM alignment), EWRA improves the SLMs’ ability to generate well-grounded and domain-specific responses for extreme weather analytics. Our results show that the approach proposed guides SLMs to output domain-aligned responses, surpassing the performance of task-specific models and offering enhanced real-world applicability for extreme weather analytics.
zh

[NLP-54] Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models

【速读】: 该论文试图解决临床摘要中关键事件提取的准确性问题,以及生成摘要过程中幻觉(hallucination)现象的检测问题。解决方案的关键在于利用开源大型语言模型(LLMs)进行临床文本的自动化摘要生成,并通过全面的数值模拟评估模型在临床摘要中的性能,以确保提取内容的准确性和真实性。

链接: https://arxiv.org/abs/2504.19061
作者: Anindya Bijoy Das,Shibbir Ahmed,Shahnewaz Karim Sakib
机构: The University of Akron (阿克伦大学); Texas State University (德克萨斯州立大学); University of Tennessee at Chattanooga (田纳西大学查塔努加分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, such as reasons for hospital admission, significant in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive numerical simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization.
zh

[NLP-55] Generative AI for Character Animation: A Comprehensive Survey of Techniques Applications and Future Directions

【速读】: 该论文试图解决当前生成式 AI 在角色动画领域发展迅速但缺乏系统性综述的问题,旨在提供一个全面的视角来整合该领域的研究进展。其解决方案的关键在于从面部动画、表情渲染、图像合成、虚拟角色创建、手势建模、运动合成、物体生成和纹理合成等多个方面进行综合分析,并突出各领域的前沿研究、实际应用、常用数据集及发展趋势,从而为研究人员和开发者提供进入该领域的知识基础与未来研究方向的指导。

链接: https://arxiv.org/abs/2504.19056
作者: Mohammad Mahdi Abootorabi,Omid Ghahroodi,Pardis Sadat Zahraei,Hossein Behzadasl,Alireza Mirrokni,Mobina Salimipanah,Arash Rasouli,Bahar Behzadipour,Sara Azarnoush,Benyamin Maleki,Erfan Sadraiye,Kiarash Kiani Feriz,Mahdi Teymouri Nahad,Ali Moghadasi,Abolfazl Eshagh Abianeh,Nizi Nazar,Hamid R. Rabiee,Mahdieh Soleymani Baghshah,Meisam Ahmadi,Ehsaneddin Asgari
机构: Qatar Computing Research Institute (卡塔尔计算研究研究所); Sharif University of Technology (沙里夫理工大学); Iran University of Science and Technology (伊朗科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 50 main pages, 30 pages appendix, 21 figures, 8 tables, GitHub Repository: this https URL

点击查看摘要

Abstract:Generative AI is reshaping art, gaming, and most notably animation. Recent breakthroughs in foundation and diffusion models have reduced the time and cost of producing animated content. Characters are central animation components, involving motion, emotions, gestures, and facial expressions. The pace and breadth of advances in recent months make it difficult to maintain a coherent view of the field, motivating the need for an integrative review. Unlike earlier overviews that treat avatars, gestures, or facial animation in isolation, this survey offers a single, comprehensive perspective on all the main generative AI applications for character animation. We begin by examining the state-of-the-art in facial animation, expression rendering, image synthesis, avatar creation, gesture modeling, motion synthesis, object generation, and texture synthesis. We highlight leading research, practical deployments, commonly used datasets, and emerging trends for each area. To support newcomers, we also provide a comprehensive background section that introduces foundational models and evaluation metrics, equipping readers with the knowledge needed to enter the field. We discuss open challenges and map future research directions, providing a roadmap to advance AI-driven character-animation technologies. This survey is intended as a resource for researchers and developers entering the field of generative AI animation or adjacent fields. Resources are available at: this https URL.
zh

[NLP-56] Calibrating Translation Decoding with Quality Estimation on LLM s

【速读】: 该论文试图解决神经机器翻译(Neural Machine Translation, NMT)系统中最大后验概率(Maximum A Posteriori, MAP)解码方法存在的不足,即其生成的翻译质量较低或出现病态假设的问题,因为MAP解码目标与实际翻译质量不一致。解决方案的关键在于从分布视角出发,通过直接优化假设似然与翻译质量之间的皮尔逊相关性(Pearson correlation)来校准假设似然,从而提升翻译解码的有效性。

链接: https://arxiv.org/abs/2504.19044
作者: Di Wu,Yibin Lei,Christof Monz
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural machine translation (NMT) systems typically employ maximum a posteriori (MAP) decoding to select the highest-scoring translation from the distribution mass. However, recent evidence highlights the inadequacy of MAP decoding, often resulting in low-quality or even pathological hypotheses – the decoding objective is not aligned with real-world translation quality. This paper proposes calibrating hypothesis likelihoods with translation quality from a distribution view by directly optimizing their Pearson correlation – thereby enhancing the effectiveness of translation decoding. With our method, translation on large language models (LLMs) improves substantially after limited training (2K instances per direction). This improvement is orthogonal to those achieved through supervised fine-tuning, leading to substantial gains across a broad range of metrics and human evaluations – even when applied to top-performing translation-specialized LLMs fine-tuned on high-quality translation data, such as Tower, or when compared to recent preference optimization methods, like CPO. Moreover, the calibrated translation likelihood can directly serve as a strong proxy for translation quality, closely approximating or even surpassing some state-of-the-art translation quality estimation models, like CometKiwi. Lastly, our in-depth analysis demonstrates that calibration enhances the effectiveness of MAP decoding, thereby enabling greater efficiency in real-world deployment. The resulting state-of-the-art translation model, which covers 10 languages, along with the accompanying code and human evaluation data, has been released to the community: this https URL.
zh

[NLP-57] KETCHUP: K-Step Return Estimation for Sequential Knowledge Distillation

【速读】: 该论文旨在解决基于强化学习(Reinforcement Learning, RL)的知识蒸馏(Knowledge Distillation, KD)在文本生成任务中的优化问题,特别是在学生模型规模较大时梯度估计方差较高导致优化效果受限的问题。其解决方案的关键在于提出一种新的k-step return估计方法(称为KETCHUP),通过利用多步Bellman最优方程诱导k-step return,从而降低梯度估计的方差,提升RL优化效果。

链接: https://arxiv.org/abs/2504.19024
作者: Jiabin Fan,Guoqing Luo,Michael Bowling,Lili Mou
机构: University of Alberta (阿尔伯塔大学); Alberta Machine Intelligence Institute (阿尔伯塔机器智能研究所); Amii (阿尔伯塔机器智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a novel k-step return estimation method (called KETCHUP) for Reinforcement Learning(RL)-based knowledge distillation (KD) in text generation tasks. Our idea is to induce a K-step return by using the Bellman Optimality Equation for multiple steps. Theoretical analysis shows that this K-step formulation reduces the variance of the gradient estimates, thus leading to improved RL optimization especially when the student model size is large. Empirical evaluation on three text generation tasks demonstrates that our approach yields superior performance in both standard task metrics and large language model (LLM)-based evaluation. These results suggest that our K-step return induction offers a promising direction for enhancing RL-based KD in LLM research.
zh

[NLP-58] Advancing Scientific Text Classification: Fine-Tuned Models with Dataset Expansion and Hard-Voting

【速读】: 该论文旨在解决学术出版物数量激增背景下高效文本分类的问题。其解决方案的关键在于利用预训练语言模型(PLMs)如BERT、SciBERT、BioBERT和BlueBERT,并在Web of Science(WoS-46985)数据集上进行微调,同时通过执行七项针对性查询扩充数据集,每类获取1,000篇与WoS-46985主类别对齐的文章。随后,使用硬投票策略结合模型预测以提升准确性和置信度,并通过动态学习率和早停策略在扩展数据集上进行微调,显著提高了分类精度,特别是在专业领域中。

链接: https://arxiv.org/abs/2504.19021
作者: Zhyar Rzgar K Rostam,Gábor Kertész
机构: Obuda University (奥布达大学); John von Neumann Faculty of Informatics (约翰·冯·诺伊曼信息学院); Doctoral School of Applied Informatics and Applied Mathematics (应用信息学与应用数学博士学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, 8 tables

点击查看摘要

Abstract:Efficient text classification is essential for handling the increasing volume of academic publications. This study explores the use of pre-trained language models (PLMs), including BERT, SciBERT, BioBERT, and BlueBERT, fine-tuned on the Web of Science (WoS-46985) dataset for scientific text classification. To enhance performance, we augment the dataset by executing seven targeted queries in the WoS database, retrieving 1,000 articles per category aligned with WoS-46985’s main classes. PLMs predict labels for this unlabeled data, and a hard-voting strategy combines predictions for improved accuracy and confidence. Fine-tuning on the expanded dataset with dynamic learning rates and early stopping significantly boosts classification accuracy, especially in specialized domains. Domain-specific models like SciBERT and BioBERT consistently outperform general-purpose models such as BERT. These findings underscore the efficacy of dataset augmentation, inference-driven label prediction, hard-voting, and fine-tuning techniques in creating robust and scalable solutions for automated academic text classification.
zh

[NLP-59] Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对齐社会标准方面存在的安全漏洞问题,特别是针对其安全机制可能被对抗性越狱(adversarial jailbreaks)绕过的挑战。论文提出的解决方案是Graph of ATtacks (GoAT),其关键在于利用思维图(Graph of Thoughts)框架生成高效的对抗性提示,通过更复杂的图结构进行推理,而非受限于传统的树状推理方式。GoAT能够在较少查询次数下实现更高的越狱成功率,并且无需访问目标模型的参数,具备黑盒攻击特性,同时能够生成高质量、可读性强的对抗性提示。

链接: https://arxiv.org/abs/2504.19019
作者: Mohammad Akbar-Tajari,Mohammad Taher Pilehvar,Mohammad Mahmoody
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 19 pages, 1 figure, 6 tables

点击查看摘要

Abstract:The challenge of ensuring Large Language Models (LLMs) align with societal standards is of increasing interest, as these models are still prone to adversarial jailbreaks that bypass their safety mechanisms. Identifying these vulnerabilities is crucial for enhancing the robustness of LLMs against such exploits. We propose Graph of ATtacks (GoAT), a method for generating adversarial prompts to test the robustness of LLM alignment using the Graph of Thoughts framework [Besta et al., 2024]. GoAT excels at generating highly effective jailbreak prompts with fewer queries to the victim model than state-of-the-art attacks, achieving up to five times better jailbreak success rate against robust models like Llama. Notably, GoAT creates high-quality, human-readable prompts without requiring access to the targeted model’s parameters, making it a black-box attack. Unlike approaches constrained by tree-based reasoning, GoAT’s reasoning is based on a more intricate graph structure. By making simultaneous attack paths aware of each other’s progress, this dynamic framework allows a deeper integration and refinement of reasoning paths, significantly enhancing the collaborative exploration of adversarial vulnerabilities in LLMs. At a technical level, GoAT starts with a graph structure and iteratively refines it by combining and improving thoughts, enabling synergy between different thought paths. The code for our implementation can be found at: this https URL.
zh

[NLP-60] Dynamic Fisher-weighted Model Merging via Bayesian Optimization

【速读】: 该论文试图解决在不依赖训练数据或联合训练的情况下,如何高效地将多个微调后的任务特定模型合并为一个多功能模型的问题。现有方法在参数层面进行模型合并时,通常采用逐模型缩放参数或逐参数整合重要性的策略,但这些方法存在性能瓶颈。论文提出的关键解决方案是动态Fisher加权合并(DF-Merge),其核心在于将候选模型与一组线性缩放系数相关联,并通过贝叶斯优化动态调整这些系数以最大化验证集上的整体性能,同时在每次迭代中基于条件Fisher信息整合参数重要性,从而实现更优的合并效果。

链接: https://arxiv.org/abs/2504.18992
作者: Sanwoo Lee,Jiahao Liu,Qifan Wang,Jingang Wang,Xunliang Cai,Yunfang Wu
机构: Peking University (北京大学); Meituan (美团); Meta AI (Meta人工智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The fine-tuning of pre-trained language models has resulted in the widespread availability of task-specific models. Model merging offers an efficient way to create multi-task models by combining these fine-tuned models at the parameter level, without the need for training data or joint training on multiple datasets. Existing merging approaches typically involve scaling the parameters model-wise or integrating parameter importance parameter-wise. Both approaches exhibit their own weaknesses, leading to a notable performance gap compared to multi-task fine-tuning. In this paper, we unify these seemingly distinct strategies into a more general merging framework, and introduce Dynamic Fisher-weighted Merging (DF-Merge). Specifically, candidate models are associated with a set of coefficients that linearly scale their fine-tuned parameters. Bayesian optimization is applied to dynamically adjust these coefficients, aiming to maximize overall performance on validation sets. Each iteration of this process integrates parameter importance based on the Fisher information conditioned by the coefficients. Experimental results show that DF-Merge outperforms strong baselines across models of different sizes and a variety of tasks. Our analysis shows that the effectiveness of DF-Merge arises from the unified view of merging and that near-optimal performance is achievable in a few iterations, even with minimal validation data.
zh

[NLP-61] LINC: Supporting Language Independent Communication and Comprehension to Enhance Contribution in Multilingual Collaborative Meetings

【速读】: 该论文试图解决多语言团队中英语非母语(ESL)研究人员在会议沟通和理解讨论内容方面遇到的挑战,这些问题导致其参与度受限。解决方案的关键在于开发LINC系统,这是一个多模态语言无关协作系统,包含两个核心组件:一个用于会议期间多语言实时交流的模块,以及一个用于会后讨论分析的仪表盘。该系统通过支持参与者使用首选语言进行沟通、帮助回顾和整理行动见解以及有效准备后续会议,提升了多语言协作效率。

链接: https://arxiv.org/abs/2504.18988
作者: Saramsh Gautam,Mahmood Jasim
机构: Louisiana State University (路易斯安那州立大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 19 pages, 4 figures. Multimodal system design and evaluation study

点击查看摘要

Abstract:Collaborative research often includes contributors with varied perspectives from diverse linguistic backgrounds. However, English as a Second Language (ESL) researchers often struggle to communicate during meetings in English and comprehend discussions, leading to limited contribution. To investigate these challenges, we surveyed 64 ESL researchers who frequently collaborate in multilingual teams and identified four key design goals around participation, comprehension, documentation, and feedback. Guided by these design goals, we developed LINC, a multimodal Language INdependent Collaboration system with two components: a real-time module for multilingual communication during meetings and a post-meeting dashboard for discussion analysis. We evaluated the system through a two-phased study with six triads of multilingual teams. We found that using LINC, participants benefited from communicating in their preferred language, recalled and reviewed actionable insights, and prepared for upcoming meetings effectively. We discuss external factors that impact multilingual meeting participation beyond language preferences and the implications of multimodal systems in facilitating meetings in hybrid multilingual collaborative settings beyond research.
zh

[NLP-62] LawFlow : Collecting and Simulating Lawyers Thought Processes

【速读】: 该论文试图解决当前法律人工智能(Artificial Intelligence)模型在支持复杂、端到端法律工作流方面的不足,这些问题包括模型过于专注于孤立子任务,无法捕捉真实法律实践中所需的动态、模块化和迭代推理过程。解决方案的关键在于引入LawFlow数据集,该数据集基于真实商业实体设立场景,收集了训练有素的法学院学生的完整端到端法律工作流程,从而能够更准确地反映法律实践中的模糊性、修订过程和客户适应策略。通过对比人类与大语言模型(Large Language Model, LLM)生成的工作流程,论文揭示了结构、推理灵活性和计划执行方面的系统性差异,并提出了基于实证观察的设计建议,以实现更符合人类法律工作者目标的协作式、推理感知型法律AI系统。

链接: https://arxiv.org/abs/2504.18942
作者: Debarati Das,Khanh Chi Le,Ritik Sachin Parkar,Karin De Langis,Brendan Madson,Chad M. Berryman,Robin M. Willis,Daniel H. Moses,Brett McDonnell,Daniel Schwarcz,Dongyeop Kang
机构: University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitted to COLM 2025

点击查看摘要

Abstract:Legal practitioners, particularly those early in their careers, face complex, high-stakes tasks that require adaptive, context-sensitive reasoning. While AI holds promise in supporting legal work, current datasets and models are narrowly focused on isolated subtasks and fail to capture the end-to-end decision-making required in real-world practice. To address this gap, we introduce LawFlow, a dataset of complete end-to-end legal workflows collected from trained law students, grounded in real-world business entity formation scenarios. Unlike prior datasets focused on input-output pairs or linear chains of thought, LawFlow captures dynamic, modular, and iterative reasoning processes that reflect the ambiguity, revision, and client-adaptive strategies of legal practice. Using LawFlow, we compare human and LLM-generated workflows, revealing systematic differences in structure, reasoning flexibility, and plan execution. Human workflows tend to be modular and adaptive, while LLM workflows are more sequential, exhaustive, and less sensitive to downstream implications. Our findings also suggest that legal professionals prefer AI to carry out supportive roles, such as brainstorming, identifying blind spots, and surfacing alternatives, rather than executing complex workflows end-to-end. Building on these findings, we propose a set of design suggestions, rooted in empirical observations, that align AI assistance with human goals of clarity, completeness, creativity, and efficiency, through hybrid planning, adaptive execution, and decision-point support. Our results highlight both the current limitations of LLMs in supporting complex legal workflows and opportunities for developing more collaborative, reasoning-aware legal AI systems. All data and code are available on our project page (this https URL).
zh

[NLP-63] MTCSC: Retrieval-Augmented Iterative Refinement for Chinese Spelling Correction

【速读】: 该论文旨在解决中文拼写纠错(Chinese Spelling Correction, CSC)任务中存在的一致性输出长度问题以及领域适应性不足的问题。传统CSC任务通常要求输入与输出长度相同,限制了其在实际应用中的灵活性。为了解决这些问题,作者提出了MTCSC(Multi-Turn CSC)框架,其关键在于结合检索增强生成(RAG)技术并引入长度反射机制,通过构建领域特定的检索数据库和迭代式长度反射策略,提升纠错效果,特别是在处理领域特定和变长错误校正任务时表现出显著优势。

链接: https://arxiv.org/abs/2504.18938
作者: Junhong Liang,Yu Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Chinese Spelling Correction (CSC) aims to detect and correct erroneous tokens in sentences. While Large Language Models (LLMs) have shown remarkable success in identifying and rectifying potential errors, they often struggle with maintaining consistent output lengths and adapting to domain-specific corrections. Furthermore, existing CSC task impose rigid constraints requiring input and output lengths to be identical, limiting their applicability. In this work, we extend traditional CSC to variable-length correction scenarios, including Chinese Splitting Error Correction (CSEC) and ASR N-best Error Correction. To address domain adaptation and length consistency, we propose MTCSC (Multi-Turn CSC) framework based on RAG enhanced with a length reflection mechanism. Our approach constructs a retrieval database from domain-specific training data and dictionaries, fine-tuning retrievers to optimize performance for error-containing inputs. Additionally, we introduce a multi-source combination strategy with iterative length reflection to ensure output length fidelity. Experiments across diverse domain datasets demonstrate that our method significantly outperforms current approaches in correction quality, particularly in handling domain-specific and variable-length error correction tasks.
zh

[NLP-64] Clinical knowledge in LLM s does not translate to human interactions

【速读】: 该论文试图解决生成式 AI (Generative AI) 在医疗建议场景中实际应用效果与理论性能之间存在的差距问题。研究发现,尽管大型语言模型(LLMs)在医学执照考试中表现优异,但在真实用户交互中未能展现出同等水平的准确性。解决方案的关键在于识别并改善用户与 LLM 之间的交互过程,强调在医疗领域部署前需进行系统性的人类用户测试,以评估其在实际交互中的表现。

链接: https://arxiv.org/abs/2504.18919
作者: Andrew M. Bean,Rebecca Payne,Guy Parsons,Hannah Rose Kirk,Juan Ciro,Rafael Mosquera,Sara Hincapié Monsalve,Aruna S. Ekanayaka,Lionel Tarassenko,Luc Rocher,Adam Mahdi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 52 pages, 4 figures

点击查看摘要

Abstract:Global healthcare providers are exploring use of large language models (LLMs) to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings. We tested if LLMs can assist members of the public in identifying underlying conditions and choosing a course of action (disposition) in ten medical scenarios in a controlled study with 1,298 participants. Participants were randomly assigned to receive assistance from an LLM (GPT-4o, Llama 3, Command R+) or a source of their choice (control). Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in less than 34.5% of cases and disposition in less than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice. Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures we find with human participants. Moving forward, we recommend systematic human user testing to evaluate interactive capabilities prior to public deployments in healthcare.
zh

[NLP-65] A Simple Ensemble Strategy for LLM Inference: Towards More Stable Text Classification

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在每次试验中结果的可变性和可重复性问题,这一问题在现有文献中被忽视,而实际的人工标注过程中通常通过多数投票来解决标注者之间的分歧。该研究提出的解决方案的关键在于引入一种简单的集成策略,通过多个中等规模的LLMs进行推理集成,从而获得比仅使用一个大型模型单次尝试更稳健和准确的结果,且在实验中实现了RMSE降低18.6%。

链接: https://arxiv.org/abs/2504.18884
作者: Junichiro Niimi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This manuscript has been accepted for the 30th International Conference on Natural Language Information Systems (NLDB 2025). The final version will appear in the Springer LNCS proceedings. arXiv admin note: text overlap with arXiv:2407.13069

点击查看摘要

Abstract:With the advance of large language models (LLMs), LLMs have been utilized for the various tasks. However, the issues of variability and reproducibility of results from each trial of LLMs have been largely overlooked in existing literature while actual human annotation uses majority voting to resolve disagreements among annotators. Therefore, this study introduces the straightforward ensemble strategy to a sentiment analysis using LLMs. As the results, we demonstrate that the ensemble of multiple inference using medium-sized LLMs produces more robust and accurate results than using a large model with a single attempt with reducing RMSE by 18.6%.
zh

[NLP-66] Latent Adversarial Training Improves the Representation of Refusal

【速读】: 该论文试图解决语言模型在拒绝行为(refusal behavior)编码上的脆弱性问题,特别是针对通过引入噪声进行训练的Latent Adversarial Training (LAT) 方法,其对拒绝行为底层表示的影响尚不明确。解决方案的关键在于通过分析Llama 2 7B模型的潜在空间,比较LAT与传统监督安全微调(SSFT)和嵌入空间对抗训练(AT)在拒绝行为表示上的差异,揭示噪声训练如何重构拒绝行为的潜在表示。研究发现,LAT显著改变了拒绝行为的表示,将其集中于前两个奇异值分解(SVD)成分,从而增强了拒绝向量的有效性和可迁移性,但也暴露了新的脆弱性。

链接: https://arxiv.org/abs/2504.18872
作者: Alexandra Abbas,Nora Petrova,Helios Ael Lyons,Natalia Perez-Campanero
机构: Apart Research (Apart Research)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work has shown that language models’ refusal behavior is primarily encoded in a single direction in their latent space, making it vulnerable to targeted attacks. Although Latent Adversarial Training (LAT) attempts to improve robustness by introducing noise during training, a key question remains: How does this noise-based training affect the underlying representation of refusal behavior? Understanding this encoding is crucial for evaluating LAT’s effectiveness and limitations, just as the discovery of linear refusal directions revealed vulnerabilities in traditional supervised safety fine-tuning (SSFT). Through the analysis of Llama 2 7B, we examine how LAT reorganizes the refusal behavior in the model’s latent space compared to SSFT and embedding space adversarial training (AT). By computing activation differences between harmful and harmless instruction pairs and applying Singular Value Decomposition (SVD), we find that LAT significantly alters the refusal representation, concentrating it in the first two SVD components which explain approximately 75 percent of the activation differences variance - significantly higher than in reference models. This concentrated representation leads to more effective and transferable refusal vectors for ablation attacks: LAT models show improved robustness when attacked with vectors from reference models but become more vulnerable to self-generated vectors compared to SSFT and AT. Our findings suggest that LAT’s training perturbations enable a more comprehensive representation of refusal behavior, highlighting both its potential strengths and vulnerabilities for improving model safety. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2504.18872 [cs.CL] (or arXiv:2504.18872v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.18872 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-67] Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理超过预训练长度的输入时,难以生成连贯上下文的问题。其关键解决方案是提出一种无需训练的框架——维度感知的位置嵌入操作(Dimension-Wise Positional Embeddings Manipulation, DPE),通过深入分析旋转位置编码(RoPE)的不同隐藏维度,识别每个维度的有效长度并操纵关键维度的位置索引至最优长度,从而在不改变预训练模型结构的前提下,显著扩展模型的上下文窗口并提升性能。

链接: https://arxiv.org/abs/2504.18857
作者: Yi Lu,Wanxu Zhao,Xin Zhou,Chenxin An,Chenglong Wang,Shuo Li,Yuming Yang,Jun Zhao,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); The University of Hong Kong (香港大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs but require expensive overhead to train the large-scale models with longer context. In this work, we propose Dimension-Wise Positional Embeddings Manipulation (DPE), a training-free framework to extrapolate the context window of LLMs by diving into RoPE’s different hidden dimensions. Instead of manipulating all dimensions equally, DPE detects the effective length for every dimension and finds the key dimensions for context extension. We reuse the original position indices with their embeddings from the pre-trained model and manipulate the key dimensions’ position indices to their most effective lengths. In this way, DPE adjusts the pre-trained models with minimal modifications while ensuring that each dimension reaches its optimal state for extrapolation. DPE significantly surpasses well-known baselines such as YaRN and Self-Extend. DPE enables Llama3-8k 8B to support context windows of 128k tokens without continual training and integrates seamlessly with Flash Attention 2. In addition to its impressive extrapolation capability, DPE also dramatically improves the models’ performance within training length, such as Llama3.1 70B, by over 18 points on popular long-context benchmarks RULER. When compared with commercial models, Llama 3.1 70B with DPE even achieves better performance than GPT-4-128K.
zh

[NLP-68] When2Call: When (not) to Call Tools NAACL2025

【速读】: 该论文试图解决当前语言模型在调用外部工具时的决策能力不足问题,即不仅关注是否正确调用工具,还关注模型何时应调用工具、何时应提出后续问题或承认无法通过现有工具回答问题。解决方案的关键在于构建了一个新的基准测试集When2Call,用于评估工具调用的决策过程,并开发了针对该基准的偏好优化训练方法,相较于传统微调方法表现出更显著的性能提升。

链接: https://arxiv.org/abs/2504.18851
作者: Hayley Ross,Ameya Sunil Mahabaleshwarkar,Yoshi Suhara
机构: Harvard University (哈佛大学); NVIDIA (英伟达)
类目: Computation and Language (cs.CL)
备注: NAACL 2025

点击查看摘要

Abstract:Leveraging external tools is a key feature for modern Language Models (LMs) to expand their capabilities and integrate them into existing systems. However, existing benchmarks primarily focus on the accuracy of tool calling – whether the correct tool is called with the correct parameters – and less on evaluating when LMs should (not) call tools. We develop a new benchmark, When2Call, which evaluates tool-calling decision-making: when to generate a tool call, when to ask follow-up questions and when to admit the question can’t be answered with the tools provided. We find that state-of-the-art tool-calling LMs show significant room for improvement on When2Call, indicating the importance of this benchmark. We also develop a training set for When2Call and leverage the multiple-choice nature of the benchmark to develop a preference optimization training regime, which shows considerably more improvement than traditional fine-tuning. We release the benchmark and training data as well as evaluation scripts at this https URL.
zh

[NLP-69] owards Robust Dialogue Breakdown Detection: Addressing Disruptors in Large Language Models with Self-Guided Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对话系统中处理对话中断(dialogue breakdowns)能力不足的问题。尽管OpenAI和Anthropic的先进模型在许多对话任务中表现出色,但它们仍可能生成不连贯或矛盾的回应,从而影响用户信任。论文提出的解决方案关键在于结合专用微调与高级提示策略,包括少样本学习、思维链推理和类比提示,以提升模型在英语和日语对话中的分类与校准能力,并通过实时部署架构在检测到对话中断时仅在必要时调用更复杂的前沿模型,从而降低运营成本和能耗。

链接: https://arxiv.org/abs/2504.18839
作者: Abdellah Ghassel,Xianzhi Li,Xiaodan Zhu
机构: Ingenuity Labs Research Institute (Ingenuity Labs 研究院); Queen’s University (皇后大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are rapidly changing various domains. However, their capabilities in handling conversational breakdowns still require an in-depth exploration. This paper addresses the challenge of detecting and mitigating dialogue breakdowns within LLM-driven conversational systems. While powerful models from OpenAI and Anthropic excel in many dialogue tasks, they can still produce incoherent or contradictory responses, commonly referred to as breakdowns, which undermine user trust. To tackle this, we propose an approach that combines specialized fine-tuning with advanced prompting strategies, including few-shot learning, chain-of-thought reasoning, and analogical prompting. In particular, we fine-tune a small 8B model and demonstrate its robust classification and calibration capabilities in English and Japanese dialogue. We also validate its generalization on the BETOLD dataset, achieving a 7% accuracy improvement over its base model. Furthermore, we introduce a real-time deployment architecture that selectively escalates suspicious responses to more resource-intensive frontier models only when breakdowns are detected, significantly cutting operational expenses and energy consumption. Experimental results show our method surpasses prior state-of-the-art specialized classifiers while also narrowing performance gaps between smaller open-source models and large proprietary ones. Our approach offers a scalable solution for robust conversational AI in high-impact domains by combining efficiency, interpretability, and reliability.
zh

[NLP-70] oward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)快速发展所带来的评估挑战,特别是评估体系在适应模型能力提升时的不足。其解决方案的关键在于识别并分析两个重要的转变:一是从任务特定评估向能力基础评估的转变,即围绕知识、推理、指令遵循、多模态理解和安全性等核心能力重新组织评估基准;二是从人工评估向自动化评估的转变,包括动态数据集构建和“大语言模型作为评判者”的评分机制。然而,论文也指出,即便实现了这些转变,评估泛化问题仍然是一个关键障碍,因为有限的测试集无法与能力持续增长的模型相匹配。

链接: https://arxiv.org/abs/2504.18838
作者: Yixin Cao,Shibo Hong,Xinze Li,Jiahao Ying,Yubo Ma,Haiyuan Liang,Yantao Liu,Zijun Yao,Xiaozhi Wang,Dan Huang,Wenxuan Zhang,Lifu Huang,Muhao Chen,Lei Hou,Qianru Sun,Xingjun Ma,Zuxuan Wu,Min-Yen Kan,David Lo,Qi Zhang,Heng Ji,Jing Jiang,Juanzi Li,Aixin Sun,Xuanjing Huang,Tat-Seng Chua,Yu-Gang Jiang
机构: Fudan University (复旦大学); Nanyang Technological University (南洋理工大学); Singapore Management University (新加坡管理大学); Tsinghua University (清华大学); Singapore University of Technology and Design (新加坡科技设计大学); University of California Davis (加州大学戴维斯分校); National University of Singapore (新加坡国立大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Australian National University (澳大利亚国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are advancing at an amazing speed and have become indispensable across academia, industry, and daily applications. To keep pace with the status quo, this survey probes the core challenges that the rise of LLMs poses for evaluation. We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety; and (ii) from manual to automated evaluation, encompassing dynamic dataset curation and “LLM-as-a-judge” scoring. Yet, even with these transitions, a crucial obstacle persists: the evaluation generalization issue. Bounded test sets cannot scale alongside models whose abilities grow seemingly without limit. We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics. Due to the fast evolving of this field, we will maintain a living GitHub repository (links are in each section) to crowd-source updates and corrections, and warmly invite contributors and collaborators. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.18838 [cs.CL] (or arXiv:2504.18838v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.18838 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-71] Stealing Creators Workflow: A Creator-Inspired Agent ic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation

【速读】: 该论文试图解决从科学论文中生成具有吸引力且准确的短格式视频所面临的挑战,这些问题源于内容的复杂性以及作者与读者之间的知识差距。现有端到端方法常出现事实性错误和视觉伪影,限制了其在科学传播中的应用。解决方案的关键在于提出SciTalk,这是一种基于多种大语言模型(Large Language Model, LLM)的智能体框架,通过整合文本、图表、视觉风格和虚拟人物等多种来源的信息,结合内容摘要、视觉场景规划和文本与版式编辑等专用智能体,并引入迭代反馈机制,使视频生成过程能够根据模拟用户角色的反馈不断优化生成提示。

链接: https://arxiv.org/abs/2504.18805
作者: Jong Inn Park,Maanas Taneja,Qianwen Wang,Dongyeop Kang
机构: University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Generating engaging, accurate short-form videos from scientific papers is challenging due to content complexity and the gap between expert authors and readers. Existing end-to-end methods often suffer from factual inaccuracies and visual artifacts, limiting their utility for scientific dissemination. To address these issues, we propose SciTalk, a novel multi-LLM agentic framework, grounding videos in various sources, such as text, figures, visual styles, and avatars. Inspired by content creators’ workflows, SciTalk uses specialized agents for content summarization, visual scene planning, and text and layout editing, and incorporates an iterative feedback mechanism where video agents simulate user roles to give feedback on generated videos from previous iterations and refine generation prompts. Experimental evaluations show that SciTalk outperforms simple prompting methods in generating scientifically accurate and engaging content over the refined loop of video generation. Although preliminary results are still not yet matching human creators’ quality, our framework provides valuable insights into the challenges and benefits of feedback-driven video generation. Our code, data, and generated videos will be publicly available.
zh

[NLP-72] SynLexLM: Scaling Legal LLM s with Synthetic Data and Curriculum Learning

【速读】: 该论文试图解决在法律领域中,大型语言模型(Large Language Models, LLMs)因需要大量微调和专用数据集而面临的问题,以及通用预训练模型难以捕捉法律细节的挑战。其关键解决方案是提出SynLexLM方法,该方法结合了课程学习(curriculum learning)与基于Gemini Pro等模型的合成数据增强技术,通过从简单到复杂的法律文本和查询逐步预训练,以应对数据稀缺问题,并提升法律基准测试(如BigLaw-Bench、EUR-Lex-Sum)上的性能。

链接: https://arxiv.org/abs/2504.18762
作者: Ojasw Upadhyay,Abishek Saravankumar,Ayman Ismail
机构: Georgia Institue of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are powerful but often require extensive fine-tuning and large datasets for specialized domains like law. General-purpose pre-training may not capture legal nuances, and acquiring sufficient legal data is challenging. We introduce SynLexLM, a novel approach to efficiently pre-train a legal LLM. Our method employs curriculum learning, progressing from simple to complex legal texts and queries, combined with synthetic data augmentation using models like Gemini Pro to address data scarcity. We aim to achieve improved performance on legal benchmarks (BigLaw-Bench, EUR-Lex-Sum) compared to traditional models and fine-tuned versions. Preliminary work involves generating synthetic QA pairs reflecting legal reasoning. This work aims to enhance legal document analysis and research tools, potentially democratizing access to advanced legal AI.
zh

[NLP-73] Generative Product Recommendations for Implicit Superlative Queries

【速读】: 该论文试图解决在推荐系统中,用户通过隐式最高级查询(implicit superlative queries)寻找最佳产品时,传统检索和排序系统因缺乏显式属性描述而难以准确识别和推理复杂因素的问题。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)生成隐式属性,并基于这些属性进行推理,以提升针对此类查询的产品推荐效果。为此,作者提出了一种新的四点标注方案SUPERB,并结合LLM生成的产品标注,构建了新的数据集以支持相关方法的评估与分析。

链接: https://arxiv.org/abs/2504.18748
作者: Kaustubh D. Dhole,Nikhita Vedula,Saar Kuzi,Giuseppe Castellucci,Eugene Agichtein,Shervin Malmasi
机构: Emory University (埃默里大学); Amazon.com Inc. (亚马逊公司)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In Recommender Systems, users often seek the best products through indirect, vague, or under-specified queries, such as “best shoes for trail running”. Such queries, also referred to as implicit superlative queries, pose a significant challenge for standard retrieval and ranking systems as they lack an explicit mention of attributes and require identifying and reasoning over complex factors. We investigate how Large Language Models (LLMs) can generate implicit attributes for ranking as well as reason over them to improve product recommendations for such queries. As a first step, we propose a novel four-point schema for annotating the best product candidates for superlative queries called SUPERB, paired with LLM-based product annotations. We then empirically evaluate several existing retrieval and ranking approaches on our new dataset, providing insights and discussing their integration into real-world e-commerce production systems.
zh

[NLP-74] EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers

【速读】: 该论文试图解决在生物医学论文中自动寻找与假设相关证据的问题(evidence retrieval),这是研究人员验证科学假设过程中的关键步骤。解决方案的关键在于引入EvidenceBench,这是一个通过新颖的流水线构建的数据集,该流水线包括假设生成和逐句标注生物医学论文以识别相关证据,整个过程完全依据并忠实遵循现有人类专家的判断。该流水线通过多组人类专家标注验证了其有效性和准确性。

链接: https://arxiv.org/abs/2504.18736
作者: Jianyou Wang,Weili Cao,Kaicheng Wang,Xiaoyue Wang,Ashish Dalvi,Gino Prasad,Qishan Liang,Hsuan-lin Her,Ming Wang,Qin Yang,Gene W. Yeo,David E. Neal,Maxim Khan,Christopher D. Rosin,Ramamohan Paturi,Leon Bergen
机构: University of California, San Diego (加利福尼亚大学圣地亚哥分校); Elsevier (爱思唯尔); Department of Cellular and Molecular Medicine, University of California, San Diego (细胞与分子医学系,加利福尼亚大学圣地亚哥分校); Sichuan Cancer Hospital & Institute (四川肿瘤医院与研究所); The Third People’s Hospital of Chengdu (成都第三人民医院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline’s validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at this https URL
zh

[NLP-75] Building UD Cairo for Old English in the Classroom

【速读】: 该论文试图解决历史语言学中古英语语料库构建与标注的问题,特别是如何有效利用生成式 AI (Generative AI) 和人工标注相结合的方法来创建高质量的树库。其解决方案的关键在于结合大型语言模型(LLM)生成的句子与真实古英语数据,并通过多名具备有限统一依存句法(UD)背景的学生进行标注,再通过比较和仲裁以提高标注质量,同时通过后编辑提升LLM输出的语法真实性。

链接: https://arxiv.org/abs/2504.18718
作者: Lauren Levine,Junghyun Min,Amir Zeldes
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:In this paper we present a sample treebank for Old English based on the UD Cairo sentences, collected and annotated as part of a classroom curriculum in Historical Linguistics. To collect the data, a sample of 20 sentences illustrating a range of syntactic constructions in the world’s languages, we employ a combination of LLM prompting and searches in authentic Old English data. For annotation we assigned sentences to multiple students with limited prior exposure to UD, whose annotations we compare and adjudicate. Our results suggest that while current LLM outputs in Old English do not reflect authentic syntax, this can be mitigated by post-editing, and that although beginner annotators do not possess enough background to complete the task perfectly, taken together they can produce good results and learn from the experience. We also conduct preliminary parsing experiments using Modern English training data, and find that although performance on Old English is poor, parsing on annotated features (lemma, hyperlemma, gloss) leads to improved performance.
zh

[NLP-76] Spatial Speech Translation: Translating Across Space With Binaural Hearables

【速读】: 该论文试图解决在嘈杂环境中实现空间语音翻译的问题,即在保持说话人方向和独特语音特征的同时,将环境中的语音翻译成佩戴者的母语。解决方案的关键在于整合盲源分离、定位、实时表达性翻译和双耳渲染等技术,以在翻译音频中保留说话人方向,并实现在Apple M2芯片上的实时推理。

链接: https://arxiv.org/abs/2504.18715
作者: Tuochao Chen,Qirui Wang,Runlin He,Shyam Gollakota
机构: Paul G. Allen School, University of Washington (保罗·G·艾伦计算机科学与工程学院,华盛顿大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by CHI2025

点击查看摘要

Abstract:Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer’s environment, while maintaining the direction and unique voice characteristics of each speaker in the binaural output. To achieve this, we tackle several technical challenges spanning blind source separation, localization, real-time expressive translation, and binaural rendering to preserve the speaker directions in the translated audio, while achieving real-time inference on the Apple M2 silicon. Our proof-of-concept evaluation with a prototype binaural headset shows that, unlike existing models, which fail in the presence of interference, we achieve a BLEU score of up to 22.01 when translating between languages, despite strong interference from other speakers in the environment. User studies further confirm the system’s effectiveness in spatially rendering the translated speech in previously unseen real-world reverberant environments. Taking a step back, this work marks the first step towards integrating spatial perception into speech translation.
zh

[NLP-77] Can Third-parties Read Our Emotions?

【速读】: 该论文试图解决第三方标注者在情感识别任务中无法准确反映作者私人状态的问题(即第三方标注的可靠性问题)。其解决方案的关键在于通过直接比较第三方标注与第一方(作者自述)情感标签,揭示第三方标注的局限性,并探索提升标注质量的方法,包括利用作者与标注者之间的人口统计学相似性以及将第一方人口统计信息纳入提示以改进大型语言模型(LLMs)的性能。

链接: https://arxiv.org/abs/2504.18673
作者: Jiayi Li,Yingfan Zhou,Pranav Narayanan Venkit,Halima Binte Islam,Sneha Arya,Shomir Wilson,Sarah Rajtmajer
机构: Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language Processing tasks that aim to infer an author’s private states, e.g., emotions and opinions, from their written text, typically rely on datasets annotated by third-party annotators. However, the assumption that third-party annotators can accurately capture authors’ private states remains largely unexamined. In this study, we present human subjects experiments on emotion recognition tasks that directly compare third-party annotations with first-party (author-provided) emotion labels. Our findings reveal significant limitations in third-party annotations-whether provided by human annotators or large language models (LLMs)-in faithfully representing authors’ private states. However, LLMs outperform human annotators nearly across the board. We further explore methods to improve third-party annotation quality. We find that demographic similarity between first-party authors and third-party human annotators enhances annotation performance. While incorporating first-party demographic information into prompts leads to a marginal but statistically significant improvement in LLMs’ performance. We introduce a framework for evaluating the limitations of third-party annotations and call for refined annotation practices to accurately represent and model authors’ private states.
zh

[NLP-78] Span-Level Hallucination Detection for LLM -Generated Answers

【速读】: 该论文试图解决大语言模型(Large Language Model,LLM)生成答案中幻觉(hallucination)片段的检测问题,以提升事实一致性。解决方案的关键在于集成语义角色标注(Semantic Role Labeling,SRL)技术,将答案分解为原子语义角色,并与通过基于问题的LLM提示获取的参考上下文进行比较,随后利用基于DeBERTa的文本蕴含模型评估每个角色与参考上下文的语义对齐程度,最终通过令牌级置信度度量进一步优化得分以检测幻觉片段。

链接: https://arxiv.org/abs/2504.18639
作者: Passant Elchafei,Mervet Abu-Elkheir
机构: Ulm University (乌尔姆大学); German University in Cairo (开罗德国大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Detecting spans of hallucination in LLM-generated answers is crucial for improving factual consistency. This paper presents a span-level hallucination detection framework for the SemEval-2025 Shared Task, focusing on English and Arabic texts. Our approach integrates Semantic Role Labeling (SRL) to decompose the answer into atomic roles, which are then compared with a retrieved reference context obtained via question-based LLM prompting. Using a DeBERTa-based textual entailment model, we evaluate each role semantic alignment with the retrieved context. The entailment scores are further refined through token-level confidence measures derived from output logits, and the combined scores are used to detect hallucinated spans. Experiments on the Mu-SHROOM dataset demonstrate competitive performance. Additionally, hallucinated spans have been verified through fact-checking by prompting GPT-4 and LLaMA. Our findings contribute to improving hallucination detection in LLM-generated responses.
zh

[NLP-79] Optimizing the Privacy-Utility Balance using Synthetic Data and Configurable Perturbation Pipelines

【速读】: 该论文旨在解决在处理大规模数据集时,如何在保障隐私安全的同时保持数据的分析效用和操作效率的问题,特别是在银行业、金融服务和保险(BFSI)等数据敏感行业。其解决方案的关键在于采用现代合成数据生成技术和先进的数据扰动方法,如生成对抗网络(Generative Adversarial Networks, GANs)、上下文感知的个人身份信息(PII)转换、可配置的统计扰动以及差分隐私,以替代传统的匿名化方法,从而创建既真实又具备隐私保护能力的数据集,满足复杂机器学习任务和分析的需求。

链接: https://arxiv.org/abs/2504.18596
作者: Anantha Sharma,Swetha Devabhaktuni,Eklove Mohan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Probability (math.PR)
备注: 18 pages, 8 figures, 5 tables

点击查看摘要

Abstract:This paper explores the strategic use of modern synthetic data generation and advanced data perturbation techniques to enhance security, maintain analytical utility, and improve operational efficiency when managing large datasets, with a particular focus on the Banking, Financial Services, and Insurance (BFSI) sector. We contrast these advanced methods encompassing generative models like GANs, sophisticated context-aware PII transformation, configurable statistical perturbation, and differential privacy with traditional anonymization approaches. The goal is to create realistic, privacy-preserving datasets that retain high utility for complex machine learning tasks and analytics, a critical need in the data-sensitive industries like BFSI, Healthcare, Retail, and Telecommunications. We discuss how these modern techniques potentially offer significant improvements in balancing privacy preservation while maintaining data utility compared to older methods. Furthermore, we examine the potential for operational gains, such as reduced overhead and accelerated analytics, by using these privacy-enhanced datasets. We also explore key use cases where these methods can mitigate regulatory risks and enable scalable, data-driven innovation without compromising sensitive customer information. Comments: 18 pages, 8 figures, 5 tables Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2504.18596 [cs.CR] (or arXiv:2504.18596v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.18596 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-80] Mind the Language Gap: Automated and Augmented Evaluation of Bias in LLM s for High- and Low-Resource Languages

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在自然语言处理过程中可能继承并放大训练数据中的社会偏见问题。解决方案的关键在于提出一种名为多语言增强偏见测试(MultiLingual Augmented Bias Testing, MLA-BiTe)的框架,该框架通过自动化翻译和改写技术实现系统化的多语言偏见测试,从而支持在多种语言环境下的全面偏见评估。

链接: https://arxiv.org/abs/2504.18560
作者: Alessio Buscemi,Cédric Lothritz,Sergio Morales,Marcos Gomez-Vazquez,Robert Clarisó,Jordi Cabot,German Castignani
机构: Luxembourg Institute of Science and Technology (卢森堡科学与技术研究所); Universitat Oberta de Catalunya (加泰罗尼亚开放大学); University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited impressive natural language processing capabilities but often perpetuate social biases inherent in their training data. To address this, we introduce MultiLingual Augmented Bias Testing (MLA-BiTe), a framework that improves prior bias evaluation methods by enabling systematic multilingual bias testing. MLA-BiTe leverages automated translation and paraphrasing techniques to support comprehensive assessments across diverse linguistic settings. In this study, we evaluate the effectiveness of MLA-BiTe by testing four state-of-the-art LLMs in six languages – including two low-resource languages – focusing on seven sensitive categories of discrimination.
zh

[NLP-81] Versatile Framework for Song Generation with Prompt-based Control

【速读】: 该论文旨在解决现有歌曲生成方法在基于提示控制生成人声与伴奏以及实现良好对齐方面的不足,同时缺乏对多种任务的支持。其解决方案的关键在于提出VersBand框架,该框架包含多个核心模型:VocalBand通过流匹配方法实现风格可控的高质量人声生成;AccompBand采用基于流的Transformer模型结合Band-MOE机制,提升伴奏的质量、对齐性和控制能力;LyricBand和MelodyBand分别负责歌词和旋律的生成,从而构建一个全面的多任务歌曲生成系统。

链接: https://arxiv.org/abs/2504.19062
作者: Yu Zhang,Wenxiang Guo,Changhao Pan,Zhiyuan Zhu,Ruiqi Li,Jingyu Lu,Rongjie Huang,Ruiyuan Zhang,Zhiqing Hong,Ziyue Jiang,Zhou Zhao
机构: Zhejiang University (浙江大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results demonstrate that VersBand performs better over baseline models across multiple song generation tasks using objective and subjective metrics. Audio samples are available at this https URL.
zh

计算机视觉

[CV-0] CompleteMe: Reference-based Human Image Completion

【速读】:该论文试图解决人体图像补全中难以保留独特细节(如特定服装图案或显著配饰)的问题,尤其是在没有显式参考图像的情况下。现有基于参考图像的图像修复方法在准确捕捉和整合参考图像中的细粒度细节方面仍存在局限。解决方案的关键在于提出CompleteMe框架,该框架采用双U-Net结构结合区域聚焦注意力(Region-focused Attention, RFA)模块,通过显式引导模型关注参考图像中的相关区域,从而有效捕捉细节并确保语义一致性。

链接: https://arxiv.org/abs/2504.20042
作者: Yu-Ju Tsai,Brian Price,Qing Liu,Luis Figueroa,Daniil Pakhomov,Zhihong Ding,Scott Cohen,Ming-Hsuan Yang
机构: UC Merced (加州大学默塞德分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent methods for human image completion can reconstruct plausible body shapes but often fail to preserve unique details, such as specific clothing patterns or distinctive accessories, without explicit reference images. Even state-of-the-art reference-based inpainting approaches struggle to accurately capture and integrate fine-grained details from reference images. To address this limitation, we propose CompleteMe, a novel reference-based human image completion framework. CompleteMe employs a dual U-Net architecture combined with a Region-focused Attention (RFA) Block, which explicitly guides the model’s attention toward relevant regions in reference images. This approach effectively captures fine details and ensures accurate semantic correspondence, significantly improving the fidelity and consistency of completed images. Additionally, we introduce a challenging benchmark specifically designed for evaluating reference-based human image completion tasks. Extensive experiments demonstrate that our proposed method achieves superior visual quality and semantic consistency compared to existing techniques. Project page: this https URL
zh

[CV-1] Learning Streaming Video Representation via Multitask Training

【速读】:该论文旨在解决实时应用中连续视频流理解的问题,特别是在处理视频流时需要逐帧处理、保留历史信息并实现低延迟的需求。其解决方案的关键在于提出了一种名为StreamFormer的新型流式视频主干网络,通过将因果时序注意力机制融入预训练的视觉Transformer,实现了高效的流式视频处理同时保持图像表征能力。

链接: https://arxiv.org/abs/2504.20041
作者: Yibin Yan,Jilan Xu,Shangzhe Di,Yikun Liu,Yudi Shi,Qirui Chen,Zeqian Li,Yifei Huang,Weidi Xie
机构: Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Project Page: this https URL

点击查看摘要

Abstract:Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency this http URL address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability.(ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.
zh

[CV-2] MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion CVPR2025

【速读】:该论文旨在解决Structure-from-Motion (SfM)在极端视角变化、低重叠度、低视差或高对称性场景下的性能退化问题,这些问题使得现有最先进的SfM系统容易失效,从而限制了其在非专业用户中的广泛应用。论文提出的解决方案的关键在于将经典的SfM范式与由深度神经网络推断出的单目深度和法线先验相结合,通过紧密集成单目和多视角约束,显著提升了在极端视角变化下的性能,同时保持了标准条件下的强鲁棒性。

链接: https://arxiv.org/abs/2504.20040
作者: Zador Pataki,Paul-Edouard Sarlin,Johannes L. Schönberger,Marc Pollefeys
机构: ETH Zurich (苏黎世联邦理工学院); Google(谷歌); Microsoft Spatial AI Lab (微软空间人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2025

点击查看摘要

Abstract:While Structure-from-Motion (SfM) has seen much progress over the years, state-of-the-art systems are prone to failure when facing extreme viewpoint changes in low-overlap, low-parallax or high-symmetry scenarios. Because capturing images that avoid these pitfalls is challenging, this severely limits the wider use of SfM, especially by non-expert users. We overcome these limitations by augmenting the classical SfM paradigm with monocular depth and normal priors inferred by deep neural networks. Thanks to a tight integration of monocular and multi-view constraints, our approach significantly outperforms existing ones under extreme viewpoint changes, while maintaining strong performance in standard conditions. We also show that monocular priors can help reject faulty associations due to symmetries, which is a long-standing problem for SfM. This makes our approach the first capable of reliably reconstructing challenging indoor environments from few images. Through principled uncertainty propagation, it is robust to errors in the priors, can handle priors inferred by different models with little tuning, and will thus easily benefit from future progress in monocular depth and normal estimation. Our code is publicly available at this https URL.
zh

[CV-3] Mitigating Catastrophic Forgetting in the Incremental Learning of Medical Images DATE

【速读】:该论文旨在解决在医疗影像分析中,特别是在使用来自不同医疗机构的T2加权磁共振成像(T2w MRI)数据进行前列腺癌检测时,深度学习模型的准确性和效率问题。其解决方案的关键在于采用增量学习(Incremental Learning, IL)方法,并结合知识蒸馏(Knowledge Distillation, KD),通过利用先前任务生成的图像来指导后续任务的模型训练,从而在不直接访问原始数据的情况下,使模型能够保留并应用之前获得的知识,提升模型性能和收敛速度。

链接: https://arxiv.org/abs/2504.20033
作者: Sara Yavari,Jacob Furst
机构: DePaul University (德保罗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 Pages, 3 Figures, 3 Tables, 1 Algorithm, This paper will be updated

点击查看摘要

Abstract:This paper proposes an Incremental Learning (IL) approach to enhance the accuracy and efficiency of deep learning models in analyzing T2-weighted (T2w) MRI medical images prostate cancer detection using the PI-CAI dataset. We used multiple health centers’ artificial intelligence and radiology data, focused on different tasks that looked at prostate cancer detection using MRI (PI-CAI). We utilized Knowledge Distillation (KD), as it employs generated images from past tasks to guide the training of models for subsequent tasks. The approach yielded improved performance and faster convergence of the models. To demonstrate the versatility and robustness of our approach, we evaluated it on the PI-CAI dataset, a diverse set of medical imaging modalities including OCT and PathMNIST, and the benchmark continual learning dataset CIFAR-10. Our results indicate that KD can be a promising technique for IL in medical image analysis in which data is sourced from individual health centers and the storage of large datasets is not feasible. By using generated images from prior tasks, our method enables the model to retain and apply previously acquired knowledge without direct access to the original data.
zh

[CV-4] More Clear More Flexible More Precise: A Comprehensive Oriented Object Detection benchmark for UAV

【速读】:该论文试图解决当前无人机(UAV)定向目标检测(OOD)数据集在实际飞行场景中泛化性能有限的问题,这些数据集往往针对特定下游任务设计,无法全面展示算法在真实环境中的有效性。解决方案的关键在于提出CODrone数据集,该数据集通过涵盖多种城市、不同光照条件下的广泛标注图像,真实反映现实场景,并针对现有数据集在图像分辨率、目标类别、单视角成像和飞行高度限制等方面的不足进行改进,从而提升其适用性和鲁棒性,为UAV领域的OOD研究提供更具泛化能力的基准。

链接: https://arxiv.org/abs/2504.20032
作者: Kai Ye,Haidi Tang,Bowen Liu,Pingyang Dai,Liujuan Cao,Rongrong Ji
机构: Xiamen University (厦门大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Applications of unmanned aerial vehicle (UAV) in logistics, agricultural automation, urban management, and emergency response are highly dependent on oriented object detection (OOD) to enhance visual perception. Although existing datasets for OOD in UAV provide valuable resources, they are often designed for specific downstream this http URL, they exhibit limited generalization performance in real flight scenarios and fail to thoroughly demonstrate algorithm effectiveness in practical environments. To bridge this critical gap, we introduce CODrone, a comprehensive oriented object detection dataset for UAVs that accurately reflects real-world conditions. It also serves as a new benchmark designed to align with downstream task requirements, ensuring greater applicability and robustness in UAV-based this http URL on application requirements, we identify four key limitations in current UAV OOD datasets-low image resolution, limited object categories, single-view imaging, and restricted flight altitudes-and propose corresponding improvements to enhance their applicability and this http URL, CODrone contains a broad spectrum of annotated images collected from multiple cities under various lighting conditions, enhancing the realism of the benchmark. To rigorously evaluate CODrone as a new benchmark and gain deeper insights into the novel challenges it presents, we conduct a series of experiments based on 22 classical or SOTA this http URL evaluation not only assesses the effectiveness of CODrone in real-world scenarios but also highlights key bottlenecks and opportunities to advance OOD in UAV this http URL, CODrone fills the data gap in OOD from UAV perspective and provides a benchmark with enhanced generalization capability, better aligning with practical applications and future algorithm development.
zh

[CV-5] LIRM: Large Inverse Rendering Model for Progressive Reconstruction of Shape Materials and View-dependent Radiance Fields CVPR2025

【速读】:该论文旨在解决多视角三维重建中难以准确重建未见区域、无法恢复高光材质以及生成可被标准图形引擎使用的可光照重演的3D内容的问题。其解决方案的关键在于三个技术贡献:引入一个更新模型以逐步增加输入视角提升重建质量;提出六平面神经SDF(Signed Distance Field)表示以更好地恢复细节纹理、几何和材质参数;开发一种新颖的神经方向嵌入机制以处理视图依赖效应。这些改进使得模型在保持高精度的同时显著降低了推理时间。

链接: https://arxiv.org/abs/2504.20026
作者: Zhengqin Li,Dilin Wang,Ka Chen,Zhaoyang Lv,Thu Nguyen-Phuoc,Milim Lee,Jia-Bin Huang,Lei Xiao,Cheng Zhang,Yufeng Zhu,Carl S. Marshall,Yufeng Ren,Richard Newcombe,Zhao Dong
机构: Meta Reality Labs (元宇宙实验室); University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:We present Large Inverse Rendering Model (LIRM), a transformer architecture that jointly reconstructs high-quality shape, materials, and radiance fields with view-dependent effects in less than a second. Our model builds upon the recent Large Reconstruction Models (LRMs) that achieve state-of-the-art sparse-view reconstruction quality. However, existing LRMs struggle to reconstruct unseen parts accurately and cannot recover glossy appearance or generate relightable 3D contents that can be consumed by standard Graphics engines. To address these limitations, we make three key technical contributions to build a more practical multi-view 3D reconstruction framework. First, we introduce an update model that allows us to progressively add more input views to improve our reconstruction. Second, we propose a hexa-plane neural SDF representation to better recover detailed textures, geometry and material parameters. Third, we develop a novel neural directional-embedding mechanism to handle view-dependent effects. Trained on a large-scale shape and material dataset with a tailored coarse-to-fine training scheme, our model achieves compelling results. It compares favorably to optimization-based dense-view inverse rendering methods in terms of geometry and relighting accuracy, while requiring only a fraction of the inference time.
zh

[CV-6] SpatialReason er: Towards Explicit and Generalizable 3D Spatial Reasoning

【速读】:该论文旨在解决当前3D空间推理方法在训练过程中对未见过的问题类型泛化能力不足的问题,以及现有方法通常以隐式方式执行空间推理而缺乏显式的3D表示共享的问题。解决方案的关键在于引入SpatialReasoner,这是一种新型的大规模视觉-语言模型(Large Vision-Language Model, LVLM),通过在不同阶段(3D感知、计算和推理)中共享显式的3D表示,从而提供一个连贯的接口以支持更高级的3D空间推理,并能够研究LVLMs在事实性错误上的表现。

链接: https://arxiv.org/abs/2504.20024
作者: Wufei Ma,Yu-Cheng Chou,Qihao Liu,Xingrui Wang,Celso de Melo,Jieneng Chen,Jianwen Xie,Alan Yuille
机构: DEVCOM Army Research Laboratory (美国陆军研究实验室); Lambda Inc (Lambda 公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent studies in 3D spatial reasoning explore data-driven approaches and achieve enhanced spatial reasoning performance with reinforcement learning (RL). However, these methods typically perform spatial reasoning in an implicit manner, and it remains underexplored whether the acquired 3D knowledge generalizes to unseen question types at any stage of the training. In this work we introduce SpatialReasoner, a novel large vision-language model (LVLM) that address 3D spatial reasoning with explicit 3D representations shared between stages – 3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and enable us to study the factual errors made by LVLMs. Results show that our SpatialReasoner achieve improved performance on a variety of spatial reasoning benchmarks and generalizes better when evaluating on novel 3D spatial reasoning questions. Our study bridges the 3D parsing capabilities of prior visual foundation models with the powerful reasoning abilities of large language models, opening new directions for 3D spatial reasoning.
zh

[CV-7] owards AI-Driven Policing: Interdisciplinary Knowledge Discovery from Police Body-Worn Camera Footage

【速读】:该论文试图解决如何通过先进的生成式 AI (Generative AI) 和统计机器学习 (Statistical Machine Learning, ML) 技术,从警察执法记录仪(BWC)视频中检测、分类并分析警员与平民之间的互动模式,以识别关键的行为动态,如尊重、不尊重、升级和缓和。解决方案的关键在于采用多模态数据分析方法,整合视频、音频和自然语言处理(Natural Language Processing, NLP)技术,从而从BWC数据中提取有意义的见解。

链接: https://arxiv.org/abs/2504.20007
作者: Anita Srbinovska,Angela Srbinovska,Vivek Senthil,Adrian Martin,John McCluskey,Ernest Fokoué
机构: Rochester Institute of Technology(罗彻斯特理工学院); Rochester Police Department(罗切斯特警察局); University at Albany(阿尔巴尼大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 2 figures, and 1 table

点击查看摘要

Abstract:This paper proposes a novel interdisciplinary framework for analyzing police body-worn camera (BWC) footage from the Rochester Police Department (RPD) using advanced artificial intelligence (AI) and statistical machine learning (ML) techniques. Our goal is to detect, classify, and analyze patterns of interaction between police officers and civilians to identify key behavioral dynamics, such as respect, disrespect, escalation, and de-escalation. We apply multimodal data analysis by integrating video, audio, and natural language processing (NLP) techniques to extract meaningful insights from BWC footage. We present our methodology, computational techniques, and findings, outlining a practical approach for law enforcement while advancing the frontiers of knowledge discovery from police BWC data.
zh

[CV-8] Monitoring digestate application on agricultural crops using Sentinel-2 Satellite imagery

【速读】:该论文试图解决如何有效监测农业中外源有机质(Exogenous Organic Matter, EOM)的应用问题,以评估其对土壤和作物健康的影响。解决方案的关键在于结合遥感技术与机器学习(Machine Learning, ML)方法,利用Sentinel-2卫星影像的时间序列分析以及多种机器学习模型(如随机森林、k-近邻、梯度提升和前馈神经网络)来检测消化物的施用情况,从而实现对EOM应用的高效、低成本监测。

链接: https://arxiv.org/abs/2504.19996
作者: Andreas Kalogeras,Dimitrios Bormpoudakis,Iason Tsardanidis,Dimitra A. Loka,Charalampos Kontoes
机构: BEYOND EO Centre, IAASARS, National Observatory of Athens, Athens, Greece; Institute of Industrial and Forage Crops, Hellenic Agricultural Organization (ELGO) “DIMITRA”, Larissa, Greece
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread use of Exogenous Organic Matter in agriculture necessitates monitoring to assess its effects on soil and crop health. This study evaluates optical Sentinel-2 satellite imagery for detecting digestate application, a practice that enhances soil fertility but poses environmental risks like microplastic contamination and nitrogen losses. In the first instance, Sentinel-2 satellite image time series (SITS) analysis of specific indices (EOMI, NDVI, EVI) was used to characterize EOM’s spectral behavior after application on the soils of four different crop types in Thessaly, Greece. Furthermore, Machine Learning (ML) models (namely Random Forest, k-NN, Gradient Boosting and a Feed-Forward Neural Network), were used to investigate digestate presence detection, achieving F1-scores up to 0.85. The findings highlight the potential of combining remote sensing and ML for scalable and cost-effective monitoring of EOM applications, supporting precision agriculture and sustainability.
zh

[CV-9] Mapping of Weed Management Methods in Orchards using Sentinel-2 and PlanetScope Data

【速读】:该论文试图解决传统地面调查在监测果园中四种不同杂草管理方法(Mowing, Tillage, Chemical-spraying, and No practice)时存在的成本高、耗时长及延迟问题。解决方案的关键在于利用地球观测(Earth Observation, EO)数据和机器学习(Machine Learning, ML)技术,通过卫星影像时间序列(Satellite Image Time Series, SITS)数据实现对杂草管理方法的高效准确制图。

链接: https://arxiv.org/abs/2504.19991
作者: Ioannis Kontogiorgakis,Iason Tsardanidis,Dimitrios Bormpoudakis,Ilias Tsoumas,Dimitra A. Loka,Christos Noulas,Alexandros Tsitouras,Charalampos Kontoes
机构: BEYOND EO Centre, IAASARS, National Observatory of Athens, Athens, Greece; Artificial Intelligence, Wageningen University & Research, The Netherlands; Institute of Industrial and Forage Crops, Hellenic Agricultural Organization (ELGO) “DIMITRA”, Larissa, Greece
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Effective weed management is crucial for improving agricultural productivity, as weeds compete with crops for vital resources like nutrients and water. Accurate maps of weed management methods are essential for policymakers to assess farmer practices, evaluate impacts on vegetation health, biodiversity, and climate, as well as ensure compliance with policies and subsidies. However, monitoring weed management methods is challenging as commonly rely on on-ground field surveys, which are often costly, time-consuming and subject to delays. In order to tackle this problem, we leverage Earth Observation (EO) data and Machine Learning (ML). Specifically, we developed an ML approach for mapping four distinct weed management methods (Mowing, Tillage, Chemical-spraying, and No practice) in orchards using satellite image time series (SITS) data from two different sources: Sentinel-2 (S2) and PlanetScope (PS). The findings demonstrate the potential of ML-driven remote sensing to enhance the efficiency and accuracy of weed management mapping in orchards.
zh

[CV-10] Shopformer: Transformer-Based Framework for Detecting Shoplifting via Human Pose

【速读】:该论文试图解决零售行业中偷窃行为检测效率低的问题,传统监控系统依赖人工监视,效果有限,仅约2%的偷窃者被逮捕。现有基于人工智能的方法依赖像素级视频分析,存在隐私问题、对环境变化敏感以及计算资源需求高的缺点。解决方案的关键在于引入Shopformer,这是一种基于Transformer的模型,通过分析姿态序列而非原始视频来检测偷窃行为,并提出了一种自定义的分词策略,将姿态序列转换为紧凑的嵌入表示,以实现高效的Transformer处理。

链接: https://arxiv.org/abs/2504.19970
作者: Narges Rashvand,Ghazal Alinezhad Noghre,Armin Danesh Pazho,Babak Rahimi Ardabili,Hamed Tabkhi
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Shoplifting remains a costly issue for the retail sector, but traditional surveillance systems, which are mostly based on human monitoring, are still largely ineffective, with only about 2% of shoplifters being arrested. Existing AI-based approaches rely on pixel-level video analysis which raises privacy concerns, is sensitive to environmental variations, and demands significant computational resources. To address these limitations, we introduce Shopformer, a novel transformer-based model that detects shoplifting by analyzing pose sequences rather than raw video. We propose a custom tokenization strategy that converts pose sequences into compact embeddings for efficient transformer processing. To the best of our knowledge, this is the first pose-sequence-based transformer model for shoplifting detection. Evaluated on real-world pose data, our method outperforms state-of-the-art anomaly detection models, offering a privacy-preserving, and scalable solution for real-time retail surveillance. The code base for this work is available at this https URL.
zh

[CV-11] Mesh-Learner: Texturing Mesh with Spherical Harmonics

【速读】:该论文旨在解决传统光栅化流水线中3D重建与渲染的兼容性及效率问题,特别是在处理大规模、无限场景时的GPU内存限制。其解决方案的关键在于提出一种名为Mesh-Learner的框架,该框架将网格(mesh)与球谐纹理(spherical harmonic texture, SH texture)整合到学习过程中,通过端到端方式学习每个网格的视角依赖辐射率,并利用新颖的插值方法在像素采样点上进行SH Texels的插值,同时通过反向传播将像素梯度传递至相关SH Texels。此外,该框架充分利用了光栅化流水线的图形特性(如纹理采样和延迟渲染),实现了与基于光栅化流水线的工具(如Blender)和任务(如3D重建、场景渲染、机器人强化学习)的天然兼容性,从而有效降低了GPU内存使用并支持大规模场景的训练。

链接: https://arxiv.org/abs/2504.19938
作者: Yunfei Wan,Jianheng Liu,Jiarong Lin,Fu Zhang
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In this paper, we present a 3D reconstruction and rendering framework termed Mesh-Learner that is natively compatible with traditional rasterization pipelines. It integrates mesh and spherical harmonic (SH) texture (i.e., texture filled with SH coefficients) into the learning process to learn each mesh s view-dependent radiance end-to-end. Images are rendered by interpolating surrounding SH Texels at each pixel s sampling point using a novel interpolation method. Conversely, gradients from each pixel are back-propagated to the related SH Texels in SH textures. Mesh-Learner exploits graphic features of rasterization pipeline (texture sampling, deferred rendering) to render, which makes Mesh-Learner naturally compatible with tools (e.g., Blender) and tasks (e.g., 3D reconstruction, scene rendering, reinforcement learning for robotics) that are based on rasterization pipelines. Our system can train vast, unlimited scenes because we transfer only the SH textures within the frustum to the GPU for training. At other times, the SH textures are stored in CPU RAM, which results in moderate GPU memory usage. The rendering results on interpolation and extrapolation sequences in the Replica and FAST-LIVO2 datasets achieve state-of-the-art performance compared to existing state-of-the-art methods (e.g., 3D Gaussian Splatting and M2-Mapping). To benefit the society, the code will be available at this https URL.
zh

[CV-12] Enhancing Quality for VVC Compressed Videos with Omniscient Quality Enhancement Model

【速读】:该论文旨在解决H.266/VVC(Versatile Video Coding)压缩视频在解码端的感知质量提升问题,以及编码端的压缩性能优化挑战。其解决方案的关键在于提出一种名为OVQE-VVC的增强网络,该网络基于对HEVC压缩视频设计的Omniscient视频质量增强网络进行改进,并集成到最新的STD-VVC解码器架构中。通过利用时空特征和跨频域信息,该方法在保持相似视觉质量的前提下,实现了显著的PSNR提升(约0.74 dB至1.2 dB),并对应约19.6%的码率节省。

链接: https://arxiv.org/abs/2504.19935
作者: Xiem HoangVan,Hieu Bui Minh,Sang NguyenQuang,Wen-Hsiao Peng
机构: University Of Engineering and Technology (大学工程技术); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The latest video coding standard H.266/VVC has shown its great improvement in terms of compression performance when compared to its predecessor HEVC standard. Though VVC was implemented with many advanced techniques, it still met the same challenges as its predecessor due to the need for even higher perceptual quality demand at the decoder side as well as the compression performance at the encoder side. The advancement of Artificial Intelligence (AI) technology, notably the deep learning-based video quality enhancement methods, was shown to be a promising approach to improving the perceptual quality experience. In this paper, we propose a novel Omniscient video quality enhancement Network for VVC compressed Videos. The Omniscient Network for compressed video quality enhancement was originally designed for HEVC compressed videos in which not only the spatial-temporal features but also cross-frequencies information were employed to augment the visual quality. Inspired by this work, we propose a modification of the OVQE model and integrate it into the lasted STD-VVC (Standard Versatile Video Coding) decoder architecture. As assessed in a rich set of test conditions, the proposed OVQE-VVC solution is able to achieve significant PSNR improvement, notably around 0.74 dB and up to 1.2 dB with respect to the original STD-VVC codec. This also corresponds to around 19.6% of bitrate saving while keeping a similar quality observation.
zh

[CV-13] Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI

【速读】:该论文试图解决手术视频自动摘要的问题,旨在提升手术过程的文档记录、支持外科培训以及促进术后分析。其解决方案的关键在于提出一种多模态框架,该框架结合了计算机视觉和大语言模型(Large Language Model, LLM)的最新进展,通过三个关键阶段实现视频摘要生成:首先利用视觉变换器从手术视频中提取帧级视觉特征,其次通过大语言模型生成帧级描述并结合时间特征生成片段级摘要,最后使用专门优化的LLM将片段描述整合为完整的手术报告。

链接: https://arxiv.org/abs/2504.19918
作者: Hugo Georgenthum,Cristian Cosentino,Fabrizio Marozzo,Pietro Liò
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The automatic summarization of surgical videos is essential for enhancing procedural documentation, supporting surgical training, and facilitating post-operative analysis. This paper presents a novel method at the intersection of artificial intelligence and medicine, aiming to develop machine learning models with direct real-world applications in surgical contexts. We propose a multi-modal framework that leverages recent advancements in computer vision and large language models to generate comprehensive video summaries. % The approach is structured in three key stages. First, surgical videos are divided into clips, and visual features are extracted at the frame level using visual transformers. This step focuses on detecting tools, tissues, organs, and surgical actions. Second, the extracted features are transformed into frame-level captions via large language models. These are then combined with temporal features, captured using a ViViT-based encoder, to produce clip-level summaries that reflect the broader context of each video segment. Finally, the clip-level descriptions are aggregated into a full surgical report using a dedicated LLM tailored for the summarization task. % We evaluate our method on the CholecT50 dataset, using instrument and action annotations from 50 laparoscopic videos. The results show strong performance, achieving 96% precision in tool detection and a BERT score of 0.74 for temporal context summarization. This work contributes to the advancement of AI-assisted tools for surgical reporting, offering a step toward more intelligent and reliable clinical documentation.
zh

[CV-14] Breast Cancer Detection from Multi-View Screening Mammograms with Visual Prompt Tuning

【速读】:该论文旨在解决高分辨率乳腺X线摄影图像中乳腺癌准确检测的问题,尤其是在多视角数据整合方面的挑战。传统方法在处理大规模、高分辨率医学影像时存在效率低和泛化能力不足的问题。该研究提出了一种新颖的多视角视觉提示调优网络(MVPT-NET),其关键在于首先在高分辨率乳腺X线图像上预训练一个稳健的单视角分类模型,然后创新性地将多视角特征学习融入任务特定的提示调优过程,通过仅调整少量可训练参数(7%)来保持预训练模型的鲁棒性,从而实现多视角数据的高效融合。

链接: https://arxiv.org/abs/2504.19900
作者: Han Chen,Anne L. Martel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate detection of breast cancer from high-resolution mammograms is crucial for early diagnosis and effective treatment planning. Previous studies have shown the potential of using single-view mammograms for breast cancer detection. However, incorporating multi-view data can provide more comprehensive insights. Multi-view classification, especially in medical imaging, presents unique challenges, particularly when dealing with large-scale, high-resolution data. In this work, we propose a novel Multi-view Visual Prompt Tuning Network (MVPT-NET) for analyzing multiple screening mammograms. We first pretrain a robust single-view classification model on high-resolution mammograms and then innovatively adapt multi-view feature learning into a task-specific prompt tuning process. This technique selectively tunes a minimal set of trainable parameters (7%) while retaining the robustness of the pre-trained single-view model, enabling efficient integration of multi-view data without the need for aggressive downsampling. Our approach offers an efficient alternative to traditional feature fusion methods, providing a more robust, scalable, and efficient solution for high-resolution mammogram analysis. Experimental results on a large multi-institution dataset demonstrate that our method outperforms conventional approaches while maintaining detection efficiency, achieving an AUROC of 0.852 for distinguishing between Benign, DCIS, and Invasive classes. This work highlights the potential of MVPT-NET for medical imaging tasks and provides a scalable solution for integrating multi-view data in breast cancer detection.
zh

[CV-15] CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition

【速读】:该论文试图解决电影场景构图(cinematic scene composition)中的生成问题,特别是如何在多镜头生成中保持帧间的连贯性与一致性,同时处理电影制作中的复杂挑战,如多角色互动和视觉电影效果。其解决方案的关键在于提出一种两阶段方法:首先利用大语言模型(LLM)根据高层场景描述生成详细的场景设定和镜头计划;其次通过微调文本到图像生成模型来合成高质量的视觉关键帧,从而实现视觉一致且上下文丰富的电影场景生成。

链接: https://arxiv.org/abs/2504.19894
作者: Quynh Phung,Long Mai,Fabian David Caba Heilbron,Feng Liu,Jia-Bin Huang,Cusuh Ham
机构: University of Maryland, College Park (马里兰大学学院市分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: link website: this https URL

点击查看摘要

Abstract:We present CineVerse, a novel framework for the task of cinematic scene composition. Similar to traditional multi-shot generation, our task emphasizes the need for consistency and continuity across frames. However, our task also focuses on addressing challenges inherent to filmmaking, such as multiple characters, complex interactions, and visual cinematic effects. In order to learn to generate such content, we first create the CineVerse dataset. We use this dataset to train our proposed two-stage approach. First, we prompt a large language model (LLM) with task-specific instructions to take in a high-level scene description and generate a detailed plan for the overall setting and characters, as well as the individual shots. Then, we fine-tune a text-to-image generation model to synthesize high-quality visual keyframes. Experimental results demonstrate that CineVerse yields promising improvements in generating visually coherent and contextually rich movie scenes, paving the way for further exploration in cinematic video synthesis.
zh

[CV-16] Enhancing breast cancer detection on screening mammogram using self-supervised learning and a hybrid deep model of Swin Transformer and Convolutional Neural Network

【速读】:该论文旨在解决乳腺癌诊断中高质量标注医学训练数据稀缺的问题,这是将人工智能系统应用于该领域的主要限制之一。为应对这一挑战,研究提出了一种基于自监督学习(SSL)和深度混合模型HybMNet的新方法,其关键在于结合局部自注意力机制与细粒度特征提取,以提升筛查乳腺X线摄影中的乳腺癌检测性能。具体而言,该方法采用两阶段学习流程:首先通过SSL预训练使用有限数量的乳腺X线图像对Swin Transformer进行预训练,随后在下游任务中将其作为主干网络,并与基于卷积神经网络(CNN)的网络及新型融合策略相结合,从而实现全局与局部信息的有效整合,提高分类性能。

链接: https://arxiv.org/abs/2504.19888
作者: Han Chen,Anne L. Martel
机构: Sunnybrook Research Institute (阳光研究院); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: The scarcity of high-quality curated labeled medical training data remains one of the major limitations in applying artificial intelligence (AI) systems to breast cancer diagnosis. Deep models for mammogram analysis and mass (or micro-calcification) detection require training with a large volume of labeled images, which are often expensive and time-consuming to collect. To reduce this challenge, we proposed a novel method that leverages self-supervised learning (SSL) and a deep hybrid model, named \textbfHybMNet, which combines local self-attention and fine-grained feature extraction to enhance breast cancer detection on screening mammograms. Approach: Our method employs a two-stage learning process: (1) SSL Pretraining: We utilize EsViT, a SSL technique, to pretrain a Swin Transformer (Swin-T) using a limited set of mammograms. The pretrained Swin-T then serves as the backbone for the downstream task. (2) Downstream Training: The proposed HybMNet combines the Swin-T backbone with a CNN-based network and a novel fusion strategy. The Swin-T employs local self-attention to identify informative patch regions from the high-resolution mammogram, while the CNN-based network extracts fine-grained local features from the selected patches. A fusion module then integrates global and local information from both networks to generate robust predictions. The HybMNet is trained end-to-end, with the loss function combining the outputs of the Swin-T and CNN modules to optimize feature extraction and classification performance. Results: The proposed method was evaluated for its ability to detect breast cancer by distinguishing between benign (normal) and malignant mammograms. Leveraging SSL pretraining and the HybMNet model, it achieved AUC of 0.864 (95% CI: 0.852, 0.875) on the CMMD dataset and 0.889 (95% CI: 0.875, 0.903) on the INbreast dataset, highlighting its effectiveness. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.19888 [cs.CV] (or arXiv:2504.19888v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.19888 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-17] Federated Out-of-Distribution Generalization: A Causal Augmentation View IJCNN2025

【速读】:该论文旨在解决联邦学习中由于客户端数据偏差导致的模型泛化能力不足问题,尤其是现有方法在处理分布外样本时表现受限以及数据增强与原始数据质量存在差距的问题。其解决方案的关键在于提出一种基于因果关系的数据增强方法——FedCAug,通过设计因果区域定位模块和因果启发式数据增强模块,有效打破属性与类别之间的虚假相关性,并在不共享任何客户端信息的情况下提升数据多样性,从而保护数据隐私并增强模型性能。

链接: https://arxiv.org/abs/2504.19882
作者: Runhui Zhang,Sijin Zhou,Zhuang Qi
机构: School of Software, Shandong University, Jinan, China (软件学院,山东大学,中国济南); AIM Lab, Faculty of IT, Monash University, Clayton, VIC, Australia (人工智能实验室,信息技术学院,莫纳什大学,澳大利亚克拉伦登)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCNN 2025 Accepted

点击查看摘要

Abstract:Federated learning aims to collaboratively model by integrating multi-source information to obtain a model that can generalize across all client data. Existing methods often leverage knowledge distillation or data augmentation to mitigate the negative impact of data bias across clients. However, the limited performance of teacher models on out-of-distribution samples and the inherent quality gap between augmented and original data hinder their effectiveness and they typically fail to leverage the advantages of incorporating rich contextual information. To address these limitations, this paper proposes a Federated Causal Augmentation method, termed FedCAug, which employs causality-inspired data augmentation to break the spurious correlation between attributes and categories. Specifically, it designs a causal region localization module to accurately identify and decouple the background and objects in the image, providing rich contextual information for causal data augmentation. Additionally, it designs a causality-inspired data augmentation module that integrates causal features and within-client context to generate counterfactual samples. This significantly enhances data diversity, and the entire process does not require any information sharing between clients, thereby contributing to the protection of data privacy. Extensive experiments conducted on three datasets reveal that FedCAug markedly reduces the model’s reliance on background to predict sample labels, achieving superior performance compared to state-of-the-art methods.
zh

[CV-18] Using Fixed and Mobile Eye Tracking to Understand How Visitors View Art in a Museum: A Study at the Bowes Museum County Durham UK

【速读】:该论文旨在解决如何通过眼动追踪技术理解参观者在实体美术馆环境中观看艺术品的视觉行为,从而为博物馆藏品的展示方式提供优化建议,以增强观众的参与度。解决方案的关键在于采用固定和移动眼动追踪技术,结合跨学科团队(包括数字人文、心理学、艺术史和计算机科学领域的研究人员)与博物馆专业人员的合作,以获取关于观众视觉行为的详细数据。

链接: https://arxiv.org/abs/2504.19881
作者: Claire Warwick,Andrew Beresford,Soazig Casteau,Hubert P. H. Shum,Dan Smith,Francis Xiatian Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The following paper describes a collaborative project involving researchers at Durham University, and professionals at the Bowes Museum, Barnard Castle, County Durham, UK, during which we used fixed and mobile eye tracking to understand how visitors view art. Our study took place during summer 2024 and builds on work presented at DH2017 (Bailey-Ross et al., 2017). Our interdisciplinary team included researchers from digital humanities, psychology, art history and computer science, working in collaboration with professionals from the museum. We used fixed and mobile eye tracking to understand how museum visitors view art in a physical gallery setting. This research will enable us to make recommendations about how the Museum’s collections could be more effectively displayed, encouraging visitors to engage with them more fully.
zh

[CV-19] DeeCLIP: A Robust and Generalizable Transformer-Based Framework for Detecting AI-Generated Images

【速读】:该论文旨在解决AI生成图像检测中现有方法在不同生成模型间泛化能力不足以及对微小扰动敏感的问题。其解决方案的关键在于引入DeeFuser融合模块,通过结合高阶与低阶特征来提升模型对压缩和模糊等退化情况的鲁棒性,并采用三元组损失优化嵌入空间,增强区分真实与合成内容的能力。此外,利用基于低秩适应(LoRA)的参数高效微调策略,在保持预训练知识的同时实现轻量级适配,支持有效的零样本学习。

链接: https://arxiv.org/abs/2504.19876
作者: Mamadou Keita,Wassim Hamidouche,Hessen Bougueffa Eutamene,Abdelmalik Taleb-Ahmed,Abdenour Hadid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This paper introduces DeeCLIP, a novel framework for detecting AI-generated images using CLIP-ViT and fusion learning. Despite significant advancements in generative models capable of creating highly photorealistic images, existing detection methods often struggle to generalize across different models and are highly sensitive to minor perturbations. To address these challenges, DeeCLIP incorporates DeeFuser, a fusion module that combines high-level and low-level features, improving robustness against degradations such as compression and blurring. Additionally, we apply triplet loss to refine the embedding space, enhancing the model’s ability to distinguish between real and synthetic content. To further enable lightweight adaptation while preserving pre-trained knowledge, we adopt parameter-efficient fine-tuning using low-rank adaptation (LoRA) within the CLIP-ViT backbone. This approach supports effective zero-shot learning without sacrificing generalization. Trained exclusively on 4-class ProGAN data, DeeCLIP achieves an average accuracy of 89.00% on 19 test subsets composed of generative adversarial network (GAN) and diffusion models. Despite having fewer trainable parameters, DeeCLIP outperforms existing methods, demonstrating superior robustness against various generative models and real-world distortions. The code is publicly available at this https URL for research purposes.
zh

[CV-20] owards Ball Spin and Trajectory Analysis in Table Tennis Broadcast Videos via Physically Grounded Synthetic-to-Real Transfer CVPR

【速读】:该论文旨在解决从单目广播视频中推断乒乓球的三维轨迹和初始旋转的问题,因为旋转在标准广播视频中无法直接观测,但可以通过球的轨迹进行推断。解决方案的关键在于提出一种新颖的方法,通过仅使用合成数据训练神经网络来实现这一目标,其输入数据表示、物理正确的合成训练数据以及有针对性的数据增强策略使得网络能够自然泛化到真实数据,而无需任何真实数据进行训练。

链接: https://arxiv.org/abs/2504.19863
作者: Daniel Kienzle,Robin Schön,Rainer Lienhart,Shin’Ichi Satoh
机构: University of Augsburg, Germany; National Institute of Informatics, Japan; University of Tokyo, Japan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To be published in 2025 IEEE/CVF International Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

点击查看摘要

Abstract:Analyzing a player’s technique in table tennis requires knowledge of the ball’s 3D trajectory and spin. While, the spin is not directly observable in standard broadcasting videos, we show that it can be inferred from the ball’s trajectory in the video. We present a novel method to infer the initial spin and 3D trajectory from the corresponding 2D trajectory in a video. Without ground truth labels for broadcast videos, we train a neural network solely on synthetic data. Due to the choice of our input data representation, physically correct synthetic training data, and using targeted augmentations, the network naturally generalizes to real data. Notably, these simple techniques are sufficient to achieve generalization. No real data at all is required for training. To the best of our knowledge, we are the first to present a method for spin and trajectory prediction in simple monocular broadcast videos, achieving an accuracy of 92.0% in spin classification and a 2D reprojection error of 0.19% of the image diagonal.
zh

[CV-21] CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback

【速读】:该论文旨在解决基于Score Distillation Sampling (SDS)的文本到3D内容生成方法在处理复杂多物体交互场景时,难以保持语义一致性的问题。现有方法通过在3D数据集上微调多视角扩散模型来提升3D一致性,但这一策略反而加剧了文本-3D对齐的退化。解决方案的关键在于提出一种新的SDS目标函数——Textual Coherent Score Distillation (TCSD),该方法通过引入多模态大语言模型(MLLM)的对齐反馈,利用其跨模态理解能力评估并引导文本-3D对应关系,在优化过程中提升语义一致性。此外,还提出了LLM-layout初始化以加速优化收敛,从而实现更优的文本对齐3D生成效果。

链接: https://arxiv.org/abs/2504.19860
作者: Chenhan Jiang,Yihan Zeng,Hang Xu,Dit-Yan Yeung
机构: Hong Kong University of Science and Technology (香港科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Score Distillation Sampling (SDS) has achieved remarkable success in text-to-3D content generation. However, SDS-based methods struggle to maintain semantic fidelity for user prompts, particularly when involving multiple objects with intricate interactions. While existing approaches often address 3D consistency through multiview diffusion model fine-tuning on 3D datasets, this strategy inadvertently exacerbates text-3D alignment degradation. The limitation stems from SDS’s inherent accumulation of view-independent biases during optimization, which progressively diverges from the ideal text alignment direction. To alleviate this limitation, we propose a novel SDS objective, dubbed as Textual Coherent Score Distillation (TCSD), which integrates alignment feedback from multimodal large language models (MLLMs). Our TCSD leverages cross-modal understanding capabilities of MLLMs to assess and guide the text-3D correspondence during the optimization. We further develop 3DLLaVA-CRITIC - a fine-tuned MLLM specialized for evaluating multiview text alignment in 3D generations. Additionally, we introduce an LLM-layout initialization that significantly accelerates optimization convergence through semantic-aware spatial configuration. Comprehensive evaluations demonstrate that our framework, CoherenDream, establishes state-of-the-art performance in text-aligned 3D generation across multiple benchmarks, including T ^3 Bench and TIFA subset. Qualitative results showcase the superior performance of CoherenDream in preserving textual consistency and semantic interactions. As the first study to incorporate MLLMs into SDS optimization, we also conduct extensive ablation studies to explore optimal MLLM adaptations for 3D generation tasks.
zh

[CV-22] NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

【速读】:该论文试图解决现有视觉-语言-动作(Visual-Language-Action, VLA)模型在零样本场景下的性能局限性,特别是由于视觉编码能力不足导致的物体抓取等任务失败问题,以及模型参数量大(通常超过7B)所带来的高计算开销,使其难以适用于对速度和效率要求较高的实时机器人环境。解决方案的关键在于提出NORA,一个3B参数的模型,其核心是采用Qwen-2.5-VL-3B多模态模型作为基础架构,以提升视觉语义理解能力,并结合970k真实机器人演示数据与FAST+分词器,实现高效的动作序列生成,从而在保持强任务性能的同时显著降低计算开销。

链接: https://arxiv.org/abs/2504.19854
作者: Chia-Yu Hung,Qi Sun,Pengfei Hong,Amir Zadeh,Chuan Li,U-Xuan Tan,Navonil Majumder,Soujanya Poria
机构: Singapore University of Technology and Design (新加坡科技设计大学); Lambda Labs (Lambda 实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real-time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance visual reasoning and action grounding. Additionally, our \model is trained on 970k real-world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real-time robotic autonomy.
zh

[CV-23] Foundation Model-Driven Framework for Human-Object Interaction Prediction with Segmentation Mask Integration

【速读】:该论文试图解决传统基于检测的人-物体交互(Human-Object Interaction, HOI)方法在表达能力和灵活性上的局限性,其核心问题是如何更精确地建模人与物体之间的交互关系。解决方案的关键在于提出一种新的框架Seg2HOI,该框架将基于分割的视觉基础模型与HOI任务相结合,不仅预测标准的三元组,还引入了包含人-物体对分割掩码的四元组,从而增强了HOI检测的表达能力。此外,Seg2HOI利用视觉基础模型的属性(如可提示性和交互机制),并通过解码器将其应用于HOI任务,无需额外训练即可高效运行,展现出良好的泛化能力和应用灵活性。

链接: https://arxiv.org/abs/2504.19847
作者: Juhan Park,Kyungjae Lee,Hyung Jin Chang,Jungchan Cho
机构: cau.ac.kr(中国农业大学); korea.ac.kr(韩国高丽大学); bham.ac.uk(英国伯明翰大学); gachon.ac.kr(韩国伽倻大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we introduce Segmentation to Human-Object Interaction (\textit\textbfSeg2HOI) approach, a novel framework that integrates segmentation-based vision foundation models with the human-object interaction task, distinguished from traditional detection-based Human-Object Interaction (HOI) methods. Our approach enhances HOI detection by not only predicting the standard triplets but also introducing quadruplets, which extend HOI triplets by including segmentation masks for human-object pairs. More specifically, Seg2HOI inherits the properties of the vision foundation model (e.g., promptable and interactive mechanisms) and incorporates a decoder that applies these attributes to HOI task. Despite training only for HOI, without additional training mechanisms for these properties, the framework demonstrates that such features still operate efficiently. Extensive experiments on two public benchmark datasets demonstrate that Seg2HOI achieves performance comparable to state-of-the-art methods, even in zero-shot scenarios. Lastly, we propose that Seg2HOI can generate HOI quadruplets and interactive HOI segmentation from novel text and visual prompts that were not used during training, making it versatile for a wide range of applications by leveraging this flexibility.
zh

[CV-24] SRMF: A Data Augmentation and Multimodal Fusion Approach for Long-Tail UHR Satellite Image Segmentation

【速读】:该论文试图解决超高分辨率(UHR)卫星影像语义分割中的长尾分布问题(long-tail problem),即类别样本分布极不均衡对模型性能的影响。其解决方案的关键在于引入多尺度裁剪技术和基于语义重新排序与重采样的数据增强策略,以缓解长尾类别的数据稀缺问题,同时提出一种基于多模态融合的通用表示知识注入方法,首次在无需单独区域文本描述的情况下融合文本和视觉特征,从而提取更鲁棒的特征。

链接: https://arxiv.org/abs/2504.19839
作者: Yulong Guo,Zilun Zhang,Yongheng Shang,Tiancheng Zhao,Shuiguang Deng,Yingchun Yang,Jianwei Yin
机构: Zhejiang University(浙江大学); Haina Institute of Zhejiang University(浙江大学海宁研究院); Advanced Technology Institute, Zhejiang University(浙江大学先进技术研究院); Binjiang Research Institute of Zhejiang University(浙江大学滨江研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: None

点击查看摘要

Abstract:The long-tail problem presents a significant challenge to the advancement of semantic segmentation in ultra-high-resolution (UHR) satellite imagery. While previous efforts in UHR semantic segmentation have largely focused on multi-branch network architectures that emphasize multi-scale feature extraction and fusion, they have often overlooked the importance of addressing the long-tail issue. In contrast to prior UHR methods that focused on independent feature extraction, we emphasize data augmentation and multimodal feature fusion to alleviate the long-tail problem. In this paper, we introduce SRMF, a novel framework for semantic segmentation in UHR satellite imagery. Our approach addresses the long-tail class distribution by incorporating a multi-scale cropping technique alongside a data augmentation strategy based on semantic reordering and resampling. To further enhance model performance, we propose a multimodal fusion-based general representation knowledge injection method, which, for the first time, fuses text and visual features without the need for individual region text descriptions, extracting more robust features. Extensive experiments on the URUR, GID, and FBP datasets demonstrate that our method improves mIoU by 3.33%, 0.66%, and 0.98%, respectively, achieving state-of-the-art performance. Code is available at: this https URL.
zh

[CV-25] AnimateAnywhere: Rouse the Background in Human Image Animation

【速读】:该论文旨在解决人体图像动画中背景生成不足的问题,现有方法通常关注人体动作而忽视背景生成,导致结果静态或运动不协调。其解决方案的关键在于提出一种无需相机轨迹的AnimateAnywhere框架,通过引入背景运动学习器(BML)从人体姿态序列中学习背景运动,并利用视差约束增强模型对跨帧对应关系的学习能力。

链接: https://arxiv.org/abs/2504.19834
作者: Xiaoyu Liu,Mingshuai Yao,Yabo Zhang,Xianhui Lin,Peiran Ren,Xiaoming Li,Ming Liu,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human image animation aims to generate human videos of given characters and backgrounds that adhere to the desired pose sequence. However, existing methods focus more on human actions while neglecting the generation of background, which typically leads to static results or inharmonious movements. The community has explored camera pose-guided animation tasks, yet preparing the camera trajectory is impractical for most entertainment applications and ordinary users. As a remedy, we present an AnimateAnywhere framework, rousing the background in human image animation without requirements on camera trajectories. In particular, based on our key insight that the movement of the human body often reflects the motion of the background, we introduce a background motion learner (BML) to learn background motions from human pose sequences. To encourage the model to learn more accurate cross-frame correspondences, we further deploy an epipolar constraint on the 3D attention map. Specifically, the mask used to suppress geometrically unreasonable attention is carefully constructed by combining an epipolar mask and the current 3D attention map. Extensive experiments demonstrate that our AnimateAnywhere effectively learns the background motion from human pose sequences, achieving state-of-the-art performance in generating human animation results with vivid and realistic backgrounds. The source code and model will be available at this https URL.
zh

[CV-26] HOIGaze: Gaze Estimation During Hand-Object Interactions in Extended Reality Exploiting Eye-Hand-Head Coordination SIGGRAPH2025

【速读】:该论文旨在解决在扩展现实(XR)中手-物体交互(HOI)场景下的眼动估计问题,该场景下传统方法难以准确捕捉 gaze 方向。其解决方案的关键在于利用眼、手和头部运动之间的紧密协调性,通过识别具有协同运动特征的训练样本,从而有效对训练数据进行去噪。这种方法与以往将所有训练样本视为等同的策略形成鲜明对比,通过引入层次化框架、跨模态 Transformer 融合机制以及新的眼-头协调损失函数,显著提升了 gaze 估计的准确性。

链接: https://arxiv.org/abs/2504.19828
作者: Zhiming Hu,Daniel Haeufle,Syn Schmitt,Andreas Bulling
机构: University of Stuttgart(斯图加特大学); University of Tuebingen(图宾根大学); The Center for Bionic Intelligence Tuebingen Stuttgart(仿生智能中心图宾根斯图加特)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at SIGGRAPH 2025, link: this https URL

点击查看摘要

Abstract:We present HOIGaze - a novel learning-based approach for gaze estimation during hand-object interactions (HOI) in extended reality (XR). HOIGaze addresses the challenging HOI setting by building on one key insight: The eye, hand, and head movements are closely coordinated during HOIs and this coordination can be exploited to identify samples that are most useful for gaze estimator training - as such, effectively denoising the training data. This denoising approach is in stark contrast to previous gaze estimation methods that treated all training samples as equal. Specifically, we propose: 1) a novel hierarchical framework that first recognises the hand currently visually attended to and then estimates gaze direction based on the attended hand; 2) a new gaze estimator that uses cross-modal Transformers to fuse head and hand-object features extracted using a convolutional neural network and a spatio-temporal graph convolutional network; and 3) a novel eye-head coordination loss that upgrades training samples belonging to the coordinated eye-head movements. We evaluate HOIGaze on the HOT3D and Aria digital twin (ADT) datasets and show that it significantly outperforms state-of-the-art methods, achieving an average improvement of 15.6% on HOT3D and 6.0% on ADT in mean angular error. To demonstrate the potential of our method, we further report significant performance improvements for the sample downstream task of eye-based activity recognition on ADT. Taken together, our results underline the significant information content available in eye-hand-head coordination and, as such, open up an exciting new direction for learning-based gaze estimation.
zh

[CV-27] aming the Randomness: Towards Label-Preserving Cropping in Contrastive Learning

【速读】:该论文试图解决对比学习(Contrastive Learning, CL)中由于随机图像增强特别是随机裁剪导致的语义偏离原图过远从而引发错误自标注的问题,该问题会降低方法的有效性。解决方案的关键在于引入两种新颖的参数化裁剪方法,这些方法通过提高自标注的鲁棒性来增强模型性能,实验结果表明,与非参数化的随机裁剪方法相比,该方法在CIFAR-10分类任务中的模型准确率提升了2.7%至12.4%。

链接: https://arxiv.org/abs/2504.19824
作者: Mohamed Hassan,Mohammad Wasil,Sebastian Houben
机构: Hochschule Bonn-Rhein-Sieg(波恩-莱茵-锡格应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive learning (CL) approaches have gained great recognition as a very successful subset of self-supervised learning (SSL) methods. SSL enables learning from unlabeled data, a crucial step in the advancement of deep learning, particularly in computer vision (CV), given the plethora of unlabeled image data. CL works by comparing different random augmentations (e.g., different crops) of the same image, thus achieving self-labeling. Nevertheless, randomly augmenting images and especially random cropping can result in an image that is semantically very distant from the original and therefore leads to false labeling, hence undermining the efficacy of the methods. In this research, two novel parameterized cropping methods are introduced that increase the robustness of self-labeling and consequently increase the efficacy. The results show that the use of these methods significantly improves the accuracy of the model by between 2.7% and 12.4% on the downstream task of classifying CIFAR-10, depending on the crop size compared to that of the non-parameterized random cropping method.
zh

[CV-28] Mjölnir: A Deep Learning Parametrization Framework for Global Lightning Flash Density

【速读】:该论文试图解决全球闪电闪击密度参数化的问题,即如何准确模拟和预测大尺度环境条件与闪电活动之间的非线性关系。解决方案的关键在于提出一种基于深度学习的框架Mjölnir,其采用InceptionNeXt主干网络结合SENet,并通过多任务学习策略同时预测闪电的发生与强度,从而在日尺度和1度空间分辨率下实现对全球闪电活动分布、季节变化及区域特征的高精度再现。

链接: https://arxiv.org/abs/2504.19822
作者: Minjong Cheon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Recent advances in AI-based weather forecasting models, such as FourCastNet, Pangu-Weather, and GraphCast, have demonstrated the remarkable ability of deep learning to emulate complex atmospheric dynamics. Building on this momentum, we propose Mjölnir, a novel deep learning-based framework for global lightning flash density parameterization. Trained on ERA5 atmospheric predictors and World Wide Lightning Location Network (WWLLN) observations at a daily temporal resolution and 1 degree spatial resolution, Mjölnir captures the nonlinear mapping between large-scale environmental conditions and lightning activity. The model architecture is based on the InceptionNeXt backbone with SENet, and a multi-task learning strategy to simultaneously predict lightning occurrence and magnitude. Extensive evaluations yield that Mollnir accurately reproduces the global distribution, seasonal variability, and regional characteristics of lightning activity, achieving a global Pearson correlation coefficient of 0.96 for annual mean fields. These results suggest that Mjölnir serves not only as an effective data-driven global lightning parameterization but also as a promising AI-based scheme for next-generation Earth system models (AI-ESMs).
zh

[CV-29] Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在训练过程中依赖精确预计算相机位姿的问题,特别是在复杂场景下如大角度旋转时,现有方法因依赖良好的初始位姿或深度先验而表现不佳。其解决方案的关键在于通过建模连续相机运动为时间相关的角速度和线速度,首先通过速度积分学习相机之间的相对运动,再将这些相对运动聚合到某一时间步定义的世界坐标系中,从而消除对先验信息的依赖。此方法通过时间依赖的NeRF学习精确的连续相机运动,进而优化NeRF以表示完整的场景几何结构。

链接: https://arxiv.org/abs/2504.19819
作者: Hoang Chuong Nguyen,Wei Mao,Jose M. Alvarez,Miaomiao Liu
机构: Australian National University (澳大利亚国立大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) has demonstrated its superior capability to represent 3D geometry but require accurately precomputed camera poses during training. To mitigate this requirement, existing methods jointly optimize camera poses and NeRF often relying on good pose initialisation or depth priors. However, these approaches struggle in challenging scenarios, such as large rotations, as they map each camera to a world coordinate system. We propose a novel method that eliminates prior dependencies by modeling continuous camera motions as time-dependent angular velocity and velocity. Relative motions between cameras are learned first via velocity integration, while camera poses can be obtained by aggregating such relative motions up to a world coordinate system defined at a single time step within the video. Specifically, accurate continuous camera movements are learned through a time-dependent NeRF, which captures local scene geometry and motion by training from neighboring frames for each time step. The learned motions enable fine-tuning the NeRF to represent the full scene geometry. Experiments on Co3D and Scannet show our approach achieves superior camera pose and depth estimation and comparable novel-view synthesis performance compared to state-of-the-art methods. Our code is available at this https URL.
zh

[CV-30] Learning Brenier Potentials with Convex Generative Adversarial Neural Networks

【速读】:该论文试图解决生成对抗网络(Generative Adversarial Networks, GANs)在学习Brenier势函数时的统计学习理论问题,特别是如何保证所学势函数的严格凸性以确保其梯度作为最优传输映射的有效性。解决方案的关键在于引入一种基于ReCU网络(具有三次激活函数的神经网络)的通用逼近理论,并结合对抗训练过程,将经典的判别器交叉熵损失与一个强制(严格)凸性的惩罚项相结合,从而确保网络输出的势函数满足严格的凸性条件。通过理论分析和实验验证,证明了在适当惩罚参数下,所选网络在对抗极小极大优化问题中均为严格凸函数,进而保证了学习过程的一致性。

链接: https://arxiv.org/abs/2504.19779
作者: Claudia Drygala,Hanno Gottschalk,Thomas Kruse,Ségolène Martin,Annika Mütze
机构: University of Wuppertal, School of Mathematics and Natural Sciences, IMACM & IZMD (伍珀塔尔大学,数学与自然科学学院,IMACM 与 IZMD); Technical University Berlin (柏林工业大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brenier proved that under certain conditions on a source and a target probability measure there exists a strictly convex function such that its gradient is a transport map from the source to the target distribution. This function is called the Brenier potential. Furthermore, detailed information on the Hölder regularity of the Brenier potential is available. In this work we develop the statistical learning theory of generative adversarial neural networks that learn the Brenier potential. As by the transformation of densities formula, the density of the generated measure depends on the second derivative of the Brenier potential, we develop the universal approximation theory of ReCU networks with cubic activation \mathttReCU(x)=\max\0,x^3 that combines the favorable approximation properties of Hölder functions with a Lipschitz continuous density. In order to assure the convexity of such general networks, we introduce an adversarial training procedure for a potential function represented by the ReCU networks that combines the classical discriminator cross entropy loss with a penalty term that enforces (strict) convexity. We give a detailed decomposition of learning errors and show that for a suitable high penalty parameter all networks chosen in the adversarial min-max optimization problem are strictly convex. This is further exploited to prove the consistency of the learning procedure for (slowly) expanding network capacity. We also implement the described learning algorithm and apply it to a number of standard test cases from Gaussian mixture to image data as target distributions. As predicted in theory, we observe that the convexity loss becomes inactive during the training process and the potentials represented by the neural networks have learned convexity.
zh

[CV-31] Hybrid Approach Combining Ultrasound and Blood Test Analysis with a Voting Classifier for Accurate Liver Fibrosis and Cirrhosis Assessment

【速读】:该论文旨在解决肝硬化诊断中传统方法(如肝活检)的侵入性和不便性问题,以提高肝纤维化和肝硬化的检测准确性。解决方案的关键在于构建一个融合机器学习技术与临床数据及超声图像的混合模型,该模型结合了固定血液检测概率与基于DenseNet-201的深度学习模型对超声图像的预测,最终实现了92.5%的准确率,验证了该混合模型在提升诊断精度和支持肝病早期干预方面的可行性。

链接: https://arxiv.org/abs/2504.19755
作者: Kapil Kashyap,Sean Fargose,Chrisil Dabre,Fatema Dolaria,Nilesh Patil,Aniket Kore
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Liver cirrhosis is an insidious condition involving the substitution of normal liver tissue with fibrous scar tissue and causing major health complications. The conventional method of diagnosis using liver biopsy is invasive and, therefore, inconvenient for use in regular screening. In this paper,we present a hybrid model that combines machine learning techniques with clinical data and ultrasoundscans to improve liver fibrosis and cirrhosis detection accuracy is presented. The model integrates fixed blood test probabilities with deep learning model predictions (DenseNet-201) for ultrasonic images. The combined hybrid model achieved an accuracy of 92.5%. The findings establish the viability of the combined model in enhancing diagnosis accuracy and supporting early intervention in liver disease care.
zh

[CV-32] STCOcc: Sparse Spatial-Temporal Cascade Renovation for 3D Occupancy and Scene Flow Prediction

【速读】:该论文旨在解决3D占用(3D occupancy)和场景流(scene flow)任务中,由于3D空间的稀疏性和复杂性导致的局部细节捕捉不足以及模型空间判别能力下降的问题。其解决方案的关键在于提出一种基于显式状态建模的方法,通过利用占据状态(occupied state)来优化3D特征,并引入一种稀疏遮挡感知注意力机制与级联精炼策略,以在占据状态信息的指导下精确地重构3D特征,同时通过新型的长期动态交互建模方法降低计算成本并保持空间信息。

链接: https://arxiv.org/abs/2504.19749
作者: Zhimin Liao,Ping Wei,Shuaijia Chen,Haoxuan Wang,Ziyang Ren
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D occupancy and scene flow offer a detailed and dynamic representation of 3D scene. Recognizing the sparsity and complexity of 3D space, previous vision-centric methods have employed implicit learning-based approaches to model spatial and temporal information. However, these approaches struggle to capture local details and diminish the model’s spatial discriminative ability. To address these challenges, we propose a novel explicit state-based modeling method designed to leverage the occupied state to renovate the 3D features. Specifically, we propose a sparse occlusion-aware attention mechanism, integrated with a cascade refinement strategy, which accurately renovates 3D features with the guidance of occupied state information. Additionally, we introduce a novel method for modeling long-term dynamic interactions, which reduces computational costs and preserves spatial information. Compared to the previous state-of-the-art methods, our efficient explicit renovation strategy not only delivers superior performance in terms of RayIoU and mAVE for occupancy and scene flow prediction but also markedly reduces GPU memory usage during training, bringing it down to 8.7GB. Our code is available on this https URL
zh

[CV-33] EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observations and Wikipedia CVPR

【速读】:该论文试图解决如何通过遥感(Remote Sensing, RS)图像直接预测生态属性的问题,以更生态学有意义的方式理解RS图像。其解决方案的关键在于引入EcoWikiRS数据集,该数据集包含高分辨率航空图像、对应的地理定位物种观测数据以及从维基百科获取的物种生境文本描述,并采用WINCEL方法,即一种加权的InfoNCE损失函数,以处理弱监督和噪声监督下的任务。

链接: https://arxiv.org/abs/2504.19742
作者: Valerie Zermatten,Javiera Castillo-Navarro,Pallavi Jain,Devis Tuia,Diego Marcos
机构: EPFL(瑞士联邦理工学院); CNAM(法国国家高等工程技术学校); INRIA(法国国家信息与自动化研究所); CIHEAM-IAMM(地中海农业高等教育机构-国际农业与生物多样性管理中心); Univ. of Montpellier(蒙彼利埃大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at EarthVision 2025 (CVPRW 2025)

点击查看摘要

Abstract:The presence of species provides key insights into the ecological properties of a location such as land cover, climatic conditions or even soil properties. We propose a method to predict such ecological properties directly from remote sensing (RS) images by aligning them with species habitat descriptions. We introduce the EcoWikiRS dataset, consisting of high-resolution aerial images, the corresponding geolocated species observations, and, for each species, the textual descriptions of their habitat from Wikipedia. EcoWikiRS offers a scalable way of supervision for RS vision language models (RS-VLMs) for ecology. This is a setting with weak and noisy supervision, where, for instance, some text may describe properties that are specific only to part of the species’ niche or is irrelevant to a specific image. We tackle this by proposing WINCEL, a weighted version of the InfoNCE loss. We evaluate our model on the task of ecosystem zero-shot classification by following the habitat definitions from the European Nature Information System (EUNIS). Our results show that our approach helps in understanding RS images in a more ecologically meaningful manner. The code and the dataset are available at this https URL.
zh

[CV-34] Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model

【速读】:该论文旨在解决从3D/4D数据中实现语义丰富且视觉全面的面部情绪理解问题。其解决方案的关键在于提出一种联合表示学习框架与新型梯度友好的损失函数,以有效捕捉视觉特征并加速模型收敛至最优特征表示,同时通过增强的文本提示和混合视角增强技术提升模型的多模态理解和泛化能力。

链接: https://arxiv.org/abs/2504.19739
作者: Muzammil Behzad,Guoying Zhao
机构: King Fahd University of Petroleum & Minerals (法赫德国王石油与矿业大学); University of Oulu (奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce AffectVLM, a vision-language model designed to integrate multiviews for a semantically rich and visually comprehensive understanding of facial emotions from 3D/4D data. To effectively capture visual features, we propose a joint representation learning framework paired with a novel gradient-friendly loss function that accelerates model convergence towards optimal feature representation. Additionally, we introduce augmented textual prompts to enhance the model’s linguistic capabilities and employ mixed view augmentation to expand the visual dataset. We also develop a Streamlit app for a real-time interactive inference and enable the model for distributed learning. Extensive experiments validate the superior performance of AffectVLM across multiple benchmarks.
zh

[CV-35] CoDEx: Combining Domain Expertise for Spatial Generalization in Satellite Image Analysis CVPR2025

【速读】:该论文试图解决卫星图像分析中因地形外观全球差异导致的模型泛化能力不足问题,这种问题在训练数据与测试数据地理位置不同时尤为显著。解决方案的关键在于提出一种新的领域泛化框架,该框架不是尝试学习一个单一的通用模型,而是在每个训练领域上训练一个专家模型,同时学习专家之间的相似性并鼓励相似专家保持一致性。随后,通过模型选择模块为给定的测试样本识别最合适的专家并聚合其预测结果。

链接: https://arxiv.org/abs/2504.19737
作者: Abhishek Kuriyal,Elliot Vincent,Mathieu Aubry,Loic Landrieu
机构: LIGM, ENPC, IP Paris, Univ Gustave Eiffel, CNRS, France; LASTIG, Univ Gustave Eiffel, IGN-ENSG, 94160, Saint-Mande, France; Inria, ENS, CNRS, PSL Research University, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 EarthVision Workshop

点击查看摘要

Abstract:Global variations in terrain appearance raise a major challenge for satellite image analysis, leading to poor model performance when training on locations that differ from those encountered at test time. This remains true even with recent large global datasets. To address this challenge, we propose a novel domain-generalization framework for satellite images. Instead of trying to learn a single generalizable model, we train one expert model per training domain, while learning experts’ similarity and encouraging similar experts to be consistent. A model selection module then identifies the most suitable experts for a given test sample and aggregates their predictions. Experiments on four datasets (DynamicEarthNet, MUDS, OSCD, and FMoW) demonstrate consistent gains over existing domain generalization and adaptation methods. Our code is publicly available at this https URL.
zh

[CV-36] Measuring Train Driver Performance as Key to Approval of Driverless Trains

【速读】:该论文试图解决无人驾驶列车计算机视觉系统安全认证中因障碍物检测性能难以量化而存在的问题,尤其是在缺乏公开测量结果的情况下。解决方案的关键在于提供一个包含711次列车驾驶员在受控实验中表现数据的新型公共且匿名化数据集,该数据集涵盖了不同速度、障碍物尺寸、列车保护系统和障碍物颜色对比度下的反应时间和距离测量值,旨在为研究、标准化和监管提供客观且全面的数据支持。

链接: https://arxiv.org/abs/2504.19735
作者: Rustam Tagiew(1),Prasannavenkatesh Balaji(1) ((1) German Centre for Rail Traffic Research at the Federal Railway Authority)
机构: German Centre for Rail Traffic Research (DZSF)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, abstract accepted by IAVVC 2025, full paper to be submitted to IAVVC 2025

点击查看摘要

Abstract:Points 2.1.4(b), 2.4.2(b) and 2.4.3(b) in Annex I of Implementing Regulation (EU) No. 402/2013 allow a simplified approach for the safety approval of computer vision systems for driverless trains, if they have ‘similar’ functions and interfaces as the replaced human driver. The human driver is not replaced one-to-one by a technical system - only a limited set of cognitive functions are replaced. However, performance in the most challenging function, obstacle detection, is difficult to quantify due to the deficiency of published measurement results. This article summarizes the data published so far. This article also goes a long way to remedy this situation by providing a new public and anonymized dataset of 711 train driver performance measurements from controlled experiments. The measurements are made for different speeds, obstacle sizes, train protection systems and obstacle color contrasts respectively. The measured values are reaction time and distance to the obstacle. The goal of this paper is an unbiased and exhaustive description of the presented dataset for research, standardization and regulation. Further project related information including the dataset and source code is available at this https URL
zh

[CV-37] RepText: Rendering Visual Text via Replicating

【速读】:该论文试图解决当前文本到图像生成模型在生成精确且灵活的排版元素(尤其是非拉丁字母)方面的局限性。解决方案的关键在于提出RepText,该方法假设文本理解并非文本渲染的必要条件,而是充分条件,并通过引入与语言无关的字形和位置信息,使预训练单语文本到图像生成模型能够准确复制用户指定字体下的多语言视觉文本,而无需真正理解文本内容。

链接: https://arxiv.org/abs/2504.19724
作者: Haofan Wang,Yujia Xu,Yimeng Li,Junchen Li,Chaowei Zhang,Jing Wang,Kejia Yang,Zhibo Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. this https URL

点击查看摘要

Abstract:Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.
zh

[CV-38] he ATLAS of Traffic Lights: A Reliable Perception Framework for Autonomous Driving

【速读】:该论文旨在解决自动驾驶车辆中基于摄像头的感知系统对交通信号灯的准确检测与解释问题,以确保在复杂城市环境中安全导航。其解决方案的关键在于提出一个模块化的感知框架,该框架整合了先进的检测模型与一种新颖的实时关联和决策框架,从而实现了在自动驾驶系统中的无缝部署。此外,为克服现有公共数据集的局限性,作者引入了ATLAS数据集,该数据集提供了在多种环境条件和相机设置下交通信号灯状态和图标的全面标注,为模型训练与评估提供了高质量的数据支持。

链接: https://arxiv.org/abs/2504.19722
作者: Rupert Polley,Nikolai Polley,Dominik Heid,Marc Heinrich,Sven Ochs,J. Marius Zöllner
机构: FZI Research Center for Information Technology, Technical Cognitive Systems (弗劳恩霍夫信息科技研究中心,技术认知系统); Karlsruhe Institute of Technology (KIT) (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at IEEE Intelligent Vehicles Symposium (IV 2025). Dataset link: this https URL

点击查看摘要

Abstract:Traffic light perception is an essential component of the camera-based perception system for autonomous vehicles, enabling accurate detection and interpretation of traffic lights to ensure safe navigation through complex urban environments. In this work, we propose a modularized perception framework that integrates state-of-the-art detection models with a novel real-time association and decision framework, enabling seamless deployment into an autonomous driving stack. To address the limitations of existing public datasets, we introduce the ATLAS dataset, which provides comprehensive annotations of traffic light states and pictograms across diverse environmental conditions and camera setups. This dataset is publicly available at this https URL. We train and evaluate several state-of-the-art traffic light detection architectures on ATLAS, demonstrating significant performance improvements in both accuracy and robustness. Finally, we evaluate the framework in real-world scenarios by deploying it in an autonomous vehicle to make decisions at traffic light-controlled intersections, highlighting its reliability and effectiveness for real-time operation.
zh

[CV-39] A computer vision method to estimate ventilation rate of Atlantic salmon in sea fish farms

【速读】:该论文旨在解决传统鱼类健康与福利监测方法在实际海水养殖环境中应用受限的问题,特别是在非侵入式视频监控中缺乏对鱼类生理特征(如呼吸频率)的直接评估。解决方案的关键在于开发一种基于计算机视觉的方法,通过检测鱼头并分类其口部状态,结合多目标跟踪技术,从生产环境中的水下视频中估算大西洋鲑(Salmo salar)的呼吸频率,从而实现对鱼类呼吸窘迫的准确识别。

链接: https://arxiv.org/abs/2504.19719
作者: Lukas Folkman,Quynh LK Vo,Colin Johnston,Bela Stantic,Kylie A Pitt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing demand for aquaculture production necessitates the development of innovative, intelligent tools to effectively monitor and manage fish health and welfare. While non-invasive video monitoring has become a common practice in finfish aquaculture, existing intelligent monitoring methods predominantly focus on assessing body condition or fish swimming patterns and are often developed and evaluated in controlled tank environments, without demonstrating their applicability to real-world aquaculture settings in open sea farms. This underscores the necessity for methods that can monitor physiological traits directly within the production environment of sea fish farms. To this end, we have developed a computer vision method for monitoring ventilation rates of Atlantic salmon (Salmo salar), which was specifically designed for videos recorded in the production environment of commercial sea fish farms using the existing infrastructure. Our approach uses a fish head detection model, which classifies the mouth state as either open or closed using a convolutional neural network. This is followed with multiple object tracking to create temporal sequences of fish swimming across the field of view of the underwater video camera to estimate ventilation rates. The method demonstrated high efficiency, achieving a Pearson correlation coefficient of 0.82 between ground truth and predicted ventilation rates in a test set of 100 fish collected independently of the training data. By accurately identifying pens where fish exhibit signs of respiratory distress, our method offers broad applicability and the potential to transform fish health and welfare monitoring in finfish aquaculture.
zh

[CV-40] Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation

【速读】:该论文试图解决3D人脸扫描中模板网格与扫描数据匹配质量下降的问题,特别是在非皮肤区域(如头发、胡须、配饰等)的匹配效果不佳,这是由于优化过程中的模板到扫描距离导致模板网格向噪声扫描表面偏移。解决方案的关键在于提出一种新颖的方法,通过从多视角图像中提取特征并将其聚合到3D空间,再与扫描网格的3D几何特征融合,从而在扫描网格上直接预测分割掩码,实现对皮肤与非皮肤区域的准确分离。

链接: https://arxiv.org/abs/2504.19718
作者: Victoria Yue Chen,Daoye Wang,Stephan Garbin,Sebastian Winberg,Timo Bolkart,Thabo Beeler
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 4 figures, to be published in Eurographics 2025 as a short paper

点击查看摘要

Abstract:Face registration deforms a template mesh to closely fit a 3D face scan, the quality of which commonly degrades in non-skin regions (e.g., hair, beard, accessories), because the optimized template-to-scan distance pulls the template mesh towards the noisy scan surface. Improving registration quality requires a clean separation of skin and non-skin regions on the scan mesh. Existing image-based (2D) or scan-based (3D) segmentation methods however perform poorly. Image-based segmentation outputs multi-view inconsistent masks, and they cannot account for scan inaccuracies or scan-image misalignment, while scan-based methods suffer from lower spatial resolution compared to images. In this work, we introduce a novel method that accurately separates skin from non-skin geometry on 3D human head scans. For this, our method extracts features from multi-view images using a frozen image foundation model and aggregates these features in 3D. These lifted 2D features are then fused with 3D geometric features extracted from the scan mesh, to then predict a segmentation mask directly on the scan mesh. We show that our segmentations improve the registration accuracy over pure 2D or 3D segmentation methods by 8.89% and 14.3%, respectively. Although trained only on synthetic data, our model generalizes well to real data.
zh

[CV-41] Open-set Anomaly Segmentation in Complex Scenarios

【速读】:该论文旨在解决开放集场景下异常分割(即分布外对象的精确分割)中存在的不足,特别是在复杂气象条件和动态驾驶环境中,现有方法在性能和安全性上存在显著缺陷。其解决方案的关键在于提出一种基于能量-熵学习(Energy-Entropy Learning, EEL)的策略,通过融合能量和熵的信息以增强模型在复杂开放世界环境中的鲁棒性,并引入一种基于扩散的异常训练数据合成器,生成多样且高质量的异常图像以提升现有数据合成方法的效果。

链接: https://arxiv.org/abs/2504.19706
作者: Song Xia,Yi Yu,Henghui Ding,Wenhan Yang,Shifei Liu,Alex C. Kot,Xudong Jiang
机构: Nanyang Technological University(南洋理工大学); Fudan University(复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise segmentation of out-of-distribution (OoD) objects, herein referred to as anomalies, is crucial for the reliable deployment of semantic segmentation models in open-set, safety-critical applications, such as autonomous driving. Current anomalous segmentation benchmarks predominantly focus on favorable weather conditions, resulting in untrustworthy evaluations that overlook the risks posed by diverse meteorological conditions in open-set environments, such as low illumination, dense fog, and heavy rain. To bridge this gap, this paper introduces the ComsAmy, a challenging benchmark specifically designed for open-set anomaly segmentation in complex scenarios. ComsAmy encompasses a wide spectrum of adverse weather conditions, dynamic driving environments, and diverse anomaly types to comprehensively evaluate the model performance in realistic open-world scenarios. Our extensive evaluation of several state-of-the-art anomalous segmentation models reveals that existing methods demonstrate significant deficiencies in such challenging scenarios, highlighting their serious safety risks for real-world deployment. To solve that, we propose a novel energy-entropy learning (EEL) strategy that integrates the complementary information from energy and entropy to bolster the robustness of anomaly segmentation under complex open-world environments. Additionally, a diffusion-based anomalous training data synthesizer is proposed to generate diverse and high-quality anomalous images to enhance the existing copy-paste training data synthesizer. Extensive experimental results on both public and ComsAmy benchmarks demonstrate that our proposed diffusion-based synthesizer with energy and entropy learning (DiffEEL) serves as an effective and generalizable plug-and-play method to enhance existing models, yielding an average improvement of around 4.96% in \rmAUPRC and 9.87% in \rmFPR_95 .
zh

[CV-42] SubGrapher: Visual Fingerprinting of Chemical Structures

【速读】:该论文旨在解决从科学文献中自动提取化学结构的问题,特别是针对专利文档中以图像形式存在的分子信息难以通过传统文本搜索获取的挑战。其解决方案的关键在于提出SubGrapher方法,该方法通过基于学习的实例分割技术,直接从化学结构图像中提取分子指纹,而非重建完整的分子图,从而实现更高效和准确的化学结构检索。

链接: https://arxiv.org/abs/2504.19695
作者: Lucas Morin,Gerhard Ingmar Meijer,Valéry Weber,Luc Van Gool,Peter W. J. Staar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of chemical structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables chemical structure retrieval. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions. The dataset, models, and code will be made publicly available.
zh

[CV-43] Prompt Guiding Multi-Scale Adaptive Sparse Representation-driven Network for Low-Dose CT MAR

【速读】:该论文旨在解决低剂量CT(LDCT)重建与金属伪影去除(LDMAR)中的两个主要问题:现有深度学习方法在网络设计上忽视了多尺度和同尺度信息的融合,以及为不同剂量水平训练独立模型需要大量存储空间。其解决方案的关键在于提出一种提示引导的多尺度自适应稀疏表示驱动网络(PMSRNet),通过精心设计的提示引导尺度自适应阈值生成器(PSATG)和多尺度系数融合模块(MSFuM),实现同尺度特征与跨尺度互补性的同时利用,从而提升图像质量。此外,还构建了一个可解释的双域LDMAR框架PDuMSRNet,并采用提示引导策略训练单一模型以适应多种剂量水平,有效提升了模型的泛化能力和实用性。

链接: https://arxiv.org/abs/2504.19687
作者: Baoshun Shi,Bing Chen,Shaolei Zhang,Huazhu Fu,Zhanli Hu
机构: Yanshan University (燕山大学); Institute of High Performance Computing (高性能计算研究所); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-dose CT (LDCT) is capable of reducing X-ray radiation exposure, but it will potentially degrade image quality, even yields metal artifacts at the case of metallic implants. For simultaneous LDCT reconstruction and metal artifact reduction (LDMAR), existing deep learning-based efforts face two main limitations: i) the network design neglects multi-scale and within-scale information; ii) training a distinct model for each dose necessitates significant storage space for multiple doses. To fill these gaps, we propose a prompt guiding multi-scale adaptive sparse representation-driven network, abbreviated as PMSRNet, for LDMAR task. Specifically, we construct PMSRNet inspired from multi-scale sparsifying frames, and it can simultaneously employ within-scale characteristics and cross-scale complementarity owing to an elaborated prompt guiding scale-adaptive threshold generator (PSATG) and a built multi-scale coefficient fusion module (MSFuM). The PSATG can adaptively capture multiple contextual information to generate more faithful thresholds, achieved by fusing features from local, regional, and global levels. Furthermore, we elaborate a model interpretable dual domain LDMAR framework called PDuMSRNet, and train single model with a prompt guiding strategy for multiple dose levels. We build a prompt guiding module, whose input contains dose level, metal mask and input instance, to provide various guiding information, allowing a single model to accommodate various CT dose settings. Extensive experiments at various dose levels demonstrate that the proposed methods outperform the state-of-the-art LDMAR methods.
zh

[CV-44] ClearVision: Leverag ing CycleGAN and SigLIP-2 for Robust All-Weather Classification in Traffic Camera Imagery

【速读】:该论文旨在解决从低质量交通摄像头图像中准确进行天气分类的问题,特别是在恶劣夜间条件下。其关键解决方案是结合生成式域适应(generative domain adaptation)与高效对比学习(efficient contrastive learning),通过CycleGAN-based域转换提升夜间图像质量,从而增强下游模型的特征提取能力,并采用轻量级SigLIP-2损失函数优化对比学习效果,显著提升了夜间条件下的分类性能。

链接: https://arxiv.org/abs/2504.19684
作者: Anush Lakshman Sivaraman,Kojo Adu-Gyamfi,Ibne Farabi Shihab,Anuj Sharma
机构: Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate weather classification from low-quality traffic camera imagery remains a challenging task, particularly under adverse nighttime conditions. In this study, we propose a scalable framework that combines generative domain adaptation with efficient contrastive learning to enhance classification performance. Using CycleGAN-based domain translation, we improve the quality of nighttime images, enabling better feature extraction by downstream models. While the baseline EVA-02 model employing CLIP-based contrastive loss achieves an overall accuracy of 96.55%, it exhibits a significant performance gap between daytime (97.21%) and nighttime conditions (63.40%). Replacing CLIP with the lightweight SigLIP-2 (Sigmoid contrastive loss) achieves a competitive overall accuracy of 94.00%, with substantial improvements in nighttime performance (85.90% accuracy). The combination of Vision-SigLIP-2, Text-SigLIP-2, CycleGAN, and contrastive training achieves the best nighttime accuracy (85.90%) among all models tested, while EVA-02 with CycleGAN maintains the highest overall accuracy (97.01%) and per-class accuracies. These findings demonstrate the potential of combining domain adaptation and efficient contrastive learning to build practical, resource-efficient weather classification systems for intelligent transportation infrastructure.
zh

[CV-45] Explaining Vision GNNs: A Semantic and Visual Analysis of Graph-based Image Classification

【速读】:该论文试图解决基于图神经网络(Graph Neural Networks, GNNs)的视觉模型在决策过程中的可解释性问题,特别是分析不同层中形成的图结构的语义一致性及其对物体结构和有意义关系的保留能力。解决方案的关键在于通过量化层间图连接与语义相似性和空间一致性的关系,以及利用热图可视化技术揭示信息流动,从而评估模型的可解释性及鲁棒性。

链接: https://arxiv.org/abs/2504.19682
作者: Nikolaos Chaidos,Angeliki Dimitriou,Nikolaos Spanos,Athanasios Voulodimos,Giorgos Stamou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 3 figures, accepted for presentation at xAI-World-Conference 2025, code is available at this https URL

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as an efficient alternative to convolutional approaches for vision tasks such as image classification, leveraging patch-based representations instead of raw pixels. These methods construct graphs where image patches serve as nodes, and edges are established based on patch similarity or classification relevance. Despite their efficiency, the explainability of GNN-based vision models remains underexplored, even though graphs are naturally interpretable. In this work, we analyze the semantic consistency of the graphs formed at different layers of GNN-based image classifiers, focusing on how well they preserve object structures and meaningful relationships. A comprehensive analysis is presented by quantifying the extent to which inter-layer graph connections reflect semantic similarity and spatial coherence. Explanations from standard and adversarial settings are also compared to assess whether they reflect the classifiers’ robustness. Additionally, we visualize the flow of information across layers through heatmap-based visualization techniques, thereby highlighting the models’ explainability. Our findings demonstrate that the decision-making processes of these models can be effectively explained, while also revealing that their reasoning does not necessarily align with human perception, especially in deeper layers.
zh

[CV-46] xEdgeFace: Efficient Cross-Spectral Face Recognition for Edge Devices

【速读】:该论文旨在解决跨模态人脸识别(Heterogeneous Face Recognition, HFR)中计算复杂度高、难以部署于资源受限边缘设备的问题。其解决方案的关键在于提出一种轻量级但高效的HFR框架,通过适配原本用于人脸识别的混合卷积神经网络-Transformer(CNN-Transformer)架构,在仅需少量配对异构数据的情况下实现端到端的有效训练,同时保持在标准RGB人脸识别任务中的高性能,从而兼顾了计算效率与识别准确性。

链接: https://arxiv.org/abs/2504.19646
作者: Anjith George,Sebastien Marcel
机构: Idiap Research Institute (Idiap 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Heterogeneous Face Recognition (HFR) addresses the challenge of matching face images across different sensing modalities, such as thermal to visible or near-infrared to visible, expanding the applicability of face recognition systems in real-world, unconstrained environments. While recent HFR methods have shown promising results, many rely on computation-intensive architectures, limiting their practicality for deployment on resource-constrained edge devices. In this work, we present a lightweight yet effective HFR framework by adapting a hybrid CNN-Transformer architecture originally designed for face recognition. Our approach enables efficient end-to-end training with minimal paired heterogeneous data while preserving strong performance on standard RGB face recognition tasks. This makes it a compelling solution for both homogeneous and heterogeneous scenarios. Extensive experiments across multiple challenging HFR and face recognition benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches while maintaining a low computational overhead.
zh

[CV-47] BARIS: Boundary-Aware Refinement with Environmental Degradation Priors for Robust Underwater Instance Segmentation

【速读】:该论文旨在解决水下实例分割中由于光照衰减、散射和颜色失真等不良视觉条件导致的模型性能下降问题。其解决方案的关键在于提出BARIS-Decoder(边界感知细化解码器),通过特征细化提升分割精度,并引入Environmental Robust Adapter(ERA),有效建模水下退化模式,同时将可训练参数减少超过90%,从而实现高效且鲁棒的水下实例分割。

链接: https://arxiv.org/abs/2504.19643
作者: Pin-Chi Pan,Soo-Chang Pei
机构: National Taiwan University (台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures, and 11 tables

点击查看摘要

Abstract:Underwater instance segmentation is challenging due to adverse visual conditions such as light attenuation, scattering, and color distortion, which degrade model performance. In this work, we propose BARIS-Decoder (Boundary-Aware Refinement Decoder for Instance Segmentation), a framework that enhances segmentation accuracy through feature refinement. To address underwater degradations, we introduce the Environmental Robust Adapter (ERA), which efficiently models underwater degradation patterns while reducing trainable parameters by over 90% compared to full fine-tuning. The integration of BARIS-Decoder with ERA-tuning, referred to as BARIS-ERA, achieves state-of-the-art performance, surpassing Mask R-CNN by 3.4 mAP with a Swin-B backbone and 3.8 mAP with ConvNeXt V2. Our findings demonstrate the effectiveness of BARIS-ERA in advancing underwater instance segmentation, providing a robust and efficient solution.
zh

[CV-48] Exploiting Inter-Sample Correlation and Intra-Sample Redundancy for Partially Relevant Video Retrieval

【速读】:该论文试图解决部分相关视频检索(Partially Relevant Video Retrieval, PRVR)中由于文本与视觉模态之间的语义不对称性导致的检索性能下降问题,特别是在视频包含大量与查询无关内容的情况下。解决方案的关键在于系统性地利用任务中的两个核心特性:样本间相关性(inter-sample correlation)和样本内冗余性(intra-sample redundancy)。为此,作者提出了一种包含三个核心模块的框架:Inter Correlation Enhancement (ICE) 模块通过构建伪正样本对增强语义空间;Intra Redundancy Mining (IRM) 模块通过挖掘冗余特征作为难例样本提升表示能力;Temporal Coherence Prediction (TCP) 模块则通过预测视频帧和片段的原始时间顺序来增强特征区分性。

链接: https://arxiv.org/abs/2504.19637
作者: Junlong Ren,Gangjian Zhang,Yu Hu,Jian Shu,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Partially Relevant Video Retrieval (PRVR) aims to retrieve the target video that is partially relevant to the text query. The primary challenge in PRVR arises from the semantic asymmetry between textual and visual modalities, as videos often contain substantial content irrelevant to the query. Existing methods coarsely align paired videos and text queries to construct the semantic space, neglecting the critical cross-modal dual nature inherent in this task: inter-sample correlation and intra-sample redundancy. To this end, we propose a novel PRVR framework to systematically exploit these two characteristics. Our framework consists of three core modules. First, the Inter Correlation Enhancement (ICE) module captures inter-sample correlation by identifying semantically similar yet unpaired text queries and video moments, combining them to form pseudo-positive pairs for more robust semantic space construction. Second, the Intra Redundancy Mining (IRM) module mitigates intra-sample redundancy by mining redundant video moment features and treating them as hard negative samples, thereby encouraging the model to learn more discriminative representations. Finally, to reinforce these modules, we introduce the Temporal Coherence Prediction (TCP) module, which enhances feature discrimination by training the model to predict the original temporal order of randomly shuffled video frames and moments. Extensive experiments on three datasets demonstrate the superiority of our approach compared to previous methods, achieving state-of-the-art results.
zh

[CV-49] NSegment : Noisy Segment Improves Remote Sensing Image Segmentation

【速读】:该论文旨在解决遥感(Remote Sensing, RS)图像分割数据集中标签错误的问题,这些错误通常由于类别边界模糊、混合像素、阴影、复杂地形特征以及标注者主观偏差而难以察觉。为了解决这一问题,作者提出了一种名为NSegment的数据增强方法,其关键在于仅对分割标签应用弹性变换,并在每个训练周期中根据样本差异调整变形强度,从而缓解标注不一致性带来的影响。

链接: https://arxiv.org/abs/2504.19634
作者: Yechan Kim,DongHo Yoon,SooYeon Kim,Moongu Jeon
机构: Gwangju Institute of Science and Technology (韩国光州科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Labeling errors in remote sensing (RS) image segmentation datasets often remain implicit and subtle due to ambiguous class boundaries, mixed pixels, shadows, complex terrain features, and subjective annotator bias. Furthermore, the scarcity of annotated RS data due to high image acquisition and labeling costs complicates training noise-robust models. While sophisticated mechanisms such as label selection or noise correction might address this issue, they tend to increase training time and add implementation complexity. In this letter, we propose NSegment-a simple yet effective data augmentation solution to mitigate this issue. Unlike traditional methods, it applies elastic transformations only to segmentation labels, varying deformation intensity per sample in each training epoch to address annotation inconsistencies. Experimental results demonstrate that our approach improves the performance of RS image segmentation on various state-of-the-art models.
zh

[CV-50] DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer

【速读】:该论文旨在解决多视角驾驶场景视频生成质量差、时空一致性不足的问题,从而提升3D视觉感知任务的性能。其关键解决方案是提出DiVE框架,该框架基于扩散变压器(diffusion transformer),通过统一的跨注意力机制和SketchFormer实现对多模态数据的精确控制,并引入视图膨胀注意力机制,在不增加额外参数的情况下保证多视角的一致性。此外,为应对高分辨率视频生成中的双重挑战,DiVE还引入了多控制辅助分支蒸馏和分辨率渐进采样两项创新技术,显著提升了生成速度并保持了高质量输出。

链接: https://arxiv.org/abs/2504.19614
作者: Junpeng Jiang,Gangyi Hong,Miao Zhang,Hengtong Hu,Kun Zhan,Rui Shao,Liqiang Nie
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳); Li Auto Inc.(小鹏汽车公司); Tsinghua University Shenzhen(清华大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Collecting multi-view driving scenario videos to enhance the performance of 3D visual perception tasks presents significant challenges and incurs substantial costs, making generative models for realistic data an appealing alternative. Yet, the videos generated by recent works suffer from poor quality and spatiotemporal consistency, undermining their utility in advancing perception tasks under driving scenarios. To address this gap, we propose DiVE, a diffusion transformer-based generative framework meticulously engineered to produce high-fidelity, temporally coherent, and cross-view consistent multi-view videos, aligning seamlessly with bird’s-eye view layouts and textual descriptions. DiVE leverages a unified cross-attention and a SketchFormer to exert precise control over multimodal data, while incorporating a view-inflated attention mechanism that adds no extra parameters, thereby guaranteeing consistency across views. Despite these advancements, synthesizing high-resolution videos under multimodal constraints introduces dual challenges: investigating the optimal classifier-free guidance coniguration under intricate multi-condition inputs and mitigating excessive computational latency in high-resolution rendering–both of which remain underexplored in prior researches. To resolve these limitations, we introduce two innovations: Multi-Control Auxiliary Branch Distillation, which streamlines multi-condition CFG selection while circumventing high computational overhead, and Resolution Progressive Sampling, a training-free acceleration strategy that staggers resolution scaling to reduce high latency due to high resolution. These innovations collectively achieve a 2.62x speedup with minimal quality degradation. Evaluated on the nuScenes dataset, DiVE achieves SOTA performance in multi-view video generation, yielding photorealistic outputs with exceptional temporal and cross-view coherence.
zh

[CV-51] Image Generation Method Based on Heat Diffusion Models

【速读】:该论文旨在解决传统扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)在图像生成过程中难以有效保留细节的问题,因为DDPMs将图像视为整体进行处理,而未充分考虑相邻像素之间的局部关系。解决方案的关键在于引入热扩散模型(Heat Diffusion Model, HDM),该模型通过在扩散和生成公式中整合二维热方程的离散形式,实现像素级操作,从而在保持与DDPM相同训练流程的前提下,增强模型对邻近像素关系的计算能力,进而生成更高质量、更逼真的图像。

链接: https://arxiv.org/abs/2504.19600
作者: Pengfei Zhang,Shouqing Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Denoising Diffusion Probabilistic Models (DDPMs) achieve high-quality image generation without adversarial training, but they process images as a whole. Since adjacent pixels are highly likely to belong to the same object, we propose the Heat Diffusion Model (HDM) to further preserve image details and generate more realistic images. HDM is a model that incorporates pixel-level operations while maintaining the same training process as DDPM. In HDM, the discrete form of the two-dimensional heat equation is integrated into the diffusion and generation formulas of DDPM, enabling the model to compute relationships between neighboring pixels during image processing. Our experiments demonstrate that HDM can generate higher-quality samples compared to models such as DDPM, Consistency Diffusion Models (CDM), Latent Diffusion Models (LDM), and Vector Quantized Generative Adversarial Networks (VQGAN).
zh

[CV-52] Lightweight Adapter Learning for More Generalized Remote Sensing Change Detection

【速读】:该论文试图解决深度学习方法在遥感图像变化检测(Change Detection, CD)中因数据分布和标注差异导致的泛化能力差的问题。现有方法通常为每个数据集训练特定的深度网络,但这种网络在其他数据集上的表现较差。解决方案的关键在于提出一种变化适配器网络(Change Adapter Network, CANet),其包含数据集共享和数据集特定的学习模块,其中轻量级适配器模型能够适应不同数据集的数据分布和标注特性,并通过引入一种有趣的变化区域掩码(Interesting Change Region Mask, ICM)来自适应地关注感兴趣的变更对象,减少标注差异的影响。此外,CANet为每个数据集采用独特的批归一化层以处理数据分布差异,从而实现更通用的CD性能。

链接: https://arxiv.org/abs/2504.19598
作者: Dou Quan,Rufan Zhou,Shuang Wang,Ning Huyan,Dong Zhao,Yunan Li,Licheng Jiao
机构: School of Artificial Intelligence, Xidian University(人工智能学院,西安电子科技大学); Department of Automation, Tsinghua University(自动化系,清华大学); School of Computer Science and Technology, Xidian University(计算机科学与技术学院,西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning methods have shown promising performances in remote sensing image change detection (CD). However, existing methods usually train a dataset-specific deep network for each dataset. Due to the significant differences in the data distribution and labeling between various datasets, the trained dataset-specific deep network has poor generalization performances on other datasets. To solve this problem, this paper proposes a change adapter network (CANet) for a more universal and generalized CD. CANet contains dataset-shared and dataset-specific learning modules. The former explores the discriminative features of images, and the latter designs a lightweight adapter model, to deal with the characteristics of different datasets in data distribution and labeling. The lightweight adapter can quickly generalize the deep network for new CD tasks with a small computation cost. Specifically, this paper proposes an interesting change region mask (ICM) in the adapter, which can adaptively focus on interested change objects and decrease the influence of labeling differences in various datasets. Moreover, CANet adopts a unique batch normalization layer for each dataset to deal with data distribution differences. Compared with existing deep learning methods, CANet can achieve satisfactory CD performances on various datasets simultaneously. Experimental results on several public datasets have verified the effectiveness and advantages of the proposed CANet on CD. CANet has a stronger generalization ability, smaller training costs (merely updating 4.1%-7.7% parameters), and better performances under limited training datasets than other deep learning methods, which also can be flexibly inserted with existing deep models.
zh

[CV-53] WILD: a new in-the-Wild Image Linkage Dataset for synthetic image attribution

【速读】:该论文试图解决合成图像来源归属(synthetic image source attribution)的挑战,特别是在面对日益增多的图像生成器和缺乏高质量多样化数据集的情况下,训练和评估相关模型变得尤为困难。解决方案的关键在于构建一个名为WILD(in-the-Wild Image Linkage Dataset)的数据集,该数据集包含10个封闭集的流行商业生成器和10个开放集的额外生成器,模拟真实世界中的场景,并通过每种生成器提供1,000张图像,共计20,000张图像,其中一半经过多种后处理操作,以支持在封闭集和开放集识别与验证以及对抗攻击下的鲁棒性归属任务的基准测试。

链接: https://arxiv.org/abs/2504.19595
作者: Pietro Bongini,Sara Mandelli,Andrea Montibeller,Mirko Casu,Orazio Pontorno,Claudio Ragaglia,Luca Zanchetta,Mattia Aquilina,Taiba Majid Wani,Luca Guarnera,Benedetta Tondi,Paolo Bestagini,Irene Amerini,Francesco Denatale,Sebastiano Battiato,Mauro Barni
机构: University of Siena, Department of Information Engineering an Mathematics, Italy; Politecnico di Milano, Department of Electronics, Informatics, and Bioengineering, Italy; University of Trento, Department of Information Engineering and Computer Science, Italy; University of Catania, Department of Mathematics and Computer Science, Italy; Sapienza University of Rome - Departement of Computer, Control and Management Engineering, Italy
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic image source attribution is an open challenge, with an increasing number of image generators being released yearly. The complexity and the sheer number of available generative techniques, as well as the scarcity of high-quality open source datasets of diverse nature for this task, make training and benchmarking synthetic image source attribution models very challenging. WILD is a new in-the-Wild Image Linkage Dataset designed to provide a powerful training and benchmarking tool for synthetic image attribution models. The dataset is built out of a closed set of 10 popular commercial generators, which constitutes the training base of attribution models, and an open set of 10 additional generators, simulating a real-world in-the-wild scenario. Each generator is represented by 1,000 images, for a total of 10,000 images in the closed set and 10,000 images in the open set. Half of the images are post-processed with a wide range of operators. WILD allows benchmarking attribution models in a wide range of tasks, including closed and open set identification and verification, and robust attribution with respect to post-processing and adversarial attacks. Models trained on WILD are expected to benefit from the challenging scenario represented by the dataset itself. Moreover, an assessment of seven baseline methodologies on closed and open set attribution is presented, including robustness tests with respect to post-processing.
zh

[CV-54] Neural network task specialization via domain constraining

【速读】:该论文试图解决如何在不增加额外数据或改变训练制度的情况下,提升神经网络在特定数据子空间中的性能问题。其解决方案的关键在于通过任务特定的领域约束,对神经网络进行专业化(specialization),即通过限制网络执行的类别标签空间,实现性能增强。研究指出,有效的专业化需要修改传统的微调方法,并将数据空间约束到语义一致的子集,同时提出在微调前进行专家提取阶段以获得最大性能提升。

链接: https://arxiv.org/abs/2504.19592
作者: Roman Malashin,Daniil Ilyukhin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a concept of neural network specialization via task-specific domain constraining, aimed at enhancing network performance on data subspace in which the network operates. The study presents experiments on training specialists for image classification and object detection tasks. The results demonstrate that specialization can enhance a generalist’s accuracy even without additional data or changing training regimes: solely by constraining class label space in which the network performs. Theoretical and experimental analyses indicate that effective specialization requires modifying traditional fine-tuning methods and constraining data space to semantically coherent subsets. The specialist extraction phase before tuning the network is proposed for maximal performance gains. We also provide analysis of the evolution of the feature space during specialization. This study paves way to future research for developing more advanced dynamically configurable image analysis systems, where computations depend on the specific input. Additionally, the proposed methods can help improve system performance in scenarios where certain data domains should be excluded from consideration of the generalist network.
zh

[CV-55] Magnifier: A Multi-grained Neural Network-based Architecture for Burned Area Delineation

【速读】:该论文试图解决在危机管理和遥感领域中,由于数据稀缺和缺乏大规模基准数据集而导致的神经网络模型训练能力受限的问题。解决方案的关键在于提出了一种名为Magnifier的新方法,该方法适用于任何现有的编码器-解码器架构,通过双编码器策略(局部编码器和全局编码器)在不同上下文层次上融合信息,从而在有限的数据条件下提升分割性能。

链接: https://arxiv.org/abs/2504.19589
作者: Daniele Rege Cambrin,Luca Colomba,Paolo Garza
机构: Politecnico di Torino (都灵理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

点击查看摘要

Abstract:In crisis management and remote sensing, image segmentation plays a crucial role, enabling tasks like disaster response and emergency planning by analyzing visual data. Neural networks are able to analyze satellite acquisitions and determine which areas were affected by a catastrophic event. The problem in their development in this context is the data scarcity and the lack of extensive benchmark datasets, limiting the capabilities of training large neural network models. In this paper, we propose a novel methodology, namely Magnifier, to improve segmentation performance with limited data availability. The Magnifier methodology is applicable to any existing encoder-decoder architecture, as it extends a model by merging information at different contextual levels through a dual-encoder approach: a local and global encoder. Magnifier analyzes the input data twice using the dual-encoder approach. In particular, the local and global encoders extract information from the same input at different granularities. This allows Magnifier to extract more information than the other approaches given the same set of input images. Magnifier improves the quality of the results of +2.65% on average IoU while leading to a restrained increase in terms of the number of trainable parameters compared to the original model. We evaluated our proposed approach with state-of-the-art burned area segmentation models, demonstrating, on average, comparable or better performances in less than half of the GFLOPs.
zh

[CV-56] ShowMak3r: Compositional TV Show Reconstruction

【速读】:该论文旨在解决从视频片段中重建动态辐射场(dynamic radiance fields)的难题,尤其是在处理娱乐视频如电视剧时所面临的挑战,包括演员相互遮挡、面部表情多样、场景杂乱以及视差基线小或镜头突然切换等问题。其解决方案的关键在于提出了一种名为ShowMak3r的综合重建流水线,该流水线包含多个核心模块:3DLocator模块通过深度先验定位恢复的演员并估计未见的人体姿态;ShotMatcher模块在镜头切换下跟踪演员;以及一个动态恢复演员表情的面部拟合网络。这些技术共同实现了对电视剧场景的高效重建与编辑。

链接: https://arxiv.org/abs/2504.19584
作者: Sangmin Kim,Seunguk Do,Jaesik Park
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page : this https URL

点击查看摘要

Abstract:Reconstructing dynamic radiance fields from video clips is challenging, especially when entertainment videos like TV shows are given. Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present ShowMak3r, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room. In ShowMak3r, a 3DLocator module locates recovered actors on the stage using depth prior and estimates unseen human poses via interpolation. The proposed ShotMatcher module then tracks the actors under shot changes. Furthermore, ShowMak3r introduces a face-fitting network that dynamically recovers the actors’ expressions. Experiments on Sitcoms3D dataset show that our pipeline can reassemble TV show scenes with new cameras at different timestamps. We also demonstrate that ShowMak3r enables interesting applications such as synthetic shot-making, actor relocation, insertion, deletion, and pose manipulation. Project page : this https URL
zh

[CV-57] SAMBLE: Shape-Specific Point Cloud Sampling for an Optimal Trade-Off Between Local Detail and Global Uniformity

【速读】:该论文旨在解决点云采样中难以平衡局部细节保留与全局形状均匀性的问题,同时克服现有基于学习的采样方法在生成不可识别的采样模式或过度关注锐利边缘导致结果偏差的缺陷。其解决方案的关键在于提出一种基于稀疏注意力图和分箱学习的方法(Sparse Attention Map and Bin-based Learning, SAMBLE),通过学习形状特定的采样策略,实现局部细节与全局结构之间的有效平衡,从而在多种点云下游任务中取得更优性能。

链接: https://arxiv.org/abs/2504.19581
作者: Chengzhi Wu,Yuxin Wan,Hao Fu,Julius Pfrommer,Zeyun Zhong,Junwei Zheng,Jiaming Zhang,Jürgen Beyerer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driven by the increasing demand for accurate and efficient representation of 3D data in various domains, point cloud sampling has emerged as a pivotal research topic in 3D computer vision. Recently, learning-to-sample methods have garnered growing interest from the community, particularly for their ability to be jointly trained with downstream tasks. However, previous learning-based sampling methods either lead to unrecognizable sampling patterns by generating a new point cloud or biased sampled results by focusing excessively on sharp edge details. Moreover, they all overlook the natural variations in point distribution across different shapes, applying a similar sampling strategy to all point clouds. In this paper, we propose a Sparse Attention Map and Bin-based Learning method (termed SAMBLE) to learn shape-specific sampling strategies for point cloud shapes. SAMBLE effectively achieves an improved balance between sampling edge points for local details and preserving uniformity in the global shape, resulting in superior performance across multiple common point cloud downstream tasks, even in scenarios with few-point sampling.
zh

[CV-58] DG-DETR: Toward Domain Generalized Detection Transformer

【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中,基于 Transformer 的检测器(DETR)在跨领域泛化能力不足的问题,尤其是针对其在分布外(out-of-distribution, OOD)场景下的鲁棒性较弱。解决方案的关键在于提出一种名为 DG-DETR 的简单、有效且可直接集成的方法,其核心是通过一种领域无关的查询选择策略,利用实例特定风格空间上的正交投影去除对象查询中的领域相关偏差,并结合小波分解将特征解耦为领域不变和领域特定成分,从而在保持语义特征的同时合成多样的潜在风格。

链接: https://arxiv.org/abs/2504.19574
作者: Seongmin Hwang,Daeyoung Han,Moongu Jeon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:End-to-end Transformer-based detectors (DETRs) have demonstrated strong detection performance. However, domain generalization (DG) research has primarily focused on convolutional neural network (CNN)-based detectors, while paying little attention to enhancing the robustness of DETRs. In this letter, we introduce a Domain Generalized DEtection TRansformer (DG-DETR), a simple, effective, and plug-and-play method that improves out-of-distribution (OOD) robustness for DETRs. Specifically, we propose a novel domain-agnostic query selection strategy that removes domain-induced biases from object queries via orthogonal projection onto the instance-specific style space. Additionally, we leverage a wavelet decomposition to disentangle features into domain-invariant and domain-specific components, enabling synthesis of diverse latent styles while preserving the semantic features of objects. Experimental results validate the effectiveness of DG-DETR. Our code is available at this https URL.
zh

[CV-59] Category-Level and Open-Set Object Pose Estimation for Robotics

【速读】:该论文试图解决类别级(category-level)和开放集(open-set)对象位姿估计(6D object pose estimation)中的挑战,特别是在纹理、形状和尺寸部分或完全未知的情况下,如何提升位姿估计的准确性与泛化能力。其解决方案的关键在于对比不同数据集、精度评估指标和算法,并通过分析这些方法的优劣,提出可行的建议以弥合类别级与开放集位姿估计之间的差距,从而实现更广泛的适用性。

链接: https://arxiv.org/abs/2504.19572
作者: Peter Hönig,Matthias Hirschmanner,Markus Vincze
机构: TU Wien(维也纳技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at Austrian Robotics Workshop 2025

点击查看摘要

Abstract:Object pose estimation enables a variety of tasks in computer vision and robotics, including scene understanding and robotic grasping. The complexity of a pose estimation task depends on the unknown variables related to the target object. While instance-level methods already excel for opaque and Lambertian objects, category-level and open-set methods, where texture, shape, and size are partially or entirely unknown, still struggle with these basic material properties. Since texture is unknown in these scenarios, it cannot be used for disambiguating object symmetries, another core challenge of 6D object pose estimation. The complexity of estimating 6D poses with such a manifold of unknowns led to various datasets, accuracy metrics, and algorithmic solutions. This paper compares datasets, accuracy metrics, and algorithms for solving 6D pose estimation on the category-level. Based on this comparison, we analyze how to bridge category-level and open-set object pose estimation to reach generalization and provide actionable recommendations.
zh

[CV-60] CE-NPBG: Connectivity Enhanced Neural Point-Based Graphics for Novel View Synthesis in Autonomous Driving Scenes CVPR

【速读】:该论文旨在解决大规模自动驾驶场景中基于点云的新型视图合成(NVS)方法在可扩展性和渲染质量上的局限性,其核心问题在于几何与外观模态之间的可见性不匹配。解决方案的关键是提出CE-NPBG方法,该方法通过构建外观与几何之间的连通性关系图,从大规模3D点云地图中检索当前相机视角下的点并用于渲染,从而显著提升渲染质量,并通过仅使用点云中的小部分点来增强运行效率和可扩展性。此外,该方法将神经描述符与点关联,并采用联合对抗与点光栅化训练策略以优化描述符编码,进一步提升渲染效果。

链接: https://arxiv.org/abs/2504.19557
作者: Mohammad Altillawi,Fengyi Shen,Liudi Yang,Sai Manoj Prakhya,Ziyuan Liu
机构: Huawei Munich Research Center (华为慕尼黑研究中心); Technical University of Munich (慕尼黑工业大学); University of Freiburg (弗莱堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

点击查看摘要

Abstract:Current point-based approaches encounter limitations in scalability and rendering quality when using large 3D point cloud maps because using them directly for novel view synthesis (NVS) leads to degraded visualizations. We identify the primary issue behind these low-quality renderings as a visibility mismatch between geometry and appearance, stemming from using these two modalities together. To address this problem, we present CE-NPBG, a new approach for novel view synthesis (NVS) in large-scale autonomous driving scenes. Our method is a neural point-based technique that leverages two modalities: posed images (cameras) and synchronized raw 3D point clouds (LiDAR). We first employ a connectivity relationship graph between appearance and geometry, which retrieves points from a large 3D point cloud map observed from the current camera perspective and uses them for rendering. By leveraging this connectivity, our method significantly improves rendering quality and enhances run-time and scalability by using only a small subset of points from the large 3D point cloud map. Our approach associates neural descriptors with the points and uses them to synthesize views. To enhance the encoding of these descriptors and elevate rendering quality, we propose a joint adversarial and point rasterization training. During training, we pair an image-synthesizer network with a multi-resolution discriminator. At inference, we decouple them and use the image-synthesizer to generate novel views. We also integrate our proposal into the recent 3D Gaussian Splatting work to highlight its benefits for improved rendering and scalability.
zh

[CV-61] DEEMO: De-identity Multimodal Emotion Recognition and Reasoning

【速读】:该论文旨在解决传统情感理解方法过度依赖身份敏感信息(如面部表情和语音)所引发的隐私泄露问题。其解决方案的关键在于提出一种名为DEEMO的新型任务,通过使用去身份化的视频和音频输入实现情感识别与推理,从而在不损害身份隐私的前提下完成情感理解。为此,研究者构建了包含非面部身体语言标注的DEEMO-NFBL子集以及基于无身份线索的多模态情感识别与推理指令数据集DEEMO-MER,并进一步提出了DEEMO-LLaMA模型,该模型整合去身份化的音频、视频和文本信息,显著提升了情感识别与推理性能。

链接: https://arxiv.org/abs/2504.19549
作者: Deng Li,Bohao Xing,Xin Liu,Baiqiang Xia,Bihan Wen,Heikki Kälviäinen
机构: Lappeenranta-Lahti University of Technology LUT(Lappeenranta-Lahti大学技术学院 LUT); Silo AI(Silo AI); Nanyang Technological University(南洋理工大学); Brno University of Technology(布尔诺理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Emotion understanding is a critical yet challenging task. Most existing approaches rely heavily on identity-sensitive information, such as facial expressions and speech, which raises concerns about personal privacy. To address this, we introduce the De-identity Multimodal Emotion Recognition and Reasoning (DEEMO), a novel task designed to enable emotion understanding using de-identified video and audio inputs. The DEEMO dataset consists of two subsets: DEEMO-NFBL, which includes rich annotations of Non-Facial Body Language (NFBL), and DEEMO-MER, an instruction dataset for Multimodal Emotion Recognition and Reasoning using identity-free cues. This design supports emotion understanding without compromising identity privacy. In addition, we propose DEEMO-LLaMA, a Multimodal Large Language Model (MLLM) that integrates de-identified audio, video, and textual information to enhance both emotion recognition and reasoning. Extensive experiments show that DEEMO-LLaMA achieves state-of-the-art performance on both tasks, outperforming existing MLLMs by a significant margin, achieving 74.49% accuracy and 74.45% F1-score in de-identity emotion recognition, and 6.20 clue overlap and 7.66 label overlap in de-identity emotion reasoning. Our work contributes to ethical AI by advancing privacy-preserving emotion understanding and promoting responsible affective computing.
zh

[CV-62] Crowd Detection Using Very-Fine-Resolution Satellite Imagery

【速读】:该论文旨在解决传统人群检测(Crowd Detection, CD)方法在时空覆盖范围上的局限性,这些方法依赖于地面和航拍图像,难以实现大范围的精准检测。其关键解决方案是提出CrowdSat-Net,这是一种基于点的卷积神经网络,包含两个创新组件:Dual-Context Progressive Attention Network (DCPAN) 通过聚合场景上下文和局部个体特征来提升个体特征表示,以及High-Frequency Guided Deformable Upsampler (HFGDU) 通过频域引导的可变形卷积恢复上采样过程中的高频信息,从而提升检测精度。

链接: https://arxiv.org/abs/2504.19546
作者: Tong Xiao,Qunming Wang,Ping Lu,Tenghai Huang,Xiaohua Tong,Peter M. Atkinson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Accurate crowd detection (CD) is critical for public safety and historical pattern analysis, yet existing methods relying on ground and aerial imagery suffer from limited spatio-temporal coverage. The development of very-fine-resolution (VFR) satellite sensor imagery (e.g., ~0.3 m spatial resolution) provides unprecedented opportunities for large-scale crowd activity analysis, but it has never been considered for this task. To address this gap, we proposed CrowdSat-Net, a novel point-based convolutional neural network, which features two innovative components: Dual-Context Progressive Attention Network (DCPAN) to improve feature representation of individuals by aggregating scene context and local individual characteristics, and High-Frequency Guided Deformable Upsampler (HFGDU) that recovers high-frequency information during upsampling through frequency-domain guided deformable convolutions. To validate the effectiveness of CrowdSat-Net, we developed CrowdSat, the first VFR satellite imagery dataset designed specifically for CD tasks, comprising over 120k manually labeled individuals from multi-source satellite platforms (Beijing-3N, Jilin-1 Gaofen-04A and Google Earth) across China. In the experiments, CrowdSat-Net was compared with five state-of-the-art point-based CD methods (originally designed for ground or aerial imagery) using CrowdSat and achieved the largest F1-score of 66.12% and Precision of 73.23%, surpassing the second-best method by 1.71% and 2.42%, respectively. Moreover, extensive ablation experiments validated the importance of the DCPAN and HFGDU modules. Furthermore, cross-regional evaluation further demonstrated the spatial generalizability of CrowdSat-Net. This research advances CD capability by providing both a newly developed network architecture for CD and a pioneering benchmark dataset to facilitate future CD development.
zh

[CV-63] Point2Quad: Generating Quad Meshes from Point Clouds via Face Prediction

【速读】:该论文旨在解决从点云中生成仅包含四边形(quad-only)的四边形网格(quad mesh)的问题,该问题由于需要保证共面性、凸性和四边形唯一性而面临较大挑战。解决方案的关键在于提出Point2Quad方法,通过融合点级和面级特征来学习识别四边形网格,具体包括基于k-NN的候选生成以考虑共面性和正方形特性,随后通过两个编码器提取几何和拓扑特征,并结合深度四边形特性的特征进行特征融合,最终通过设计的复合损失函数训练分类器,并通过四边形专用的后处理优化结果。

链接: https://arxiv.org/abs/2504.19545
作者: Zezeng Li,Zhihui Qi,Weimin Wang,Ziliang Wang,Junyi Duan,Na Lei
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quad meshes are essential in geometric modeling and computational mechanics. Although learning-based methods for triangle mesh demonstrate considerable advancements, quad mesh generation remains less explored due to the challenge of ensuring coplanarity, convexity, and quad-only meshes. In this paper, we present Point2Quad, the first learning-based method for quad-only mesh generation from point clouds. The key idea is learning to identify quad mesh with fused pointwise and facewise features. Specifically, Point2Quad begins with a k-NN-based candidate generation considering the coplanarity and squareness. Then, two encoders are followed to extract geometric and topological features that address the challenge of quad-related constraints, especially by combining in-depth quadrilaterals-specific characteristics. Subsequently, the extracted features are fused to train the classifier with a designed compound loss. The final results are derived after the refinement by a quad-specific post-processing. Extensive experiments on both clear and noise data demonstrate the effectiveness and superiority of Point2Quad, compared to baseline methods under comprehensive metrics.
zh

[CV-64] Adversarial Shallow Watermarking

【速读】:该论文试图解决现有基于深度神经网络的数字水印方法在面对未知失真时鲁棒性不足的问题,这些方法通常依赖于“编码器-噪声层-解码器”架构,并通过联合训练编码器和解码器来适应训练阶段使用的特定噪声层,从而导致对未见过的失真敏感。解决方案的关键在于提出一种名为对抗浅层水印(Adversarial Shallow Watermarking, ASW)的新框架,该框架仅使用一个随机参数化且对失真不敏感的浅层解码器,通过冻结解码器并对抗性优化宿主图像,使其生成的水印图像能够稳定触发解码器输出水印信息,从而实现无需训练、无编码器和无噪声层的高效水印嵌入与提取。

链接: https://arxiv.org/abs/2504.19529
作者: Guobiao Li,Lei Tan,Yuliang Xue,Gaozhi Liu,Zhenxing Qian,Sheng Li,Xinpeng Zhang
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 10 pages, 12 figures

点击查看摘要

Abstract:Recent advances in digital watermarking make use of deep neural networks for message embedding and extraction. They typically follow the encoder-noise layer-decoder''-based architecture. By deliberately establishing a differentiable noise layer to simulate the distortion of the watermarked signal, they jointly train the deep encoder and decoder to fit the noise layer to guarantee robustness. As a result, they are usually weak against unknown distortions that are not used in their training pipeline. In this paper, we propose a novel watermarking framework to resist unknown distortions, namely Adversarial Shallow Watermarking (ASW). ASW utilizes only a shallow decoder that is randomly parameterized and designed to be insensitive to distortions for watermarking extraction. During the watermark embedding, ASW freezes the shallow decoder and adversarially optimizes a host image until its updated version (i.e., the watermarked image) stably triggers the shallow decoder to output the watermark message. During the watermark extraction, it accurately recovers the message from the watermarked image by leveraging the insensitive nature of the shallow decoder against arbitrary distortions. Our ASW is training-free, encoder-free, and noise layer-free. Experiments indicate that the watermarked images created by ASW have strong robustness against various unknown distortions. Compared to the existing encoder-noise layer-decoder’’ approaches, ASW achieves comparable results on known distortions and better robustness on unknown distortions.
zh

[CV-65] LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning

【速读】:该论文旨在解决工业异常检测(Industrial Anomaly Detection, IAD)中面临的挑战,包括传统方法对大规模数据的依赖、可扩展性差、现有视觉-语言模型(Vision-Language Models, VLMs)和多模态大语言模型(Multimodal Large Language Models, MLLMs)依赖掩码标注导致的高实施成本与误报率,以及工业数据集如MVTec-AD和VisA中存在的严重类别不平衡问题。其解决方案的关键在于提出一种动态优先级奖励函数以处理类别不平衡,并引入无掩码推理框架,结合思维链(Chain of Thought, CoT)和组相对策略优化(Group Relative Policy Optimization, GRPO)机制,从而直接从原始图像中进行异常检测,并生成可解释的缺陷定位步骤。

链接: https://arxiv.org/abs/2504.19524
作者: Peijian Zeng,Feiyan Pang,Zhanbo Wang,Aimin Yang
机构: Guangdong University of Technology (广东工业大学); Lingnan Normal University (岭南师范学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Industrial Anomaly Detection (IAD) is critical for ensuring product quality by identifying defects. Traditional methods such as feature embedding and reconstruction-based approaches require large datasets and struggle with scalability. Existing vision-language models (VLMs) and Multimodal Large Language Models (MLLMs) address some limitations but rely on mask annotations, leading to high implementation costs and false positives. Additionally, industrial datasets like MVTec-AD and VisA suffer from severe class imbalance, with defect samples constituting only 23.8% and 11.1% of total data respectively. To address these challenges, we propose a reward function that dynamically prioritizes rare defect patterns during training to handle class imbalance. We also introduce a mask-free reasoning framework using Chain of Thought (CoT) and Group Relative Policy Optimization (GRPO) mechanisms, enabling anomaly detection directly from raw images without annotated masks. This approach generates interpretable step-by-step explanations for defect localization. Our method achieves state-of-the-art performance, outperforming prior approaches by 36% in accuracy on MVTec-AD and 16% on VisA. By eliminating mask dependency and reducing costs while providing explainable outputs, this work advances industrial anomaly detection and supports scalable quality control in manufacturing. Code to reproduce the experiment is available at this https URL.
zh

[CV-66] FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding

【速读】:该论文试图解决现有数据集在花样滑冰领域中对技术与艺术评估缺乏全面标注的问题,以及当前体育研究主要集中于球类运动而对艺术性运动关注不足的现状。解决方案的关键在于提出FSAnno数据集,其中包含用于模型训练和测试的开放数据集,以及用于公平评估的基准数据集FSBench,该基准数据集涵盖文本和多模态运动数据,支持从技术分析到表演评论的多种任务。

链接: https://arxiv.org/abs/2504.19514
作者: Rong Gao,Xin Liu,Zhuozhao Hu,Bohao Xing,Baiqiang Xia,Zitong Yu,Heikki Kälviäinen
机构: Lappeenranta-Lahti University of Technology LUT (拉彭兰塔-劳里大学技术学院); Tianjin University (天津大学); AMD Silo AI (AMD Silo AI); Great Bay University (大湾大学); Rensselaer Polytechnic Institute (伦斯勒理工学院); Brno University of Technology (布尔诺理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Figure skating, known as the “Art on Ice,” is among the most artistic sports, challenging to understand due to its blend of technical elements (like jumps and spins) and overall artistic expression. Existing figure skating datasets mainly focus on single tasks, such as action recognition or scoring, lacking comprehensive annotations for both technical and artistic evaluation. Current sports research is largely centered on ball games, with limited relevance to artistic sports like figure skating. To address this, we introduce FSAnno, a large-scale dataset advancing artistic sports understanding through figure skating. FSAnno includes an open-access training and test dataset, alongside a benchmark dataset, FSBench, for fair model evaluation. FSBench consists of FSBench-Text, with multiple-choice questions and explanations, and FSBench-Motion, containing multimodal data and Question and Answer (QA) pairs, supporting tasks from technical analysis to performance commentary. Initial tests on FSBench reveal significant limitations in existing models’ understanding of artistic sports. We hope FSBench will become a key tool for evaluating and enhancing model comprehension of figure skating.
zh

[CV-67] SynergyAmodal: Deocclude Anything with Text Control

【速读】:该论文旨在解决图像去遮挡(image deocclusion)任务中高质量数据稀缺的问题,具体表现为数据在多样性、合理性及保真度方面的不平衡。其解决方案的关键在于提出SynergyAmodal框架,通过三元数据-人类-模型协作机制,整合野生图像数据的多样性、人类专家的合理性判断以及生成先验的保真度,从而实现对遮挡实例的完整形状和外观的协同合成。

链接: https://arxiv.org/abs/2504.19506
作者: Xinyang Li,Chengjie Yi,Jiawei Lai,Mingbao Lin,Yansong Qu,Shengchuan Zhang,Liujuan Cao
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Image deocclusion (or amodal completion) aims to recover the invisible regions (\ie, shape and appearance) of occluded instances in images. Despite recent advances, the scarcity of high-quality data that balances diversity, plausibility, and fidelity remains a major obstacle. To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporating human expertise for plausibility, and utilizing generative priors for fidelity. We propose SynergyAmodal, a novel framework for co-synthesizing in-the-wild amodal datasets with comprehensive shape and appearance annotations, which integrates these elements through a tripartite data-human-model collaboration. First, we design an occlusion-grounded self-supervised learning algorithm to harness the diversity of in-the-wild image data, fine-tuning an inpainting diffusion model into a partial completion diffusion model. Second, we establish a co-synthesis pipeline to iteratively filter, refine, select, and annotate the initial deocclusion results of the partial completion diffusion model, ensuring plausibility and fidelity through human expert guidance and prior model constraints. This pipeline generates a high-quality paired amodal dataset with extensive category and scale diversity, comprising approximately 16K pairs. Finally, we train a full completion diffusion model on the synthesized dataset, incorporating text prompts as conditioning signals. Extensive experiments demonstrate the effectiveness of our framework in achieving zero-shot generalization and textual controllability. Our code, dataset, and models will be made publicly available at this https URL.
zh

[CV-68] CasaGPT : Cuboid Arrangement and Scene Assembly for Interior Design

【速读】:该论文旨在解决室内场景合成中3D物体布局不自然、物体间交叠问题,以及现有方法在生成场景真实感方面的不足。其解决方案的关键在于采用分解的长方体(cuboid)原语来表示场景中的3D物体,并通过自回归模型(autoregressive model)依次排列这些长方体,从而生成物理上合理且紧凑的场景。此外,通过在微调阶段应用拒绝采样(rejection sampling)过滤存在物体碰撞的场景,进一步减少了物体交叠并提升了场景质量。

链接: https://arxiv.org/abs/2504.19478
作者: Weitao Feng,Hang Zhou,Jing Liao,Li Cheng,Wenbo Zhou
机构: University of Science and Technology of China (中国科学技术大学); University of Alberta (阿尔伯塔大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene. Unlike conventional methods that use bounding boxes to determine the placement and scale of 3D objects, our approach leverages cuboids as a straightforward yet highly effective alternative for modeling objects. This allows for compact scene generation while minimizing object intersections. Our approach, coined CasaGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes. By applying rejection sampling during the fine-tuning stage to filter out scenes with object collisions, our model further reduces intersections and enhances scene quality. Additionally, we introduce a refined dataset, 3DFRONT-NC, which eliminates significant noise presented in the original dataset, 3D-FRONT. Extensive experiments on the 3D-FRONT dataset as well as our dataset demonstrate that our approach consistently outperforms the state-of-the-art methods, enhancing the realism of generated scenes, and providing a promising direction for 3D scene synthesis.
zh

[CV-69] Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video CVPR

【速读】:该论文旨在解决视觉机制可解释性研究中缺乏可访问框架和预训练权重的问题,从而阻碍了该领域的发展。其解决方案的关键在于提出Prisma,一个开源框架,提供了统一的工具包,支持75+视觉和视频Transformer模型的访问、稀疏自编码器(SAE)训练、激活缓存、电路分析工具以及80+预训练SAE权重等,从而加速视觉机制可解释性研究并降低进入门槛。

链接: https://arxiv.org/abs/2504.19475
作者: Sonia Joseph,Praneet Suresh,Lorenz Hufe,Edward Stevinson,Robert Graham,Yash Vadi,Danilo Bzdok,Sebastian Lapuschkin,Lee Sharkey,Blake Aaron Richards
机构: Mila Quebec(蒙特利尔人工智能实验室); McGill University(麦吉尔大学); Meta(元); Fraunhofer HHI(弗劳恩霍夫海因里希·赫兹研究所); Imperial College London(帝国理工学院); Université de Montréal(蒙特利尔大学); Apollo Research(阿波罗研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 3 figures, 9 tables. Oral and Tutorial at the CVPR Mechanistic Interpretability for Vision (MIV) Workshop

点击查看摘要

Abstract:Robust tooling and publicly available pre-trained models have helped drive recent advances in mechanistic interpretability for language models. However, similar progress in vision mechanistic interpretability has been hindered by the lack of accessible frameworks and pre-trained weights. We present Prisma (Access the codebase here: this https URL), an open-source framework designed to accelerate vision mechanistic interpretability research, providing a unified toolkit for accessing 75+ vision and video transformers; support for sparse autoencoder (SAE), transcoder, and crosscoder training; a suite of 80+ pre-trained SAE weights; activation caching, circuit analysis tools, and visualization tools; and educational resources. Our analysis reveals surprising findings, including that effective vision SAEs can exhibit substantially lower sparsity patterns than language SAEs, and that in some instances, SAE reconstructions can decrease model loss. Prisma enables new research directions for understanding vision model internals while lowering barriers to entry in this emerging field.
zh

[CV-70] Masked Language Prompting for Generative Data Augmentation in Few-shot Fashion Style Recognition

【速读】:该论文试图解决时尚风格识别中数据集构建的挑战,特别是由于风格概念的主观性和模糊性导致的数据多样性与风格一致性难以平衡的问题。其解决方案的关键在于提出一种名为\textbf{Masked Language Prompting (MLP)}的新型提示策略,通过在参考描述中遮蔽选定词汇并利用大语言模型生成语义连贯的补全,从而在保留原始描述结构语义的同时引入属性级别的变化,实现风格一致且多样的图像生成,无需微调。

链接: https://arxiv.org/abs/2504.19455
作者: Yuki Hirakawa,Ryotaro Shimizu
机构: ZOZO Research (ZOZO 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Constructing dataset for fashion style recognition is challenging due to the inherent subjectivity and ambiguity of style concepts. Recent advances in text-to-image models have facilitated generative data augmentation by synthesizing images from labeled data, yet existing methods based solely on class names or reference captions often fail to balance visual diversity and style consistency. In this work, we propose \textbfMasked Language Prompting (MLP), a novel prompting strategy that masks selected words in a reference caption and leverages large language models to generate diverse yet semantically coherent completions. This approach preserves the structural semantics of the original caption while introducing attribute-level variations aligned with the intended style, enabling style-consistent and diverse image generation without fine-tuning. Experimental results on the FashionStyle14 dataset demonstrate that our MLP-based augmentation consistently outperforms class-name and caption-based baselines, validating its effectiveness for fashion style recognition under limited supervision.
zh

[CV-71] CLIP-KOA: Enhancing Knee Osteoarthritis Diagnosis with Multi-Modal Learning and Symmetry-Aware Loss Functions

【速读】:该论文试图解决膝骨关节炎(Knee Osteoarthritis, KOA)诊断中因Kellgren and Lawrence (KL)分级系统存在高观察者间变异性和主观性而导致的诊断一致性不足问题。其解决方案的关键在于提出一种基于CLIP的框架(CLIP-KOA),通过整合图像和文本信息,并引入对称性损失(Symmetry Loss)和一致性损失(Consistency Loss),以提升KOA等级预测的一致性和可靠性。

链接: https://arxiv.org/abs/2504.19443
作者: Yejin Jeong,Donghun Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Knee osteoarthritis (KOA) is a universal chronic musculoskeletal disorders worldwide, making early diagnosis crucial. Currently, the Kellgren and Lawrence (KL) grading system is widely used to assess KOA severity. However, its high inter-observer variability and subjectivity hinder diagnostic consistency. To address these limitations, automated diagnostic techniques using deep learning have been actively explored in recent years. In this study, we propose a CLIP-based framework (CLIP-KOA) to enhance the consistency and reliability of KOA grade prediction. To achieve this, we introduce a learning approach that integrates image and text information and incorporate Symmetry Loss and Consistency Loss to ensure prediction consistency between the original and flipped images. CLIP-KOA achieves state-of-the-art accuracy of 71.86% on KOA severity prediction task, and ablation studies show that CLIP-KOA has 2.36% improvement in accuracy over the standard CLIP model due to our contribution. This study shows a novel direction for data-driven medical prediction not only to improve reliability of fine-grained diagnosis and but also to explore multimodal methods for medical image analysis. Our code is available at this https URL.
zh

[CV-72] EarthMapper: Visual Autoregressive Models for Controllable Bidirectional Satellite-Map Translation

【速读】:该论文旨在解决卫星图像与地图之间的双向翻译(BSMT)问题,该任务面临两大挑战:一是两种模态间缺乏精确的像素级对齐,二是需要同时实现地理特征的高层次抽象与高质量视觉合成。解决方案的关键在于提出EarthMapper框架,其核心包括地理坐标嵌入以确保区域特定的适应性,以及在地理条件联合尺度自回归(GJSA)过程中采用多尺度特征对齐,从而在单一训练周期内统一双向翻译。此外,引入语义注入(SI)机制和关键点自适应引导(KPAG)机制,分别提升特征级一致性和推理过程中的多样性与精度平衡。

链接: https://arxiv.org/abs/2504.19432
作者: Zhe Dong,Yuzhe Sun,Tianzhu Liu,Wangmeng Zuo,Yanfeng Gu
机构: Harbin Institute of Technology(哈尔滨工业大学); Peng Cheng Lab(鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Satellite imagery and maps, as two fundamental data modalities in remote sensing, offer direct observations of the Earth’s surface and human-interpretable geographic abstractions, respectively. The task of bidirectional translation between satellite images and maps (BSMT) holds significant potential for applications in urban planning and disaster response. However, this task presents two major challenges: first, the absence of precise pixel-wise alignment between the two modalities substantially complicates the translation process; second, it requires achieving both high-level abstraction of geographic features and high-quality visual synthesis, which further elevates the technical complexity. To address these limitations, we introduce EarthMapper, a novel autoregressive framework for controllable bidirectional satellite-map translation. EarthMapper employs geographic coordinate embeddings to anchor generation, ensuring region-specific adaptability, and leverages multi-scale feature alignment within a geo-conditioned joint scale autoregression (GJSA) process to unify bidirectional translation in a single training cycle. A semantic infusion (SI) mechanism is introduced to enhance feature-level consistency, while a key point adaptive guidance (KPAG) mechanism is proposed to dynamically balance diversity and precision during inference. We further contribute CNSatMap, a large-scale dataset comprising 302,132 precisely aligned satellite-map pairs across 38 Chinese cities, enabling robust benchmarking. Extensive experiments on CNSatMap and the New York dataset demonstrate EarthMapper’s superior performance, achieving significant improvements in visual realism, semantic consistency, and structural fidelity over state-of-the-art methods. Additionally, EarthMapper excels in zero-shot tasks like in-painting, out-painting and coordinate-conditional generation, underscoring its versatility.
zh

[CV-73] A Real-Time Event-Based Normal Flow Estimator

【速读】:该论文试图解决事件相机(event camera)中实时、异步的法线流(normal flow)估计问题。其解决方案的关键在于通过利用事件坐标的整数特性,将原方法中的邻接矩阵乘法操作重新表述为池化(pooling)操作,从而显著降低计算复杂度,实现更高效的法线流预测。这一优化使得方法能够在GPU上实现实时运行,支持每秒数百万个法线流的处理。

链接: https://arxiv.org/abs/2504.19417
作者: Dehao Yuan,Cornelia Fermüller
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a real-time, asynchronous, event-based normal flow estimator. It follows the same algorithm as Learning Normal Flow Directly From Event Neighborhoods, but with a more optimized implementation. The original method treats event slices as 3D point clouds, encodes each event’s local geometry into a fixed-length vector, and uses a multi-layer perceptron to predict normal flow. It constructs representations by multiplying an adjacency matrix with a feature matrix, resulting in quadratic time complexity with respect to the number of events. In contrast, we leverage the fact that event coordinates are integers and reformulate the representation step as a pooling operation. This achieves the same effect as the adjacency matrix but with much lower computational cost. As a result, our method supports real-time normal flow prediction on event cameras. Our estimator uses 1 GB of CUDA memory and runs at 4 million normal flows per second on an RTX 3070, or 6 million per second on an RTX A5000. We release the CUDA implementation along with a Python interface at this https URL.
zh

[CV-74] GMAR: Gradient-Driven Multi-Head Attention Rollout for Vision Transformer Interpretability

【速读】:该论文试图解决Vision Transformer(ViT)中多头注意力机制的可解释性问题,特别是“并非所有注意力头都具有同等重要性”的现象导致现有可解释性方法存在局限。解决方案的关键在于提出梯度驱动的多头注意力传播(Gradient-Driven Multi-Head Attention Rollout, GMAR),通过基于梯度的分数量化每个注意力头的重要性,并对其进行归一化以获得加权聚合注意力得分,从而更精确地揭示每个头在预测过程中的贡献。

链接: https://arxiv.org/abs/2504.19414
作者: Sehyeong Jo,Gangjae Jang,Haesol Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Vision Transformer (ViT) has made significant advancements in computer vision, utilizing self-attention mechanisms to achieve state-of-the-art performance across various tasks, including image classification, object detection, and segmentation. Its architectural flexibility and capabilities have made it a preferred choice among researchers and practitioners. However, the intricate multi-head attention mechanism of ViT presents significant challenges to interpretability, as the underlying prediction process remains opaque. A critical limitation arises from an observation commonly noted in transformer architectures: “Not all attention heads are equally meaningful.” Overlooking the relative importance of specific heads highlights the limitations of existing interpretability methods. To address these challenges, we introduce Gradient-Driven Multi-Head Attention Rollout (GMAR), a novel method that quantifies the importance of each attention head using gradient-based scores. These scores are normalized to derive a weighted aggregate attention score, effectively capturing the relative contributions of individual heads. GMAR clarifies the role of each head in the prediction process, enabling more precise interpretability at the head level. Experimental results demonstrate that GMAR consistently outperforms traditional attention rollout techniques. This work provides a practical contribution to transformer-based architectures, establishing a robust framework for enhancing the interpretability of Vision Transformer models.
zh

[CV-75] UNet with Axial Transformer : A Neural Weather Model for Precipitation Nowcasting

【速读】:该论文旨在解决局部性天气预测,尤其是针对小时尺度演变的极端天气事件(如雷暴)的准确预测问题,其目标是用深度学习方法替代现有的数值天气模型和数据同化系统。解决方案的关键在于开发一种基于Transformer的机器学习方法,通过轴向注意力机制从时间序列帧中学习复杂的模式和动态,从而实现高分辨率、即时的降水预测,并且该框架具有通用性,可应用于单变量、多变量时间序列数据及时间序列嵌入数据。

链接: https://arxiv.org/abs/2504.19408
作者: Maitreya Sonawane,Sumit Mamtani
机构: New York University (纽约大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Making accurate weather predictions can be particularly challenging for localized storms or events that evolve on hourly timescales, such as thunderstorms. Hence, our goal for the project was to model Weather Nowcasting for making highly localized and accurate predictions that apply to the immediate future replacing the current numerical weather models and data assimilation systems with Deep Learning approaches. A significant advantage of machine learning is that inference is computationally cheap given an already-trained model, allowing forecasts that are nearly instantaneous and in the native high resolution of the input data. In this work we developed a novel method that employs Transformer-based machine learning models to forecast precipitation. This approach works by leveraging axial attention mechanisms to learn complex patterns and dynamics from time series frames. Moreover, it is a generic framework and can be applied to univariate and multivariate time series data, as well as time series embeddings data. This paper represents an initial research on the dataset used in the domain of next frame prediciton, and hence, we demonstrate state-of-the-art results in terms of metrices (PSNR = 47.67, SSIM = 0.9943) used for the given dataset using UNet with Axial Transformer.
zh

[CV-76] Boosting 3D Liver Shape Datasets with Diffusion Models and Implicit Neural Representations

【速读】:该论文旨在解决当前可用的3D肝脏形状数据集存在组织混乱和包含伪影的问题,这些问题限制了鲁棒模型的开发与训练,尤其是在精确的3D重建任务中。论文提出了一种基于扩散模型与隐式神经表示(Implicit Neural Representations, INRs)相结合的解决方案,其关键在于利用扩散模型的生成能力创建真实且多样的3D肝脏形状,从而扩展现有数据集并解决数据稀缺问题。实验结果表明,该方法能够提升数据集的多样性,为医学应用中的3D肝脏重建与生成提供可扩展的解决方案。

链接: https://arxiv.org/abs/2504.19402
作者: Khoa Tuan Nguyen,Francesca Tozzi,Wouter Willaert,Joris Vankerschaver,Nikdokht Rashidian,Wesley De Neve
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While the availability of open 3D medical shape datasets is increasing, offering substantial benefits to the research community, we have found that many of these datasets are, unfortunately, disorganized and contain artifacts. These issues limit the development and training of robust models, particularly for accurate 3D reconstruction tasks. In this paper, we examine the current state of available 3D liver shape datasets and propose a solution using diffusion models combined with implicit neural representations (INRs) to augment and expand existing datasets. Our approach utilizes the generative capabilities of diffusion models to create realistic, diverse 3D liver shapes, capturing a wide range of anatomical variations and addressing the problem of data scarcity. Experimental results indicate that our method enhances dataset diversity, providing a scalable solution to improve the accuracy and reliability of 3D liver reconstruction and generation in medical applications. Finally, we suggest that diffusion models can also be applied to other downstream tasks in 3D medical imaging.
zh

[CV-77] Dynamic Arthroscopic Navigation System for Anterior Cruciate Ligament Reconstruction Based on Multi-level Memory Architecture

【速读】:该论文旨在解决前交叉韧带(anterior cruciate ligament, ACL)重建手术中动态跟踪精度不足的问题,特别是在存在视角变化、器械遮挡和组织变形等复杂情况下的导航稳定性问题。其解决方案的关键在于引入基于Atkinson-Shiffrin记忆模型的多层次记忆架构(sensory memory, working memory, and long-term memory),实现了对股骨髁的持续跟踪,并在标准关节镜设备上实现实时运行,无需额外跟踪硬件,从而显著提升了手术导航的精度与鲁棒性。

链接: https://arxiv.org/abs/2504.19398
作者: Shuo Wang,Weili Shi,Shuai Yang,Jiahao Cui,Qinwei Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 13 figures

点击查看摘要

Abstract:This paper presents a dynamic arthroscopic navigation system based on multi-level memory architecture for anterior cruciate ligament (ACL) reconstruction surgery. The system extends our previously proposed markerless navigation method from static image matching to dynamic video sequence tracking. By integrating the Atkinson-Shiffrin memory model’s three-level architecture (sensory memory, working memory, and long-term memory), our system maintains continuous tracking of the femoral condyle throughout the surgical procedure, providing stable navigation support even in complex situations involving viewpoint changes, instrument occlusion, and tissue deformation. Unlike existing methods, our system operates in real-time on standard arthroscopic equipment without requiring additional tracking hardware, achieving 25.3 FPS with a latency of only 39.5 ms, representing a 3.5-fold improvement over our previous static system. For extended sequences (1000 frames), the dynamic system maintained an error of 5.3 plus-minus 1.5 pixels, compared to the static system’s 12.6 plus-minus 3.7 pixels - an improvement of approximately 45 percent. For medium-length sequences (500 frames) and short sequences (100 frames), the system achieved approximately 35 percent and 19 percent accuracy improvements, respectively. Experimental results demonstrate the system overcomes limitations of traditional static matching methods, providing new technical support for improving surgical precision in ACL reconstruction.
zh

[CV-78] HumMorph: Generalized Dynamic Human Neural Fields from Few Views

【速读】:该论文试图解决动态人体的自由视角渲染问题,特别是在仅提供少量观测视角(甚至仅一个视角)的情况下,实现对人物姿态的显式控制。其解决方案的关键在于构建一个粗粒度的规范T姿态表示,并结合从观测视角中提取的精细像素对齐特征,以补充缺失信息并提供高分辨率外观细节。该方法通过仅依赖前向传播实现快速推理,并在仅需两个单目观测的情况下显著提升了视觉质量,同时在参数估计存在噪声的实用场景下表现出更强的鲁棒性。

链接: https://arxiv.org/abs/2504.19390
作者: Jakub Zadrożny,Hakan Bilen
机构: University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce HumMorph, a novel generalized approach to free-viewpoint rendering of dynamic human bodies with explicit pose control. HumMorph renders a human actor in any specified pose given a few observed views (starting from just one) in arbitrary poses. Our method enables fast inference as it relies only on feed-forward passes through the model. We first construct a coarse representation of the actor in the canonical T-pose, which combines visual features from individual partial observations and fills missing information using learned prior knowledge. The coarse representation is complemented by fine-grained pixel-aligned features extracted directly from the observed views, which provide high-resolution appearance information. We show that HumMorph is competitive with the state-of-the-art when only a single input view is available, however, we achieve results with significantly better visual quality given just 2 monocular observations. Moreover, previous generalized methods assume access to accurate body shape and pose parameters obtained using synchronized multi-camera setups. In contrast, we consider a more practical scenario where these body parameters are noisily estimated directly from the observed views. Our experimental results demonstrate that our architecture is more robust to errors in the noisy parameters and clearly outperforms the state of the art in this setting.
zh

[CV-79] Mitigating Bias in Facial Recognition Systems: Centroid Fairness Loss Optimization NEURIPS2024 ICPR2024

【速读】:该论文试图解决面部识别(Facial Recognition, FR)系统在敏感属性(如性别、种族、年龄)上存在的不公平问题,即某些群体在错误率上的差异性,这已被监管机构认定为不可接受。解决方案的关键在于提出一种新的后处理方法,通过优化基于中心点得分的回归损失来提升预训练FR模型的公平性,同时保持全局准确性。

链接: https://arxiv.org/abs/2504.19370
作者: Jean-Rémy Conti,Stéphan Clémençon
机构: LTCI, Télécom Paris (LTCI, 法国电信巴黎); Institut Polytechnique de Paris (巴黎综合理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted at both the AFME and RegML Workshops at NeurIPS 2024. A preliminary version has been accepted for publication by Springer Nature, in the context of the ICPR 2024 conference

点击查看摘要

Abstract:The urging societal demand for fair AI systems has put pressure on the research community to develop predictive models that are not only globally accurate but also meet new fairness criteria, reflecting the lack of disparate mistreatment with respect to sensitive attributes ( \textite.g. gender, ethnicity, age). In particular, the variability of the errors made by certain Facial Recognition (FR) systems across specific segments of the population compromises the deployment of the latter, and was judged unacceptable by regulatory authorities. Designing fair FR systems is a very challenging problem, mainly due to the complex and functional nature of the performance measure used in this domain ( \textiti.e. ROC curves) and because of the huge heterogeneity of the face image datasets usually available for training. In this paper, we propose a novel post-processing approach to improve the fairness of pre-trained FR models by optimizing a regression loss which acts on centroid-based scores. Beyond the computational advantages of the method, we present numerical experiments providing strong empirical evidence of the gain in fairness and of the ability to preserve global accuracy.
zh

[CV-80] MERA: Multimodal and Multiscale Self-Explanatory Model with Considerably Reduced Annotation for Lung Nodule Diagnosis

【速读】:该论文旨在解决肺结节诊断中因标注数据有限导致的可解释性人工智能(Explainable Artificial Intelligence, XAI)系统性能不足的问题,尤其是在缺乏充足标注数据的情况下难以提供清晰、全面的决策解释。其解决方案的关键在于提出MERA模型,该模型结合了自监督学习与视觉Transformer架构进行无监督特征提取,并通过半监督主动学习在学习到的潜在空间中利用稀疏标注进行层次化预测,从而显著降低对标注数据的依赖,同时实现多层级的自我解释能力。

链接: https://arxiv.org/abs/2504.19357
作者: Jiahao Lu,Chong Yin,Silvia Ingala,Kenny Erleben,Michael Bachmann Nielsen,Sune Darkner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lung cancer, a leading cause of cancer-related deaths globally, emphasises the importance of early detection for better patient outcomes. Pulmonary nodules, often early indicators of lung cancer, necessitate accurate, timely diagnosis. Despite Explainable Artificial Intelligence (XAI) advances, many existing systems struggle providing clear, comprehensive explanations, especially with limited labelled data. This study introduces MERA, a Multimodal and Multiscale self-Explanatory model designed for lung nodule diagnosis with considerably Reduced Annotation requirements. MERA integrates unsupervised and weakly supervised learning strategies (self-supervised learning techniques and Vision Transformer architecture for unsupervised feature extraction) and a hierarchical prediction mechanism leveraging sparse annotations via semi-supervised active learning in the learned latent space. MERA explains its decisions on multiple levels: model-level global explanations via semantic latent space clustering, instance-level case-based explanations showing similar instances, local visual explanations via attention maps, and concept explanations using critical nodule attributes. Evaluations on the public LIDC dataset show MERA’s superior diagnostic accuracy and self-explainability. With only 1% annotated samples, MERA achieves diagnostic accuracy comparable to or exceeding state-of-the-art methods requiring full annotation. The model’s inherent design delivers comprehensive, robust, multilevel explanations aligned closely with clinical practice, enhancing trustworthiness and transparency. Demonstrated viability of unsupervised and weakly supervised learning lowers the barrier to deploying diagnostic AI in broader medical domains. Our complete code is open-source available: this https URL.
zh

[CV-81] Improving Small Drone Detection Through Multi-Scale Processing and Data Augmentation IJCNN

【速读】:该论文旨在解决在复杂环境中对小型无人机(small drones)进行有效检测的问题,此类无人机往往与鸟类难以区分。解决方案的关键在于基于中等规模的YOLOv11目标检测模型,采用多尺度处理方法,将输入图像整体及分块处理后进行预测聚合,同时利用复制粘贴数据增强技术丰富训练数据集,并引入基于帧间一致性的后处理技术以减少漏检。这些方法共同提升了模型在复杂场景下的小目标检测性能。

链接: https://arxiv.org/abs/2504.19347
作者: Rayson Laroca,Marcelo dos Santos,David Menotti
机构: Pontifical Catholic University of Paraná (天主教巴拉那联邦大学); Federal University of Paraná (巴拉那联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the International Joint Conference on Neural Networks (IJCNN) 2025

点击查看摘要

Abstract:Detecting small drones, often indistinguishable from birds, is crucial for modern surveillance. This work introduces a drone detection methodology built upon the medium-sized YOLOv11 object detection model. To enhance its performance on small targets, we implemented a multi-scale approach in which the input image is processed both as a whole and in segmented parts, with subsequent prediction aggregation. We also utilized a copy-paste data augmentation technique to enrich the training dataset with diverse drone and bird examples. Finally, we implemented a post-processing technique that leverages frame-to-frame consistency to mitigate missed detections. The proposed approach attained a top-3 ranking in the 8th WOSDETC Drone-vsBird Detection Grand Challenge, held at the 2025 International Joint Conference on Neural Networks (IJCNN), showcasing its capability to detect drones in complex environments effectively.
zh

[CV-82] Enhancing seeding efficiency using a computer vision system to monitor furrow quality in real-time

【速读】:该论文试图解决精准农业中种子播种效率受限的问题,具体表现为残余物堆积、土壤温度低以及沟开器作业后作物残茬嵌入(hair pinning)导致沟槽形成不理想。为了解决这些问题,研究提出了一种基于生成式 AI (Generative AI) 的计算机视觉方法,通过视频采集系统获取沟槽状况数据,并利用分割模型分析土壤、秸秆和机械等关键元素,从而客观量化沟清理器的性能,其关键在于开发出一种定量评估方法以优化沟清理器的选择和提升播种效率。

链接: https://arxiv.org/abs/2504.19334
作者: Sidharth Rai,Aryan Dalal,Riley Slichter,Ajay Sharda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective seed sowing in precision agriculture is hindered by challenges such as residue accumulation, low soil temperatures, and hair pinning (crop residue pushed in the trench by furrow opener), which obstruct optimal trench formation. Row cleaners are employed to mitigate these issues, but there is a lack of quantitative methods to assess trench cleanliness. In this study, a novel computer vision-based method was developed to evaluate row cleaner performance. Multiple air seeders were equipped with a video acquisition system to capture trench conditions after row cleaner operation, enabling an effective comparison of the performance of each row cleaner. The captured data were used to develop a segmentation model that analyzed key elements such as soil, straw, and machinery. Using the results from the segmentation model, an objective method was developed to quantify row cleaner performance. The results demonstrated the potential of this method to improve row cleaner selection and enhance seeding efficiency in precision agriculture.
zh

[CV-83] Platonic Grounding for Efficient Multimodal Language Models

【速读】:该论文试图解决基于Transformer的模型在数据和参数量激增时性能提升逐渐减弱的问题,尤其是在训练成本较高的情况下。其关键解决方案是通过对依赖预训练模型对齐的多模态框架进行简单修改,从而在保持甚至提升基线方法性能的同时,显著降低训练和推理阶段的计算开销。这一方法利用了预训练模型深层中跨模态的隐式对齐特性,为高效整合预训练模型提供了新的思路。

链接: https://arxiv.org/abs/2504.19327
作者: Moulik Choraria,Xinbo Wu,Akhil Bhimaraju,Nitesh Sekhar,Yue Wu,Xu Zhang,Prateek Singhal,Lav R. Varshney
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The hyperscaling of data and parameter count in Transformer-based models is yielding diminishing performance improvement, especially when weighed against training costs. Such plateauing indicates the importance of methods for more efficient finetuning and inference, while retaining similar performance. This is especially relevant for multimodal learning paradigms, where inference costs of processing multimodal tokens can determine the model’s practical viability. At the same time, research on representations and mechanistic interpretability has improved our understanding of the inner workings of Transformer-based models; one such line of work reveals an implicit alignment in the deeper layers of pretrained models, across modalities. Taking inspiration from this, we motivate and propose a simple modification to existing multimodal frameworks that rely on aligning pretrained models. We demonstrate that our approach maintains and, in some cases, even improves performance of baseline methods while achieving significant gains in both training and inference-time compute. Our work also has implications for combining pretrained models into larger systems efficiently.
zh

[CV-84] Myocardial Region-guided Feature Aggregation Net for Automatic Coronary artery Segmentation and Stenosis Assessment using Coronary Computed Tomography Angiography

【速读】:该论文旨在解决冠状动脉疾病(Coronary Artery Disease, CAD)在冠状动脉计算机断层扫描血管造影(Coronary Computed Tomography Angiography, CCTA)中的分割与狭窄检测问题,尤其是面对低对比度、形态多样性及小血管分割等挑战。其解决方案的关键在于提出一种名为Myocardial Region-guided Feature Aggregation Net(MGFA-Net)的新型U型双编码器架构,该架构通过整合解剖先验知识来提升冠状动脉分割的鲁棒性,包含三个核心创新:心肌区域引导模块、残差特征提取编码模块以及多尺度特征融合模块,并结合蒙特卡洛丢弃法量化预测不确定性,从而实现更准确的分割与狭窄检测。

链接: https://arxiv.org/abs/2504.19300
作者: Ni Yao,Xiangyu Liu,Danyang Sun,Chuang Han,Yanting Li,Jiaofen Nan,Chengyang Li,Fubao Zhu,Weihua Zhou,Chen Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 12 figures

点击查看摘要

Abstract:Coronary artery disease (CAD) remains a leading cause of mortality worldwide, requiring accurate segmentation and stenosis detection using Coronary Computed Tomography angiography (CCTA). Existing methods struggle with challenges such as low contrast, morphological variability and small vessel segmentation. To address these limitations, we propose the Myocardial Region-guided Feature Aggregation Net, a novel U-shaped dual-encoder architecture that integrates anatomical prior knowledge to enhance robustness in coronary artery segmentation. Our framework incorporates three key innovations: (1) a Myocardial Region-guided Module that directs attention to coronary regions via myocardial contour expansion and multi-scale feature fusion, (2) a Residual Feature Extraction Encoding Module that combines parallel spatial channel attention with residual blocks to enhance local-global feature discrimination, and (3) a Multi-scale Feature Fusion Module for adaptive aggregation of hierarchical vascular features. Additionally, Monte Carlo dropout f quantifies prediction uncertainty, supporting clinical interpretability. For stenosis detection, a morphology-based centerline extraction algorithm separates the vascular tree into anatomical branches, enabling cross-sectional area quantification and stenosis grading. The superiority of MGFA-Net was demonstrated by achieving an Dice score of 85.04%, an accuracy of 84.24%, an HD95 of 6.1294 mm, and an improvement of 5.46% in true positive rate for stenosis detection compared to3D U-Net. The integrated segmentation-to-stenosis pipeline provides automated, clinically interpretable CAD assessment, bridging deep learning with anatomical prior knowledge for precision medicine. Our code is publicly available at this http URL
zh

[CV-85] FusionNet: Multi-model Linear Fusion Framework for Low-light Image Enhancement

【速读】:该论文旨在解决低光照图像增强(Low-Light Image Enhancement, LLIE)中现有融合策略面临的参数爆炸、优化不稳定和特征错位等问题,从而提升在不同退化场景下的性能。其解决方案的关键在于提出FusionNet,这是一种基于希尔伯特空间理论保障的多模型线性融合框架,通过并行操作有效捕捉多种色彩空间中的全局与局部特征,缓解网络坍塌并降低训练成本。

链接: https://arxiv.org/abs/2504.19295
作者: Kangbiao Shi,Yixu Feng,Tao Hu,Yu Cao,Peng Wu,Yijin Liang,Yanning Zhang,Qingsen Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advent of Deep Neural Networks (DNNs) has driven remarkable progress in low-light image enhancement (LLIE), with diverse architectures (e.g., CNNs and Transformers) and color spaces (e.g., sRGB, HSV, HVI) yielding impressive results. Recent efforts have sought to leverage the complementary strengths of these paradigms, offering promising solutions to enhance performance across varying degradation scenarios. However, existing fusion strategies are hindered by challenges such as parameter explosion, optimization instability, and feature misalignment, limiting further improvements. To overcome these issues, we introduce FusionNet, a novel multi-model linear fusion framework that operates in parallel to effectively capture global and local features across diverse color spaces. By incorporating a linear fusion strategy underpinned by Hilbert space theoretical guarantees, FusionNet mitigates network collapse and reduces excessive training costs. Our method achieved 1st place in the CVPR2025 NTIRE Low Light Enhancement Challenge. Extensive experiments conducted on synthetic and real-world benchmark datasets demonstrate that the proposed method significantly outperforms state-of-the-art methods in terms of both quantitative and qualitative results, delivering robust enhancement under diverse low-light conditions.
zh

[CV-86] Marine Snow Removal Using Internally Generated Pseudo Ground Truth

【速读】:该论文试图解决水下视频因光线吸收、散射及各种噪声源(尤其是海洋雪,即悬浮有机颗粒,表现为亮点或噪声)导致的图像质量退化问题,该问题严重影响机器视觉任务中的特征匹配。解决方案的关键在于提出一种新颖的增强框架,该框架通过从原始水下视频中生成配对数据集,解决了现有方法因缺乏配对训练数据而效果不佳的问题,从而实现了监督式的视频增强。

链接: https://arxiv.org/abs/2504.19289
作者: Alexandra Malyugina,Guoxi Huang,Eduardo Ruiz,Benjamin Leslie,Nantheera Anantrasirichai
机构: University of Bristol (布里斯托大学); Beam (贝姆)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater videos often suffer from degraded quality due to light absorption, scattering, and various noise sources. Among these, marine snow, which is suspended organic particles appearing as bright spots or noise, significantly impacts machine vision tasks, particularly those involving feature matching. Existing methods for removing marine snow are ineffective due to the lack of paired training data. To address this challenge, this paper proposes a novel enhancement framework that introduces a new approach for generating paired datasets from raw underwater videos. The resulting dataset consists of paired images of generated snowy and snow, free underwater videos, enabling supervised training for video enhancement. We describe the dataset creation process, highlight its key characteristics, and demonstrate its effectiveness in enhancing underwater image restoration in the absence of ground truth.
zh

[CV-87] Optimal Hyperspectral Undersampling Strategy for Satellite Imaging

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中面临的高维度、光谱冗余以及标注数据有限等问题。其解决方案的关键在于提出一种名为迭代小波梯度采样(Iterative Wavelet-based Gradient Sampling, IWGS)的新型波段选择策略,该方法通过在小波变换域内分析梯度,逐步选择最具信息量的光谱波段,从而实现高效且有针对性的降维。与传统方法不同,IWGS利用小波的多分辨率特性,更有效地捕捉对分类有重要意义的细微光谱变化,并通过迭代机制系统地排除冗余或噪声波段,同时最大化保留判别特征。

链接: https://arxiv.org/abs/2504.19279
作者: Vita V. Vlasova,Vladimir G. Kuzmin,Maria S. Varetsa,Natalia A. Ibragimova,Oleg Y. Rogov,Elena V. Lyapuntsova
机构: Bauman Moscow State Technical University (鲍曼莫斯科国立技术大学); VeinCV LLC (VeinCV 有限责任公司); Moscow Technical University of Communications and Informatics (莫斯科通信与信息科学理工大学); 2GIS (2GIS); Artificial Intelligence Research Institute (人工智能研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:Hyperspectral image (HSI) classification presents significant challenges due to the high dimensionality, spectral redundancy, and limited labeled data typically available in real-world applications. To address these issues and optimize classification performance, we propose a novel band selection strategy known as Iterative Wavelet-based Gradient Sampling (IWGS). This method incrementally selects the most informative spectral bands by analyzing gradients within the wavelet-transformed domain, enabling efficient and targeted dimensionality reduction. Unlike traditional selection methods, IWGS leverages the multi-resolution properties of wavelets to better capture subtle spectral variations relevant for classification. The iterative nature of the approach ensures that redundant or noisy bands are systematically excluded while maximizing the retention of discriminative features. We conduct comprehensive experiments on two widely-used benchmark HSI datasets: Houston 2013 and Indian Pines. Results demonstrate that IWGS consistently outperforms state-of-the-art band selection and classification techniques in terms of both accuracy and computational efficiency. These improvements make our method especially suitable for deployment in edge devices or other resource-constrained environments, where memory and processing power are limited. In particular, IWGS achieved an overall accuracy up to 97.8% on Indian Pines for selected classes, confirming its effectiveness and generalizability across different HSI scenarios.
zh

[CV-88] Leverag ing Multi-Modal Saliency and Fusion for Gaze Target Detection NEURIPS2023

【速读】:该论文旨在解决注视目标检测(Gaze Target Detection, GTD)问题,即预测图像中一个人的视线方向。该任务具有挑战性,因为需要理解人物头部、身体和眼睛与周围环境之间的关系。论文提出了一种新颖的方法,其关键在于通过单目深度估计将2D图像投影到3D表示,并提取融合深度信息的显著性模块图,同时结合面部和深度模态信息进行多模态融合,从而更准确地识别注视目标。

链接: https://arxiv.org/abs/2504.19271
作者: Athul M. Mathew,Arshad Ali Khan,Thariq Khalid,Faroq AL-Tam,Riad Souissi
机构: Elm Company(埃尔姆公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at NeurIPS 2023 Gaze Meets ML Workshop

点击查看摘要

Abstract:Gaze target detection (GTD) is the task of predicting where a person in an image is looking. This is a challenging task, as it requires the ability to understand the relationship between the person’s head, body, and eyes, as well as the surrounding environment. In this paper, we propose a novel method for GTD that fuses multiple pieces of information extracted from an image. First, we project the 2D image into a 3D representation using monocular depth estimation. We then extract a depth-infused saliency module map, which highlights the most salient (\textitattention-grabbing) regions in image for the subject in consideration. We also extract face and depth modalities from the image, and finally fuse all the extracted modalities to identify the gaze target. We quantitatively evaluated our method, including the ablation analysis on three publicly available datasets, namely VideoAttentionTarget, GazeFollow and GOO-Real, and showed that it outperforms other state-of-the-art methods. This suggests that our method is a promising new approach for GTD.
zh

[CV-89] VI3NR: Variance Informed Initialization for Implicit Neural Representations CVPR2025

【速读】:该论文试图解决隐式神经表示(Implicit Neural Representations, INRs)中网络初始化不适用于多种激活函数的问题,这会影响模型的收敛性和准确性。解决方案的关键在于提出一种具有层间稳定方差的初始化方法,该方法适用于任何激活函数,并且在多种信号模态中表现出更好的稳定性与性能。

链接: https://arxiv.org/abs/2504.19270
作者: Chamin Hewa Koneputugodage,Yizhak Ben-Shabat,Sameera Ramasinghe,Stephen Gould
机构: The Australian National University (澳大利亚国立大学); Roblox (罗布洛克斯); Pluralis AI (普拉拉里斯人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Implicit Neural Representations (INRs) are a versatile and powerful tool for encoding various forms of data, including images, videos, sound, and 3D shapes. A critical factor in the success of INRs is the initialization of the network, which can significantly impact the convergence and accuracy of the learned model. Unfortunately, commonly used neural network initializations are not widely applicable for many activation functions, especially those used by INRs. In this paper, we improve upon previous initialization methods by deriving an initialization that has stable variance across layers, and applies to any activation function. We show that this generalizes many previous initialization methods, and has even better stability for well studied activations. We also show that our initialization leads to improved results with INR activation functions in multiple signal modalities. Our approach is particularly effective for Gaussian INRs, where we demonstrate that the theory of our initialization matches with task performance in multiple experiments, allowing us to achieve improvements in image, audio, and 3D surface reconstruction.
zh

[CV-90] OpenFusion: An Open-vocabulary Real-time Scene Understanding System

【速读】:该论文旨在解决实时开放词汇场景理解中的挑战,包括精确的实例分割不足、语义更新静态以及对复杂查询处理能力有限等问题。其解决方案的关键在于提出OpenFusion++,一个基于TSDF(Truncated Signed Distance Field)的实时3D语义-几何重建系统,通过融合基础模型的置信度图来优化3D点云,利用基于实例区域的自适应缓存动态更新全局语义标签,并采用双路径编码框架将对象属性与环境上下文相结合,以实现精准的查询响应。

链接: https://arxiv.org/abs/2504.19266
作者: Xiaofeng Jin,Matteo Frosi,Matteo Matteucci
机构: Politecnico di Milano (米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Real-time open-vocabulary scene understanding is essential for efficient 3D perception in applications such as vision-language navigation, embodied intelligence, and augmented reality. However, existing methods suffer from imprecise instance segmentation, static semantic updates, and limited handling of complex queries. To address these issues, we present OpenFusion++, a TSDF-based real-time 3D semantic-geometric reconstruction system. Our approach refines 3D point clouds by fusing confidence maps from foundational models, dynamically updates global semantic labels via an adaptive cache based on instance area, and employs a dual-path encoding framework that integrates object attributes with environmental context for precise query responses. Experiments on the ICL, Replica, ScanNet, and ScanNet++ datasets demonstrate that OpenFusion++ significantly outperforms the baseline in both semantic accuracy and query responsiveness.
zh

[CV-91] Rendering Anywhere You See: Renderability Field-guided Gaussian Splatting

【速读】:该论文旨在解决场景视图合成(scene view synthesis)中由于非均匀观测导致的渲染质量不稳定问题。其解决方案的关键在于提出了一种基于可渲染场引导的高斯点云投射(renderability field-guided gaussian splatting, RF-GS)方法,通过可渲染场量化输入的不均匀性,并指导伪视图采样以提升视觉一致性,同时结合图像修复模型提升宽基线伪视图的质量,并采用验证过的混合数据优化策略融合伪视图角度与源视图纹理信息。

链接: https://arxiv.org/abs/2504.19261
作者: Xiaofeng Jin,Yan Fang,Matteo Frosi,Jianfei Ge,Jiangjian Xiao,Matteo Matteucci
机构: Politecnico di Milano, Milan 20133, Italy (米兰理工大学,米兰20133,意大利); Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences (中国科学院宁波材料技术与工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages,8 figures

点击查看摘要

Abstract:Scene view synthesis, which generates novel views from limited perspectives, is increasingly vital for applications like virtual reality, augmented reality, and robotics. Unlike object-based tasks, such as generating 360° views of a car, scene view synthesis handles entire environments where non-uniform observations pose unique challenges for stable rendering quality. To address this issue, we propose a novel approach: renderability field-guided gaussian splatting (RF-GS). This method quantifies input inhomogeneity through a renderability field, guiding pseudo-view sampling to enhanced visual consistency. To ensure the quality of wide-baseline pseudo-views, we train an image restoration model to map point projections to visible-light styles. Additionally, our validated hybrid data optimization strategy effectively fuses information of pseudo-view angles and source view textures. Comparative experiments on simulated and real-world data show that our method outperforms existing approaches in rendering stability.
zh

[CV-92] OPAL: Visibility-aware LiDAR-to-OpenStreetMap Place Recognition via Adaptive Radial Fusion

【速读】:该论文旨在解决大规模户外环境中LiDAR场景识别的问题,现有方法主要依赖于预构建的3D密集地图或航拍图像,导致存储开销大且缺乏实时适应性。论文提出的OPAL网络通过利用轻量且更新及时的OpenStreetMap (OpenStreetMap) 作为先验信息,解决了这一问题。其关键创新在于通过两个精心设计的组件弥合稀疏LiDAR扫描与结构化OSM数据之间的领域差异:跨模态可见性掩码用于识别两种模态下的最大可观测区域以指导特征学习,自适应径向融合模块则动态整合多尺度径向特征以生成具有区分性的全局描述符。

链接: https://arxiv.org/abs/2504.19258
作者: Shuhao Kang,Martin Y. Liao,Yan Xia,Olaf Wysocki,Boris Jutzi,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Technical report. 15 pages, 9 figures

点击查看摘要

Abstract:LiDAR place recognition is a critical capability for autonomous navigation and cross-modal localization in large-scale outdoor environments. Existing approaches predominantly depend on pre-built 3D dense maps or aerial imagery, which impose significant storage overhead and lack real-time adaptability. In this paper, we propose OPAL, a novel network for LiDAR place recognition that leverages OpenStreetMap as a lightweight and up-to-date prior. Our key innovation lies in bridging the domain disparity between sparse LiDAR scans and structured OSM data through two carefully designed components: a cross-modal visibility mask that identifies maximal observable regions from both modalities to guide feature learning, and an adaptive radial fusion module that dynamically consolidates multiscale radial features into discriminative global descriptors. Extensive experiments on the augmented KITTI and KITTI-360 datasets demonstrate OPAL’s superiority, achieving 15.98% higher recall at @1m threshold for top-1 retrieved matches while operating at 12x faster inference speeds compared to state-of-the-art approaches. Code and datasets are publicly available at: this https URL .
zh

[CV-93] LM-MCVT: A Lightweight Multi-modal Multi-view Convolutional-Vision Transformer Approach for 3D Object Recognition

【速读】:该论文旨在解决在以人为中心的环境中(如餐厅、家庭和仓库)机器人在准确识别3D物体时所面临的挑战,这些挑战源于环境的复杂性和多样性,包括物体形状的差异。论文提出的解决方案是轻量级多模态多视角卷积-视觉变压器网络(LM-MCVT),其关键在于采用基于全局熵的嵌入融合(GEEF)方法,以高效整合多视角信息,并通过预处理和中层卷积编码器以及局部和全局变压器结构增强特征提取与识别准确性。

链接: https://arxiv.org/abs/2504.19256
作者: Songsong Xiong,Hamidreza Kasaei
机构: University of Groningen (格罗宁根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In human-centered environments such as restaurants, homes, and warehouses, robots often face challenges in accurately recognizing 3D objects. These challenges stem from the complexity and variability of these environments, including diverse object shapes. In this paper, we propose a novel Lightweight Multi-modal Multi-view Convolutional-Vision Transformer network (LM-MCVT) to enhance 3D object recognition in robotic applications. Our approach leverages the Globally Entropy-based Embeddings Fusion (GEEF) method to integrate multi-views efficiently. The LM-MCVT architecture incorporates pre- and mid-level convolutional encoders and local and global transformers to enhance feature extraction and recognition accuracy. We evaluate our method on the synthetic ModelNet40 dataset and achieve a recognition accuracy of 95.6% using a four-view setup, surpassing existing state-of-the-art methods. To further validate its effectiveness, we conduct 5-fold cross-validation on the real-world OmniObject3D dataset using the same configuration. Results consistently show superior performance, demonstrating the method’s robustness in 3D object recognition across synthetic and real-world 3D data.
zh

[CV-94] Quantitative evaluation of brain-inspired vision sensors in high-speed robotic perception

【速读】:该论文旨在解决传统相机在高速和动态环境下机器人感知系统中因运动模糊导致的空间特征完整性受损及任务性能下降的问题。其解决方案的关键在于引入脑启发视觉传感器(Brain-inspired Vision Sensors, BVS),特别是事件驱动视觉传感器(Event-based Vision Sensors, EVS)和基于原始特征的传感器天眸(Tianmouc),通过高时间分辨率、低带宽和低功耗的特性提升动态场景下的感知能力。研究构建了统一的测试协议,评估了传感器非理想性对结构信息捕获的影响,并通过角点检测和运动估计任务验证了不同速度下的性能表现,揭示了BVS技术在不同应用场景中的适用性。

链接: https://arxiv.org/abs/2504.19253
作者: Taoyi Wang,Lijian Wang,Yihan Lin,Mingtao Ou,Yuguo Chen,Xinglong Ji,Rong Zhao
机构: Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, 1 table, conference

点击查看摘要

Abstract:Perception systems in robotics encounter significant challenges in high-speed and dynamic conditions when relying on traditional cameras, where motion blur can compromise spatial feature integrity and task performance. Brain-inspired vision sensors (BVS) have recently gained attention as an alternative, offering high temporal resolution with reduced bandwidth and power requirements. Here, we present the first quantitative evaluation framework for two representative classes of BVSs in variable-speed robotic sensing, including event-based vision sensors (EVS) that detect asynchronous temporal contrasts, and the primitive-based sensor Tianmouc that employs a complementary mechanism to encode both spatiotemporal changes and intensity. A unified testing protocol is established, including crosssensor calibrations, standardized testing platforms, and quality metrics to address differences in data modality. From an imaging standpoint, we evaluate the effects of sensor non-idealities, such as motion-induced distortion, on the capture of structural information. For functional benchmarking, we examine task performance in corner detection and motion estimation under different rotational speeds. Results indicate that EVS performs well in highspeed, sparse scenarios and in modestly fast, complex scenes, but exhibits performance limitations in high-speed, cluttered settings due to pixel-level bandwidth variations and event rate saturation. In comparison, Tianmouc demonstrates consistent performance across sparse and complex scenarios at various speeds, supported by its global, precise, high-speed spatiotemporal gradient samplings. These findings offer valuable insights into the applicationdependent suitability of BVS technologies and support further advancement in this area.
zh

[CV-95] ODExAI: A Comprehensive Object Detection Explainable AI Evaluation

【速读】:该论文旨在解决可解释人工智能(Explainable Artificial Intelligence, XAI)技术在目标检测模型中的评估缺乏统一标准的问题,这一现状阻碍了方法间的比较和适用性选择。其解决方案的关键是提出Object Detection Explainable AI Evaluation (ODExAI)框架,该框架从定位准确性、对模型行为的忠实度以及计算复杂性三个核心维度对XAI方法进行系统评估。

链接: https://arxiv.org/abs/2504.19249
作者: Loc Phuc Truong Nguyen,Hung Truong Thanh Nguyen,Hung Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) techniques for interpreting object detection models remain in an early stage, with no established standards for systematic evaluation. This absence of consensus hinders both the comparative analysis of methods and the informed selection of suitable approaches. To address this gap, we introduce the Object Detection Explainable AI Evaluation (ODExAI), a comprehensive framework designed to assess XAI methods in object detection based on three core dimensions: localization accuracy, faithfulness to model behavior, and computational complexity. We benchmark a set of XAI methods across two widely used object detectors (YOLOX and Faster R-CNN) and standard datasets (MS-COCO and PASCAL VOC). Empirical results demonstrate that region-based methods (e.g., D-CLOSE) achieve strong localization (PG = 88.49%) and high model faithfulness (OA = 0.863), though with substantial computational overhead (Time = 71.42s). On the other hand, CAM-based methods (e.g., G-CAME) achieve superior localization (PG = 96.13%) and significantly lower runtime (Time = 0.54s), but at the expense of reduced faithfulness (OA = 0.549). These findings demonstrate critical trade-offs among existing XAI approaches and reinforce the need for task-specific evaluation when deploying them in object detection pipelines. Our implementation and evaluation benchmarks are publicly available at: this https URL.
zh

[CV-96] Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID

【速读】:该论文旨在解决无监督可见-红外行人重识别(USL-VI-ReID)中因细粒度模式导致的跨模态特征表示和伪标签分布差异问题,这些问题限制了仅优化全局特征时的模态共享学习效果。解决方案的关键在于提出一种语义对齐的协同精炼框架(SALCR),通过引入双向统一伪标签的双关联与全局学习模块(DAGI)以及基于细粒度语义对齐的局部模式探索模块(FGSAL),实现不同模态标签分布之间的互补对齐,并通过全局-局部协同精炼模块(GPCR)动态挖掘可靠正样本集以优化实例间关系,从而提升模型性能。

链接: https://arxiv.org/abs/2504.19244
作者: De Cheng,Lingfeng He,Nannan Wang,Dingwen Zhang,Xinbo Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCV 2025

点击查看摘要

Abstract:Unsupervised visible-infrared person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning. Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning. However, these methods overlook the cross-modality variations in feature representation and pseudo-label distributions brought by fine-grained patterns. This insight results in insufficient modality-shared learning when only global features are optimized. To address this issue, we propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up optimization objective for specific fine-grained patterns emphasized by each modality, thereby achieving complementary alignment between the label distributions of different modalities. Specifically, we first introduce a Dual Association with Global Learning (DAGI) module to unify the pseudo-labels of cross-modality instances in a bi-directional manner. Afterward, a Fine-Grained Semantic-Aligned Learning (FGSAL) module is carried out to explore part-level semantic-aligned patterns emphasized by each modality from cross-modality instances. Optimization objective is then formulated based on the semantic-aligned features and their corresponding label space. To alleviate the side-effects arising from noisy pseudo-labels, we propose a Global-Part Collaborative Refinement (GPCR) module to mine reliable positive sample sets for the global and part features dynamically and optimize the inter-instance relationships. Extensive experiments demonstrate the effectiveness of the proposed method, which achieves superior performances to state-of-the-art methods. Our code is available at \hrefthis https URL.
zh

[CV-97] Unsupervised 2D-3D lifting of non-rigid objects using local constraints

【速读】:该论文试图解决从2D关键点观测中预测非刚性物体的3D形状的问题,该问题由于遮挡以及视角变化与形状变化的解耦而具有病态性。传统方法通过嵌入低秩约束到专用模型中来应对这一挑战,但这些模型难以训练,因为它们依赖于在学习详细几何之前找到一种规范化的对齐方式,且这些约束限制了重建质量。论文提出的关键解决方案是使用具有高容量的通用模型,并通过无监督损失进行训练,从而获得更精确的预测形状;特别地,将低秩约束应用于完整形状的局部子集,使得高容量模型能够被适当约束,从而在S-Up3D数据集上将重建误差降低了超过70%。

链接: https://arxiv.org/abs/2504.19227
作者: Shalini Maiti,Lourdes Agapito,Benjamin Graham
机构: Meta AI(元人工智能); University College London(伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:For non-rigid objects, predicting the 3D shape from 2D keypoint observations is ill-posed due to occlusions, and the need to disentangle changes in viewpoint and changes in shape. This challenge has often been addressed by embedding low-rank constraints into specialized models. These models can be hard to train, as they depend on finding a canonical way of aligning observations, before they can learn detailed geometry. These constraints have limited the reconstruction quality. We show that generic, high capacity models, trained with an unsupervised loss, allow for more accurate predicted shapes. In particular, applying low-rank constraints to localized subsets of the full shape allows the high capacity to be suitably constrained. We reduce the state-of-the-art reconstruction error on the S-Up3D dataset by over 70%.
zh

[CV-98] CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis

【速读】:该论文试图解决由于光谱相机在通道维度和捕获波长上的差异导致的AI驱动方法发展受限问题,这些问题使得模型具有相机特异性,泛化能力和跨相机适用性不足。解决方案的关键在于提出一种名为CARL的模型,该模型通过引入波长位置编码和自注意力-交叉注意力机制,将任意通道维度的光谱图像转换为相机无关的嵌入表示,并采用一种新颖的受JEPA启发的光谱自监督预训练策略实现光谱-空间预训练,从而提升了模型对光谱异质性的鲁棒性。

链接: https://arxiv.org/abs/2504.19223
作者: Alexander Baumann,Leonardo Ayala,Silvia Seidlitz,Jan Sellner,Alexander Studier-Fischer,Berkin Özdemir,Lena Maier-Hein,Slobodan Ilic
机构: Siemens AG(西门子股份公司); German Cancer Research Center (DKFZ) Heidelberg(德国癌症研究中心(DKFZ)海德堡); Heidelberg University(海德堡大学); National Center for Tumor Diseases (NCT), NCT Heidelberg(国家肿瘤疾病中心(NCT),NCT海德堡); HIDSS4Health, Heidelberg(海德堡健康数据科学与系统集成中心); University Medical Center Mannheim(曼海姆大学医学中心); Department of Urology and Urosurgery, University Medical Center Mannheim(曼海姆大学医学中心泌尿外科); Department of General, Visceral, and Transplantation Surgery, Heidelberg University Hospital(海德堡大学医院普通、腹部和移植外科); Division of Intelligent Systems and Robotics in Urology, DKFZ Heidelberg(德国癌症研究中心(DKFZ)海德堡智能系统与泌尿外科机器人学部); DKFZ Hector Cancer Institute, University Medical Center Mannheim(曼海姆大学医学中心 DKFZ Hector 癌症研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding, and is already established as a critical modality in remote sensing. However, variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies, leading to camera-specific models with limited generalizability and inadequate cross-camera applicability. To address this bottleneck, we introduce \textbfCARL , a model for \textbfC amera- \textbfA gnostic \textbfR epresentation \textbfL earning across RGB, multispectral, and hyperspectral imaging modalities. To enable the conversion of a spectral image with any channel dimensionality to a camera-agnostic embedding, we introduce wavelength positional encoding and a self-attention-cross-attention mechanism to compress spectral information into learned query representations. Spectral-spatial pre-training is achieved with a novel spectral self-supervised JEPA-inspired strategy tailored to CARL. Large-scale experiments across the domains of medical imaging, autonomous driving, and satellite imaging demonstrate our model’s unique robustness to spectral heterogeneity, outperforming on datasets with simulated and real-world cross-camera spectral variations. The scalability and versatility of the proposed approach position our model as a backbone for future spectral foundation models.
zh

[CV-99] CapsFake: A Multimodal Capsule Network for Detecting Instruction-Guided Deepfakes

【速读】:该论文试图解决深度伪造(deepfake)技术在指令引导的图像编辑中带来的数字图像完整性威胁,特别是针对难以被人类和现有检测系统察觉的细微、上下文感知的篡改问题。解决方案的关键在于提出一种新型多模态胶囊网络(CapsFake),通过融合视觉、文本和频域模态的低级胶囊,并利用竞争性路由机制预测高级胶囊,动态聚合局部特征以精确识别被篡改区域。

链接: https://arxiv.org/abs/2504.19212
作者: Tuan Nguyen,Naseem Khan,Issa Khalil
机构: Qatar Computing Research Institute, Hamad Bin Khalifa University (卡塔尔计算研究研究所,哈马德本哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:The rapid evolution of deepfake technology, particularly in instruction-guided image editing, threatens the integrity of digital images by enabling subtle, context-aware manipulations. Generated conditionally from real images and textual prompts, these edits are often imperceptible to both humans and existing detection systems, revealing significant limitations in current defenses. We propose a novel multimodal capsule network, CapsFake, designed to detect such deepfake image edits by integrating low-level capsules from visual, textual, and frequency-domain modalities. High-level capsules, predicted through a competitive routing mechanism, dynamically aggregate local features to identify manipulated regions with precision. Evaluated on diverse datasets, including MagicBrush, Unsplash Edits, Open Images Edits, and Multi-turn Edits, CapsFake outperforms state-of-the-art methods by up to 20% in detection accuracy. Ablation studies validate its robustness, achieving detection rates above 94% under natural perturbations and 96% against adversarial attacks, with excellent generalization to unseen editing scenarios. This approach establishes a powerful framework for countering sophisticated image manipulations.
zh

[CV-100] FlexPara: Flexible Neural Surface Parameterization

【速读】:该论文旨在解决传统表面参数化方法在处理复杂拓扑结构时的局限性,这些问题包括对高质量网格三角剖分的依赖以及对人工指定切割缝的依赖。其解决方案的关键在于提出FlexPara,一个无监督的神经优化框架,通过建立3D表面点与自适应变形的2D UV坐标之间的逐点映射,实现全局和多图层表面参数化。该框架巧妙设计并组合了一系列具有几何可解释性的子网络,实现了无需人工指定切割缝的双向循环映射,并构建了具有自适应学习图层分配的多图层参数化框架。

链接: https://arxiv.org/abs/2504.19210
作者: Yuming Zhao,Qijian Zhang,Junhui Hou,Jiazhi Xia,Wenping Wang,Ying He
机构: City University of Hong Kong (香港城市大学); Tencent Games (腾讯游戏); Central South University (中南大学); Texas A&M University (德州农工大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surface parameterization is a fundamental geometry processing task, laying the foundations for the visual presentation of 3D assets and numerous downstream shape analysis scenarios. Conventional parameterization approaches demand high-quality mesh triangulation and are restricted to certain simple topologies unless additional surface cutting and decomposition are provided. In practice, the optimal configurations (e.g., type of parameterization domains, distribution of cutting seams, number of mapping charts) may vary drastically with different surface structures and task characteristics, thus requiring more flexible and controllable processing pipelines. To this end, this paper introduces FlexPara, an unsupervised neural optimization framework to achieve both global and multi-chart surface parameterizations by establishing point-wise mappings between 3D surface points and adaptively-deformed 2D UV coordinates. We ingeniously design and combine a series of geometrically-interpretable sub-networks, with specific functionalities of cutting, deforming, unwrapping, and wrapping, to construct a bi-directional cycle mapping framework for global parameterization without the need for manually specified cutting seams. Furthermore, we construct a multi-chart parameterization framework with adaptively-learned chart assignment. Extensive experiments demonstrate the universality, superiority, and inspiring potential of our neural surface parameterization paradigm. The code will be publicly available at this https URL
zh

[CV-101] Adaptive Dual-domain Learning for Underwater Image Enhancement AAAI2025

【速读】:该论文旨在解决基于学习的水下图像增强(Underwater Image Enhancement, UIE)方法中存在的两个关键问题:一是现有方法很少同时考虑不同空间区域和光谱波段中的不一致退化程度;二是它们对所有区域一视同仁,忽略了高频细节区域在重建过程中的难度。解决方案的关键在于提出一种基于空间-光谱双域自适应学习的新型UIE方法(SS-UIE),其核心是引入具有线性复杂度的空间级多尺度循环选择扫描模块(MCSS)和光谱级自注意力模块(SWSA),并将其并行结合形成空间-光谱块(SS-block),以有效建模不同空间区域和光谱波段的退化水平,实现基于退化程度的双域自适应UIE。此外,通过引入频域损失(FWL)来缩小频域差异并增强模型对高频细节区域的关注。

链接: https://arxiv.org/abs/2504.19198
作者: Lingtao Peng,Liheng Bian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Recently, learning-based Underwater Image Enhancement (UIE) methods have demonstrated promising performance. However, existing learning-based methods still face two challenges. 1) They rarely consider the inconsistent degradation levels in different spatial regions and spectral bands simultaneously. 2) They treat all regions equally, ignoring that the regions with high-frequency details are more difficult to reconstruct. To address these challenges, we propose a novel UIE method based on spatial-spectral dual-domain adaptive learning, termed SS-UIE. Specifically, we first introduce a spatial-wise Multi-scale Cycle Selective Scan (MCSS) module and a Spectral-Wise Self-Attention (SWSA) module, both with linear complexity, and combine them in parallel to form a basic Spatial-Spectral block (SS-block). Benefiting from the global receptive field of MCSS and SWSA, SS-block can effectively model the degradation levels of different spatial regions and spectral bands, thereby enabling degradation level-based dual-domain adaptive UIE. By stacking multiple SS-blocks, we build our SS-UIE network. Additionally, a Frequency-Wise Loss (FWL) is introduced to narrow the frequency-wise discrepancy and reinforce the model’s attention on the regions with high-frequency details. Extensive experiments validate that the SS-UIE technique outperforms state-of-the-art UIE methods while requiring cheaper computational and memory costs.
zh

[CV-102] Sketch2Anim: Towards Transferring Sketch Storyboards into 3D Animation

【速读】:该论文试图解决将2D故事板草图直接转换为高质量3D动画的问题,这一任务在现有研究中尚属探索阶段。传统方法依赖于动画师通过试错过程手工制作3D动画,耗时且需要高度专业技能。论文提出的解决方案关键在于构建一个名为Sketch2Anim的系统,其核心是两个关键模块:草图约束理解与运动生成。该系统通过设计一个3D条件运动生成器,结合3D关键姿态、关节轨迹和动作词汇,实现对运动的精确控制,并引入神经映射器在共享嵌入空间中对齐用户提供的2D草图与其对应的3D关键姿态和轨迹,从而首次实现了直接的2D控制运动生成。

链接: https://arxiv.org/abs/2504.19189
作者: Lei Zhong,Chuan Guo,Yiming Xie,Jiawei Wang,Changjian Li
机构: University of Edinburgh(爱丁堡大学); Snap Inc(快闪公司); Northeastern University(东北大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Storyboarding is widely used for creating 3D animations. Animators use the 2D sketches in storyboards as references to craft the desired 3D animations through a trial-and-error process. The traditional approach requires exceptional expertise and is both labor-intensive and time-consuming. Consequently, there is a high demand for automated methods that can directly translate 2D storyboard sketches into 3D animations. This task is under-explored to date and inspired by the significant advancements of motion diffusion models, we propose to address it from the perspective of conditional motion synthesis. We thus present Sketch2Anim, composed of two key modules for sketch constraint understanding and motion generation. Specifically, due to the large domain gap between the 2D sketch and 3D motion, instead of directly conditioning on 2D inputs, we design a 3D conditional motion generator that simultaneously leverages 3D keyposes, joint trajectories, and action words, to achieve precise and fine-grained motion control. Then, we invent a neural mapper dedicated to aligning user-provided 2D sketches with their corresponding 3D keyposes and trajectories in a shared embedding space, enabling, for the first time, direct 2D control of motion generation. Our approach successfully transfers storyboards into high-quality 3D motions and inherently supports direct 3D animation editing, thanks to the flexibility of our multi-conditional motion generator. Comprehensive experiments and evaluations, and a user perceptual study demonstrate the effectiveness of our approach.
zh

[CV-103] LRFusionPR: A Polar BEV-Based LiDAR-Radar Fusion Network for Place Recognition

【速读】:该论文旨在解决在GPS-denied环境中,如何通过LiDAR与雷达(Radar)融合提升定位的准确性和鲁棒性问题。现有方法受限于雷达数据的噪声大、稀疏性以及异构雷达配置带来的统一跨模态融合框架开发困难。其解决方案的关键在于提出LRFusionPR,该方法通过在统一极坐标鸟瞰图(BEV)表示下构建双分支网络,利用交叉注意力机制实现跨模态特征交互,并通过知识蒸馏将融合信息传递至仅以雷达为输入的分支,最终通过多模态全局描述符提升位置识别性能。

链接: https://arxiv.org/abs/2504.19186
作者: Zhangshuo Qi,Luqi Cheng,Zijie Zhou,Guangming Xiong
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:In autonomous driving, place recognition is critical for global localization in GPS-denied environments. LiDAR and radar-based place recognition methods have garnered increasing attention, as LiDAR provides precise ranging, whereas radar excels in adverse weather resilience. However, effectively leveraging LiDAR-radar fusion for place recognition remains challenging. The noisy and sparse nature of radar data limits its potential to further improve recognition accuracy. In addition, heterogeneous radar configurations complicate the development of unified cross-modality fusion frameworks. In this paper, we propose LRFusionPR, which improves recognition accuracy and robustness by fusing LiDAR with either single-chip or scanning radar. Technically, a dual-branch network is proposed to fuse different modalities within the unified polar coordinate bird’s eye view (BEV) representation. In the fusion branch, cross-attention is utilized to perform cross-modality feature interactions. The knowledge from the fusion branch is simultaneously transferred to the distillation branch, which takes radar as its only input to further improve the robustness. Ultimately, the descriptors from both branches are concatenated, producing the multimodal global descriptor for place retrieval. Extensive evaluations on multiple datasets demonstrate that our LRFusionPR achieves accurate place recognition, while maintaining robustness under varying weather conditions. Our open-source code will be released at this https URL.
zh

[CV-104] Segmenting Objectiveness and Task-awareness Unknown Region for Autonomous Driving

【速读】:该论文旨在解决道路场景分割方法在面对分布外(out-of-distribution, OOD)物体时检测能力不足的问题。现有方法主要依赖图像修复和OOD分布检测技术,存在两个关键缺陷:一是未能充分考虑异常区域的客观属性,导致与已知类别相似的异常物体出现分割不完整;二是未充分关注环境约束,造成对自动驾驶任务无关的异常进行误检。论文提出的解决方案关键在于构建一种名为SOTA(Segmenting Objectiveness and Task-Awareness)的框架,通过语义融合模块(Semantic Fusion Block, SFB)增强客观性分割,并利用场景理解引导提示-上下文适配器(Scene-understanding Guided Prompt-Context Adaptor, SG-PCA)过滤与道路导航任务无关的异常,从而提升OOD检测性能。

链接: https://arxiv.org/abs/2504.19183
作者: Mi Zheng,Guanglei Yang,Zitong Huang,Zhenhua Guo,Kevin Han,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Tianyijiaotong Technology Ltd. (天亿交通科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the emergence of transformer-based architectures and large language models (LLMs), the accuracy of road scene perception has substantially advanced. Nonetheless, current road scene segmentation approaches are predominantly trained on closed-set data, resulting in insufficient detection capabilities for out-of-distribution (OOD) objects. To overcome this limitation, road anomaly detection methods have been proposed. However, existing methods primarily depend on image inpainting and OOD distribution detection techniques, facing two critical issues: (1) inadequate consideration of the objectiveness attributes of anomalous regions, causing incomplete segmentation when anomalous objects share similarities with known classes, and (2) insufficient attention to environmental constraints, leading to the detection of anomalies irrelevant to autonomous driving tasks. In this paper, we propose a novel framework termed Segmenting Objectiveness and Task-Awareness (SOTA) for autonomous driving scenes. Specifically, SOTA enhances the segmentation of objectiveness through a Semantic Fusion Block (SFB) and filters anomalies irrelevant to road navigation tasks using a Scene-understanding Guided Prompt-Context Adaptor (SG-PCA). Extensive empirical evaluations on multiple benchmark datasets, including Fishyscapes Lost and Found, Segment-Me-If-You-Can, and RoadAnomaly, demonstrate that the proposed SOTA consistently improves OOD detection performance across diverse detectors, achieving robust and accurate segmentation outcomes.
zh

[CV-105] CLR-Wire: Towards Continuous Latent Representations for 3D Curve Wireframe Generation SIGGRAPH2025

【速读】:该论文试图解决3D曲线基线框生成中几何与拓扑信息难以统一建模的问题,传统方法通常将顶点、边和面解耦处理,导致无法有效联合学习几何形状与拓扑结构。解决方案的关键在于提出CLR-Wire框架,该框架通过注意力驱动的变分自编码器(VAE)将曲线作为神经参数曲线及其拓扑连通性编码到一个连续且固定长度的潜在空间中,实现几何与拓扑的统一表示,并利用流匹配模型从高斯噪声逐步映射到潜在表示,最终解码生成完整的3D线框。

链接: https://arxiv.org/abs/2504.19174
作者: Xueqi Ma,Yilin Liu,Tianlong Gao,Qirui Huang,Hui Huang
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2025

点击查看摘要

Abstract:We introduce CLR-Wire, a novel framework for 3D curve-based wireframe generation that integrates geometry and topology into a unified Continuous Latent Representation. Unlike conventional methods that decouple vertices, edges, and faces, CLR-Wire encodes curves as Neural Parametric Curves along with their topological connectivity into a continuous and fixed-length latent space using an attention-driven variational autoencoder (VAE). This unified approach facilitates joint learning and generation of both geometry and topology. To generate wireframes, we employ a flow matching model to progressively map Gaussian noise to these latents, which are subsequently decoded into complete 3D wireframes. Our method provides fine-grained modeling of complex shapes and irregular topologies, and supports both unconditional generation and generation conditioned on point cloud or image inputs. Experimental results demonstrate that, compared with state-of-the-art generative approaches, our method achieves substantial improvements in accuracy, novelty, and diversity, offering an efficient and comprehensive solution for CAD design, geometric reconstruction, and 3D content creation.
zh

[CV-106] IM-Portrait: Learning 3D-aware Video Diffusion for PhotorealisticTalking Heads from Monocular Videos CVPR2025

【速读】:该论文旨在解决如何从单张身份图像和显式控制信号(如表情)直接生成逼真的人物头部视频的问题,同时确保几何一致性以支持沉浸式观看体验。其解决方案的关键在于提出一种基于扩散模型的3D感知方法,通过直接生成多平面图像(Multiplane Images, MPIs)来替代传统的分阶段或联合优化的3D重建过程,从而在单一去噪过程中完成最终输出的生成,避免了后续渲染新视角的后处理步骤。

链接: https://arxiv.org/abs/2504.19165
作者: Yuan Li,Ziqian Bai,Feitong Tan,Zhaopeng Cui,Sean Fanello,Yinda Zhang
机构: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025; project page: this https URL

点击查看摘要

Abstract:We propose a novel 3D-aware diffusion-based method for generating photorealistic talking head videos directly from a single identity image and explicit control signals (e.g., expressions). Our method generates Multiplane Images (MPIs) that ensure geometric consistency, making them ideal for immersive viewing experiences like binocular videos for VR headsets. Unlike existing methods that often require a separate stage or joint optimization to reconstruct a 3D representation (such as NeRF or 3D Gaussians), our approach directly generates the final output through a single denoising process, eliminating the need for post-processing steps to render novel views efficiently. To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space. This approach enables the model to simultaneously learn sharp image details and underlying 3D information. Extensive experiments demonstrate the effectiveness of our method, which achieves competitive avatar quality and novel-view rendering capabilities, even without explicit 3D reconstruction or high-quality multi-view training data.
zh

[CV-107] RadioFormer: A Multiple-Granularity Radio Map Estimation Transformer with 1textpertenthousand Spatial Sampling

【速读】:该论文旨在解决在空间采样极端稀疏的情况下,传统基于深度视觉模型的无线电图估计方法效果受限的问题。其关键解决方案是提出RadioFormer,一种多粒度Transformer架构,通过双流自注意力(DSA)模块分别捕捉像素级信号功率的相关性以及块级建筑几何信息,并利用跨流交叉注意力(CCA)模块将这些信息整合到多尺度无线电图表示中,从而在有限观测节点下实现更准确和高效的无线电图估计。

链接: https://arxiv.org/abs/2504.19161
作者: Zheng Fang,Kangjun Liu,Ke Chen,Qingyu Liu,Jianguo Zhang,Lingyang Song,Yaowei Wang
机构: Pengcheng Laboratory (鹏城实验室); Southern University of Science and Technology (南方科技大学); Peking University Shenzhen Graduate School (北京大学深圳研究生院); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The task of radio map estimation aims to generate a dense representation of electromagnetic spectrum quantities, such as the received signal strength at each grid point within a geographic region, based on measurements from a subset of spatially distributed nodes (represented as pixels). Recently, deep vision models such as the U-Net have been adapted to radio map estimation, whose effectiveness can be guaranteed with sufficient spatial observations (typically 0.01% to 1% of pixels) in each map, to model local dependency of observed signal power. However, such a setting of sufficient measurements can be less practical in real-world scenarios, where extreme sparsity in spatial sampling can be widely encountered. To address this challenge, we propose RadioFormer, a novel multiple-granularity transformer designed to handle the constraints posed by spatial sparse observations. Our RadioFormer, through a dual-stream self-attention (DSA) module, can respectively discover the correlation of pixel-wise observed signal power and also learn patch-wise buildings’ geometries in a style of multiple granularities, which are integrated into multi-scale representations of radio maps by a cross stream cross-attention (CCA) module. Extensive experiments on the public RadioMapSeer dataset demonstrate that RadioFormer outperforms state-of-the-art methods in radio map estimation while maintaining the lowest computational cost. Furthermore, the proposed approach exhibits exceptional generalization capabilities and robust zero-shot performance, underscoring its potential to advance radio map estimation in a more practical setting with very limited observation nodes.
zh

[CV-108] PAD: Phase-Amplitude Decoupling Fusion for Multi-Modal Land Cover Classification

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)与RGB影像在地表覆盖分类中的模态异质性问题以及光谱互补性的利用不足问题。现有方法难以分离共享的结构特征与模态特异的辐射属性,导致特征冲突和信息丢失。其解决方案的关键在于提出相位-幅度解耦(Phase-Amplitude Decoupling, PAD)框架,该框架在傅里叶域中分离相位(模态共享)和幅度(模态特异)成分,通过相位谱校正和幅度谱融合两个核心组件,实现跨模态特征的对齐与有效融合。

链接: https://arxiv.org/abs/2504.19136
作者: Huiling Zheng,Xian Zhong,Bin Liu,Yi Xiao,Bihan Wen,Xiaofeng Li
机构: Wuhan University of Technology (武汉理工大学); Chinese Academy of Sciences (中国科学院); Shanghai Ocean University (上海海洋大学); Wuhan University (武汉大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:The fusion of Synthetic Aperture Radar (SAR) and RGB imagery for land cover classification remains challenging due to modality heterogeneity and the underutilization of spectral complementarity. Existing methods often fail to decouple shared structural features from modality-specific radiometric attributes, leading to feature conflicts and information loss. To address this issue, we propose Phase-Amplitude Decoupling (PAD), a frequency-aware framework that separates phase (modality-shared) and amplitude (modality-specific) components in the Fourier domain. Specifically, PAD consists of two key components: 1) Phase Spectrum Correction (PSC), which aligns cross-modal phase features through convolution-guided scaling to enhance geometric consistency, and 2) Amplitude Spectrum Fusion (ASF), which dynamically integrates high-frequency details and low-frequency structures using frequency-adaptive multilayer perceptrons. This approach leverages SAR’s sensitivity to morphological features and RGB’s spectral richness. Extensive experiments on WHU-OPT-SAR and DDHR-SK datasets demonstrate state-of-the-art performance. Our work establishes a new paradigm for physics-aware multi-modal fusion in remote sensing. The code will be available at this https URL.
zh

[CV-109] DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning ICMR2025

【速读】:该论文试图解决低光照图像增强(LLIE)中现有方法仅学习低光与正常光照域之间的直接映射,而未考虑不同区域的语义信息的问题,特别是在严重信息丢失的极暗区域。解决方案的关键在于提出一种基于Retinex图像分解的深度语义先验引导框架(DeepSPG),通过预训练的语义分割模型和多模态学习探索有信息量的语义知识,并结合图像级和文本级语义先验,构建具有组合深度语义先验引导的多模态学习框架。

链接: https://arxiv.org/abs/2504.19127
作者: Jialang Lu,Huayu Zhao,Huiyu Zhai,Xingxing Yang,Shini Han
机构: Hubei University(湖北大学); Beijing Shougang International Engineering Technology(北京首钢国际工程技术有限公司); University of Electronic Science and Technology of China(电子科技大学); Hong Kong Baptist University(香港浸会大学); Harbin University of Science and Technology(哈尔滨科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by ICMR 2025 Main track. Code is available at this https URL

点击查看摘要

Abstract:There has long been a belief that high-level semantics learning can benefit various downstream computer vision tasks. However, in the low-light image enhancement (LLIE) community, existing methods learn a brutal mapping between low-light and normal-light domains without considering the semantic information of different regions, especially in those extremely dark regions that suffer from severe information loss. To address this issue, we propose a new deep semantic prior-guided framework (DeepSPG) based on Retinex image decomposition for LLIE to explore informative semantic knowledge via a pre-trained semantic segmentation model and multimodal learning. Notably, we incorporate both image-level semantic prior and text-level semantic prior and thus formulate a multimodal learning framework with combinatorial deep semantic prior guidance for LLIE. Specifically, we incorporate semantic knowledge to guide the enhancement process via three designs: an image-level semantic prior guidance by leveraging hierarchical semantic features from a pre-trained semantic segmentation model; a text-level semantic prior guidance by integrating natural language semantic constraints via a pre-trained vision-language model; a multi-scale semantic-aware structure that facilitates effective semantic feature incorporation. Eventually, our proposed DeepSPG demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets. The implementation details and code are publicly available at this https URL.
zh

[CV-110] Blind Source Separation Based on Sparsity

【速读】:该论文旨在解决盲源分离(Blind Source Separation, BSS)问题,即在未知混合矩阵的情况下从观测混合信号中恢复出原始源信号。传统独立成分分析(Independent Component Analysis, ICA)方法依赖于源信号相互独立的假设,但这一假设在实际应用中往往不成立。为克服这一限制,论文引入了基于稀疏性的方法,其核心在于利用预定义字典对源信号进行稀疏分解。关键解决方案包括基于稀疏表示理论的形态成分分析(Morphological Component Analysis, MCA),以及通过K-SVD算法改进的块稀疏字典学习方法,最终提出一种增强型算法SAC+BK-SVD,通过块聚类和更新相似原子来提升盲图像分离的质量。

链接: https://arxiv.org/abs/2504.19124
作者: Zhongxuan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blind source separation (BSS) is a key technique in array processing and data analysis, aiming to recover unknown sources from observed mixtures without knowledge of the mixing matrix. Classical independent component analysis (ICA) methods rely on the assumption that sources are mutually independent. To address limitations of ICA, sparsity-based methods have been introduced, which decompose source signals sparsely in a predefined dictionary. Morphological Component Analysis (MCA), based on sparse representation theory, assumes that a signal is a linear combination of components with distinct geometries, each sparsely representable in one dictionary and not in others. This approach has recently been applied to BSS with promising results. This report reviews key approaches derived from classical ICA and explores sparsity-based methods for BSS. It introduces the theory of sparse representation and decomposition, followed by a block coordinate relaxation MCA algorithm, whose variants are used in Multichannel MCA (MMCA) and Generalized MCA (GMCA). A local dictionary learning method using K-SVD is then presented. Finally, we propose an improved algorithm, SAC+BK-SVD, which enhances K-SVD by learning a block-sparsifying dictionary that clusters and updates similar atoms in blocks. The implementation includes experiments on image segmentation and blind image source separation using the discussed techniques. We also compare the proposed block-sparse dictionary learning algorithm with K-SVD. Simulation results demonstrate that our method yields improved blind image separation quality. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.19124 [cs.CV] (or arXiv:2504.19124v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.19124 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-111] owards Latency-Aware 3D Streaming Perception for Autonomous Driving

【速读】:该论文旨在解决现有3D感知算法在边缘设备上部署时面临的显著运行时延问题(runtime latency)。其解决方案的关键在于提出了一种面向在线评估的新基准,并基于此构建了Latency-Aware 3D Streaming Perception (LASP)框架,该框架通过两个核心组件缓解时延影响:一是时延感知的历史融合机制,将查询传播扩展为连续过程,确保在不同延迟下都能整合历史特征;二是时延感知的预测检测模块,通过预测轨迹和后验访问时延补偿检测结果。该方法在不使用任何加速技术的情况下,在Jetson AGX Orin平台上实现了接近离线评估80%的在线性能。

链接: https://arxiv.org/abs/2504.19115
作者: Jiaqi Peng,Tai Wang,Jiangmiao Pang,Yuan Shen
机构: Tsinghua University (清华大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Although existing 3D perception algorithms have demonstrated significant improvements in performance, their deployment on edge devices continues to encounter critical challenges due to substantial runtime latency. We propose a new benchmark tailored for online evaluation by considering runtime latency. Based on the benchmark, we build a Latency-Aware 3D Streaming Perception (LASP) framework that addresses the latency issue through two primary components: 1) latency-aware history integration, which extends query propagation into a continuous process, ensuring the integration of historical feature regardless of varying latency; 2) latency-aware predictive detection, a module that compensates the detection results with the predicted trajectory and the posterior accessed latency. By incorporating the latency-aware mechanism, our method shows generalization across various latency levels, achieving an online performance that closely aligns with 80% of its offline evaluation on the Jetson AGX Orin without any acceleration techniques.
zh

[CV-112] Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction

【速读】:该论文旨在解决单域泛化目标检测(Single-Domain Generalized Object Detection, S-DGOD)中模型在面对不同未见目标域时检测性能下降的问题,特别是在复杂多媒体应用场景中,如智能视频监控和VR/AR技术。现有方法虽利用预训练的视觉-语言知识进行跨域特征学习,但其知识粒度较粗,仅作为隐式正则化,难以学习到精确的区域和对象级特征。该论文提出的解决方案关键在于设计了一种跨模态区域感知特征交互机制(Cross-modal and Region-aware Feature Interaction),通过细粒度文本与视觉特征之间的动态交互,同时学习跨模态和同模态的区域不变性,从而提升模型在不同域中的检测能力。

链接: https://arxiv.org/abs/2504.19086
作者: Xiaoran Xu,Jiangang Yang,Wenyue Chong,Wenhui Shi,Shichu Sun,Jing Xing,Jian Liu
机构: School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学先进交叉学科研究院); Institute of Microelectronics of the Chinese Academy of Sciences(中国科学院微电子研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-Domain Generalized Object Detection~(S-DGOD) aims to train an object detector on a single source domain while generalizing well to diverse unseen target domains, making it suitable for multimedia applications that involve various domain shifts, such as intelligent video surveillance and VR/AR technologies. With the success of large-scale Vision-Language Models, recent S-DGOD approaches exploit pre-trained vision-language knowledge to guide invariant feature learning across visual domains. However, the utilized knowledge remains at a coarse-grained level~(e.g., the textual description of adverse weather paired with the image) and serves as an implicit regularization for guidance, struggling to learn accurate region- and object-level features in varying domains. In this work, we propose a new cross-modal feature learning method, which can capture generalized and discriminative regional features for S-DGOD tasks. The core of our method is the mechanism of Cross-modal and Region-aware Feature Interaction, which simultaneously learns both inter-modal and intra-modal regional invariance through dynamic interactions between fine-grained textual and visual features. Moreover, we design a simple but effective strategy called Cross-domain Proposal Refining and Mixing, which aligns the position of region proposals across multiple domains and diversifies them, enhancing the localization ability of detectors in unseen scenarios. Our method achieves new state-of-the-art results on S-DGOD benchmark datasets, with improvements of +8.8%~mPC on Cityscapes-C and +7.9%~mPC on DWD over baselines, demonstrating its efficacy.
zh

[CV-113] MIA-Mind: A Multidimensional Interactive Attention Mechanism Based on MindSpore

【速读】:该论文试图解决现有注意力机制在建模通道重要性和空间显著性时独立处理导致的相互依赖关系被忽视的问题,从而限制了其效果。解决方案的关键在于提出MIA-Mind,一种基于MindSpore框架的轻量级、模块化的多维交互注意力机制,通过统一的跨注意力融合策略联合建模空间和通道特征,实现细粒度的特征重新校准,同时保持较低的计算开销。

链接: https://arxiv.org/abs/2504.19080
作者: Zhenkai Qin,Jiaquan Liang,Qiao Fang
机构: Guangxi Police College (广西警察学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Attention mechanisms have significantly advanced deep learning by enhancing feature representation through selective focus. However, existing approaches often independently model channel importance and spatial saliency, overlooking their inherent interdependence and limiting their effectiveness. To address this limitation, we propose MIA-Mind, a lightweight and modular Multidimensional Interactive Attention Mechanism, built upon the MindSpore framework. MIA-Mind jointly models spatial and channel features through a unified cross-attentive fusion strategy, enabling fine-grained feature recalibration with minimal computational overhead. Extensive experiments are conducted on three representative datasets: on CIFAR-10, MIA-Mind achieves an accuracy of 82.9%; on ISBI2012, it achieves an accuracy of 78.7%; and on CIC-IDS2017, it achieves an accuracy of 91.9%. These results validate the versatility, lightweight design, and generalization ability of MIA-Mind across heterogeneous tasks. Future work will explore the extension of MIA-Mind to large-scale datasets, the development of ada,ptive attention fusion strategies, and distributed deployment to further enhance scalability and robustness.
zh

[CV-114] Learning to Drive from a World Model

【速读】:该论文试图解决传统自动驾驶系统依赖人工编码的感知输出和工程化驾驶规则的问题,旨在通过端到端方法直接从人类驾驶数据中学习,以简化训练架构并提升可扩展性。其解决方案的关键在于提出一种基于真实驾驶数据的端到端训练架构,在策略模拟器中训练驾驶策略,并采用两种不同的仿真方法——重投影仿真和学习的世界模型,从而无需任何人工编码的驾驶规则即可学习驾驶行为。

链接: https://arxiv.org/abs/2504.19077
作者: Mitchell Goff,Greg Hogan,George Hotz,Armand du Parc Locmaria,Kacper Raczy,Harald Schäfer,Adeeb Shihadeh,Weixing Zhang,Yassine Yousfi
机构: comma.ai(逗号人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Most self-driving systems rely on hand-coded perception outputs and engineered driving rules. Learning directly from human driving data with an end-to-end method can allow for a training architecture that is simpler and scales well with compute and data. In this work, we propose an end-to-end training architecture that uses real driving data to train a driving policy in an on-policy simulator. We show two different methods of simulation, one with reprojective simulation and one with a learned world model. We show that both methods can be used to train a policy that learns driving behavior without any hand-coded driving rules. We evaluate the performance of these policies in a closed-loop simulation and when deployed in a real-world advanced driver-assistance system. Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2504.19077 [cs.CV] (or arXiv:2504.19077v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.19077 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-115] HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimers Disease

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)诊断中多模态数据整合不足以及动态领域知识结构化机制缺失的问题。其解决方案的关键在于提出HoloDx框架,该框架通过知识注入模块与知识感知门控交叉注意力机制,实现从大型语言模型和临床专业知识中动态融合领域特定见解;同时结合记忆注入模块与原型记忆注意力机制,以保留和检索个体特异性信息,从而提升诊断的可解释性、鲁棒性及知识与数据的一致性。

链接: https://arxiv.org/abs/2504.19075
作者: Qiuhui Chen,Jintao Wang,Gang Wang,Yi Hong
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Jiao Tong University School of Medicine (上海交通大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate diagnosis of Alzheimer’s disease (AD) requires effectively integrating multimodal data and clinical expertise. However, existing methods often struggle to fully utilize multimodal information and lack structured mechanisms to incorporate dynamic domain knowledge. To address these limitations, we propose HoloDx, a knowledge- and data-driven framework that enhances AD diagnosis by aligning domain knowledge with multimodal clinical data. HoloDx incorporates a knowledge injection module with a knowledge-aware gated cross-attention, allowing the model to dynamically integrate domain-specific insights from both large language models (LLMs) and clinical expertise. Also, a memory injection module with a designed prototypical memory attention enables the model to retain and retrieve subject-specific information, ensuring consistency in decision-making. By jointly leveraging these mechanisms, HoloDx enhances interpretability, improves robustness, and effectively aligns prior knowledge with current subject data. Evaluations on five AD datasets demonstrate that HoloDx outperforms state-of-the-art methods, achieving superior diagnostic accuracy and strong generalization across diverse cohorts. The source code will be released upon publication acceptance.
zh

[CV-116] Dual-Branch Residual Network for Cross-Domain Few-Shot Hyperspectral Image Classification with Refined Prototype

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中卷积神经网络(Convolutional Neural Network, CNN)因3D卷积结构导致的高计算成本和小样本场景下泛化能力有限的问题,以及由于传感器差异和环境变化引起的域偏移对跨数据集适应性的阻碍。其解决方案的关键在于提出一种通过并行分支融合空间与光谱特征的双分支残差网络,并引入正则化项以获得更鲁棒的精炼原型,同时采用核概率匹配策略对齐源域和目标域特征,从而缓解域偏移问题。

链接: https://arxiv.org/abs/2504.19074
作者: Anyong Qin,Chaoqi Yuan,Qiang Li,Feng Yang,Tiecheng Song,Chenqiang Gao
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 2 figures. IEEE Geoscience and Remote Sensing Letters (2025)

点击查看摘要

Abstract:Convolutional neural networks (CNNs) are effective for hyperspectral image (HSI) classification, but their 3D convolutional structures introduce high computational costs and limited generalization in few-shot scenarios. Domain shifts caused by sensor differences and environmental variations further hinder cross-dataset adaptability. Metric-based few-shot learning (FSL) prototype networks mitigate this problem, yet their performance is sensitive to prototype quality, especially with limited samples. To overcome these challenges, a dual-branch residual network that integrates spatial and spectral features via parallel branches is proposed in this letter. Additionally, more robust refined prototypes are obtained through a regulation term. Furthermore, a kernel probability matching strategy aligns source and target domain features, alleviating domain shift. Experiments on four publicly available HSI datasets illustrate that the proposal achieves superior performance compared to other methods.
zh

[CV-117] VISUALCENT: Visual Human Analysis using Dynamic Centroid Representation

【速读】:该论文试图解决多人群视觉人体分析中的泛化性和可扩展性限制问题(generalizability and scalability limitations)。其解决方案的关键在于提出VISUALCENT框架,该框架采用基于中心点的自底向上关键点检测范式,并引入包含盘状表示(Disk Representation)和关键中心点(KeyCentroid)的关键点热图来识别最优关键点坐标;同时,在统一的分割任务中定义显式关键点为动态中心点(MaskCentroid),以在人体运动快速变化或环境严重遮挡时迅速将像素聚类到特定的人体实例。

链接: https://arxiv.org/abs/2504.19032
作者: Niaz Ahmad,Youngmoon Lee,Guanghui Wang
机构: Toronto Metropolitan University (多伦多都会大学); Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce VISUALCENT, a unified human pose and instance segmentation framework to address generalizability and scalability limitations to multi person visual human analysis. VISUALCENT leverages centroid based bottom up keypoint detection paradigm and uses Keypoint Heatmap incorporating Disk Representation and KeyCentroid to identify the optimal keypoint coordinates. For the unified segmentation task, an explicit keypoint is defined as a dynamic centroid called MaskCentroid to swiftly cluster pixels to specific human instance during rapid changes in human body movement or significantly occluded environment. Experimental results on COCO and OCHuman datasets demonstrate VISUALCENTs accuracy and real time performance advantages, outperforming existing methods in mAP scores and execution frame rate per second. The implementation is available on the project page.
zh

[CV-118] Deep Learning-Based Multi-Modal Fusion for Robust Robot Perception and Navigation

【速读】:该论文旨在解决自主导航机器人在复杂环境中感知能力不足的问题,特别是如何有效融合RGB图像与LiDAR数据以提升导航和定位精度。其解决方案的关键在于设计了一个轻量级特征提取网络以增强特征表示,开发了自适应加权跨模态融合策略以提高系统鲁棒性,并引入时序信息建模以提升动态场景感知的准确性。

链接: https://arxiv.org/abs/2504.19002
作者: Delun Lai,Yeyubei Zhang,Yunchong Liu,Chaojie Li,Huadong Mo
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:This paper introduces a novel deep learning-based multimodal fusion architecture aimed at enhancing the perception capabilities of autonomous navigation robots in complex environments. By utilizing innovative feature extraction modules, adaptive fusion strategies, and time-series modeling mechanisms, the system effectively integrates RGB images and LiDAR data. The key contributions of this work are as follows: a. the design of a lightweight feature extraction network to enhance feature representation; b. the development of an adaptive weighted cross-modal fusion strategy to improve system robustness; and c. the incorporation of time-series information modeling to boost dynamic scene perception accuracy. Experimental results on the KITTI dataset demonstrate that the proposed approach increases navigation and positioning accuracy by 3.5% and 2.2%, respectively, while maintaining real-time performance. This work provides a novel solution for autonomous robot navigation in complex environments.
zh

[CV-119] REED-VAE: RE-Encode Decode Training for Iterative Image Editing with Diffusion Models

【速读】:该论文试图解决现有潜在扩散模型在多次迭代图像编辑过程中因像素空间与潜在空间反复转换而积累伪影和噪声的问题。其解决方案的关键在于提出一种针对变分自编码器(Variational Autoencoder, VAE)的RE-encode decode(REED)训练方案,该方案能够在多次迭代中保持图像质量,从而支持多种编辑方法的连续迭代操作。

链接: https://arxiv.org/abs/2504.18989
作者: Gal Almog,Ariel Shamir,Ohad Fried
机构: Reichman University (里奇曼大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to Eurographics 2025. Project page: this https URL

点击查看摘要

Abstract:While latent diffusion models achieve impressive image editing results, their application to iterative editing of the same image is severely restricted. When trying to apply consecutive edit operations using current models, they accumulate artifacts and noise due to repeated transitions between pixel and latent spaces. Some methods have attempted to address this limitation by performing the entire edit chain within the latent space, sacrificing flexibility by supporting only a limited, predetermined set of diffusion editing operations. We present a RE-encode decode (REED) training scheme for variational autoencoders (VAEs), which promotes image quality preservation even after many iterations. Our work enables multi-method iterative image editing: users can perform a variety of iterative edit operations, with each operation building on the output of the previous one using both diffusion-based operations and conventional editing techniques. We demonstrate the advantage of REED-VAE across a range of image editing scenarios, including text-based and mask-based editing frameworks. In addition, we show how REED-VAE enhances the overall editability of images, increasing the likelihood of successful and precise edit operations. We hope that this work will serve as a benchmark for the newly introduced task of multi-method image editing. Our code and models will be available at this https URL
zh

[CV-120] MediAug: Exploring Visual Augmentation in Medical Imaging

【速读】:该论文旨在解决医学影像数据增强中的两个关键问题:一是自然照片与医学图像之间的显著领域差异可能导致疾病特征的扭曲,二是现有的医学影像增强研究多局限于单一任务或架构,缺乏对先进混合策略优势的系统评估。其解决方案的关键在于提出一个统一的评估框架,将六种基于混合的数据增强方法(MixUp、YOCO、CropMix、CutMix、AugMix 和 SnapMix)与卷积神经网络和 Transformer 架构相结合,在脑肿瘤 MRI 和眼病视网膜图像数据集上进行系统性实验,从而全面评估不同增强方法在不同模型和任务中的性能表现。

链接: https://arxiv.org/abs/2504.18983
作者: Xuyin Qi,Zeyu Zhang,Canxuan Gang,Hao Zhang,Lei Zhang,Zhiwei Zhang,Yang Zhao
机构: La Trobe(拉特罗布大学); AIML(人工智能与机器学习); ANU(澳大利亚国立大学); UNSW(新南威尔士大学); UCAS(中国科学院大学); PSU(宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data augmentation is essential in medical imaging for improving classification accuracy, lesion detection, and organ segmentation under limited data conditions. However, two significant challenges remain. First, a pronounced domain gap between natural photographs and medical images can distort critical disease features. Second, augmentation studies in medical imaging are fragmented and limited to single tasks or architectures, leaving the benefits of advanced mix-based strategies unclear. To address these challenges, we propose a unified evaluation framework with six mix-based augmentation methods integrated with both convolutional and transformer backbones on brain tumour MRI and eye disease fundus datasets. Our contributions are threefold. (1) We introduce MediAug, a comprehensive and reproducible benchmark for advanced data augmentation in medical imaging. (2) We systematically evaluate MixUp, YOCO, CropMix, CutMix, AugMix, and SnapMix with ResNet-50 and ViT-B backbones. (3) We demonstrate through extensive experiments that MixUp yields the greatest improvement on the brain tumor classification task for ResNet-50 with 79.19% accuracy and SnapMix yields the greatest improvement for ViT-B with 99.44% accuracy, and that YOCO yields the greatest improvement on the eye disease classification task for ResNet-50 with 91.60% accuracy and CutMix yields the greatest improvement for ViT-B with 97.94% accuracy. Code will be available at this https URL.
zh

[CV-121] 3DPyranet Features Fusion for Spatio-temporal Feature Learning

【速读】:该论文旨在解决视频中人类动作和动态场景识别的挑战性问题,特别是在存在相机运动等复杂环境下的识别性能问题。其解决方案的关键在于提出一种名为3DPyraNet的三维金字塔神经网络及其基于该网络的判别性时空特征学习方法3DPyraNet-F。3DPyraNet引入了一种新的加权机制,能够从空间和时间维度分析多个相邻帧,并保持生物合理的结构,从而在保持输入图像空间拓扑的同时,减少参数数量和计算及内存成本。3DPyraNet-F通过融合网络最高层的特征图生成单一向量,并将其输入线性-SVM分类器,以提升视频中人类动作和动态场景的识别效果。

链接: https://arxiv.org/abs/2504.18977
作者: Ihsan Ullah,Alfredo Petrosino
机构: University of Naples ’Parthenope’ (那不勒斯帕尔特诺佩大学); University of Milan (米兰大学); University of Galway (加拉韦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural network (CNN) slides a kernel over the whole image to produce an output map. This kernel scheme reduces the number of parameters with respect to a fully connected neural network (NN). While CNN has proven to be an effective model in recognition of handwritten characters and traffic signal sign boards, etc. recently, its deep variants have proven to be effective in similar as well as more challenging applications like object, scene and action recognition. Deep CNN add more layers and kernels to the classical CNN, increasing the number of parameters, and partly reducing the main advantage of CNN which is less parameters. In this paper, a 3D pyramidal neural network called 3DPyraNet and a discriminative approach for spatio-temporal feature learning based on it, called 3DPyraNet-F, are proposed. 3DPyraNet introduces a new weighting scheme which learns features from both spatial and temporal dimensions analyzing multiple adjacent frames and keeping a biological plausible structure. It keeps the spatial topology of the input image and presents fewer parameters and lower computational and memory costs compared to both fully connected NNs and recent deep CNNs. 3DPyraNet-F extract the features maps of the highest layer of the learned network, fuse them in a single vector, and provide it as input in such a way to a linear-SVM classifier that enhances the recognition of human actions and dynamic scenes from the videos. Encouraging results are reported with 3DPyraNet in real-world environments, especially in the presence of camera induced motion. Further, 3DPyraNet-F clearly outperforms the state-of-the-art on three benchmark datasets and shows comparable result for the fourth.
zh

[CV-122] R-Sparse R-CNN: SAR Ship Detection Based on Background-Aware Sparse Learnable Proposals

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像中定向船舶检测的挑战,特别是在复杂背景下的准确区分问题。其解决方案的关键在于提出一种名为R-Sparse R-CNN的新框架,该框架通过引入背景感知提议(Background-Aware Proposals, BAPs)来增强目标表示,结合双上下文池化(Dual-Context Pooling, DCP)策略以高效提取船体与背景特征,并设计基于Transformer的交互模块来建模它们的上下文关系,从而提升检测精度。

链接: https://arxiv.org/abs/2504.18959
作者: Kamirul Kamirul,Odysseas Pappas,Alin Achim
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

点击查看摘要

Abstract:We introduce R-Sparse R-CNN, a novel pipeline for oriented ship detection in Synthetic Aperture Radar (SAR) images that leverages sparse learnable proposals enriched with background contextual information, termed background-aware proposals (BAPs). The adoption of sparse proposals streamlines the pipeline by eliminating the need for proposal generators and post-processing for overlapping predictions. The proposed BAPs enrich object representation by integrating ship and background features, allowing the model to learn their contextual relationships for more accurate distinction of ships in complex environments. To complement BAPs, we propose Dual-Context Pooling (DCP), a novel strategy that jointly extracts ship and background features in a single unified operation. This unified design improves efficiency by eliminating redundant computation inherent in separate pooling. Moreover, by ensuring that ship and background features are pooled from the same feature map level, DCP provides aligned features that improve contextual relationship learning. Finally, as a core component of contextual relationship learning in R-Sparse R-CNN, we design a dedicated transformer-based Interaction Module. This module interacts pooled ship and background features with corresponding proposal features and models their relationships. Experimental results show that R-Sparse R-CNN delivers outstanding accuracy, surpassing state-of-the-art models by margins of up to 12.8% and 11.9% on SSDD and RSDD-SAR inshore datasets, respectively. These results demonstrate the effectiveness and competitiveness of R-Sparse R-CNN as a robust framework for oriented ship detection in SAR imagery. The code is available at: this http URL.
zh

[CV-123] Kinship Verification through a Forest Neural Network

【速读】:该论文旨在解决亲属关系验证(kinship verification)中面部表征准确性不足的问题,传统方法依赖于单独的面部表征,而本文提出一种基于图神经网络(Graph Neural Network)的概念,利用面部表征并取得与联合表征算法相当的结果。解决方案的关键在于设计分类模块结构,并引入一种新的损失组合,以渐进式方式引入中心损失(center loss)来优化网络训练过程。

链接: https://arxiv.org/abs/2504.18910
作者: Ali Nazari,Mohsen Ebrahimi Moghaddam,Omidreza Borzoei
机构: Shahid Beheshti University (沙希德·贝赫什提大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early methods used face representations in kinship verification, which are less accurate than joint representations of parents’ and children’s facial images learned from scratch. We propose an approach featuring graph neural network concepts to utilize face representations and have comparable results to joint representation algorithms. Moreover, we designed the structure of the classification module and introduced a new combination of losses to engage the center loss gradually in training our network. Additionally, we conducted experiments on KinFaceW-I and II, demonstrating the effectiveness of our approach. We achieved the best result on KinFaceW-II, an average improvement of nearly 1.6 for all kinship types, and we were near the best on KinFaceW-I. The code is available at this https URL
zh

[CV-124] Sim-to-Real: An Unsupervised Noise Layer for Screen-Camera Watermarking Robustness

【速读】:该论文旨在解决未经授权的屏幕捕获与传播所带来的安全威胁,特别是针对屏幕-相机(Screen-Camera, SC)图像的版权追踪问题。现有方法通过鲁棒水印技术来增强对SC图像的版权认证能力,但其采用的启发式数学建模或监督神经网络拟合方式在模拟SC噪声方面存在局限性,无法有效逼近真实场景中的噪声分布。为解决上述问题,作者提出了一种名为Simulation-to-Real (S2R) 的解决方案,其关键在于利用无监督学习策略,通过未配对的数据学习模拟噪声分布与真实SC噪声分布之间的差异,而非直接学习从清晰图像到真实图像的映射,从而更高效地提升水印的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2504.18906
作者: Yufeng Wu,Xin Liao,Baowei Wang,Han Fang,Xiaoshuai Wu,Guiling Wang
机构: Hunan University, China; Nanjing University of Information Science and Technology, China; National University of Singapore, Singapore; New Jersey Institute of Technology, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unauthorized screen capturing and dissemination pose severe security threats such as data leakage and information theft. Several studies propose robust watermarking methods to track the copyright of Screen-Camera (SC) images, facilitating post-hoc certification against infringement. These techniques typically employ heuristic mathematical modeling or supervised neural network fitting as the noise layer, to enhance watermarking robustness against SC. However, both strategies cannot fundamentally achieve an effective approximation of SC noise. Mathematical simulation suffers from biased approximations due to the incomplete decomposition of the noise and the absence of interdependence among the noise components. Supervised networks require paired data to train the noise-fitting model, and it is difficult for the model to learn all the features of the noise. To address the above issues, we propose Simulation-to-Real (S2R). Specifically, an unsupervised noise layer employs unpaired data to learn the discrepancy between the modeling simulated noise distribution and the real-world SC noise distribution, rather than directly learning the mapping from sharp images to real-world images. Learning this transformation from simulation to reality is inherently simpler, as it primarily involves bridging the gap in noise distributions, instead of the complex task of reconstructing fine-grained image details. Extensive experimental results validate the efficacy of the proposed method, demonstrating superior watermark robustness and generalization compared to those of state-of-the-art methods.
zh

[CV-125] Exploiting Multiple Representations: 3D Face Biometrics Fusion with Application to Surveillance

【速读】:该论文试图解决在非受控场景下提升人脸识别系统性能的问题,特别是针对不同应用场景中3D人脸重建(3DFR)算法的泛化能力不足。其解决方案的关键在于利用多种先进的3DFR算法,并通过参数化与非参数化的分数级融合方法,充分发挥各算法的独特优势,从而增强生物特征识别的鲁棒性。研究还表明,不同3DFR算法提供的差异信息有助于缓解多场景泛化问题,同时验证了融合策略在提升基于3DFR的人脸识别系统可靠性方面的潜力。

链接: https://arxiv.org/abs/2504.18886
作者: Simone Maurizio La Cava,Roberto Casula,Sara Concas,Giulia Orrù,Ruben Tolosana,Martin Drahansky,Julian Fierrez,Gian Luca Marcialis
机构: University of Cagliari (卡利亚里大学); Autonomous University of Madrid (马德里自治大学); Police Academy of the Czech Republic in Prague (捷克共和国布拉格警察学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D face reconstruction (3DFR) algorithms are based on specific assumptions tailored to the limits and characteristics of the different application scenarios. In this study, we investigate how multiple state-of-the-art 3DFR algorithms can be used to generate a better representation of subjects, with the final goal of improving the performance of face recognition systems in challenging uncontrolled scenarios. We also explore how different parametric and non-parametric score-level fusion methods can exploit the unique strengths of multiple 3DFR algorithms to enhance biometric recognition robustness. With this goal, we propose a comprehensive analysis of several face recognition systems across diverse conditions, such as varying distances and camera setups, intra-dataset and cross-dataset, to assess the robustness of the proposed ensemble method. The results demonstrate that the distinct information provided by different 3DFR algorithms can alleviate the problem of generalizing over multiple application scenarios. In addition, the present study highlights the potential of advanced fusion strategies to enhance the reliability of 3DFR-based face recognition systems, providing the research community with key insights to exploit them in real-world applications effectively. Although the experiments are carried out in a specific face verification setup, our proposed fusion-based 3DFR methods may be applied to other tasks around face biometrics that are not strictly related to identity recognition.
zh

[CV-126] WLTCL: Wide Field-of-View 3-D LiDAR Truck Compartment Automatic Localization System

【速读】:该论文旨在解决自动化装载系统中卡车货箱的精确自动定位问题,特别是在复杂环境下的适应性、统一坐标系建立及可靠性不足等挑战。其解决方案的关键在于提出了一种创新的广视场3-D LiDAR(Light Detection and Ranging)车辆货箱自动定位系统,通过高密度点云生成、结合停车区域约束的点云分割方法以及基于几何特征的货箱关键点定位算法,实现了对不同尺寸围栏式卡车货箱的精准定位。

链接: https://arxiv.org/abs/2504.18870
作者: Guodong Sun,Mingjing Li,Dingjie Liu,Mingxuan Liu,Bo Wu,Yang Zhang
机构: Hubei University of Technology(湖北工业大学); Hubei Key Laboratory of Modern Manufacturing Quality Engineering(湖北省现代制造质量工程重点实验室); Shanghai Advanced Research Institute, Chinese Academy of Sciences(中国科学院上海高等研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: To appear in IEEE TIM

点击查看摘要

Abstract:As an essential component of logistics automation, the automated loading system is becoming a critical technology for enhancing operational efficiency and safety. Precise automatic positioning of the truck compartment, which serves as the loading area, is the primary step in automated loading. However, existing methods have difficulty adapting to truck compartments of various sizes, do not establish a unified coordinate system for LiDAR and mobile manipulators, and often exhibit reliability issues in cluttered environments. To address these limitations, our study focuses on achieving precise automatic positioning of key points in large, medium, and small fence-style truck compartments in cluttered scenarios. We propose an innovative wide field-of-view 3-D LiDAR vehicle compartment automatic localization system. For vehicles of various sizes, this system leverages the LiDAR to generate high-density point clouds within an extensive field-of-view range. By incorporating parking area constraints, our vehicle point cloud segmentation method more effectively segments vehicle point clouds within the scene. Our compartment key point positioning algorithm utilizes the geometric features of the compartments to accurately locate the corner points, providing stackable spatial regions. Extensive experiments on our collected data and public datasets demonstrate that this system offers reliable positioning accuracy and reduced computational resource consumption, leading to its application and promotion in relevant fields.
zh

[CV-127] PiercingEye: Dual-Space Video Violence Detection with Hyperbolic Vision-Language Guidance

【速读】:该论文旨在解决弱监督视频暴力检测(Weakly Supervised Video Violence Detection, VVD)中因欧几里得表示学习难以区分视觉相似但语义不同的事件而产生的性能瓶颈问题,主要受限于层次建模能力不足和模糊样本稀缺。其解决方案的关键在于提出PiercingEye框架,该框架通过融合欧几里得空间与双曲几何空间,增强特征的判别性表示。具体而言,PiercingEye引入了层敏感的双曲聚合策略与双曲Dirichlet能量约束以逐步建模事件层次结构,并设计了跨空间注意力机制以促进欧几里得空间与双曲空间间的互补特征交互。此外,为缓解模糊样本不足的问题,利用大语言模型生成逻辑引导的模糊事件描述,结合双曲视觉-语言对比损失函数,通过动态相似性感知加权机制优先处理高混淆样本。

链接: https://arxiv.org/abs/2504.18866
作者: Jiaxu Leng,Zhanjie Wu,Mingpi Tan,Mengjingcheng Mo,Jiankang Zheng,Qingqing Li,Ji Gan,Xinbo Gao
机构: Chongqing University of Posts and Telecommunications(重庆邮电大学); Guangyang Bay Laboratory(广阳湾实验室); China Automotive Engineering Research Institute Co., Ltd.(中国汽车工程研究院有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

点击查看摘要

Abstract:Existing weakly supervised video violence detection (VVD) methods primarily rely on Euclidean representation learning, which often struggles to distinguish visually similar yet semantically distinct events due to limited hierarchical modeling and insufficient ambiguous training samples. To address this challenge, we propose PiercingEye, a novel dual-space learning framework that synergizes Euclidean and hyperbolic geometries to enhance discriminative feature representation. Specifically, PiercingEye introduces a layer-sensitive hyperbolic aggregation strategy with hyperbolic Dirichlet energy constraints to progressively model event hierarchies, and a cross-space attention mechanism to facilitate complementary feature interactions between Euclidean and hyperbolic spaces. Furthermore, to mitigate the scarcity of ambiguous samples, we leverage large language models to generate logic-guided ambiguous event descriptions, enabling explicit supervision through a hyperbolic vision-language contrastive loss that prioritizes high-confusion samples via dynamic similarity-aware weighting. Extensive experiments on XD-Violence and UCF-Crime benchmarks demonstrate that PiercingEye achieves state-of-the-art performance, with particularly strong results on a newly curated ambiguous event subset, validating its superior capability in fine-grained violence detection.
zh

[CV-128] Spike Imaging Velocimetry: Dense Motion Estimation of Fluids Using Spike Cameras

【速读】:该论文旨在解决高湍流和复杂流动场中精确且非侵入式流速测量的问题,通过引入一种基于脉冲相机(spike camera)的深度学习框架——Spike Imaging Velocimetry (SIV),以提升粒子图像测速(Particle Image Velocimetry, PIV)的性能。其解决方案的关键在于设计了一个保持细节的分层变换(Detail-Preserving Hierarchical Transform, DPHT)模块,用于从脉冲流中聚合运动特征并减少信息丢失,同时引入图编码器(Graph Encoder, GE)以提取复杂流场的上下文特征。此外,研究还构建了包含三个挑战性流体力学场景的标注数据集Particle Scenes with Spike and Displacement (PSSD),以支持方法的验证与评估。

链接: https://arxiv.org/abs/2504.18864
作者: Yunzhong Zhang,Bo Xiong,You Zhou,Changqing Su,Zhen Cheng,Zhaofei Yu,Xun Cao,Tiejun Huang
机构: Nanjing University (南京大学); Peking University (北京大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The need for accurate and non-intrusive flow measurement methods has led to the widespread adoption of Particle Image Velocimetry (PIV), a powerful diagnostic tool in fluid motion estimation. This study investigates the tremendous potential of spike cameras (a type of ultra-high-speed, high-dynamic-range camera) in PIV. We propose a deep learning framework, Spike Imaging Velocimetry (SIV), designed specifically for highly turbulent and intricate flow fields. To aggregate motion features from the spike stream while minimizing information loss, we incorporate a Detail-Preserving Hierarchical Transform (DPHT) module. Additionally, we introduce a Graph Encoder (GE) to extract contextual features from highly complex fluid flows. Furthermore, we present a spike-based PIV dataset, Particle Scenes with Spike and Displacement (PSSD), which provides labeled data for three challenging fluid dynamics scenarios. Our proposed method achieves superior performance compared to existing baseline methods on PSSD. The datasets and our implementation of SIV are open-sourced in the supplementary materials.
zh

[CV-129] Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation

【速读】:该论文旨在解决当前计算病理学中基于单分辨率图像的视觉-语言模型(VLM)在癌症亚型分类、组织表型分析和生存分析等任务中的局限性,因为单分辨率图像提供的细节有限。其解决方案的关键在于提出一种多分辨率范式,利用全切片图像(WSI)在多个分辨率下提取组织切片并生成对应的文本描述,同时引入多分辨率视觉-文本对齐以及跨分辨率对齐机制,通过多模态编码器增强模型捕捉多分辨率上下文信息的能力,从而提升模型的特征表示能力、判别能力和跨分辨率的泛化性能。

链接: https://arxiv.org/abs/2504.18856
作者: Shahad Albastaki,Anabia Sohail,Iyyakutti Iyappan Ganapathi,Basit Alawode,Asim Khan,Sajid Javed,Naoufel Werghi,Mohammed Bennamoun,Arif Mahmood
机构: Khalifa University of Science and Technology(哈利法大学科技学院); Information Technology University of the Punjab(旁遮普信息技术大学); University of the Western Australia(西澳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In Computational Pathology (CPath), the introduction of Vision-Language Models (VLMs) has opened new avenues for research, focusing primarily on aligning image-text pairs at a single magnification level. However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations. Cross-resolution alignment using a multimodal encoder enhances the model’s ability to capture context from multiple resolutions in histology images. Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions. Pre-trained on a comprehensive TCGA dataset with 34 million image-language pairs at various resolutions, our fine-tuned model outperforms state-of-the-art (SOTA) counterparts across multiple datasets and tasks, demonstrating its effectiveness in CPath. The code is available on GitHub at: this https URL
zh

[CV-130] Dexonomy: Synthesizing All Dexterous Grasp Types in a Grasp Taxonomy

【速读】:该论文旨在解决智能机器人中泛化灵巧抓取的问题,特别是如何生成适用于多种抓取类型、物体和关节手的接触丰富、无穿透且物理合理的抓取策略。现有自动抓取合成方法通常局限于特定抓取类型或物体类别,限制了其可扩展性。该研究提出了一种高效的流水线,通过两个阶段完成抓取合成:首先优化物体以适配手部模板,然后在仿真中局部精调手部以适应物体。该方案的关键在于利用单个手工标注的模板进行多阶段优化,并结合接触感知控制策略验证生成的抓取,从而提升抓取的可靠性和泛化能力。

链接: https://arxiv.org/abs/2504.18829
作者: Jiayi Chen,Yubin Ke,Lin Peng,He Wang
机构: Peking University (北京大学); Galbot; Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Robotics: Science and Systems (RSS 2025)

点击查看摘要

Abstract:Generalizable dexterous grasping with suitable grasp types is a fundamental skill for intelligent robots. Developing such skills requires a large-scale and high-quality dataset that covers numerous grasp types (i.e., at least those categorized by the GRASP taxonomy), but collecting such data is extremely challenging. Existing automatic grasp synthesis methods are often limited to specific grasp types or object categories, hindering scalability. This work proposes an efficient pipeline capable of synthesizing contact-rich, penetration-free, and physically plausible grasps for any grasp type, object, and articulated hand. Starting from a single human-annotated template for each hand and grasp type, our pipeline tackles the complicated synthesis problem with two stages: optimize the object to fit the hand template first, and then locally refine the hand to fit the object in simulation. To validate the synthesized grasps, we introduce a contact-aware control strategy that allows the hand to apply the appropriate force at each contact point to the object. Those validated grasps can also be used as new grasp templates to facilitate future synthesis. Experiments show that our method significantly outperforms previous type-unaware grasp synthesis baselines in simulation. Using our algorithm, we construct a dataset containing 10.7k objects and 9.5M grasps, covering 31 grasp types in the GRASP taxonomy. Finally, we train a type-conditional generative model that successfully performs the desired grasp type from single-view object point clouds, achieving an 82.3% success rate in real-world experiments. Project page: this https URL.
zh

[CV-131] Audio-Driven Talking Face Video Generation with Joint Uncertainty Learning

【速读】:该论文旨在解决现有说话人脸视频生成系统中视觉不确定性学习不足导致的视觉质量不一致和跨不同输入条件性能不可靠的问题。其解决方案的关键在于提出一种联合不确定性学习网络(Joint Uncertainty Learning Network, JULNet),该网络通过引入误差图和不确定性图,直接建模视觉误差与不确定性之间的关系,并利用KL散度项和直方图技术对两者的分布进行匹配,从而实现误差与不确定性的联合优化,提升模型的性能与鲁棒性。

链接: https://arxiv.org/abs/2504.18810
作者: Yifan Xie,Fei Ma,Yi Bin,Ying He,Fei Yu
机构: Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东省人工智能与数字经济发展实验室); Tongji University (同济大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Talking face video generation with arbitrary speech audio is a significant challenge within the realm of digital human technology. The previous studies have emphasized the significance of audio-lip synchronization and visual quality. Currently, limited attention has been given to the learning of visual uncertainty, which creates several issues in existing systems, including inconsistent visual quality and unreliable performance across different input conditions. To address the problem, we propose a Joint Uncertainty Learning Network (JULNet) for high-quality talking face video generation, which incorporates a representation of uncertainty that is directly related to visual error. Specifically, we first design an uncertainty module to individually predict the error map and uncertainty map after obtaining the generated image. The error map represents the difference between the generated image and the ground truth image, while the uncertainty map is used to predict the probability of incorrect estimates. Furthermore, to match the uncertainty distribution with the error distribution through a KL divergence term, we introduce a histogram technique to approximate the distributions. By jointly optimizing error and uncertainty, the performance and robustness of our model can be enhanced. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking face video generation compared to previous methods.
zh

[CV-132] Video CLIP Model for Multi-View Echocardiography Interpretation

【速读】:该论文试图解决现有基于单帧图像的视觉-语言模型(Vision-Language Models, VLMs)在解读超声心动图视频时诊断准确性较低的问题,尤其是在需要通过心脏运动识别的疾病方面。其解决方案的关键在于开发一种能够处理五个不同视角和完整视频序列的视频-语言模型,并在包含60,747例病例的超声心动图视频与临床报告对上进行训练,从而提升模型的解读准确性。

链接: https://arxiv.org/abs/2504.18800
作者: Ryo Takizawa,Satoshi Kodera,Tempei Kabayama,Ryo Matsuoka,Yuta Ando,Yuto Nakamura,Haruki Settai,Norihiko Takeda
机构: The University of Tokyo Hospital (东京大学医院); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Echocardiography involves recording videos of the heart using ultrasound, enabling clinicians to evaluate its condition. Recent advances in large-scale vision-language models (VLMs) have garnered attention for automating the interpretation of echocardiographic videos. However, most existing VLMs proposed for medical interpretation thus far rely on single-frame (i.e., image) inputs. Consequently, these image-based models often exhibit lower diagnostic accuracy for conditions identifiable through cardiac motion. Moreover, echocardiographic videos are recorded from various views that depend on the direction of ultrasound emission, and certain views are more suitable than others for interpreting specific conditions. Incorporating multiple views could potentially yield further improvements in accuracy. In this study, we developed a video-language model that takes five different views and full video sequences as input, training it on pairs of echocardiographic videos and clinical reports from 60,747 cases. Our experiments demonstrate that this expanded approach achieves higher interpretation accuracy than models trained with only single-view videos or with still images.
zh

[CV-133] CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval

【速读】:该论文旨在解决文本驱动的人像检索任务中,由于标注成本高和隐私保护问题,依赖合成数据进行预训练时所面临的领域偏差问题,该偏差严重影响了预训练模型的可扩展性。其解决方案的关键在于提出一种基于跨模态自适应元学习(Cross-modality Adaptive Meta-Learning, CAMeL)的领域无关预训练框架,通过设计反映现实场景多样性和复杂性的任务、引入动态错误样本记忆单元以记录多任务中的历史错误,并采用自适应双速更新策略来平衡新任务的快速适应与旧任务权重的缓慢更新,从而提升模型在预训练阶段的泛化能力。

链接: https://arxiv.org/abs/2504.18782
作者: Hang Yu,Jiahao Wen,Zhedong Zheng
机构: Shanghai University (上海大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Text-based person retrieval aims to identify specific individuals within an image database using textual descriptions. Due to the high cost of annotation and privacy protection, researchers resort to synthesized data for the paradigm of pretraining and fine-tuning. However, these generated data often exhibit domain biases in both images and textual annotations, which largely compromise the scalability of the pre-trained model. Therefore, we introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability during pretraining to facilitate the subsequent downstream tasks. In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios, and introduce a dynamic error sample memory unit to memorize the history for errors encountered within multiple tasks. To further ensure multi-task adaptation, we also adopt an adaptive dual-speed update strategy, balancing fast adaptation to new tasks and slow weight updates for historical tasks. Albeit simple, our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, including CUHK-PEDES, ICFG-PEDES, and RSTPReid, but also showcases robustness and scalability in handling biased synthetic images and noisy text annotations. Our code is available at this https URL.
zh

[CV-134] IoT Botnet Detection: Application of Vision Transformer to Classification of Network Flow Traffic

【速读】:该论文试图解决现有工具在提取物联网(IoT)网络流量数据特征时无法捕捉序列模式和空间模式的问题,从而限制了Transformer模型的应用。其解决方案的关键在于提出一种新颖的预处理方法,将网络流量数据转换为1通道2D图像格式,以适应Vision Transformer(ViT)模型,并增强ViT模型以支持除多层感知机(MLP)以外的其他分类器,从而提升多类物联网僵尸网络攻击检测的性能。

链接: https://arxiv.org/abs/2504.18781
作者: Hassan Wasswa,Timothy Lynar,Aziida Nanyonga,Hussein Abbass
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the demonstrated effectiveness of transformer models in NLP, and image and video classification, the available tools for extracting features from captured IoT network flow packets fail to capture sequential patterns in addition to the absence of spatial patterns consequently limiting transformer model application. This work introduces a novel preprocessing method to adapt transformer models, the vision transformer (ViT) in particular, for IoT botnet attack detection using network flow packets. The approach involves feature extraction from .pcap files and transforming each instance into a 1-channel 2D image shape, enabling ViT-based classification. Also, the ViT model was enhanced to allow use any classifier besides Multilayer Perceptron (MLP) that was deployed in the initial ViT paper. Models including the conventional feed forward Deep Neural Network (DNN), LSTM and Bidirectional-LSTM (BLSTM) demonstrated competitive performance in terms of precision, recall, and F1-score for multiclass-based attack detection when evaluated on two IoT attack datasets.
zh

[CV-135] Depth as Points: Center Point-based Depth Estimation

【速读】:该论文旨在解决城市场景中车辆和行人的感知问题,该问题对于自动驾驶至关重要,但传统方法面临数据采集复杂、计算和硬件需求高的挑战。其解决方案的关键在于开发一种高效生成虚拟数据集的方法,从而快速创建任务和场景特定的数据集,并基于此构建了大规模多任务自动驾驶数据集VirDepth。进一步提出了一种轻量级单目深度估计架构CenterDepth,其核心创新在于通过Center FC-CRFs算法整合全局语义信息,并基于目标关键点聚合多尺度特征,实现基于检测的深度估计,从而在计算速度和预测精度上均表现出色。

链接: https://arxiv.org/abs/2504.18773
作者: Zhiheng Tu,Xinjian Huang,Yong He,Ruiyang Zhou,Bo Du,Weitao Wu
机构: Nanjing University of Science and Technology(南京理工大学); Wuhan University(武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Depth Esitimation, Key-points, Virtual Datasets, Autonomous Driving

点击查看摘要

Abstract:The perception of vehicles and pedestrians in urban scenarios is crucial for autonomous driving. This process typically involves complicated data collection, imposes high computational and hardware demands. To address these limitations, we first develop a highly efficient method for generating virtual datasets, which enables the creation of task- and scenario-specific datasets in a short time. Leveraging this method, we construct the virtual depth estimation dataset VirDepth, a large-scale, multi-task autonomous driving dataset. Subsequently, we propose CenterDepth, a lightweight architecture for monocular depth estimation that ensures high operational efficiency and exhibits superior performance in depth estimation tasks with highly imbalanced height-scale distributions. CenterDepth integrates global semantic information through the innovative Center FC-CRFs algorithm, aggregates multi-scale features based on object key points, and enables detection-based depth estimation of targets. Experiments demonstrate that our proposed method achieves superior performance in terms of both computational speed and prediction accuracy.
zh

[CV-136] PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data ICLR2025

【速读】:该论文旨在解决地球观测数据中多模态图像融合的问题,特别是如何将任意数量的混合分辨率输入波段融合为统一的表示。其解决方案的关键在于采用一种基于注意力机制的融合方法,通过生成的补丁标记(patch tokens)并利用具有新颖金字塔结构的视觉变压器堆叠进行进一步处理,从而实现有效的特征学习与表达。

链接: https://arxiv.org/abs/2504.18770
作者: Manuel Weber,Carly Beneke
机构: EarthDaily Analytics (EarthDaily Analytics)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 13 figures, Published at ICLR 2025 - Machine Learning for Remote Sensing (ML4RS) Workshop

点击查看摘要

Abstract:We propose PyViT-FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwAV algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks.
zh

[CV-137] ransparentGS: Fast Inverse Rendering of Transparent Objects with Gaussians SIGGRAPH2025

【速读】:该论文试图解决透明物体在3D重建和新视角合成中由于镜面反射与折射导致的辐射场不稳定及过拟合问题,特别是针对3D Gaussian Splatting(3D-GS)在处理具有邻近内容的透明物体时因二次光线效应而表现不佳的问题。解决方案的关键在于提出TransparentGS,其核心包括:设计了用于透明物体的高效表示——透明高斯基元(transparent Gaussian primitives),通过延迟折射策略实现镜面折射;引入高斯光场探针(GaussProbe)在统一框架中编码环境光和邻近内容;以及提出基于深度的迭代探针查询算法(IterQuery)以减少探针框架中的视差误差。

链接: https://arxiv.org/abs/2504.18768
作者: Letian Huang,Dongwei Ye,Jialin Dan,Chengzhi Tao,Huiwen Liu,Kun Zhou,Bo Ren,Yuanqi Li,Yanwen Guo,Jie Guo
机构: State Key Lab for Novel Software Technology, Nanjing UniversityNanjingChina; TMCC, College of Computer Science, Nankai UniversityTianjinChina; State Key Lab of CAD & CG, Zhejiang UniversityHangzhouChina; Institute of Hangzhou Holographic Intelligent TechnologyHangzhouChina
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by SIGGRAPH 2025; this https URL

点击查看摘要

Abstract:The emergence of neural and Gaussian-based radiance field methods has led to considerable advancements in novel view synthesis and 3D object reconstruction. Nonetheless, specular reflection and refraction continue to pose significant challenges due to the instability and incorrect overfitting of radiance fields to high-frequency light variations. Currently, even 3D Gaussian Splatting (3D-GS), as a powerful and efficient tool, falls short in recovering transparent objects with nearby contents due to the existence of apparent secondary ray effects. To address this issue, we propose TransparentGS, a fast inverse rendering pipeline for transparent objects based on 3D-GS. The main contributions are three-fold. Firstly, an efficient representation of transparent objects, transparent Gaussian primitives, is designed to enable specular refraction through a deferred refraction strategy. Secondly, we leverage Gaussian light field probes (GaussProbe) to encode both ambient light and nearby contents in a unified framework. Thirdly, a depth-based iterative probes query (IterQuery) algorithm is proposed to reduce the parallax errors in our probe-based framework. Experiments demonstrate the speed and accuracy of our approach in recovering transparent objects from complex environments, as well as several applications in computer graphics and vision.
zh

[CV-138] Multi-Stage Boundary-Aware Transformer Network for Action Segmentation in Untrimmed Surgical Videos

【速读】:该论文旨在解决外科手术流程中动作分割的难题,尤其是在长序列动作捕捉过程中由于外科医生个体差异导致的动作持续时间不一致和细微过渡带来的分割挑战。传统模型如MS-TCN因依赖大感受野容易出现过分割或欠分割问题,影响分割质量。该研究提出的多阶段边界感知Transformer网络(Multi-Stage Boundary-Aware Transformer Network, MSBATN)通过引入分层滑动窗口注意力机制,结合一种新颖的统一损失函数,将动作分类与边界检测作为相互关联但独立的任务处理,其边界投票机制利用上下文信息更准确地识别动作起止点,从而提升了分割性能。

链接: https://arxiv.org/abs/2504.18756
作者: Rezowan Shuvo,M S Mekala,Eyad Elyan
机构: Robert Gordon University (罗伯特·戈登大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding actions within surgical workflows is essential for evaluating post-operative outcomes. However, capturing long sequences of actions performed in surgical settings poses challenges, as individual surgeons have their unique approaches shaped by their expertise, leading to significant variability. To tackle this complex problem, we focused on segmentation with precise boundaries, a demanding task due to the inherent variability in action durations and the subtle transitions often observed in untrimmed videos. These transitions, marked by ambiguous starting and ending points, complicate the segmentation process. Traditional models, such as MS-TCN, which depend on large receptive fields, frequently face challenges of over-segmentation (resulting in fragmented segments) or under-segmentation (merging distinct actions). Both of these issues negatively impact the quality of segmentation. To overcome these challenges, we present the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention, designed to enhance action segmentation. Our proposed approach incorporates a novel unified loss function that treats action classification and boundary detection as distinct yet interdependent tasks. Unlike traditional binary boundary detection methods, our boundary voting mechanism accurately identifies start and end points by leveraging contextual information. Extensive experiments using three challenging surgical datasets demonstrate the superior performance of the proposed method, achieving state-of-the-art results in F1 scores at thresholds of 25% and 50%, while also delivering comparable performance in other metrics.
zh

[CV-139] Dream-Box: Object-wise Outlier Generation for Out-of-Distribution Detection CVPR2025

【速读】:该论文试图解决的是分布外(out-of-distribution, OOD)检测的问题,即在保持模型在分布内任务(如分类或目标检测)上良好性能的同时,识别出不属于训练分布的实例。其解决方案的关键在于引入Dream-Box方法,通过扩散模型在像素空间中生成对象级别的异常样本,用于训练目标检测器以实现OOD检测,同时首次提供了生成的OOD对象的可视化结果。

链接: https://arxiv.org/abs/2504.18746
作者: Brian K. S. Isaac-Medina,Toby P. Breckon
机构: Durham University (杜伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, 2 tables, LatinX in AI CVPR 2025 Workshop

点击查看摘要

Abstract:Deep neural networks have demonstrated great generalization capabilities for tasks whose training and test sets are drawn from the same distribution. Nevertheless, out-of-distribution (OOD) detection remains a challenging task that has received significant attention in recent years. Specifically, OOD detection refers to the detection of instances that do not belong to the training distribution, while still having good performance on the in-distribution task (e.g., classification or object detection). Recent work has focused on generating synthetic outliers and using them to train an outlier detector, generally achieving improved OOD detection than traditional OOD methods. In this regard, outliers can be generated either in feature or pixel space. Feature space driven methods have shown strong performance on both the classification and object detection tasks, at the expense that the visualization of training outliers remains unknown, making further analysis on OOD failure modes challenging. On the other hand, pixel space outlier generation techniques enabled by diffusion models have been used for image classification using, providing improved OOD detection performance and outlier visualization, although their adaption to the object detection task is as yet unexplored. We therefore introduce Dream-Box, a method that provides a link to object-wise outlier generation in the pixel space for OOD detection. Specifically, we use diffusion models to generate object-wise outliers that are used to train an object detector for an in-distribution task and OOD detection. Our method achieves comparable performance to previous traditional methods while being the first technique to provide concrete visualization of generated OOD objects.
zh

[CV-140] A Review of 3D Object Detection with Vision-Language Models

【速读】:该论文旨在解决如何利用视觉-语言模型(Vision-Language Models, VLMs)进行三维物体检测的问题,重点在于提升检测系统在开放词汇和零样本泛化能力方面的表现。其解决方案的关键在于通过预训练策略、架构设计及提示工程方法,实现文本特征与三维空间特征的有效对齐,从而增强模型在复杂场景下的检测性能。

链接: https://arxiv.org/abs/2504.18738
作者: Ranjan Sapkota,Konstantinos I Roumeliotis,Rahul Harsha Cheppally,Marco Flores Calero,Manoj Karkee
机构: Cornell University (康奈尔大学); University of Peloponnese (帕特拉斯大学); Kansas State University (堪萨斯州立大学); Universidad de las Fuerzas Armadas (武装部队大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This review provides a systematic analysis of comprehensive survey of 3D object detection with vision-language models(VLMs) , a rapidly advancing area at the intersection of 3D vision and multimodal AI. By examining over 100 research papers, we provide the first systematic analysis dedicated to 3D object detection with vision-language models. We begin by outlining the unique challenges of 3D object detection with vision-language models, emphasizing differences from 2D detection in spatial reasoning and data complexity. Traditional approaches using point clouds and voxel grids are compared to modern vision-language frameworks like CLIP and 3D LLMs, which enable open-vocabulary detection and zero-shot generalization. We review key architectures, pretraining strategies, and prompt engineering methods that align textual and 3D features for effective 3D object detection with vision-language models. Visualization examples and evaluation benchmarks are discussed to illustrate performance and behavior. Finally, we highlight current challenges, such as limited 3D-language datasets and computational demands, and propose future research directions to advance 3D object detection with vision-language models. Object Detection, Vision-Language Models, Agents, VLMs, LLMs, AI
zh

[CV-141] HierSum: A Global and Local Attention Mechanism for Video Summarization

【速读】:该论文旨在解决 instructional video summarization(教学视频摘要)的问题,即从教学视频中提取关键步骤并生成简洁的摘要。其解决方案的关键在于提出一种层次化方法 HierSum,该方法通过整合来自字幕的细粒度局部线索与视频级别指令提供的全局上下文信息,从而更准确地识别视频中的关键片段。此外,该方法利用“最常重播”(most replayed)统计量作为监督信号,进一步提升了摘要的有效性。

链接: https://arxiv.org/abs/2504.18689
作者: Apoorva Beedu,Irfan Essa
机构: Georgia Institute of Technology (佐治亚理工学院); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video summarization creates an abridged version (i.e., a summary) that provides a quick overview of the video while retaining pertinent information. In this work, we focus on summarizing instructional videos and propose a method for breaking down a video into meaningful segments, each corresponding to essential steps in the video. We propose \textbfHierSum, a hierarchical approach that integrates fine-grained local cues from subtitles with global contextual information provided by video-level instructions. Our approach utilizes the ``most replayed" statistic as a supervisory signal to identify critical segments, thereby improving the effectiveness of the summary. We evaluate on benchmark datasets such as TVSum, BLiSS, this http URL, and the WikiHow test set, and show that HierSum consistently outperforms existing methods in key metrics such as F1-score and rank correlation. We also curate a new multi-modal dataset using WikiHow and EHow videos and associated articles containing step-by-step instructions. Through extensive ablation studies, we demonstrate that training on this dataset significantly enhances summarization on the target datasets.
zh

[CV-142] SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models IROS2025

【速读】:该论文旨在解决在3D环境中基于空间关系和属性对物体参照语言进行解释并实现物体定位的问题,这一任务在机器人与人类协同操作中至关重要。然而,由于场景多样性、细粒度物体数量庞大以及语言参考的复杂自由形式,该任务具有较高难度,且在3D领域获取大量自然语言训练数据尤为困难。因此,方法需要具备小样本学习能力和零样本泛化到新环境的能力。论文提出的解决方案关键在于利用2D数据中的丰富物体属性,并结合基于启发式的空间推理工具与大语言模型(LLM)的序列推理能力,从而无需文本到3D的数据进行训练,并能够零样本应用于未见过的环境。

链接: https://arxiv.org/abs/2504.18684
作者: Nader Zantout,Haochen Zhang,Pujith Kachana,Jinkai Qiu,Ji Zhang,Wenshan Wang
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 7 pages, 6 figures, submitted to IROS 2025

点击查看摘要

Abstract:Interpreting object-referential language and grounding objects in 3D with spatial relations and attributes is essential for robots operating alongside humans. However, this task is often challenging due to the diversity of scenes, large number of fine-grained objects, and complex free-form nature of language references. Furthermore, in the 3D domain, obtaining large amounts of natural language training data is difficult. Thus, it is important for methods to learn from little data and zero-shot generalize to new environments. To address these challenges, we propose SORT3D, an approach that utilizes rich object attributes from 2D data and merges a heuristics-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning. Importantly, our method does not require text-to-3D data for training and can be applied zero-shot to unseen environments. We show that SORT3D achieves state-of-the-art performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on an autonomous vehicle and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments. All source code for the system pipeline is publicly released at this https URL .
zh

[CV-143] Co-Training with Active Contrastive Learning and Meta-Pseudo-Labeling on 2D Projections for Deep Semi-Supervised Learning

【速读】:该论文试图解决在标注数据稀缺而未标注数据丰富的场景下,训练深度学习(DL)模型所面临的挑战,特别是在生物图像分类任务中,传统方法依赖预训练特征和大规模验证集,并且随机采样标注数据,忽略了选择更具信息量的样本。解决方案的关键在于提出一种名为Active-DeepFA的方法,该方法有效结合了对比学习(CL)、基于教师-学生结构的元伪标签机制和主动学习(AL),通过协同训练两个相互合作的网络来减轻伪标签带来的确认偏差,同时利用深度特征的二维投影进行标签传播,并选择最可靠的伪标签和最有意义的样本进行标注,从而提升模型在少量标注数据下的性能。

链接: https://arxiv.org/abs/2504.18666
作者: David Aparco-Cardenas,Jancarlo F. Gomes,Alexandre X. Falcão,Pedro J. de Rezende
机构: Institute of Computing (计算研究所); University of Campinas (坎皮纳斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Journal of the Brazilian Computer Society (JBCS) [ this https URL ]

点击查看摘要

Abstract:A major challenge that prevents the training of DL models is the limited availability of accurately labeled data. This shortcoming is highlighted in areas where data annotation becomes a time-consuming and error-prone task. In this regard, SSL tackles this challenge by capitalizing on scarce labeled and abundant unlabeled data; however, SoTA methods typically depend on pre-trained features and large validation sets to learn effective representations for classification tasks. In addition, the reduced set of labeled data is often randomly sampled, neglecting the selection of more informative samples. Here, we present active-DeepFA, a method that effectively combines CL, teacher-student-based meta-pseudo-labeling and AL to train non-pretrained CNN architectures for image classification in scenarios of scarcity of labeled and abundance of unlabeled data. It integrates DeepFA into a co-training setup that implements two cooperative networks to mitigate confirmation bias from pseudo-labels. The method starts with a reduced set of labeled samples by warming up the networks with supervised CL. Afterward and at regular epoch intervals, label propagation is performed on the 2D projections of the networks’ deep features. Next, the most reliable pseudo-labels are exchanged between networks in a cross-training fashion, while the most meaningful samples are annotated and added into the labeled set. The networks independently minimize an objective loss function comprising supervised contrastive, supervised and semi-supervised loss components, enhancing the representations towards image classification. Our approach is evaluated on three challenging biological image datasets using only 5% of labeled samples, improving baselines and outperforming six other SoTA methods. In addition, it reduces annotation effort by achieving comparable results to those of its counterparts with only 3% of labeled data.
zh

[CV-144] Geometry aware inference of steady state PDEs using Equivariant Neural Fields representations

【速读】:该论文旨在解决在复杂几何结构下对稳态偏微分方程(PDE)进行高效、准确预测的问题,特别是当几何形状具有非参数化变化时。其解决方案的关键在于提出一种基于等变神经场(Equivariant Neural Field)架构的编码器-解码器方法enf2enf,通过将输入几何体编码为保留几何信息的潜在点云嵌入,并结合全局参数直接解码为连续输出场,从而有效建模几何与物理之间的耦合关系。该方法利用局部性和平移不变性的归纳偏置,能够捕捉细尺度物理特征和复杂形状变化,提升模型的泛化能力和物理一致性。

链接: https://arxiv.org/abs/2504.18591
作者: Giovanni Catalani,Michael Bauerheim,Frédéric Tost,Xavier Bertrand,Joseph Morlier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Neural Fields have enabled powerful, discretization-invariant methods for learning neural operators that approximate solutions of Partial Differential Equations (PDEs) on general geometries. Building on these developments, we introduce enf2enf, an encoder–decoder methodology for predicting steady-state Partial Differential Equations with non-parameterized geometric variability, based on recently proposed Equivariant Neural Field architectures. In enf2enf, input geometries are encoded into latent point cloud embeddings that inherently preserve geometric grounding and capture local phenomena. The resulting representations are then combined with global parameters and directly decoded into continuous output fields, thus efficiently modeling the coupling between geometry and physics. By leveraging the inductive biases of locality and translation invariance, our approach is able to capture fine-scale physical features as well as complex shape variations, thereby enhancing generalization and physical compliance. Extensive experiments on a high-fidelity aerodynamic dataset, a hyper-elastic material benchmark, and multi-element airfoil geometries, demonstrate that the proposed model achieves superior or competitive performance compared to state-of-the-art graph based, operator learning, and neural field methods. Notably, our method supports real time inference and zero-shot super-resolution, enabling efficient training on low-resolution meshes while maintaining high accuracy on full-scale discretizations.
zh

[CV-145] Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

【速读】:该论文试图解决当前大型视觉-语言模型(Large Vision-Language Models, LVLMs)在处理具有显式视觉依赖性的基础数学问题时评估不足的问题,这些问题需要模型在多个图像之间进行辨别、整合与推理,并结合常识知识。解决方案的关键是引入VCBENCH,这是一个涵盖六个认知领域的多模态数学推理基准,包含1,720个问题和6,697张图像,旨在促进对视觉与数学信息融合能力的全面评估。

链接: https://arxiv.org/abs/2504.18589
作者: Zhikai Wang,Jiashuo Sun,Wenqi Zhang,Zhiqiang Hu,Xin Li,Fan Wang,Deli Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Home page: this https URL

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have significantly enhanced their ability to integrate visual and linguistic information, achieving near-human proficiency in tasks like object recognition, captioning, and visual question answering. However, current benchmarks typically focus on knowledge-centric evaluations that assess domain-specific expertise, often neglecting the core ability to reason about fundamental mathematical elements and visual concepts. We identify a gap in evaluating elementary-level math problems, which rely on explicit visual dependencies-requiring models to discern, integrate, and reason across multiple images while incorporating commonsense knowledge, all of which are crucial for advancing toward broader AGI capabilities. To address this gap, we introduce VCBENCH, a comprehensive benchmark for multimodal mathematical reasoning with explicit visual dependencies. VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning. We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy. Our findings highlight the ongoing challenges in visual-mathematical integration and suggest avenues for future LVLM advancements.
zh

[CV-146] A Decade of You Only Look Once (YOLO) for Object Detection

【速读】:该论文旨在回顾和分析YOLO(You Only Look Once)框架在过去十年中的发展与演变,探讨其在实时目标检测领域的技术演进、架构趋势及应用现状。解决方案的关键在于系统梳理YOLO的主要版本,总结其高效设计、模块化可扩展性以及跨领域适应性的核心特征,并通过评估实践与伦理考量,为未来发展方向提供参考。

链接: https://arxiv.org/abs/2504.18586
作者: Leo Thomas Ramos,Angel D. Sappa
机构: Computer Vision Center, Universitat Autònoma de Barcelona(计算机视觉中心,巴塞罗那自治大学); Kauel Inc.(Kauel公司); ESPOL Polytechnic University(埃斯波尔理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This review marks the tenth anniversary of You Only Look Once (YOLO), one of the most influential frameworks in real-time object detection. Over the past decade, YOLO has evolved from a streamlined detector into a diverse family of architectures characterized by efficient design, modular scalability, and cross-domain adaptability. The paper presents a technical overview of the main versions, highlights key architectural trends, and surveys the principal application areas in which YOLO has been adopted. It also addresses evaluation practices, ethical considerations, and potential future directions for the framework’s continued development. The analysis aims to provide a comprehensive and critical perspective on YOLO’s trajectory and ongoing transformation.
zh

[CV-147] Backdoor Defense in Diffusion Models via Spatial Attention Unlearning

【速读】:该论文旨在解决文本到图像扩散模型在面对后门攻击时的安全性问题,即恶意修改训练数据会导致模型在特定触发器存在时生成非预期输出。解决方案的关键在于提出一种名为空间注意力遗忘(Spatial Attention Unlearning, SAU)的新技术,该技术通过潜在空间操作和空间注意力机制来隔离并移除后门触发器的潜在表示,从而实现对恶意影响的精确高效消除。

链接: https://arxiv.org/abs/2504.18563
作者: Abha Jha,Ashwath Vaithinathan Aravindan,Matthew Salaway,Atharva Sandeep Bhide,Duygu Nur Yaldiz
机构: University of Southern California (南加州大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models are increasingly vulnerable to backdoor attacks, where malicious modifications to the training data cause the model to generate unintended outputs when specific triggers are present. While classification models have seen extensive development of defense mechanisms, generative models remain largely unprotected due to their high-dimensional output space, which complicates the detection and mitigation of subtle perturbations. Defense strategies for diffusion models, in particular, remain under-explored. In this work, we propose Spatial Attention Unlearning (SAU), a novel technique for mitigating backdoor attacks in diffusion models. SAU leverages latent space manipulation and spatial attention mechanisms to isolate and remove the latent representation of backdoor triggers, ensuring precise and efficient removal of malicious effects. We evaluate SAU across various types of backdoor attacks, including pixel-based and style-based triggers, and demonstrate its effectiveness in achieving 100% trigger removal accuracy. Furthermore, SAU achieves a CLIP score of 0.7023, outperforming existing methods while preserving the model’s ability to generate high-quality, semantically aligned images. Our results show that SAU is a robust, scalable, and practical solution for securing text-to-image diffusion models against backdoor attacks.
zh

[CV-148] Low-Bit Integerization of Vision Transformers using Operand Reodering for Efficient Hardware

【速读】:该论文试图解决预训练视觉变压器(Vision Transformer, ViT)在计算和内存成本上的高昂问题。尽管模型量化通过降低精度减少了内存使用,但矩阵运算前的反量化操作仍导致显著的计算开销。该工作的关键解决方案是分析计算图并提出基于操作重排序的整数化过程,具体而言,将反量化延迟到矩阵运算之后,从而实现直接处理量化输入的整数矩阵乘法和线性模块,有效降低了计算能耗。

链接: https://arxiv.org/abs/2504.18547
作者: Ching-Yi Lin,Sahil Shah
机构: University of Maryland (马里兰大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 4 pages + references, 5 figures, 2 tables in IEEE double column conference template

点击查看摘要

Abstract:Pre-trained vision transformers have achieved remarkable performance across various visual tasks but suffer from expensive computational and memory costs. While model quantization reduces memory usage by lowering precision, these models still incur significant computational overhead due to the dequantization before matrix operations. In this work, we analyze the computation graph and propose an integerization process based on operation reordering. Specifically, the process delays dequantization until after matrix operations. This enables integerized matrix multiplication and linear module by directly processing the quantized input. To validate our approach, we synthesize the self-attention module of ViT on a systolic array-based hardware. Experimental results show that our low-bit inference reduces per-PE power consumption for linear layer and matrix multiplication, bridging the gap between quantized models and efficient inference.
zh

[CV-149] SST-DUNet: Automated preclinical functional MRI skull stripping using Smart Swin Transformer and Dense UNet

【速读】:该论文旨在解决预临床功能磁共振成像(fMRI)数据中脑部图像分割(skull stripping)的自动化问题,传统手动方法耗时且依赖操作者,而现有方法在处理低分辨率和不同切片尺寸的数据时表现不佳。其解决方案的关键在于提出一种名为SST-DUNet的新方法,该方法结合了基于密集U-Net的架构与基于智能Swin Transformer(SST)的特征提取器,其中引入的Smart Shifted Window Multi-Head Self-Attention(SSW-MSA)模块替代了Swin Transformer中的掩码模块,以学习通道特异性特征并关注脑结构内的相关依赖关系,从而提升对低分辨率和变体切片尺寸的适应能力。此外,为缓解类别不平衡问题,采用了结合Focal Loss和Dice Loss的联合损失函数。

链接: https://arxiv.org/abs/2504.19937
作者: Sima Soltanpour,Rachel Utama,Arnold Chang,Md Taufiq Nasseef,Dan Madularu,Praveen Kulkarni,Craig Ferris,Chris Joslin
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Skull stripping is a common preprocessing step that is often performed manually in Magnetic Resonance Imaging (MRI) pipelines, including functional MRI (fMRI). This manual process is time-consuming and operator dependent. Automating this process is challenging for preclinical data due to variations in brain geometry, resolution, and tissue contrast. While existing methods for MRI skull stripping exist, they often struggle with the low resolution and varying slice sizes in preclinical fMRI data. This study proposes a novel method called SST-DUNet, that integrates a dense UNet-based architecture with a feature extractor based on Smart Swin Transformer (SST) for fMRI skull stripping. The Smart Shifted Window Multi-Head Self-Attention (SSW-MSA) module in SST is adapted to replace the mask-based module in the Swin Transformer (ST), enabling the learning of distinct channel-wise features while focusing on relevant dependencies within brain structures. This modification allows the model to better handle the complexities of fMRI skull stripping, such as low resolution and variable slice sizes. To address the issue of class imbalance in preclinical data, a combined loss function using Focal and Dice loss is utilized. The model was trained on rat fMRI images and evaluated across three in-house datasets with a Dice similarity score of 98.65%, 97.86%, and 98.04%. The fMRI results obtained through automatic skull stripping using the SST-DUNet model closely align with those from manual skull stripping for both seed-based and independent component analyses. These results indicate that the SST-DUNet can effectively substitute manual brain extraction in rat fMRI analysis.
zh

[CV-150] Accelerated 3D-3D rigid registration of echocardiographic images obtained from apical window using particle filter

【速读】:该论文旨在解决经胸超声心动图图像在不同视角下进行三维(3D)-三维(3D)刚性配准时面临的挑战,特别是在存在显著和有限重叠的情况下,以及面对超声图像中的噪声和强度变化时的鲁棒性问题。其解决方案的关键在于提出了一种加速的顺序蒙特卡洛(sequential Monte Carlo, SMC)算法,该算法通过迭代过程估计刚性变换的平移和旋转分量,并利用基于图像和基于掩码的方法实现心脏周期内所有帧的统一配准,从而提高了配准效率与准确性。

链接: https://arxiv.org/abs/2504.19930
作者: Thanuja Uruththirakodeeswaran,Harald Becher,Michelle Noga,Lawrence H. Le,Pierre Boulanger,Jonathan Windram,Kumaradevan Punithakumar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The perfect alignment of 3D echocardiographic images captured from various angles has improved image quality and broadened the field of view. This study proposes an accelerated sequential Monte Carlo (SMC) algorithm for 3D-3D rigid registration of transthoracic echocardiographic images with significant and limited overlap taken from apical window that is robust to the noise and intensity variation in ultrasound images. The algorithm estimates the translational and rotational components of the rigid transform through an iterative process and requires an initial approximation of the rotation and translation limits. We perform registration in two ways: the image-based registration computes the transform to align the end-diastolic frame of the apical nonstandard image to the apical standard image and applies the same transform to all frames of the cardiac cycle, whereas the mask-based registration approach uses the binary masks of the left ventricle in the same way. The SMC and exhaustive search (EX) algorithms were evaluated for 4D temporal sequences recorded from 7 volunteers who participated in a study conducted at the Mazankowski Alberta Heart Institute. The evaluations demonstrate that the mask-based approach of the accelerated SMC yielded a Dice score value of 0.819 +/- 0.045 for the left ventricle and gained 16.7x speedup compared to the CPU version of the SMC algorithm.
zh

[CV-151] Dual Attention Driven Lumbar Magnetic Resonance Image Feature Enhancement and Automatic Diagnosis of Herniation

【速读】:该论文试图解决腰椎间盘突出症(Lumbar Disc Herniation, LDH)在临床诊断中依赖放射科医生经验导致的诊断延迟和培训成本高的问题。解决方案的关键在于提出一种创新的自动化LDH分类框架,该框架利用T1加权和T2加权磁共振成像(MRI)图像,结合数据增强以及通道和空间注意力机制,提取具有临床意义的LDH特征并生成标准化的诊断输出,从而提高诊断效率和准确性。

链接: https://arxiv.org/abs/2504.19438
作者: Lingrui Zhang,Liang Guo,Xiao An,Feng Lin,Binlong Zheng,Jiankun Wang,Zhirui Li
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Lumbar disc herniation (LDH) is a common musculoskeletal disease that requires magnetic resonance imaging (MRI) for effective clinical management. However, the interpretation of MRI images heavily relies on the expertise of radiologists, leading to delayed diagnosis and high costs for training physicians. Therefore, this paper proposes an innovative automated LDH classification framework. To address these key issues, the framework utilizes T1-weighted and T2-weighted MRI images from 205 people. The framework extracts clinically actionable LDH features and generates standardized diagnostic outputs by leveraging data augmentation and channel and spatial attention mechanisms. These outputs can help physicians make confident and time-effective care decisions when needed. The proposed framework achieves an area under the receiver operating characteristic curve (AUC-ROC) of 0.969 and an accuracy of 0.9486 for LDH detection. The experimental results demonstrate the performance of the proposed framework. Our framework only requires a small number of datasets for training to demonstrate high diagnostic accuracy. This is expected to be a solution to enhance the LDH detection capabilities of primary hospitals.
zh

[CV-152] Innovative Integration of 4D Cardiovascular Reconstruction and Hologram: A New Visualization Tool for Coronary Artery Bypass Grafting Planning

【速读】:该论文旨在解决冠状动脉旁路移植术(CABG)术前规划中对复杂心脏解剖结构的可视化与评估难题,特别是冠状动脉深度、钙化程度及心包粘连的准确识别。其解决方案的关键在于开发一种基于4D心脏计算机断层扫描血管造影(CTA)数据的半自动化工作流程,实现心脏结构、心外膜脂肪组织(EAT)和冠状动脉的时序分割,并结合钙化评分、EAT内冠状动脉深度可视化以及通过运动分析的心包粘连评估,最终生成动态心血管全息影像以辅助术前规划。

链接: https://arxiv.org/abs/2504.19401
作者: Shuo Wang,Tong Ren,Nan Cheng,Li Zhang,Rong Wang
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
备注: 35 pages, 9 figures

点击查看摘要

Abstract:Background: Coronary artery bypass grafting (CABG) planning requires advanced spatial visualization and consideration of coronary artery depth, calcification, and pericardial adhesions. Objective: To develop and evaluate a dynamic cardiovascular holographic visualization tool for preoperative CABG planning. Methods: Using 4D cardiac computed tomography angiography data from 14 CABG candidates, we developed a semi-automated workflow for time-resolved segmentation of cardiac structures, epicardial adipose tissue (EAT), and coronary arteries with calcium scoring. The workflow incorporated methods for cardiac segmentation, coronary calcification quantification, visualization of coronary depth within EAT, and pericardial adhesion assessment through motion analysis. Dynamic cardiovascular holograms were displayed using the Looking Glass platform. Thirteen cardiac surgeons evaluated the tool using a Likert scale. Additionally, pericardial adhesion scores from holograms of 21 patients (including seven undergoing secondary cardiac surgeries) were compared with intraoperative findings. Results: Surgeons rated the visualization tool highly for preoperative planning utility (mean Likert score: 4.57/5.0). Hologram-based pericardial adhesion scoring strongly correlated with intraoperative findings (r=0.786, P0.001). Conclusion: This study establishes a visualization framework for CABG planning that produces clinically relevant dynamic holograms from patient-specific data, with clinical feedback confirming its effectiveness for preoperative planning.
zh

[CV-153] Low-Rank Adaptive Structural Priors for Generalizable Diabetic Retinopathy Grading IJCNN2025

【速读】:该论文试图解决深度学习方法在糖尿病视网膜病变(Diabetic Retinopathy, DR)分级任务中因领域偏移(domain shifts)导致性能显著下降的问题。现有领域泛化(Domain Generalization, DG)方法通常忽视病灶特异性特征,从而影响了诊断的准确性。论文提出的解决方案的关键在于引入低秩自适应结构先验(Low-rank Adaptive Structural Priors, LoASP),通过融合结构先验信息增强现有DG方法,以学习适应性结构表示,从而提升模型在不同领域场景下的泛化能力。

链接: https://arxiv.org/abs/2504.19362
作者: Yunxuan Wang,Ray Yin,Yumei Tan,Hao Chen,Haiying Xia
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCNN 2025

点击查看摘要

Abstract:Diabetic retinopathy (DR), a serious ocular complication of diabetes, is one of the primary causes of vision loss among retinal vascular diseases. Deep learning methods have been extensively applied in the grading of diabetic retinopathy (DR). However, their performance declines significantly when applied to data outside the training distribution due to domain shifts. Domain generalization (DG) has emerged as a solution to this challenge. However, most existing DG methods overlook lesion-specific features, resulting in insufficient accuracy. In this paper, we propose a novel approach that enhances existing DG methods by incorporating structural priors, inspired by the observation that DR grading is heavily dependent on vessel and lesion structures. We introduce Low-rank Adaptive Structural Priors (LoASP), a plug-and-play framework designed for seamless integration with existing DG models. LoASP improves generalization by learning adaptive structural representations that are finely tuned to the complexities of DR diagnosis. Extensive experiments on eight diverse datasets validate its effectiveness in both single-source and multi-source domain scenarios. Furthermore, visualizations reveal that the learned structural priors intuitively align with the intricate architecture of the vessels and lesions, providing compelling insights into their interpretability and diagnostic relevance.
zh

[CV-154] Improving Generalization in MRI-Based Deep Learning Models for Total Knee Replacement Prediction

【速读】:该论文旨在解决基于MRI的深度学习模型在膝骨关节炎(Knee Osteoarthritis, KOA)预测任务中泛化能力不足的问题,尤其是在不同成像数据源之间的迁移性能受限。其解决方案的关键在于采用实例归一化(instance normalization)替代批量归一化(batch normalization)、使用数据增强技术以及引入对比损失(contrastive loss),从而提升模型在不同影像数据域间的泛化能力。

链接: https://arxiv.org/abs/2504.19203
作者: Ehsan Karami,Hamid Soltanian-Zadeh
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knee osteoarthritis (KOA) is a common joint disease that causes pain and mobility issues. While MRI-based deep learning models have demonstrated superior performance in predicting total knee replacement (TKR) and disease progression, their generalizability remains challenging, particularly when applied to imaging data from different sources. In this study, we have shown that replacing batch normalization with instance normalization, using data augmentation, and applying contrastive loss improves model generalization in a baseline deep learning model for knee osteoarthritis (KOA) prediction. We trained and evaluated our model using MRI data from the Osteoarthritis Initiative (OAI) database, considering sagittal fat-suppressed intermediate-weighted turbo spin-echo (FS-IW-TSE) images as the source domain and sagittal fat-suppressed three-dimensional (3D) dual-echo in steady state (DESS) images as the target domain. The results demonstrate a statistically significant improvement in classification accuracy across both domains, with our approach outperforming the baseline model.
zh

[CV-155] Leverag ing Modified Ex Situ Tomography Data for Segmentation of In Situ Synchrotron X-Ray Computed Tomography

【速读】:该论文试图解决在原位同步辐射X射线计算机断层扫描(in situ synchrotron X-ray computed tomography)中,由于复杂的成像伪影和有限的训练数据导致的自动化分割难题。其解决方案的关键在于利用高质量的离线实验室数据对深度学习模型进行训练,从而实现对原位数据的二值化分割,所采用的改进型SegFormer架构在未见过的数据上表现出高分割性能,并将每组3D数据的处理时间从数小时缩短至数秒。

链接: https://arxiv.org/abs/2504.19200
作者: Tristan Manchester,Adam Anders,Julio Spadotto,Hannah Eccleston,William Beavan,Hugues Arcis,Brian J. Connolly
机构: University of Manchester (曼彻斯特大学); Henry Royce Institute (亨利·罗伊思学院); United Kingdom National Nuclear Laboratory (英国国家核实验室)
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In situ synchrotron X-ray computed tomography enables dynamic material studies, but automated segmentation remains challenging due to complex imaging artefacts and limited training data. We present a methodology for deep learning-based segmentation by transforming high-quality ex situ laboratory data to train models for binary segmentation of in situ synchrotron data, demonstrated through copper oxide dissolution studies. Using a modified SegFormer architecture, our approach achieves high segmentation performance on unseen data while reducing processing time from hours to seconds per 3D dataset. The method maintains consistent performance over significant morphological changes during experiments, despite training only on static specimens. This methodology can be readily applied to diverse materials systems, accelerating the analysis of time-resolved tomographic data across scientific disciplines.
zh

[CV-156] Surgeons vs. Computer Vision: A comparative analysis on surgical phase recognition capabilities

【速读】:该论文试图解决自动化手术阶段识别(Automated Surgical Phase Recognition, SPR)在复杂、非线性手术中的应用问题,特别是探讨时间上下文是否影响专家对手术阶段的分类能力。研究的关键在于通过引入时间上下文信息,提升SPR的准确性,并验证其在机器人辅助部分肾切除术(Robot-Assisted Partial Nephrectomy, RAPN)中的有效性。研究采用定制的网络平台收集不同经验水平泌尿科医生的判断,并对比了无时间和有时间上下文的AI模型性能,结果表明时间信息的引入显著提升了分类效果,同时表明人类专家与AI在相同上下文中表现相当。

链接: https://arxiv.org/abs/2504.18954
作者: Marco Mezzina,Pieter De Backer,Tom Vercauteren,Matthew Blaschko,Alexandre Mottrie,Tinne Tuytelaars
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Purpose: Automated Surgical Phase Recognition (SPR) uses Artificial Intelligence (AI) to segment the surgical workflow into its key events, functioning as a building block for efficient video review, surgical education as well as skill assessment. Previous research has focused on short and linear surgical procedures and has not explored if temporal context influences experts’ ability to better classify surgical phases. This research addresses these gaps, focusing on Robot-Assisted Partial Nephrectomy (RAPN) as a highly non-linear procedure. Methods: Urologists of varying expertise were grouped and tasked to indicate the surgical phase for RAPN on both single frames and video snippets using a custom-made web platform. Participants reported their confidence levels and the visual landmarks used in their decision-making. AI architectures without and with temporal context as trained and benchmarked on the Cholec80 dataset were subsequently trained on this RAPN dataset. Results: Video snippets and presence of specific visual landmarks improved phase classification accuracy across all groups. Surgeons displayed high confidence in their classifications and outperformed novices, who struggled discriminating phases. The performance of the AI models is comparable to the surgeons in the survey, with improvements when temporal context was incorporated in both cases. Conclusion: SPR is an inherently complex task for expert surgeons and computer vision, where both perform equally well when given the same context. Performance increases when temporal information is provided. Surgical tools and organs form the key landmarks for human interpretation and are expected to shape the future of automated SPR.
zh

[CV-157] Reservoir-enhanced Segment Anything Model for Subsurface Diagnosis

【速读】:该论文旨在解决城市地下结构中由于裂缝和空洞等次表层异常导致的检测难题,特别是在使用地面穿透雷达(GPR)进行准确异常检测时面临的标签数据有限、地下条件变化大以及目标边界不清晰等问题。其解决方案的关键在于提出了一种名为Reservoir-enhanced Segment Anything Model(Res-SAM)的创新框架,该框架结合了GPR数据的视觉可辨性和电磁波变化特性,通过分析局部GPR数据中电磁波内部及之间的异常诱导变化信息,实现对异常区域的精确提取与分类。

链接: https://arxiv.org/abs/2504.18802
作者: Xiren Zhou,Shikang Liu,Xinyu Yan,Yizhan Fan,Xiangyu Wang,Yu Kang,Jian Cheng,Huanhuan Chen
机构: University of Science and Technology of China (中国科学技术大学); Tsinghua University (清华大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Urban roads and infrastructure, vital to city operations, face growing threats from subsurface anomalies like cracks and cavities. Ground Penetrating Radar (GPR) effectively visualizes underground conditions employing electromagnetic (EM) waves; however, accurate anomaly detection via GPR remains challenging due to limited labeled data, varying subsurface conditions, and indistinct target boundaries. Although visually image-like, GPR data fundamentally represent EM waves, with variations within and between waves critical for identifying anomalies. Addressing these, we propose the Reservoir-enhanced Segment Anything Model (Res-SAM), an innovative framework exploiting both visual discernibility and wave-changing properties of GPR data. Res-SAM initially identifies apparent candidate anomaly regions given minimal prompts, and further refines them by analyzing anomaly-induced changing information within and between EM waves in local GPR data, enabling precise and complete anomaly region extraction and category determination. Real-world experiments demonstrate that Res-SAM achieves high detection accuracy (85%) and outperforms state-of-the-art. Notably, Res-SAM requires only minimal accessible non-target data, avoids intensive training, and incorporates simple human interaction to enhance reliability. Our research provides a scalable, resource-efficient solution for rapid subsurface anomaly detection across diverse environments, improving urban safety monitoring while reducing manual effort and computational cost.
zh

[CV-158] Validation and Calibration of Semi-Analytical Models for the Event Horizon Telescope Observations of Sagittarius A*

【速读】:该论文试图解决在事件视界望远镜(Event Horizon Telescope, EHT)观测中,通过拟合射线追踪物理模型来研究黑洞吸积流时计算成本过高的问题。解决方案的关键在于利用\alinet这一生成式机器学习模型,高效生成辐射效率低的吸积流(radiatively inefficient accretion flow, RIAF)图像,并基于指定的物理参数进行插值,从而估计未建模物理效应(如星际散射和源内禀变化)引入的不确定性,进而校准从RIAF模型拟合到模拟EHT数据中的物理参数及其相关不确定性。

链接: https://arxiv.org/abs/2504.18624
作者: Ali SaraerToosi,Avery Broderick
机构: Perimeter Institute for Theoretical Physics (理论物理研究所); University of Waterloo (滑铁卢大学); Department of Physics and Astronomy (物理学与天文学系); Department of Computer Science (计算机科学系)
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages, 9 figures, 2 tables, submitted to ApJ

点击查看摘要

Abstract:The Event Horizon Telescope (EHT) enables the exploration of black hole accretion flows at event-horizon scales. Fitting ray-traced physical models to EHT observations requires the generation of synthetic images, a task that is computationally demanding. This study leverages \alinet, a generative machine learning model, to efficiently produce radiatively inefficient accretion flow (RIAF) images as a function of the specified physical parameters. \alinet has previously been shown to be able to interpolate black hole images and their associated physical parameters after training on a computationally tractable set of library images. We utilize this model to estimate the uncertainty introduced by a number of anticipated unmodeled physical effects, including interstellar scattering and intrinsic source variability. We then use this to calibrate physical parameter estimates and their associated uncertainties from RIAF model fits to mock EHT data via a library of general relativistic magnetohydrodynamics models.
zh

[CV-159] Dual-Modality Computational Ophthalmic Imaging with Deep Learning and Coaxial Optical Design

【速读】:该论文旨在解决近视和视网膜疾病日益增长的负担所带来的眼科筛查可及性和效率问题。其解决方案的关键在于开发一种紧凑的双功能光学设备,该设备将眼底摄影和屈光不正检测集成到统一平台中,采用共轴光学设计结合二向分束镜分离波长依赖的成像路径,实现眼底与屈光模块的同步对齐,并利用基于Dense-U-Net的算法进行精准瞳孔分割,从而实现自动化对准与聚焦。

链接: https://arxiv.org/abs/2504.18549
作者: Boyuan Peng,Jiaju Chen,Yiwei Zhang,Cuiyi Peng,Junyang Li,Jiaming Deng,Peiwu Qin
机构: Shenzhen International Graduate School, Tsinghua University (深圳国际研究生院,清华大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing burden of myopia and retinal diseases necessitates more accessible and efficient eye screening solutions. This study presents a compact, dual-function optical device that integrates fundus photography and refractive error detection into a unified platform. The system features a coaxial optical design using dichroic mirrors to separate wavelength-dependent imaging paths, enabling simultaneous alignment of fundus and refraction modules. A Dense-U-Net-based algorithm with customized loss functions is employed for accurate pupil segmentation, facilitating automated alignment and focusing. Experimental evaluations demonstrate the system’s capability to achieve high-precision pupil localization (EDE = 2.8 px, mIoU = 0.931) and reliable refractive estimation with a mean absolute error below 5%. Despite limitations due to commercial lens components, the proposed framework offers a promising solution for rapid, intelligent, and scalable ophthalmic screening, particularly suitable for community health settings.
zh

[CV-160] Exploring Visual Complaints through a test battery in Acquired Brain Injury Patients: A Detailed Analysis of the DiaNAH Dataset

【速读】:该论文试图解决脑损伤(Acquired Brain Injury, ABI)患者主观视觉障碍报告与客观视觉感知评估之间复杂关系的理解问题。其关键解决方案是采用自动化机器学习(AutoML)方法处理数据缺失问题,以保持原始数据集的分布特性,并通过线性相关性分析探讨患者自述的视觉症状与标准视觉感知功能测试之间的关系。

链接: https://arxiv.org/abs/2504.18540
作者: Gonçalo Hora de Carvalho
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:This study investigated visual impairment complaints in a sample of 948 Acquired Brain Injury (ABI) patients using the DiaNAH dataset, emphasizing advanced machine learning techniques for managing missing data. Patients completed a CVS questionnaire capturing eight types of visual symptoms, including blurred vision and altered contrast perception. Due to incomplete data, 181 patients were excluded, resulting in an analytical subset of 767 individuals. To address the challenge of missing data, an automated machine learning (AutoML) approach was employed for data imputation, preserving the distributional characteristics of the original dataset. Patients were grouped according to singular and combined complaint clusters derived from the 40,320 potential combinations identified through the CVS questionnaire. A linear correlation analysis revealed minimal to no direct relationship between patient-reported visual complaints and standard visual perceptual function tests. This study represents an initial systematic attempt to understand the complex relationship between subjective visual complaints and objective visual perceptual assessments in ABI patients. Given the limitations of sample size and variability, further studies with larger populations are recommended to robustly explore these complaint clusters and their implications for visual perception following brain injury.
zh

人工智能

[AI-0] Modular Machine Learning: An Indispensable Path towards New-Generation Large Language Models

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在推理能力、事实一致性以及可解释性方面的关键局限性。其解决方案的核心是引入一种新的学习范式——模块化机器学习(Modular Machine Learning, MML),通过将LLMs的复杂结构分解为三个相互依赖的组件:模块化表示、模块化模型和模块化推理,以增强模型的反事实推理能力,减少幻觉现象,并提升公平性、安全性和透明度。MML的关键在于通过解耦语义组件来明确LLMs的内部工作机制,实现灵活的任务自适应模型设计,并支持可解释且逻辑驱动的决策过程。

链接: https://arxiv.org/abs/2504.20020
作者: Xin Wang,Haoyang Li,Zeyang Zhang,Haibo Chen,Wenwu Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) have dramatically advanced machine learning research including natural language processing, computer vision, data mining, etc., yet they still exhibit critical limitations in reasoning, factual consistency, and interpretability. In this paper, we introduce a novel learning paradigm – Modular Machine Learning (MML) – as an essential approach toward new-generation LLMs. MML decomposes the complex structure of LLMs into three interdependent components: modular representation, modular model, and modular reasoning, aiming to enhance LLMs’ capability of counterfactual reasoning, mitigating hallucinations, as well as promoting fairness, safety, and transparency. Specifically, the proposed MML paradigm can: i) clarify the internal working mechanism of LLMs through the disentanglement of semantic components; ii) allow for flexible and task-adaptive model design; iii) enable interpretable and logic-driven decision-making process. We present a feasible implementation of MML-based LLMs via leveraging advanced techniques such as disentangled representation learning, neural architecture search and neuro-symbolic learning. We critically identify key challenges, such as the integration of continuous neural and discrete symbolic processes, joint optimization, and computational scalability, present promising future research directions that deserve further exploration. Ultimately, the integration of the MML paradigm with LLMs has the potential to bridge the gap between statistical (deep) learning and formal (logical) reasoning, thereby paving the way for robust, adaptable, and trustworthy AI systems across a wide range of real-world applications.
zh

[AI-1] Modelling of Underwater Vehicles using Physics-Informed Neural Networks with Control IJCNN

【速读】:该论文旨在解决传统数据驱动模型在长期预测和物理一致性方面表现不足的问题,特别是在建模水下车辆动力学时。其解决方案的关键在于引入了物理信息神经网络与控制(PINC)框架,该框架通过整合物理定律与数据驱动模型,增强了模型的泛化能力和样本效率,使模型能够在超出训练域的条件下实现物理一致的动态过渡。

链接: https://arxiv.org/abs/2504.20019
作者: Abdelhakim Amer,David Felsager,Yury Brodskiy,Andriy Sarabakha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: This paper has been accepted for presentation at the International Joint Conference on Neural Networks (IJCNN) 2025. The final version consists of 8 pages

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) integrate physical laws with data-driven models to improve generalization and sample efficiency. This work introduces an open-source implementation of the Physics-Informed Neural Network with Control (PINC) framework, designed to model the dynamics of an underwater vehicle. Using initial states, control actions, and time inputs, PINC extends PINNs to enable physically consistent transitions beyond the training domain. Various PINC configurations are tested, including differing loss functions, gradient-weighting schemes, and hyperparameters. Validation on a simulated underwater vehicle demonstrates more accurate long-horizon predictions compared to a non-physics-informed baseline
zh

[AI-2] MINT: Multi-Vector Search Index Tuning

【速读】:该论文试图解决多向量数据库中索引调优的问题,这一问题在多模态和多特征场景下尤为关键。与关系型数据库的索引调优不同,多向量搜索的索引选择对性能有显著影响,但目前仍缺乏清晰且有效的解决方案。论文提出了一种框架,通过算法在给定多向量搜索工作负载的情况下,寻找能够最小化延迟并满足存储和召回约束的索引。其解决方案的关键在于针对多向量场景设计高效的索引优化算法,从而实现显著的性能提升,相较于基线方法,延迟降低了2.1X到8.3X。

链接: https://arxiv.org/abs/2504.20018
作者: Jiongli Zhu,Yue Wang,Bailu Ding,Philip A. Bernstein,Vivek Narasayya,Surajit Chaudhuri
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vector search plays a crucial role in many real-world applications. In addition to single-vector search, multi-vector search becomes important for multi-modal and multi-feature scenarios today. In a multi-vector database, each row is an item, each column represents a feature of items, and each cell is a high-dimensional vector. In multi-vector databases, the choice of indexes can have a significant impact on performance. Although index tuning for relational databases has been extensively studied, index tuning for multi-vector search remains unclear and challenging. In this paper, we define multi-vector search index tuning and propose a framework to solve it. Specifically, given a multi-vector search workload, we develop algorithms to find indexes that minimize latency and meet storage and recall constraints. Compared to the baseline, our latency achieves 2.1X to 8.3X speedup.
zh

[AI-3] owards Automated Scoping of AI for Social Good Projects

【速读】:该论文试图解决人工智能促进社会福祉(AI4SG)项目中普遍存在的问题定义(problem scoping)瓶颈,这一过程因缺乏兼具技术与领域知识的专业人员而变得复杂且资源密集。解决方案的关键在于提出一个基于大语言模型(LLM)的问题定义代理(Problem Scoping Agent, PSA),该代理能够利用科学文献和现实世界知识生成全面的项目提案,从而有效提升问题定义的效率与质量。

链接: https://arxiv.org/abs/2504.20010
作者: Jacob Emmerson,Rayid Ghani,Zheyuan Ryan Shi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Artificial Intelligence for Social Good (AI4SG) is an emerging effort that aims to address complex societal challenges with the powerful capabilities of AI systems. These challenges range from local issues with transit networks to global wildlife preservation. However, regardless of scale, a critical bottleneck for many AI4SG initiatives is the laborious process of problem scoping – a complex and resource-intensive task – due to a scarcity of professionals with both technical and domain expertise. Given the remarkable applications of large language models (LLM), we propose a Problem Scoping Agent (PSA) that uses an LLM to generate comprehensive project proposals grounded in scientific literature and real-world knowledge. We demonstrate that our PSA framework generates proposals comparable to those written by experts through a blind review and AI evaluations. Finally, we document the challenges of real-world problem scoping and note several areas for future work.
zh

[AI-4] Simplified and Secure MCP Gateways for Enterprise AI Integration

【速读】:该论文旨在解决企业级AI代理在采用模型上下文协议(Model Context Protocol, MCP)时所面临的集成安全问题。现有的公共MCP服务器解决方案无法满足企业对自主托管和安全性的需求,因此本文提出了MCP网关作为解决方案。其关键在于通过整合安全原则、认证机制、入侵检测和安全隧道技术,实现安全的自托管部署,同时避免暴露企业基础设施。

链接: https://arxiv.org/abs/2504.19997
作者: Ivo Brett
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increased adoption of the Model Context Protocol (MCP) for AI Agents necessitates robust security for Enterprise integrations. This paper introduces the MCP Gateway to simplify self-hosted MCP server integration. The proposed architecture integrates security principles, authentication, intrusion detection, and secure tunneling, enabling secure self-hosting without exposing infrastructure. Key contributions include a reference architecture, threat model mapping, simplified integration strategies, and open-source implementation recommendations. This work focuses on the unique challenges of enterprise-centric, self-hosted AI integrations, unlike existing public MCP server solutions.
zh

[AI-5] Mitigating Societal Cognitive Overload in the Age of AI: Challenges and Directions

【速读】:该论文试图解决人工智能时代社会认知过载(cognitive overload)所带来的挑战,这种过载由信息爆炸和AI复杂性引发,对人类福祉和社会韧性构成威胁。论文指出,缓解认知过载不仅是改善当前生活质量的关键,也是应对高级AI潜在风险(包括存在性威胁)的必要前提。解决方案的关键在于重新聚焦AI安全讨论,将认知过载作为连接近期危害与长期风险的桥梁,并探索制度适应、研究方向和政策考量,以构建更具抗过载能力的人机对齐框架。

链接: https://arxiv.org/abs/2504.19990
作者: Salem Lahlou
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Societal cognitive overload, driven by the deluge of information and complexity in the AI age, poses a critical challenge to human well-being and societal resilience. This paper argues that mitigating cognitive overload is not only essential for improving present-day life but also a crucial prerequisite for navigating the potential risks of advanced AI, including existential threats. We examine how AI exacerbates cognitive overload through various mechanisms, including information proliferation, algorithmic manipulation, automation anxieties, deregulation, and the erosion of meaning. The paper reframes the AI safety debate to center on cognitive overload, highlighting its role as a bridge between near-term harms and long-term risks. It concludes by discussing potential institutional adaptations, research directions, and policy considerations that arise from adopting an overload-resilient perspective on human-AI alignment, suggesting pathways for future exploration rather than prescribing definitive solutions.
zh

[AI-6] Real-Time Imitation of Human Head Motions Blinks and Emotions by Nao Robot: A Closed-Loop Approach

【速读】:该论文旨在解决如何在人机交互中实现对人类头部运动的实时模仿,以提升交互体验。其解决方案的关键在于结合MediaPipe(一种计算机视觉库)和DeepFace(一种情感识别库),以捕捉人类头部运动的细微变化,包括眨眼动作和情绪表达,并将这些指标无缝集成到机器人的响应中,从而构建一个基于闭环反馈的精确头部模仿框架。

链接: https://arxiv.org/abs/2504.19985
作者: Keyhan Rayati,Amirhossein Feizi,Alireza Beigy,Pourya Shahverdi,Mehdi Tale Masouleh,Ahmad Kalhor
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a novel approach for enabling real-time imitation of human head motion by a Nao robot, with a primary focus on elevating human-robot interactions. By using the robust capabilities of the MediaPipe as a computer vision library and the DeepFace as an emotion recognition library, this research endeavors to capture the subtleties of human head motion, including blink actions and emotional expressions, and seamlessly incorporate these indicators into the robot’s responses. The result is a comprehensive framework which facilitates precise head imitation within human-robot interactions, utilizing a closed-loop approach that involves gathering real-time feedback from the robot’s imitation performance. This feedback loop ensures a high degree of accuracy in modeling head motion, as evidenced by an impressive R2 score of 96.3 for pitch and 98.9 for yaw. Notably, the proposed approach holds promise in improving communication for children with autism, offering them a valuable tool for more effective interaction. In essence, proposed work explores the integration of real-time head imitation and real-time emotion recognition to enhance human-robot interactions, with potential benefits for individuals with unique communication needs.
zh

[AI-7] How Group Lives Go Well

【速读】:该论文试图解决传统幸福理论在群体繁荣建模中的不足,特别是当个体牺牲促进社会整体进步时,现有理论难以有效解释和表示群体福利的问题。解决方案的关键在于对反事实理论(Counterfactual Account, CT)进行改进和扩展,引入基于基本形式本体(Basic Formal Ontology, BFO)的模型,将群体繁荣评估为群体功能,其中成员承担角色并表现出类似生物系统或设计物的持续性条件,从而实现对群体福利、社会制度及长期社会贡献的结构化推理。

链接: https://arxiv.org/abs/2504.19968
作者: John Beverley,Regina Hurley
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:This paper explores the ontological space of group well being, proposing a framework for representing collective welfare, group functions, and long term contributions within an ontology engineering context. Traditional well being theories focus on individual states, often relying on hedonistic, desire satisfaction, or objective list models. Such approaches struggle to account for cases where individual sacrifices contribute to broader social progress, a critical challenge in modeling group flourishing. To address this, the paper refines and extends the Counterfactual Account (CT) of well being, which evaluates goodness of an event by comparing an individual’s actual well being with a hypothetical counterpart in a nearby possible world. While useful, this framework is insufficient for group level ontologies, where well being depends on functional persistence, institutional roles, and historical impact rather than immediate individual outcomes. Drawing on Basic Formal Ontology (BFO), the paper introduces a model in which group flourishing is evaluated in terms of group functional, where members bear roles and exhibit persistence conditions akin to biological systems or designed artifacts. This approach enables semantic interoperability for modeling longitudinal social contributions, allowing for structured reasoning about group welfare, social institutions, and group flourishing over time.
zh

[AI-8] Enhancing short-term traffic prediction by integrating trends and fluctuations with attention mechanism

【速读】:该论文试图解决交通流预测中长期趋势与短期波动相互作用带来的挑战,传统深度学习模型由于其架构固有的低通滤波效应、门控机制偏向稳定性和记忆更新机制优先保留长期信息,难以准确捕捉细粒度的短期波动。解决方案的关键在于提出一种混合深度学习框架,通过并行处理两个输入特征来整合长期趋势和短期波动信息,并利用Bahdanau注意力机制选择性地关注交通数据中的关键时间步,从而提升模型对拥堵等瞬态现象的预测能力。

链接: https://arxiv.org/abs/2504.19967
作者: Adway Das,Agnimitra Sengupta,S. Ilgin Guler
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Traffic flow prediction is a critical component of intelligent transportation systems, yet accurately forecasting traffic remains challenging due to the interaction between long-term trends and short-term fluctuations. Standard deep learning models often struggle with these challenges because their architectures inherently smooth over fine-grained fluctuations while focusing on general trends. This limitation arises from low-pass filtering effects, gate biases favoring stability, and memory update mechanisms that prioritize long-term information retention. To address these shortcomings, this study introduces a hybrid deep learning framework that integrates both long-term trend and short-term fluctuation information using two input features processed in parallel, designed to capture complementary aspects of traffic flow dynamics. Further, our approach leverages attention mechanisms, specifically Bahdanau attention, to selectively focus on critical time steps within traffic data, enhancing the model’s ability to predict congestion and other transient phenomena. Experimental results demonstrate that features learned from both branches are complementary, significantly improving the goodness-of-fit statistics across multiple prediction horizons compared to a baseline model. Notably, the attention mechanism enhances short-term forecast accuracy by directly targeting immediate fluctuations, though challenges remain in fully integrating long-term trends. This framework can contribute to more effective congestion mitigation and urban mobility planning by advancing the robustness and precision of traffic prediction models.
zh

[AI-9] Securing Agent ic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents

【速读】:该论文试图解决生成式 AI(GenAI)代理在企业环境中引入的新型安全挑战,这些挑战与传统系统存在显著差异。其核心问题是现有威胁模型和防御机制无法有效应对 GenAI 代理的自主性、持续记忆访问、复杂推理能力及工具集成所带来的独特风险。解决方案的关键在于提出两个互补框架:ATFAA(Advanced Threat Framework for Autonomous AI Agents)用于组织代理特定的风险,SHIELD 则提供实际的缓解策略以降低企业暴露风险。研究强调,必须针对 GenAI 代理的独特架构和行为重新构建安全视角,否则可能将这一强大工具转变为严重的企业隐患。

链接: https://arxiv.org/abs/2504.19956
作者: Vineeth Sai Narajala,Om Narayan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, 1 table

点击查看摘要

Abstract:As generative AI (GenAI) agents become more common in enterprise settings, they introduce security challenges that differ significantly from those posed by traditional systems. These agents are not just LLMs; they reason, remember, and act, often with minimal human oversight. This paper introduces a comprehensive threat model tailored specifically for GenAI agents, focusing on how their autonomy, persistent memory access, complex reasoning, and tool integration create novel risks. This research work identifies 9 primary threats and organizes them across five key domains: cognitive architecture vulnerabilities, temporal persistence threats, operational execution vulnerabilities, trust boundary violations, and governance circumvention. These threats are not just theoretical they bring practical challenges such as delayed exploitability, cross-system propagation, cross system lateral movement, and subtle goal misalignments that are hard to detect with existing frameworks and standard approaches. To help address this, the research work present two complementary frameworks: ATFAA - Advanced Threat Framework for Autonomous AI Agents, which organizes agent-specific risks, and SHIELD, a framework proposing practical mitigation strategies designed to reduce enterprise exposure. While this work builds on existing work in LLM and AI security, the focus is squarely on what makes agents different and why those differences matter. Ultimately, this research argues that GenAI agents require a new lens for security. If we fail to adapt our threat models and defenses to account for their unique architecture and behavior, we risk turning a powerful new tool into a serious enterprise liability.
zh

[AI-10] Securing GenAI Multi-Agent Systems Against Tool Squatting: A Zero Trust Registry-Based Approach

【速读】:该论文试图解决生成式 AI(Generative AI)多智能体系统(MAS)中因工具注册与交互协议引发的安全问题,特别是工具劫持(tool squatting)威胁。解决方案的关键在于设计一个安全增强的工具注册系统,其核心要素包括管理员控制的注册机制、集中式工具发现、基于专用代理和工具注册服务的细粒度访问策略、基于工具版本和已知漏洞的动态信任评分机制,以及按需凭证分配,旨在有效防止常见的工具劫持路径,同时保持多智能体系统的灵活性和功能强大。

链接: https://arxiv.org/abs/2504.19951
作者: Vineeth Sai Narajala,Ken Huang,Idan Habler
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 1 table

点击查看摘要

Abstract:The rise of generative AI (GenAI) multi-agent systems (MAS) necessitates standardized protocols enabling agents to discover and interact with external tools. However, these protocols introduce new security challenges, particularly; tool squatting; the deceptive registration or representation of tools. This paper analyzes tool squatting threats within the context of emerging interoperability standards, such as Model Context Protocol (MCP) or seamless communication between agents protocols. It introduces a comprehensive Tool Registry system designed to mitigate these risks. We propose a security-focused architecture featuring admin-controlled registration, centralized tool discovery, fine grained access policies enforced via dedicated Agent and Tool Registry services, a dynamic trust scoring mechanism based on tool versioning and known vulnerabilities, and just in time credential provisioning. Based on its design principles, the proposed registry framework aims to effectively prevent common tool squatting vectors while preserving the flexibility and power of multi-agent systems. This work addresses a critical security gap in the rapidly evolving GenAI ecosystem and provides a foundation for secure tool integration in production environments.
zh

[AI-11] Capturing Aerodynamic Characteristics of ATTAS Aircraft with Evolving Intelligent System

【速读】:该论文旨在解决现代飞机系统中气动系数建模的准确性问题,以更好地理解和优化其性能。解决方案的关键在于部署一种新型的进化型二型量子模糊神经网络(eT2QFNN),该网络通过规则结构生成多个线性子模型,并采用增量学习策略替代传统批量学习方法,从而有效表示非线性飞机模型。此外,eT2QFNN利用量子隶属函数、自动规则学习和参数调优能力,增强了对不确定性和数据噪声的鲁棒性。

链接: https://arxiv.org/abs/2504.19949
作者: Aydoğan Soylu,Tufan Kumbasar
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: in International Congress on Human-Computer Interaction, Optimization and Robotic Applications, 2025

点击查看摘要

Abstract:Accurate modeling of aerodynamic coefficients is crucial for understanding and optimizing the performance of modern aircraft systems. This paper presents the novel deployment of an Evolving Type-2 Quantum Fuzzy Neural Network (eT2QFNN) for modeling the aerodynamic coefficients of the ATTAS aircraft to express the aerodynamic characteristics. eT2QFNN can represent the nonlinear aircraft model by creating multiple linear submodels with its rule-based structure through an incremental learning strategy rather than a traditional batch learning approach. Moreover, it enhances robustness to uncertainties and data noise through its quantum membership functions, as well as its automatic rule-learning and parameter-tuning capabilities. During the estimation of the aerodynamic coefficients via the flight data of the ATTAS, two different studies are conducted in the training phase: one with a large amount of data and the other with a limited amount of data. The results show that the modeling performance of the eT2QFNN is superior in comparison to baseline counterparts. Furthermore, eT2QFNN estimated the aerodynamic model with fewer rules compared to Type-1 fuzzy counterparts. In addition, by applying the Delta method to the proposed approach, the stability and control derivatives of the aircraft are analyzed. The results prove the superiority of the proposed eT2QFNN in representing aerodynamic coefficients.
zh

[AI-12] Probabilistic and Causal Satisfiability: Constraining the Model

【速读】:该论文旨在研究概率推理和因果推理中可满足性问题的复杂性。其核心问题是判断是否存在一个联合概率分布,使得由基本概率项、线性项或多项式项组成的布尔不等式组合得以满足。Fagin等人(1990)证明了对于基本项和线性项,该问题属于NP完全,而Mossé等人(2022)则表明对于多项式项,该问题属于实数存在理论的完全问题。尽管Pearl的因果层次结构(PCH)扩展了概率框架,引入了干预和反事实推理,但Mossé等人(2022)发现其可满足性复杂性未发生变化。该论文通过引入两个新维度——固定结构因果模型的图结构以及小模型约束,进一步扩展了这一研究领域,关键在于分析在不同算术形式和PCH层级下,受限模型下的可满足性复杂性。

链接: https://arxiv.org/abs/2504.19944
作者: Markus Bläser,Julian Dörfler,Maciej Liśkiewicz,Benito van der Zander
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: accepted at ICALP 25

点击查看摘要

Abstract:We study the complexity of satisfiability problems in probabilistic and causal reasoning. Given random variables X_1, X_2,\ldots over finite domains, the basic terms are probabilities of propositional formulas over atomic events X_i = x_i , such as P(X_1 = x_1) or P(X_1 = x_1 \vee X_2 = x_2) . The basic terms can be combined using addition (yielding linear terms) or multiplication (polynomial terms). The probabilistic satisfiability problem asks whether a joint probability distribution satisfies a Boolean combination of (in)equalities over such terms. Fagin et al. (1990) showed that for basic and linear terms, this problem is NP-complete, making it no harder than Boolean satisfiability, while Mossé et al. (2022) proved that for polynomial terms, it is complete for the existential theory of the reals. Pearl’s Causal Hierarchy (PCH) extends the probabilistic setting with interventional and counterfactual reasoning, enriching the expressiveness of languages. However, Mossé et al. (2022) found that satisfiability complexity remains unchanged. Van der Zander et al. (2023) showed that introducing a marginalization operator to languages induces a significant increase in complexity. We extend this line of work by adding two new dimensions to the problem by constraining the models. First, we fix the graph structure of the underlying structural causal model, motivated by settings like Pearl’s do-calculus, and give a nearly complete landscape across different arithmetics and PCH levels. Second, we study small models. While earlier work showed that satisfiable instances admit polynomial-size models, this is no longer guaranteed with compact marginalization. We characterize the complexities of satisfiability under small-model constraints across different settings. Comments: accepted at ICALP 25 Subjects: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2504.19944 [cs.CC] (or arXiv:2504.19944v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2504.19944 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-13] Automated decision-making for dynamic task assignment at scale

【速读】:该论文旨在解决动态任务分配问题(Dynamic Task Assignment Problem, DTAP)中的实时资源分配挑战,特别是在任务由随机活动序列组成的情况下,如何快速处理请求并最小化资源成本或任务周期时间。其解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的决策支持系统(DRL-based Decision Support System, DSS),该系统包含两个创新要素:一是采用图结构表示观测和动作,能够有效建模任意DTAP场景;二是设计了一个与最小化任务平均周期时间目标严格等价的奖励函数,从而使得智能体能够学习到适用于实际规模DTAP的有效且可泛化的任务分配策略。

链接: https://arxiv.org/abs/2504.19933
作者: Riccardo Lo Bianco,Willem van Jaarsveld,Jeroen Middelhuis,Luca Begnardi,Remco Dijkman
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The Dynamic Task Assignment Problem (DTAP) concerns matching resources to tasks in real time while minimizing some objectives, like resource costs or task cycle time. In this work, we consider a DTAP variant where every task is a case composed of a stochastic sequence of activities. The DTAP, in this case, involves the decision of which employee to assign to which activity to process requests as quickly as possible. In recent years, Deep Reinforcement Learning (DRL) has emerged as a promising tool for tackling this DTAP variant, but most research is limited to solving small-scale, synthetic problems, neglecting the challenges posed by real-world use cases. To bridge this gap, this work proposes a DRL-based Decision Support System (DSS) for real-world scale DTAPS. To this end, we introduce a DRL agent with two novel elements: a graph structure for observations and actions that can effectively represent any DTAP and a reward function that is provably equivalent to the objective of minimizing the average cycle time of tasks. The combination of these two novelties allows the agent to learn effective and generalizable assignment policies for real-world scale DTAPs. The proposed DSS is evaluated on five DTAP instances whose parameters are extracted from real-world logs through process mining. The experimental evaluation shows how the proposed DRL agent matches or outperforms the best baseline in all DTAP instances and generalizes on different time horizons and across instances.
zh

[AI-14] Can AI Agents Design and Implement Drug Discovery Pipelines?

【速读】:该论文试图解决如何利用生成式 AI(Generative AI)提升药物发现效率的问题,特别是通过增强虚拟筛选场景中的决策能力来减少对昂贵实验测试的依赖。解决方案的关键在于构建一个名为 DO Challenge 的基准测试平台,用于评估 AI 代理在复杂问题中的决策能力,包括独立开发、实施和执行高效策略以从大规模数据集中识别有前景的分子结构,同时处理化学空间导航、模型选择和资源管理等多目标挑战。此外,论文还提出了 Deep Thought 多智能体系统,展示了其在基准测试中的优异表现,表明当前语言模型在主代理和辅助代理角色中的潜力,但同时也揭示了 AI 驱动方法在稳定性与专家级设计上的不足。

链接: https://arxiv.org/abs/2504.19912
作者: Khachik Smbatyan,Tsolak Ghukasyan,Tigran Aghajanyan,Hovhannes Dabaghyan,Sergey Adamyan,Aram Bughdaryan,Vahagn Altunyan,Gagik Navasardyan,Aram Davtyan,Anush Hakobyan,Aram Gharibyan,Arman Fahradyan,Artur Hakobyan,Hasmik Mnatsakanyan,Narek Ginoyan,Garik Petrosyan
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The rapid advancement of artificial intelligence, particularly autonomous agentic systems based on Large Language Models (LLMs), presents new opportunities to accelerate drug discovery by improving in-silico modeling and reducing dependence on costly experimental trials. Current AI agent-based systems demonstrate proficiency in solving programming challenges and conducting research, indicating an emerging potential to develop software capable of addressing complex problems such as pharmaceutical design and drug discovery. This paper introduces DO Challenge, a benchmark designed to evaluate the decision-making abilities of AI agents in a single, complex problem resembling virtual screening scenarios. The benchmark challenges systems to independently develop, implement, and execute efficient strategies for identifying promising molecular structures from extensive datasets, while navigating chemical space, selecting models, and managing limited resources in a multi-objective context. We also discuss insights from the DO Challenge 2025, a competition based on the proposed benchmark, which showcased diverse strategies explored by human participants. Furthermore, we present the Deep Thought multi-agent system, which demonstrated strong performance on the benchmark, outperforming most human teams. Among the language models tested, Claude 3.7 Sonnet, Gemini 2.5 Pro and o3 performed best in primary agent roles, and GPT-4o, Gemini 2.0 Flash were effective in auxiliary roles. While promising, the system’s performance still fell short of expert-designed solutions and showed high instability, highlighting both the potential and current limitations of AI-driven methodologies in transforming drug discovery and broader scientific research.
zh

[AI-15] Attention Mechanism Max-Affine Partition and Universal Approximation

【速读】:该论文试图解决单层单头自注意力(self-attention)和交叉注意力(cross-attention)机制是否具备通用逼近能力的问题。解决方案的关键在于将单头注意力解释为一种输入域划分机制,通过设计注意力权重使得该机制能够模仿目标函数,从而实现对连续函数或Lebesgue可积函数的逼近。研究证明,仅需简单的线性变换前缀,单层自注意力即可在L∞范数下逼近任意紧致域上的连续函数,并进一步扩展至Lp范数下的Lebesgue可积函数。此外,该方法还首次证明了单头交叉注意力同样具备相同的通用逼近能力。

链接: https://arxiv.org/abs/2504.19901
作者: Hude Liu,Jerry Yao-Chieh Hu,Zhao Song,Han Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We establish the universal approximation capability of single-layer, single-head self- and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the L_\infty -norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under L_p -norm for 1\leq p \infty . Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.
zh

[AI-16] urboQuant: Online Vector Quantization with Near-optimal Distortion Rate

【速读】:该论文旨在解决向量量化问题,即在保持高维欧几里得向量几何结构最小失真的前提下对其进行量化。现有方法在实现最优失真率方面存在局限,无法同时有效处理均方误差(MSE)和内积失真。论文提出的解决方案——TurboQuant,其关键在于采用数据无关的算法,在线应用时可在所有位宽和维度下达到接近最优的失真率(仅相差一个小常数因子)。通过随机旋转输入向量,诱导坐标服从集中化的Beta分布,并利用高维空间中不同坐标间的近似独立性,对每个坐标单独应用最优标量量化器。此外,为解决MSE最优量化器在内积估计中引入偏差的问题,提出两阶段方法:先应用MSE量化器,再对残差进行1比特量化JL(QJL)变换,从而获得无偏的内积量化器。

链接: https://arxiv.org/abs/2504.19874
作者: Amir Zandieh,Majid Daliri,Majid Hadian,Vahab Mirrokni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Data Structures and Algorithms (cs.DS)
备注: 25 pages

点击查看摘要

Abstract:Vector quantization, a problem rooted in Shannon’s source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ( \approx 2.7 ) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.
zh

[AI-17] Human-Centered AI and Autonomy in Robotics: Insights from a Bibliometric Study

【速读】:该论文试图解决如何在智能自主机器人系统中实现人类中心的人工智能(Human-Centered AI, HCAI)架构,以平衡自动化与人类控制,确保系统在复杂任务中的性能提升同时保持创造力、掌控力和责任感。其解决方案的关键在于通过文献计量分析方法,结合SciMAT和VOSViewer工具对Scopus数据库中的研究数据进行分析,识别学术趋势与新兴主题,并将其映射到IBM MAPE-K架构中,从而指导实际自主机器人系统的开发。

链接: https://arxiv.org/abs/2504.19848
作者: Simona Casini,Pietro Ducange,Francesco Marcelloni,Lorenzo Pollini
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: International Joint Conference on Neural Network 2025 - Accepted

点击查看摘要

Abstract:The development of autonomous robotic systems offers significant potential for performing complex tasks with precision and consistency. Recent advances in Artificial Intelligence (AI) have enabled more capable intelligent automation systems, addressing increasingly complex challenges. However, this progress raises questions about human roles in such systems. Human-Centered AI (HCAI) aims to balance human control and automation, ensuring performance enhancement while maintaining creativity, mastery, and responsibility. For real-world applications, autonomous robots must balance task performance with reliability, safety, and trustworthiness. Integrating HCAI principles enhances human-robot collaboration and ensures responsible operation. This paper presents a bibliometric analysis of intelligent autonomous robotic systems, utilizing SciMAT and VOSViewer to examine data from the Scopus database. The findings highlight academic trends, emerging topics, and AI’s role in self-adaptive robotic behaviour, with an emphasis on HCAI architecture. These insights are then projected onto the IBM MAPE-K architecture, with the goal of identifying how these research results map into actual robotic autonomous systems development efforts for real-world scenarios. Comments: International Joint Conference on Neural Network 2025 - Accepted Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.19848 [cs.RO] (or arXiv:2504.19848v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2504.19848 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-18] PhenoAssistant: A Conversational Multi-Agent AI System for Automated Plant Phenotyping

【速读】:该论文旨在解决植物表型分析中现有解决方案过于复杂、难以复现和维护,以及对非计算背景用户存在高技术门槛的问题。其关键解决方案是提出PhenoAssistant,这是一个基于人工智能的系统,通过自然语言交互简化植物表型分析流程,利用大语言模型协调经过筛选的工具集,支持自动表型提取、数据可视化和模型训练等任务。

链接: https://arxiv.org/abs/2504.19818
作者: Feng Chen,Ilias Stogiannidis,Andrew Wood,Danilo Bueno,Dominic Williams,Fraser Macfarlane,Bruce Grieve,Darren Wells,Jonathan A. Atkinson,Malcolm J. Hawkesford,Stephen A. Rolfe,Tracy Lawson,Tony Pridmore,Mario Valerio Giuffrida,Sotirios A. Tsaftaris
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Plant phenotyping increasingly relies on (semi-)automated image-based analysis workflows to improve its accuracy and scalability. However, many existing solutions remain overly complex, difficult to reimplement and maintain, and pose high barriers for users without substantial computational expertise. To address these challenges, we introduce PhenoAssistant: a pioneering AI-driven system that streamlines plant phenotyping via intuitive natural language interaction. PhenoAssistant leverages a large language model to orchestrate a curated toolkit supporting tasks including automated phenotype extraction, data visualisation and automated model training. We validate PhenoAssistant through several representative case studies and a set of evaluation tasks. By significantly lowering technical hurdles, PhenoAssistant underscores the promise of AI-driven methodologies to democratising AI adoption in plant biology.
zh

[AI-19] Contextures: The Mechanism of Representation Learning

【速读】:该论文试图解决基础模型在预训练过程中所学习的表示机制不明确的问题,以及为何这些表示对下游任务具有普适性。其解决方案的关键在于提出了一种统一的理论框架——上下文结构理论(contexture theory),该理论认为表示是从输入X与上下文变量A之间的关联中学习得到的,并证明了当编码器捕获这种关联的最大信息时,即学习到上下文结构,该编码器将在与该上下文兼容的任务类别中达到最优。

链接: https://arxiv.org/abs/2504.19792
作者: Runtian Zhai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: PhD Dissertation

点击查看摘要

Abstract:This dissertation establishes the contexture theory to mathematically characterize the mechanism of representation learning, or pretraining. Despite the remarkable empirical success of foundation models, it is not very clear what representations they learn, and why these representations are useful for various downstream tasks. A scientific understanding of representation learning is critical, especially at this point when scaling up the model size is producing diminishing returns, and designing new pretraining methods is imperative for further progress. Prior work treated different representation learning methods quite differently, whereas the contexture theory provides a unified framework for analyzing these methods. The central argument is that a representation is learned from the association between the input X and a context variable A. We prove that if an encoder captures the maximum information of this association, in which case we say that the encoder learns the contexture, then it will be optimal on the class of tasks that are compatible with the context. We also show that a context is the most useful when the association between X and A is neither too strong nor too weak. The important implication of the contexture theory is that increasing the model size alone will achieve diminishing returns, and further advancements require better contexts. We demonstrate that many pretraining objectives can learn the contexture, including supervised learning, self-supervised learning, generative models, etc. Then, we introduce two general objectives – SVME and KISE, for learning the contexture. We also show how to mix multiple contexts together, an effortless way to create better contexts from existing ones. Then, we prove statistical learning bounds for representation learning. Finally, we discuss the effect of the data distribution shift from pretraining to the downstream task. Comments: PhD Dissertation Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Reportnumber: CMU-CS-25-104 Cite as: arXiv:2504.19792 [cs.LG] (or arXiv:2504.19792v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.19792 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-20] Learning Efficiency Meets Symmetry Breaking

【速读】:该论文试图解决基于学习的规划器在处理大规模搜索空间时对对称性(symmetry)的处理能力不足的问题,其关键在于提出了一种结合图神经网络的图表示方法,该方法不仅提升了学习效率,还具备检测对称性的能力,同时引入了两种剪枝方法——动作剪枝和状态剪枝,以在搜索过程中有效管理对称性。通过将这些技术集成到Fast Downward中,实现了在最新IPC学习赛道数据集上对LAMA的首次超越。

链接: https://arxiv.org/abs/2504.19738
作者: Yingbin Bai,Sylvie Thiebaux,Felipe Trevizan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Learning-based planners leveraging Graph Neural Networks can learn search guidance applicable to large search spaces, yet their potential to address symmetries remains largely unexplored. In this paper, we introduce a graph representation of planning problems allying learning efficiency with the ability to detect symmetries, along with two pruning methods, action pruning and state pruning, designed to manage symmetries during search. The integration of these techniques into Fast Downward achieves a first-time success over LAMA on the latest IPC learning track dataset. Code is released at: this https URL.
zh

[AI-21] Model-based controller assisted domain randomization in deep reinforcement learning: application to nonlinear powertrain control

【速读】:该论文旨在解决复杂机械系统(如车辆动力总成)在从仿真到实际系统转移过程中面临的非线性与不确定性带来的控制难题。传统鲁棒控制方法在处理某些类型的非线性与不确定性时存在局限性,因此需要一种更实用的方法来全面补偿这些约束。该研究提出了一种基于深度强化学习(Deep Reinforcement Learning, DRL)的新型鲁棒控制方法,其关键在于域随机化DRL、基于长短期记忆网络(Long Short-Term Memory, LSTM)的策略网络和评论家网络以及模型基础控制(Model-Based Control, MBC)之间的协同作用。通过潜马尔可夫决策过程(Latent Markov Decision Process, LMDP)建模问题,利用环境模拟器动态随机化提升控制系统的鲁棒性,同时结合基于名义系统模型的模型基础控制器以增强训练效果。相较于传统DRL控制方法,该方案在保持较高泛化能力的同时,采用了更紧凑的神经网络结构和更少的训练数据。

链接: https://arxiv.org/abs/2504.19715
作者: Heisei Yonezawa,Ansei Yonezawa,Itsuro Kajiwara
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Complex mechanical systems such as vehicle powertrains are inherently subject to multiple nonlinearities and uncertainties arising from parametric variations. Modeling and calibration errors are therefore unavoidable, making the transfer of control systems from simulation to real-world systems a critical challenge. Traditional robust controls have limitations in handling certain types of nonlinearities and uncertainties, requiring a more practical approach capable of comprehensively compensating for these various constraints. This study proposes a new robust control approach using the framework of deep reinforcement learning (DRL). The key strategy lies in the synergy among domain randomization-based DRL, long short-term memory (LSTM)-based actor and critic networks, and model-based control (MBC). The problem setup is modeled via the latent Markov decision process (LMDP), a set of vanilla MDPs, for a controlled system subject to uncertainties and nonlinearities. In LMDP, the dynamics of an environment simulator is randomized during training to improve the robustness of the control system to real testing environments. The randomization increases training difficulties as well as conservativeness of the resultant control system; therefore, progress is assisted by concurrent use of a model-based controller based on a nominal system model. Compared to traditional DRL-based controls, the proposed controller design is smarter in that we can achieve a high level of generalization ability with a more compact neural network architecture and a smaller amount of training data. The proposed approach is verified via practical application to active damping for a complex powertrain system with nonlinearities and parametric variations. Comparative tests demonstrate the high robustness of the proposed approach.
zh

[AI-22] From LLM Reasoning to Autonomous AI Agents : A Comprehensive Review

【速读】:该论文试图解决当前大型语言模型和自主AI代理评估体系碎片化、缺乏统一分类和全面综述的问题。其解决方案的关键在于提出一个涵盖约60个评估基准的分类体系,该体系覆盖了通用与学术知识推理、数学问题求解、代码生成与软件工程、事实基础与检索、领域特定评估、多模态与具身任务、任务编排及交互评估等多个方面,并对2023至2025年间引入的AI代理框架及其在多个实际领域的应用进行了系统回顾。此外,还探讨了关键的代理间协作协议,如Agent Communication Protocol (ACP)、Model Context Protocol (MCP) 和 Agent-to-Agent Protocol (A2A),并提出了未来研究的方向。

链接: https://arxiv.org/abs/2504.19678
作者: Mohamed Amine Ferrag,Norbert Tihanyi,Merouane Debbah
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models and autonomous AI agents have evolved rapidly, resulting in a diverse array of evaluation benchmarks, frameworks, and collaboration protocols. However, the landscape remains fragmented and lacks a unified taxonomy or comprehensive survey. Therefore, we present a side-by-side comparison of benchmarks developed between 2019 and 2025 that evaluate these models and agents across multiple domains. In addition, we propose a taxonomy of approximately 60 benchmarks that cover general and academic knowledge reasoning, mathematical problem-solving, code generation and software engineering, factual grounding and retrieval, domain-specific evaluations, multimodal and embodied tasks, task orchestration, and interactive assessments. Furthermore, we review AI-agent frameworks introduced between 2023 and 2025 that integrate large language models with modular toolkits to enable autonomous decision-making and multi-step reasoning. Moreover, we present real-world applications of autonomous AI agents in materials science, biomedical research, academic ideation, software engineering, synthetic data generation, chemical reasoning, mathematical problem-solving, geographic information systems, multimedia, healthcare, and finance. We then survey key agent-to-agent collaboration protocols, namely the Agent Communication Protocol (ACP), the Model Context Protocol (MCP), and the Agent-to-Agent Protocol (A2A). Finally, we discuss recommendations for future research, focusing on advanced reasoning strategies, failure modes in multi-agent LLM systems, automated scientific discovery, dynamic tool integration via reinforcement learning, integrated search capabilities, and security vulnerabilities in agent protocols.
zh

[AI-23] textttSAGE: A Generic Framework for LLM Safety Evaluation

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)安全性评估中存在的不足,特别是针对不同应用场景下特定危害的评估缺失以及对LLM系统动态和对话性质关注不足的问题。其解决方案的关键在于提出一种自动化模块化框架——\textttSAGE(Safety AI Generic Evaluation),该框架利用系统感知且具有独特个性的对抗性用户模型,实现全面的红队测试评估,从而有效识别和衡量模型在多轮对话中潜在的危害行为及其与用户个性和场景的关联性。

链接: https://arxiv.org/abs/2504.19674
作者: Madhur Jindal,Hari Shrawgi,Parag Agrawal,Sandipan Dandapat
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 24 pages, 9 main pages excluding references and appendix

点击查看摘要

Abstract:Safety evaluation of Large Language Models (LLMs) has made progress and attracted academic interest, but it remains challenging to keep pace with the rapid integration of LLMs across diverse applications. Different applications expose users to various harms, necessitating application-specific safety evaluations with tailored harms and policies. Another major gap is the lack of focus on the dynamic and conversational nature of LLM systems. Such potential oversights can lead to harms that go unnoticed in standard safety benchmarks. This paper identifies the above as key requirements for robust LLM safety evaluation and recognizing that current evaluation methodologies do not satisfy these, we introduce the \textttSAGE (Safety AI Generic Evaluation) framework. \textttSAGE is an automated modular framework designed for customized and dynamic harm evaluations. It utilizes adversarial user models that are system-aware and have unique personalities, enabling a holistic red-teaming evaluation. We demonstrate \textttSAGE 's effectiveness by evaluating seven state-of-the-art LLMs across three applications and harm policies. Our experiments with multi-turn conversational evaluations revealed a concerning finding that harm steadily increases with conversation length. Furthermore, we observe significant disparities in model behavior when exposed to different user personalities and scenarios. Our findings also reveal that some models minimize harmful outputs by employing severe refusal tactics that can hinder their usefulness. These insights highlight the necessity of adaptive and context-specific testing to ensure better safety alignment and safer deployment of LLMs in real-world scenarios.
zh

[AI-24] Generative AI in Education: Student Skills and Lecturer Roles

【速读】:该论文试图解决如何有效培养学生在教育中使用生成式人工智能(Generative AI)的关键能力,并为教师整合GenAI到教学实践提供策略。研究通过混合方法,结合文献综述和定量调查,识别出学生在GenAI互动中的核心技能,如AI素养、批判性思维和伦理AI实践,并发现学生在提示工程、偏见意识和AI输出管理方面存在不足。同时,研究还明确了教师在GenAI整合与课程设计等方面的关键策略。解决方案的关键在于提升学生的AI素养与实践能力,同时推动教师在教学中合理融入GenAI,以实现教育的创新与责任并重。

链接: https://arxiv.org/abs/2504.19673
作者: Stefanie Krause,Ashish Dalvi,Syed Khubaib Zaidi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) tools such as ChatGPT are emerging as a revolutionary tool in education that brings both positive aspects and challenges for educators and students, reshaping how learning and teaching are approached. This study aims to identify and evaluate the key competencies students need to effectively engage with GenAI in education and to provide strategies for lecturers to integrate GenAI into teaching practices. The study applied a mixed method approach with a combination of a literature review and a quantitative survey involving 130 students from South Asia and Europe to obtain its findings. The literature review identified 14 essential student skills for GenAI engagement, with AI literacy, critical thinking, and ethical AI practices emerging as the most critical. The student survey revealed gaps in prompt engineering, bias awareness, and AI output management. In our study of lecturer strategies, we identified six key areas, with GenAI Integration and Curriculum Design being the most emphasised. Our findings highlight the importance of incorporating GenAI into education. While literature prioritized ethics and policy development, students favour hands-on, project-based learning and practical AI applications. To foster inclusive and responsible GenAI adoption, institutions should ensure equitable access to GenAI tools, establish clear academic integrity policies, and advocate for global GenAI research initiatives.
zh

[AI-25] A Tripartite Perspective on GraphRAG

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型任务中面临的事实准确性不足问题,例如工业自动化和医疗领域中的应用。其关键挑战包括模型易产生幻觉、缺乏来源可追溯性(provenance)以及知识更新不及时。论文提出的解决方案是结合LLMs与三元知识图谱(Tripartite Knowledge Graph)表示,通过构建由领域特定概念组成的受控本体,并将文本块中的相关部分与这些概念进行概念锚定的预分析,从而实现信息的结构化与压缩。该方法的核心在于通过三元知识图谱提升LLM提示的密度、覆盖范围和排列效率,同时减少提示长度,进而降低计算成本并提高输出的一致性和可靠性。

链接: https://arxiv.org/abs/2504.19667
作者: Michael Banf,Johannes Kuhn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities across various domains, yet they struggle with knowledge-intensive tasks in areas that demand factual accuracy, e.g. industrial automation and healthcare. Key limitations include their tendency to hallucinate, lack of source traceability (provenance), and challenges in timely knowledge updates. Combining language models with knowledge graphs (GraphRAG) offers promising avenues for overcoming these deficits. However, a major challenge lies in creating such a knowledge graph in the first place. Here, we propose a novel approach that combines LLMs with a tripartite knowledge graph representation, which is constructed by connecting complex, domain-specific objects via a curated ontology of corresponding, domain-specific concepts to relevant sections within chunks of text through a concept-anchored pre-analysis of source documents starting from an initial lexical graph. As a consequence, our Tripartite-GraphRAG approach implements: i) a concept-specific, information-preserving pre-compression of textual chunks; ii) allows for the formation of a concept-specific relevance estimation of embedding similarities grounded in statistics; and iii) avoids common challenges w.r.t. continuous extendability, such as the need for entity resolution and deduplication. By applying a transformation to the knowledge graph, we formulate LLM prompt creation as an unsupervised node classification problem, drawing on ideas from Markov Random Fields. We evaluate our approach on a healthcare use case, involving multi-faceted analyses of patient anamneses given a set of medical concepts as well as clinical literature. Experiments indicate that it can optimize information density, coverage, and arrangement of LLM prompts while reducing their lengths, which may lead to reduced costs and more consistent and reliable LLM outputs.
zh

[AI-26] Hardware/Software Co-Design of RISC-V Extensions for Accelerating Sparse DNNs on FPGAs

【速读】:该论文试图解决在RISC-V架构上高效加速具有半结构化和非结构化稀疏性的深度神经网络(DNN)的问题。解决方案的关键在于通过定制化的指令集扩展和功能单元设计,充分利用FPGA的细粒度可配置性来处理稀疏性,从而减少不必要的计算。具体而言,针对半结构化稀疏性,提出在权重块中保留少量位以编码后续块的稀疏性信息,并利用该信息跳过无效计算;针对非结构化稀疏性,设计了一个可变周期的顺序乘加单元,仅执行非零权重的乘法操作。此外,还提出了一个综合设计,能够同时加速两种类型的稀疏性,显著提升性能并保持较低的FPGA资源消耗。

链接: https://arxiv.org/abs/2504.19659
作者: Muhammad Sabih,Abrarul Karim,Jakob Wittmann,Frank Hannig,Jürgen Teich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:The customizability of RISC-V makes it an attractive choice for accelerating deep neural networks (DNNs). It can be achieved through instruction set extensions and corresponding custom functional units. Yet, efficiently exploiting these opportunities requires a hardware/software co-design approach in which the DNN model, software, and hardware are designed together. In this paper, we propose novel RISC-V extensions for accelerating DNN models containing semi-structured and unstructured sparsity. While the idea of accelerating structured and unstructured pruning is not new, our novel design offers various advantages over other designs. To exploit semi-structured sparsity, we take advantage of the fine-grained (bit-level) configurability of FPGAs and suggest reserving a few bits in a block of DNN weights to encode the information about sparsity in the succeeding blocks. The proposed custom functional unit utilizes this information to skip computations. To exploit unstructured sparsity, we propose a variable cycle sequential multiply-and-accumulate unit that performs only as many multiplications as the non-zero weights. Our implementation of unstructured and semi-structured pruning accelerators can provide speedups of up to a factor of 3 and 4, respectively. We then propose a combined design that can accelerate both types of sparsities, providing speedups of up to a factor of 5. Our designs consume a small amount of additional FPGA resources such that the resulting co-designs enable the acceleration of DNNs even on small FPGAs. We benchmark our designs on standard TinyML applications such as keyword spotting, image classification, and person detection.
zh

[AI-27] ransformation Translation Occupancy Grid Mapping: 2-Dimensional Deep Learning Refined SLAM

【速读】:该论文旨在解决2D SLAM(Simultaneous Localisation and Mapping)在大型复杂环境中因里程计漂移和位姿估计不准确导致的地图质量下降问题,以及传统Occupancy Grid Mapping(OGM)由于基于不确定观测而产生的噪声和模糊性。其解决方案的关键在于提出一种新的Transformation and Translation Occupancy Grid Mapping(TT-OGM)方法,通过将3D SLAM中的精确鲁棒位姿估计技术应用于2D领域,并利用生成对抗网络(GANs)减轻误差,同时引入基于深度强化学习(DRL)的数据生成方法构建足够大的数据集以训练GAN进行SLAM误差校正。

链接: https://arxiv.org/abs/2504.19654
作者: Leon Davies,Baihua Li,Mohamad Saada,Simon Sølvsten,Qinggang Meng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, preprint, submitted to Robotics And Autonomous Systems

点击查看摘要

Abstract:SLAM (Simultaneous Localisation and Mapping) is a crucial component for robotic systems, providing a map of an environment, the current location and previous trajectory of a robot. While 3D LiDAR SLAM has received notable improvements in recent years, 2D SLAM lags behind. Gradual drifts in odometry and pose estimation inaccuracies hinder modern 2D LiDAR-odometry algorithms in large complex environments. Dynamic robotic motion coupled with inherent estimation based SLAM processes introduce noise and errors, degrading map quality. Occupancy Grid Mapping (OGM) produces results that are often noisy and unclear. This is due to the fact that evidence based mapping represents maps according to uncertain observations. This is why OGMs are so popular in exploration or navigation tasks. However, this also limits OGMs’ effectiveness for specific mapping based tasks such as floor plan creation in complex scenes. To address this, we propose our novel Transformation and Translation Occupancy Grid Mapping (TT-OGM). We adapt and enable accurate and robust pose estimation techniques from 3D SLAM to the world of 2D and mitigate errors to improve map quality using Generative Adversarial Networks (GANs). We introduce a novel data generation method via deep reinforcement learning (DRL) to build datasets large enough for training a GAN for SLAM error correction. We demonstrate our SLAM in real-time on data collected at Loughborough University. We also prove its generalisability on a variety of large complex environments on a collection of large scale well-known 2D occupancy maps. Our novel approach enables the creation of high quality OGMs in complex scenes, far surpassing the capabilities of current SLAM algorithms in terms of quality, accuracy and reliability.
zh

[AI-28] GAN-SLAM: Real-Time GAN Aided Floor Plan Creation Through SLAM

【速读】:该论文试图解决SLAM系统在动态环境下生成的2D占用网格地图(Occupancy Grid Map, OGM)中因运动不确定性导致的精度问题,这些问题会显著降低地图质量并影响下游任务如平面图生成的效果。解决方案的关键在于引入一种名为“GAN-SLAM”的新方法,该方法利用生成对抗网络(Generative Adversarial Networks, GANs)在SLAM过程中对占用网格进行去噪和补全,从而减少噪声和误差的影响,并将通常用于3D SLAM的精确位姿估计技术适配到2D形式,以提升2D表示的地图质量。

链接: https://arxiv.org/abs/2504.19653
作者: Leon Davies,Baihua Li,Mohamad Saada,Simon Sølvsten,Qinggang Meng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, preprint conference submission

点击查看摘要

Abstract:SLAM is a fundamental component of modern autonomous systems, providing robots and their operators with a deeper understanding of their environment. SLAM systems often encounter challenges due to the dynamic nature of robotic motion, leading to inaccuracies in mapping quality, particularly in 2D representations such as Occupancy Grid Maps. These errors can significantly degrade map quality, hindering the effectiveness of specific downstream tasks such as floor plan creation. To address this challenge, we introduce our novel ‘GAN-SLAM’, a new SLAM approach that leverages Generative Adversarial Networks to clean and complete occupancy grids during the SLAM process, reducing the impact of noise and inaccuracies introduced on the output map. We adapt and integrate accurate pose estimation techniques typically used for 3D SLAM into a 2D form. This enables the quality improvement 3D LiDAR-odometry has seen in recent years to be effective for 2D representations. Our results demonstrate substantial improvements in map fidelity and quality, with minimal noise and errors, affirming the effectiveness of GAN-SLAM for real-world mapping applications within large-scale complex environments. We validate our approach on real-world data operating in real-time, and on famous examples of 2D maps. The improved quality of the output map enables new downstream tasks, such as floor plan drafting, further enhancing the capabilities of autonomous systems. Our novel approach to SLAM offers a significant step forward in the field, improving the usability for SLAM in mapping-based tasks, and offers insight into the usage of GANs for OGM error correction.
zh

[AI-29] Fitness Landscape of Large Language Model-Assisted Automated Algorithm Search

【速读】:该论文试图解决在迭代算法搜索框架中,生成式 AI (Generative AI) 辅助算法搜索(LAS)的适应度景观(fitness landscape)特性尚未被充分探索的问题。其解决方案的关键在于采用基于图的方法对 LAS 的适应度景观进行建模与分析,其中节点表示算法,边表示算法间的转换关系。通过在多个算法设计任务和 LLMs 上进行广泛评估,揭示了 LAS 景观的高度多模态性和崎岖性,并分析了不同任务和模型间的结构差异,为优化 LAS 方法提供了理论支持与实践指导。

链接: https://arxiv.org/abs/2504.19636
作者: Fei Liu,Qingfu Zhang,Xialiang Tong,Mingxuan Yuan,Kun Mao
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant potential in algorithm design. However, when integrated into search frameworks for iterative algorithm search, the underlying fitness landscape–critical for understanding search behaviou–remains underexplored. In this paper, we illustrate and analyze the fitness landscape of LLM-assisted Algorithm Search (LAS) using a graph-based approach, where nodes represent algorithms and edges denote transitions between them. We conduct extensive evaluations across six algorithm design tasks and six commonly used LLMs. Our findings reveal that LAS landscapes are highly multimodal and rugged, particularly in combinatorial optimization tasks, with distinct structural variations across tasks and LLMs. For instance, heuristic design tasks exhibit dense clusters of high-performing algorithms, while symbolic regression tasks show sparse, scattered distributions. Additionally, we demonstrate how population size influences exploration-exploitation trade-offs and the evolving trajectory of elite algorithms. These insights not only advance our understanding of LAS landscapes but also provide practical guidance for designing more effective LAS methods.
zh

[AI-30] From Evidence to Belief: A Bayesian Epistemology Approach to Language Models

【速读】:该论文试图解决语言模型在面对不同信息量和可靠性证据时,其置信度和响应是否符合贝叶斯认识论的问题。研究的关键在于构建包含多种类型证据的数据集,并通过口头化置信度、标记概率和采样方法分析语言模型的响应与置信度表现,从而揭示其在不同证据类型下的行为特征及与贝叶斯假设的偏离原因。

链接: https://arxiv.org/abs/2504.19622
作者: Minsu Kim,Sangryul Kim,James Thorne
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates the knowledge of language models from the perspective of Bayesian epistemology. We explore how language models adjust their confidence and responses when presented with evidence with varying levels of informativeness and reliability. To study these properties, we create a dataset with various types of evidence and analyze language models’ responses and confidence using verbalized confidence, token probability, and sampling. We observed that language models do not consistently follow Bayesian epistemology: language models follow the Bayesian confirmation assumption well with true evidence but fail to adhere to other Bayesian assumptions when encountering different evidence types. Also, we demonstrated that language models can exhibit high confidence when given strong evidence, but this does not always guarantee high accuracy. Our analysis also reveals that language models are biased toward golden evidence and show varying performance depending on the degree of irrelevance, helping explain why they deviate from Bayesian assumptions.
zh

[AI-31] GVPO: Group Variance Policy Optimization for Large Language Model Post-Training

【速读】:该论文旨在解决后训练(post-training)过程中由于训练不稳定而限制大规模语言模型(Large Language Models, LLMs)实际应用的问题。其解决方案的关键在于提出一种新的方法——组方差策略优化(Group Variance Policy Optimization, GVPO),该方法将KL约束下的奖励最大化问题的解析解直接融入梯度权重中,从而确保与最优策略的一致性。GVPO通过梯度反映隐式奖励与实际奖励之间中心距离的均方误差,实现了理论保证与实践适应性的统一。

链接: https://arxiv.org/abs/2504.19599
作者: Kaichen Zhang,Yuzhong Hong,Junwei Bao,Hongfei Jiang,Yang Song,Dingqian Hong,Hui Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Post-training plays a crucial role in refining and aligning large language models to meet specific tasks and human preferences. While recent advancements in post-training techniques, such as Group Relative Policy Optimization (GRPO), leverage increased sampling with relative reward scoring to achieve superior performance, these methods often suffer from training instability that limits their practical adoption. To address this challenge, we present Group Variance Policy Optimization (GVPO). GVPO incorporates the analytical solution to KL-constrained reward maximization directly into its gradient weights, ensuring alignment with the optimal policy. The method provides intuitive physical interpretations: its gradient mirrors the mean squared error between the central distance of implicit rewards and that of actual rewards. GVPO offers two key advantages: (1) it guarantees a unique optimal solution, exactly the KL-constrained reward maximization objective, (2) it supports flexible sampling distributions that avoids on-policy and importance sampling limitations. By unifying theoretical guarantees with practical adaptability, GVPO establishes a new paradigm for reliable and versatile LLM post-training.
zh

[AI-32] Mapping the Italian Telegram Ecosystem

【速读】:该论文试图解决对Telegram平台上的意识形态互动、毒性言论及仇恨言论目标缺乏全面理解的问题,特别是针对意大利语环境下的社交生态。其解决方案的关键在于通过大规模数据分析,结合网络分析、大型语言模型和毒性检测工具,对18600万条消息进行多维度研究,从而揭示主题社区的形成机制、意识形态的趋同性以及有害话语的传播模式。

链接: https://arxiv.org/abs/2504.19594
作者: Lorenzo Alvisi,Serena Tardelli,Maurizio Tesconi
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Telegram has become a major space for political discourse and alternative media. However, its lack of moderation allows misinformation, extremism, and toxicity to spread. While prior research focused on these particular phenomena or topics, these have mostly been examined separately, and a broader understanding of the Telegram ecosystem is still missing. In this work, we fill this gap by conducting a large-scale analysis of the Italian Telegram sphere, leveraging a dataset of 186 million messages from 13,151 chats collected in 2023. Using network analysis, Large Language Models, and toxicity detection tools, we examine how different thematic communities form, align ideologically, and engage in harmful discourse within the Italian cultural context. Results show strong thematic and ideological homophily. We also identify mixed ideological communities where far-left and far-right rhetoric coexist on particular geopolitical issues. Beyond political analysis, we find that toxicity, rather than being isolated in a few extreme chats, appears widely normalized within highly toxic communities. Moreover, we find that Italian discourse primarily targets Black people, Jews, and gay individuals independently of the topic. Finally, we uncover common trend of intra-national hostility, where Italians often attack other Italians, reflecting regional and intra-regional cultural conflicts that can be traced back to old historical divisions. This study provides the first large-scale mapping of the Italian Telegram ecosystem, offering insights into ideological interactions, toxicity, and identity-targets of hate and contributing to research on online toxicity across different cultural and linguistic contexts on Telegram.
zh

[AI-33] Graph Reinforcement Learning for QoS-Aware Load Balancing in Open Radio Access Networks

【速读】:该论文旨在解决下一代无线蜂窝网络中由于小区拥塞导致的服务质量(QoS)保障问题,特别是在多频段开放无线接入网(O-RAN)环境下,如何优化保证比特率(GBR)和尽力而为(BE)业务的性能。解决方案的关键在于提出一种基于图强化学习(GRL)的QoS感知负载均衡(LB)方法,该方法将LB建模为马尔可夫决策过程,并通过图神经网络(GNN)与深度Q网络(DQN)结合的架构进行训练,以实现对节点顺序无关、适应不同网络规模并考虑空间节点依赖性的负载均衡策略。

链接: https://arxiv.org/abs/2504.19499
作者: Omid Semiari,Hosein Nikopour,Shilpa Talwar
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
备注: To be published in the proceedings of the 2025 IEEE International Conference on Communications (ICC), Seventh Workshop on Data Driven Intelligence for Networks and Systems (DDINS)

点击查看摘要

Abstract:Next-generation wireless cellular networks are expected to provide unparalleled Quality-of-Service (QoS) for emerging wireless applications, necessitating strict performance guarantees, e.g., in terms of link-level data rates. A critical challenge in meeting these QoS requirements is the prevention of cell congestion, which involves balancing the load to ensure sufficient radio resources are available for each cell to serve its designated User Equipments (UEs). In this work, a novel QoS-aware Load Balancing (LB) approach is developed to optimize the performance of Guaranteed Bit Rate (GBR) and Best Effort (BE) traffic in a multi-band Open Radio Access Network (O-RAN) under QoS and resource constraints. The proposed solution builds on Graph Reinforcement Learning (GRL), a powerful framework at the intersection of Graph Neural Network (GNN) and RL. The QoS-aware LB is modeled as a Markov Decision Process, with states represented as graphs. QoS consideration are integrated into both state representations and reward signal design. The LB agent is then trained using an off-policy dueling Deep Q Network (DQN) that leverages a GNN-based architecture. This design ensures the LB policy is invariant to the ordering of nodes (UE or cell), flexible in handling various network sizes, and capable of accounting for spatial node dependencies in LB decisions. Performance of the GRL-based solution is compared with two baseline methods. Results show substantial performance gains, including a 53% reduction in QoS violations and a fourfold increase in the 5th percentile rate for BE traffic.
zh

[AI-34] DISCO: learning to DISCover an evolution Operator for multi-physics-agnostic prediction

【速读】:该论文试图解决在仅提供短轨迹数据的情况下,预测由未知时间偏微分方程(Partial Differential Equations, PDEs)驱动的动力系统下一状态的问题。其解决方案的关键在于引入DISCO模型,该模型利用一个大型超网络(hypernetwork)处理短轨迹并生成一个小得多的算子网络参数,随后通过时间积分预测下一状态。该框架将动力学估计(即从短轨迹中发现演化算子)与状态预测(即演化该算子)解耦,从而实现更高效和准确的预测。

链接: https://arxiv.org/abs/2504.19496
作者: Rudy Morel,Jiequn Han,Edouard Oyallon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We address the problem of predicting the next state of a dynamical system governed by unknown temporal partial differential equations (PDEs) using only a short trajectory. While standard transformers provide a natural black-box solution to this task, the presence of a well-structured evolution operator in the data suggests a more tailored and efficient approach. Specifically, when the PDE is fully known, classical numerical solvers can evolve the state accurately with only a few parameters. Building on this observation, we introduce DISCO, a model that uses a large hypernetwork to process a short trajectory and generate the parameters of a much smaller operator network, which then predicts the next state through time integration. Our framework decouples dynamics estimation (i.e., DISCovering an evolution operator from a short trajectory) from state prediction (i.e., evolving this operator). Experiments show that pretraining our model on diverse physics datasets achieves state-of-the-art performance while requiring significantly fewer epochs. Moreover, it generalizes well and remains competitive when fine-tuned on downstream tasks.
zh

[AI-35] An Automated Reinforcement Learning Reward Design Framework with Large Language Model for Cooperative Platoon Coordination

【速读】:该论文旨在解决在车辆编队协调问题中,如何高效设计性能优越的奖励函数以指导强化学习(Reinforcement Learning, RL)训练的难题。传统方法依赖人工设计奖励函数,面临协调目标多变、决策问题复杂及试错过程耗时等问题。论文提出了一种基于大语言模型(Large Language Model, LLM)的编队协调奖励设计(Platoon Coordination Reward Design, PCRD)框架,其关键在于通过LLM驱动的初始化与迭代优化实现奖励函数的自动化生成与优化,从而提升RL代理在复杂编队协调任务中的性能表现。

链接: https://arxiv.org/abs/2504.19480
作者: Dixiao Wei,Peng Yi,Jinlong Lei,Yiguang Hong,Yuchuan Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has demonstrated excellent decision-making potential in platoon coordination problems. However, due to the variability of coordination goals, the complexity of the decision problem, and the time-consumption of trial-and-error in manual design, finding a well performance reward function to guide RL training to solve complex platoon coordination problems remains challenging. In this paper, we formally define the Platoon Coordination Reward Design Problem (PCRDP), extending the RL-based cooperative platoon coordination problem to incorporate automated reward function generation. To address PCRDP, we propose a Large Language Model (LLM)-based Platoon coordination Reward Design (PCRD) framework, which systematically automates reward function discovery through LLM-driven initialization and iterative optimization. In this method, LLM first initializes reward functions based on environment code and task requirements with an Analysis and Initial Reward (AIR) module, and then iteratively optimizes them based on training feedback with an evolutionary module. The AIR module guides LLM to deepen their understanding of code and tasks through a chain of thought, effectively mitigating hallucination risks in code generation. The evolutionary module fine-tunes and reconstructs the reward function, achieving a balance between exploration diversity and convergence stability for training. To validate our approach, we establish six challenging coordination scenarios with varying complexity levels within the Yangtze River Delta transportation network simulation. Comparative experimental results demonstrate that RL agents utilizing PCRD-generated reward functions consistently outperform human-engineered reward functions, achieving an average of 10% higher performance metrics in all scenarios.
zh

[AI-36] A Real-Time Gesture-Based Control Framework

【速读】:该论文试图解决如何在实时交互中通过人体动作动态调整音频和音乐的问题,以增强表演者与音乐之间的互动性。解决方案的关键在于构建一个融合计算机视觉与机器学习的框架,该框架能够实时分析视频输入,识别并解释用户手势,将其映射为声音控制指令,从而实现对节拍、音高、效果及播放顺序等音频元素的实时操控。通过少量样本训练即可实现用户独立的功能,提升了系统的适应性和实用性。

链接: https://arxiv.org/abs/2504.19460
作者: Mahya Khazaei,Ali Bahrani,George Tzanetakis
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 2025 International Computer Music Conference

点击查看摘要

Abstract:We introduce a real-time, human-in-the-loop gesture control framework that can dynamically adapt audio and music based on human movement by analyzing live video input. By creating a responsive connection between visual and auditory stimuli, this system enables dancers and performers to not only respond to music but also influence it through their movements. Designed for live performances, interactive installations, and personal use, it offers an immersive experience where users can shape the music in real time. The framework integrates computer vision and machine learning techniques to track and interpret motion, allowing users to manipulate audio elements such as tempo, pitch, effects, and playback sequence. With ongoing training, it achieves user-independent functionality, requiring as few as 50 to 80 samples to label simple gestures. This framework combines gesture training, cue mapping, and audio manipulation to create a dynamic, interactive experience. Gestures are interpreted as input signals, mapped to sound control commands, and used to naturally adjust music elements, showcasing the seamless interplay between human interaction and machine response. Comments: 8 pages, 4 figures, 2025 International Computer Music Conference Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.19460 [cs.HC] (or arXiv:2504.19460v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2504.19460 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-37] GSFF-SLAM: 3D Semantic Gaussian Splatting SLAM via Feature Field

【速读】:该论文旨在解决现有语义SLAM系统在真实环境中因依赖2D地面真实先验导致的稀疏性和噪声问题,从而限制了其性能。其解决方案的关键在于提出GSFF-SLAM,这是一种基于3D高斯点云(3D Gaussian Splatting)的密集语义SLAM系统,通过特征场实现外观、几何和N维语义特征的联合渲染,并通过独立优化特征梯度,支持利用多种形式的2D先验,特别是稀疏和噪声信号,从而提升了跟踪精度和逼真渲染质量。

链接: https://arxiv.org/abs/2504.19409
作者: Zuxing Lu,Xin Yuan,Shaowen Yang,Jingyu Liu,Jiawei Wang,Changyin Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic-aware 3D scene reconstruction is essential for autonomous robots to perform complex interactions. Semantic SLAM, an online approach, integrates pose tracking, geometric reconstruction, and semantic mapping into a unified framework, shows significant potential. However, existing systems, which rely on 2D ground truth priors for supervision, are often limited by the sparsity and noise of these signals in real-world environments. To address this challenge, we propose GSFF-SLAM, a novel dense semantic SLAM system based on 3D Gaussian Splatting that leverages feature fields to achieve joint rendering of appearance, geometry, and N-dimensional semantic features. By independently optimizing feature gradients, our method supports semantic reconstruction using various forms of 2D priors, particularly sparse and noisy signals. Experimental results demonstrate that our approach outperforms previous methods in both tracking accuracy and photorealistic rendering quality. When utilizing 2D ground truth priors, GSFF-SLAM achieves state-of-the-art semantic segmentation performance with 95.03% mIoU, while achieving up to 2.9 \times speedup with only marginal performance degradation.
zh

[AI-38] LLM s for Engineering: Teaching Models to Design High Powered Rockets

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在物理工程领域,特别是高能火箭设计中的应用潜力问题。其关键解决方案是通过结合强化学习(Reinforcement Learning, RL)训练方法,提升LLMs在复杂工程优化任务中的性能,从而使其超越当前最先进的基础模型和人类专家的表现。

链接: https://arxiv.org/abs/2504.19394
作者: Toby Simonds
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed software engineering, but their application to physical engineering domains remains underexplored. This paper evaluates LLMs’ capabilities in high-powered rocketry design through RocketBench, a benchmark connecting LLMs to high-fidelity rocket simulations. We test models on two increasingly complex design tasks: target altitude optimization and precision landing challenges. Our findings reveal that while state-of-the-art LLMs demonstrate strong baseline engineering knowledge, they struggle to iterate on their designs when given simulation results and ultimately plateau below human performance levels. However, when enhanced with reinforcement learning (RL), we show that a 7B parameter model outperforms both SoTA foundation models and human experts. This research demonstrates that RL-trained LLMs can serve as effective tools for complex engineering optimization, potentially transforming engineering domains beyond software development.
zh

[AI-39] From Inductive to Deductive: LLM s-Based Qualitative Data Analysis in Requirements Engineering

【速读】:该论文试图解决在需求工程(Requirements Engineering, RE)中,将利益相关者输入转化为一致软件设计时所面临的挑战,尤其是传统定性数据分析(Qualitative Data Analysis, QDA)方法耗时且依赖人工的问题。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs),如GPT-4、Mistral和LLaMA-2,提升QDA任务的效率与准确性,特别是在演绎式(deductive)标注任务中,GPT-4表现出与人类分析师较高的一致性,其Cohen’s Kappa分数超过0.7,同时通过结构化标签实现需求可追溯性,并支持系统化的软件设计。

链接: https://arxiv.org/abs/2504.19384
作者: Syed Tauhid Ullah Shah,Mohamad Hussein,Ann Barcomb,Mohammad Moshirpour
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Requirements Engineering (RE) is essential for developing complex and regulated software projects. Given the challenges in transforming stakeholder inputs into consistent software designs, Qualitative Data Analysis (QDA) provides a systematic approach to handling free-form data. However, traditional QDA methods are time-consuming and heavily reliant on manual effort. In this paper, we explore the use of Large Language Models (LLMs), including GPT-4, Mistral, and LLaMA-2, to improve QDA tasks in RE. Our study evaluates LLMs’ performance in inductive (zero-shot) and deductive (one-shot, few-shot) annotation tasks, revealing that GPT-4 achieves substantial agreement with human analysts in deductive settings, with Cohen’s Kappa scores exceeding 0.7, while zero-shot performance remains limited. Detailed, context-rich prompts significantly improve annotation accuracy and consistency, particularly in deductive scenarios, and GPT-4 demonstrates high reliability across repeated runs. These findings highlight the potential of LLMs to support QDA in RE by reducing manual effort while maintaining annotation quality. The structured labels automatically provide traceability of requirements and can be directly utilized as classes in domain models, facilitating systematic software design.
zh

[AI-40] Rethinking Label-specific Features for Label Distribution Learning

【速读】:该论文旨在解决标签分布学习(Label Distribution Learning, LDL)中因标签模糊性导致的实例表征不准确问题。其关键解决方案是引入结构锚点(Structural Anchor Points, SAPs),以捕捉不同聚类间的交互关系,从而改进基于LIFT的标签特定特征(Label-specific Features, LSFs)构建策略,提出LIFT-SAP方法,该方法通过整合实例相对于SAPs的距离和方向信息,提升特征表示的鲁棒性和全面性。进一步地,论文提出了LDL-LIFT-SAP算法,将来自不同LSF空间的多标签描述程度统一为一致的标签分布,从而提升整体性能。

链接: https://arxiv.org/abs/2504.19374
作者: Suping Xu,Chuyi Dai,Lin Shang,Changbin Shao,Xibei Yang,Witold Pedrycz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 11 Pages, 5 figures

点击查看摘要

Abstract:Label distribution learning (LDL) is an emerging learning paradigm designed to capture the relative importance of labels for each instance. Label-specific features (LSFs), constructed by LIFT, have proven effective for learning tasks with label ambiguity by leveraging clustering-based prototypes for each label to re-characterize instances. However, directly introducing LIFT into LDL tasks can be suboptimal, as the prototypes it collects primarily reflect intra-cluster relationships while neglecting interactions among distinct clusters. Additionally, constructing LSFs using multi-perspective information, rather than relying solely on Euclidean distance, provides a more robust and comprehensive representation of instances, mitigating noise and bias that may arise from a single distance perspective. To address these limitations, we introduce Structural Anchor Points (SAPs) to capture inter-cluster interactions. This leads to a novel LSFs construction strategy, LIFT-SAP, which enhances LIFT by integrating both distance and direction information of each instance relative to SAPs. Furthermore, we propose a novel LDL algorithm, Label Distribution Learning via Label-specifIc FeaTure with SAPs (LDL-LIFT-SAP), which unifies multiple label description degrees predicted from different LSF spaces into a cohesive label distribution. Extensive experiments on 15 real-world datasets demonstrate the effectiveness of LIFT-SAP over LIFT, as well as the superiority of LDL-LIFT-SAP compared to seven other well-established algorithms.
zh

[AI-41] Doxing via the Lens: Revealing Privacy Leakage in Image Geolocation for Agent ic Multi-Modal Large Reasoning Model

【速读】:该论文试图解决生成式 AI (Generative AI) 在多模态大推理模型中可能引发的隐私泄露问题,特别是通过视觉推理能力无意中泄露用户地理位置的风险。解决方案的关键在于通过构建包含50张真实场景图像的数据集,系统性地评估ChatGPT o3在图像地理定位任务中的表现,并识别出影响模型推理准确性的关键视觉线索,如街道布局和前院设计。同时,通过针对性遮挡实验验证了对关键特征进行遮蔽可以有效降低地理定位精度,为未来隐私保护机制的设计提供了依据。

链接: https://arxiv.org/abs/2504.19373
作者: Weidi Luo,Qiming Zhang,Tianyu Lu,Xiaogeng Liu,Yue Zhao,Zhen Xiang,Chaowei Xiao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing capabilities of agentic multi-modal large reasoning models, such as ChatGPT o3, have raised critical concerns regarding privacy leakage through inadvertent image geolocation. In this paper, we conduct the first systematic and controlled study on the potential privacy risks associated with visual reasoning abilities of ChatGPT o3. We manually collect and construct a dataset comprising 50 real-world images that feature individuals alongside privacy-relevant environmental elements, capturing realistic and sensitive scenarios for analysis. Our experimental evaluation reveals that ChatGPT o3 can predict user locations with high precision, achieving street-level accuracy (within one mile) in 60% of cases. Through analysis, we identify key visual cues, including street layout and front yard design, that significantly contribute to the model inference success. Additionally, targeted occlusion experiments demonstrate that masking critical features effectively mitigates geolocation accuracy, providing insights into potential defense mechanisms. Our findings highlight an urgent need for privacy-aware development for agentic multi-modal large reasoning models, particularly in applications involving private imagery.
zh

[AI-42] Neurosymbolic Association Rule Mining from Tabular Data

【速读】:该论文试图解决高维数据集在关联规则挖掘(Association Rule Mining, ARM)中导致规则爆炸的问题,这一问题会增加执行时间并影响下游任务的性能。解决方案的关键在于提出一种新颖的神经符号学ARM方法Aerial+,其核心是利用欠完备自编码器生成数据的神经表示,捕捉特征间的关联,并通过模型的重构机制从中提取规则,从而学习到更简洁且高质量的规则集。

链接: https://arxiv.org/abs/2504.19354
作者: Erkan Karabulut,Paul Groth,Victoria Degeler
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Association Rule Mining (ARM) is the task of mining patterns among data features in the form of logical rules, with applications across a myriad of domains. However, high-dimensional datasets often result in an excessive number of rules, increasing execution time and negatively impacting downstream task performance. Managing this rule explosion remains a central challenge in ARM research. To address this, we introduce Aerial+, a novel neurosymbolic ARM method. Aerial+ leverages an under-complete autoencoder to create a neural representation of the data, capturing associations between features. It extracts rules from this neural representation by exploiting the model’s reconstruction mechanism. Extensive evaluations on five datasets against seven baselines demonstrate that Aerial+ achieves state-of-the-art results by learning more concise, high-quality rule sets with full data coverage. When integrated into rule-based interpretable machine learning models, Aerial+ significantly reduces execution time while maintaining or improving accuracy.
zh

[AI-43] Flow Along the K-Amplitude for Generative Modeling

【速读】:该论文试图解决传统生成模型在多尺度信息控制与生成质量之间的平衡问题,特别是在图像生成和分子组装等任务中对不同尺度信息的精确调控需求。解决方案的关键在于提出一种新颖的生成学习范式——K-Flow,其核心是通过引入K-幅度(K-amplitude)分解,将尺度参数作为时间维度进行流匹配,从而实现对不同尺度信息的可控生成。

链接: https://arxiv.org/abs/2504.19353
作者: Weitao Du,Shuning Chang,Jiasheng Tang,Yu Rong,Fan Wang,Shengchao Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we propose a novel generative learning paradigm, K-Flow, an algorithm that flows along the K -amplitude. Here, k is a scaling parameter that organizes frequency bands (or projected coefficients), and amplitude describes the norm of such projected coefficients. By incorporating the K -amplitude decomposition, K-Flow enables flow matching across the scaling parameter as time. We discuss three venues and six properties of K-Flow, from theoretical foundations, energy and temporal dynamics, and practical applications, respectively. Specifically, from the practical usage perspective, K-Flow allows steerable generation by controlling the information at different scales. To demonstrate the effectiveness of K-Flow, we conduct experiments on unconditional image generation, class-conditional image generation, and molecule assembly generation. Additionally, we conduct three ablation studies to demonstrate how K-Flow steers scaling parameter to effectively control the resolution of image generation.
zh

[AI-44] PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-rich Manipulation Using Tactile-Diffusion Policies ICRA2025

【速读】:该论文试图解决在非结构化家庭环境中实现鲁棒的灵巧操作这一挑战,尤其是针对现有基于视觉和本体感觉的触觉无关控制策略在遮挡、视觉复杂性和精确接触交互控制方面存在的不足。解决方案的关键在于提出一种名为PolyTouch的新型机械手指,其集成了基于相机的触觉感知、声学感知和外围视觉感知,能够在多个时间尺度上提供高分辨率触觉反馈,从而有效提升复杂操作任务的学习效率。

链接: https://arxiv.org/abs/2504.19341
作者: Jialiang Zhao,Naveen Kuppuswamy,Siyuan Feng,Benjamin Burchfiel,Edward Adelson
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Nominated for the best paper award at ICRA 2025

点击查看摘要

Abstract:Achieving robust dexterous manipulation in unstructured domestic environments remains a significant challenge in robotics. Even with state-of-the-art robot learning methods, haptic-oblivious control strategies (i.e. those relying only on external vision and/or proprioception) often fall short due to occlusions, visual complexities, and the need for precise contact interaction control. To address these limitations, we introduce PolyTouch, a novel robot finger that integrates camera-based tactile sensing, acoustic sensing, and peripheral visual sensing into a single design that is compact and durable. PolyTouch provides high-resolution tactile feedback across multiple temporal scales, which is essential for efficiently learning complex manipulation tasks. Experiments demonstrate an at least 20-fold increase in lifespan over commercial tactile sensors, with a design that is both easy to manufacture and scalable. We then use this multi-modal tactile feedback along with visuo-proprioceptive observations to synthesize a tactile-diffusion policy from human demonstrations; the resulting contact-aware control policy significantly outperforms haptic-oblivious policies in multiple contact-aware manipulation policies. This paper highlights how effectively integrating multi-modal contact sensing can hasten the development of effective contact-aware manipulation policies, paving the way for more reliable and versatile domestic robots. More information can be found at this https URL
zh

[AI-45] NSFlow: An End-to-End FPGA Framework with Scalable Dataflow Architecture for Neuro-Symbolic AI

【速读】:该论文旨在解决现有硬件(如CPU、GPU、TPU)在执行Neuro-Symbolic AI (NSAI)任务时面临的计算内核异构性、高内存密集度和独特内存访问模式带来的性能瓶颈问题,以及当前NSAI算法在操作类型和规模上的多样性导致与现有机器学习加速器不兼容的问题。解决方案的关键在于提出NSFlow,一个基于FPGA的加速框架,其核心特性包括设计架构生成器以识别工作负载数据依赖并生成优化的数据流架构,以及可重构阵列中的灵活计算单元、可重新组织的内存和混合精度能力,从而实现高效、可扩展和通用的NSAI加速。

链接: https://arxiv.org/abs/2504.19323
作者: Hanchen Yang,Zishen Wan,Ritik Raj,Joongun Park,Ziwei Li,Ananda Samajdar,Arijit Raychowdhury,Tushar Krishna
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Neuro-Symbolic AI (NSAI) is an emerging paradigm that integrates neural networks with symbolic reasoning to enhance the transparency, reasoning capabilities, and data efficiency of AI systems. Recent NSAI systems have gained traction due to their exceptional performance in reasoning tasks and human-AI collaborative scenarios. Despite these algorithmic advancements, executing NSAI tasks on existing hardware (e.g., CPUs, GPUs, TPUs) remains challenging, due to their heterogeneous computing kernels, high memory intensity, and unique memory access patterns. Moreover, current NSAI algorithms exhibit significant variation in operation types and scales, making them incompatible with existing ML accelerators. These challenges highlight the need for a versatile and flexible acceleration framework tailored to NSAI workloads. In this paper, we propose NSFlow, an FPGA-based acceleration framework designed to achieve high efficiency, scalability, and versatility across NSAI systems. NSFlow features a design architecture generator that identifies workload data dependencies and creates optimized dataflow architectures, as well as a reconfigurable array with flexible compute units, re-organizable memory, and mixed-precision capabilities. Evaluating across NSAI workloads, NSFlow achieves 31x speedup over Jetson TX2, more than 2x over GPU, 8x speedup over TPU-like systolic array, and more than 3x over Xilinx DPU. NSFlow also demonstrates enhanced scalability, with only 4x runtime increase when symbolic workloads scale by 150x. To the best of our knowledge, NSFlow is the first framework to enable real-time generalizable NSAI algorithms acceleration, demonstrating a promising solution for next-generation cognitive systems.
zh

[AI-46] Logic-Based Artificial Intelligence Algorithms Supporting Categorical Semantics

【速读】:该论文试图解决如何在比集合结构更丰富的对象上进行符号推理的人工智能代理设计问题(artificial intelligent agents)。其解决方案的关键在于应用范畴逻辑(categorical logic),利用Johnstone的上下文中的项与公式的序列演算,开发了用于笛卡尔范畴中Horn逻辑规则的正向链式和规范形式算法,并将一阶合一方法适应于支持多排序理论、上下文及一阶逻辑片段。这一方法使得在不支持经典逻辑甚至所有逻辑联结词的语义范畴中进行推理成为可能。

链接: https://arxiv.org/abs/2504.19320
作者: Ralph Wojtowicz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages

点击查看摘要

Abstract:This paper seeks to apply categorical logic to the design of artificial intelligent agents that reason symbolically about objects more richly structured than sets. Using Johnstone’s sequent calculus of terms- and formulae-in-context, we develop forward chaining and normal form algorithms for reasoning about objects in cartesian categories with the rules for Horn logic. We also adapt first-order unification to support multi-sorted theories, contexts, and fragments of first-order logic. The significance of these reformulations rests in the fact that they can be applied to reasoning about objects in semantic categories that do not support classical logic or even all its connectives.
zh

[AI-47] Small Models Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling

【速读】:该论文旨在解决在资源受限环境下高效生成准确函数调用(function call)的问题。传统的大语言模型(Large Language Models, LLMs)虽然能够自动化这一过程,但其计算成本高,难以在边缘设备上部署。为此,研究者提出使用小语言模型(Small Language Models, SLMs)作为替代方案,其优势在于计算效率高、响应速度快,适合在边缘设备上运行。论文的关键解决方案是通过零样本(zero-shot)、少样本(few-shot)和微调(fine-tuning)等多种方法评估SLMs在不同领域生成函数调用的能力,并结合提示注入(prompt injection)实验分析模型的鲁棒性,从而为实际应用提供可行的技术路径。

链接: https://arxiv.org/abs/2504.19277
作者: Ishan Kavathekar,Raghav Donakanti,Ponnurangam Kumaraguru,Karthik Vaidhyanathan
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted at EASE 2025 AI Models and Data Evaluation track

点击查看摘要

Abstract:Function calling is a complex task with widespread applications in domains such as information retrieval, software engineering and automation. For example, a query to book the shortest flight from New York to London on January 15 requires identifying the correct parameters to generate accurate function calls. Large Language Models (LLMs) can automate this process but are computationally expensive and impractical in resource-constrained settings. In contrast, Small Language Models (SLMs) can operate efficiently, offering faster response times, and lower computational demands, making them potential candidates for function calling on edge devices. In this exploratory empirical study, we evaluate the efficacy of SLMs in generating function calls across diverse domains using zero-shot, few-shot, and fine-tuning approaches, both with and without prompt injection, while also providing the finetuned models to facilitate future applications. Furthermore, we analyze the model responses across a range of metrics, capturing various aspects of function call generation. Additionally, we perform experiments on an edge device to evaluate their performance in terms of latency and memory usage, providing useful insights into their practical applicability. Our findings show that while SLMs improve from zero-shot to few-shot and perform best with fine-tuning, they struggle significantly with adhering to the given output format. Prompt injection experiments further indicate that the models are generally robust and exhibit only a slight decline in performance. While SLMs demonstrate potential for the function call generation task, our results also highlight areas that need further refinement for real-time functioning.
zh

[AI-48] Balancing Creativity and Automation: The Influence of AI on Modern Film Production and Dissemination

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)在电影制作中的应用所带来的伦理与实践挑战,特别是如何平衡人类创作者与AI之间的关系、保持创造力与自动化之间的协调,以及制定相应的伦理规范。其解决方案的关键在于将AI定位为“具身化工具”而非独立的“他者伙伴”,从而确保人类作者性和艺术完整性,同时提出包括国际监管框架和人类控制指数(Human Control Index, HCI)在内的具体措施以量化AI的参与程度。

链接: https://arxiv.org/abs/2504.19275
作者: Yiren Xu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages, 1 figures, 2 tables

点击查看摘要

Abstract:The integration of Artificial Intelligence(AI) into film production has revolutionized efficiency and creativity, yet it simultaneously raises critical ethical and practical challenges. This study explores the dual impact of AI on modern cinema through three objectives: defining the optimal human-AI relationship, balancing creativity with automation, and developing ethical guidelines. By employing a mixed-method approach combining theoretical frameworks (auteur theory, human-technology relations) and case studies (The Safe Zone, Fast Furious 7, The Brutalist), the research reveals that positioning AI as an “embodiment tool” rather than an independent “alterity partner” preserves human authorship and artistic integrity. Key findings highlight the risks of surveillance capitalism in AI-driven markets and the ethical dilemmas of deepfake technology. The study concludes with actionable recommendations, including international regulatory frameworks and a Human Control Index (HCI) to quantify AI involvement. These insights aim to guide filmmakers, policymakers, and scholars in navigating the evolving AI-cinema landscape while safeguarding cultural diversity and ethical standards.
zh

[AI-49] Sparse: Practical Privacy-Preserving Verification of Deep Neural Networks

【速读】:该论文试图解决在不访问模型权重和敏感训练数据的情况下验证深度学习推理完整性的难题,特别是针对现代神经网络(如Transformer和大视觉模型)应用零知识简洁非交互式知识论证(ZK-SNARKs)时所面临的计算开销过大的问题。解决方案的关键在于提出一种名为TeleSparse的ZK友好型后处理机制,通过两种核心策略:一是对神经网络模型进行稀疏化以减少电路约束,从而提升证明效率而不牺牲准确性和安全性;二是通过神经传送技术优化激活范围,以减小非线性函数所需的查找表大小。该方法在保持约1%精度损失的前提下,将证明者内存使用量减少了67%,证明生成时间减少了46%。

链接: https://arxiv.org/abs/2504.19274
作者: Mohammad M Maheri,Hamed Haddadi,Alex Davidson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: This paper has been accepted to the Privacy Enhancing Technologies Symposium (PETS) 2025

点击查看摘要

Abstract:Verification of the integrity of deep learning inference is crucial for understanding whether a model is being applied correctly. However, such verification typically requires access to model weights and (potentially sensitive or private) training data. So-called Zero-knowledge Succinct Non-Interactive Arguments of Knowledge (ZK-SNARKs) would appear to provide the capability to verify model inference without access to such sensitive data. However, applying ZK-SNARKs to modern neural networks, such as transformers and large vision models, introduces significant computational overhead. We present TeleSparse, a ZK-friendly post-processing mechanisms to produce practical solutions to this problem. TeleSparse tackles two fundamental challenges inherent in applying ZK-SNARKs to modern neural networks: (1) Reducing circuit constraints: Over-parameterized models result in numerous constraints for ZK-SNARK verification, driving up memory and proof generation costs. We address this by applying sparsification to neural network models, enhancing proof efficiency without compromising accuracy or security. (2) Minimizing the size of lookup tables required for non-linear functions, by optimizing activation ranges through neural teleportation, a novel adaptation for narrowing activation functions’ range. TeleSparse reduces prover memory usage by 67% and proof generation time by 46% on the same model, with an accuracy trade-off of approximately 1%. We implement our framework using the Halo2 proving system and demonstrate its effectiveness across multiple architectures (Vision-transformer, ResNet, MobileNet) and datasets (ImageNet,CIFAR-10,CIFAR-100). This work opens new directions for ZK-friendly model design, moving toward scalable, resource-efficient verifiable deep learning. Comments: This paper has been accepted to the Privacy Enhancing Technologies Symposium (PETS) 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2504.19274 [cs.LG] (or arXiv:2504.19274v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.19274 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-50] he Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach

【速读】:该论文试图解决在关键决策场景中部署大型语言模型(Large Language Models, LLMs)时,如何系统评估其伦理推理能力的问题。解决方案的关键在于提出了一种名为Priorities in Reasoning and Intrinsic Moral Evaluation (PRIME)的综合框架,该框架通过分析后果主义-义务论推理、道德基础理论以及Kohlberg的发展阶段等核心伦理维度,对LLMs的道德优先级进行评估。研究采用双协议方法,结合直接提问与对既有伦理困境的响应分析,揭示了当前LLMs在道德判断中的共性特征与局限性。

链接: https://arxiv.org/abs/2504.19255
作者: Chad Coleman,W. Russell Neuman,Ali Dasdan,Safinah Ali,Manan Shah
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 25 pages, 8 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in consequential decision-making contexts, systematically assessing their ethical reasoning capabilities becomes a critical imperative. This paper introduces the Priorities in Reasoning and Intrinsic Moral Evaluation (PRIME) framework–a comprehensive methodology for analyzing moral priorities across foundational ethical dimensions including consequentialist-deontological reasoning, moral foundations theory, and Kohlberg’s developmental stages. We apply this framework to six leading LLMs through a dual-protocol approach combining direct questioning and response analysis to established ethical dilemmas. Our analysis reveals striking patterns of convergence: all evaluated models demonstrate strong prioritization of care/harm and fairness/cheating foundations while consistently underweighting authority, loyalty, and sanctity dimensions. Through detailed examination of confidence metrics, response reluctance patterns, and reasoning consistency, we establish that contemporary LLMs (1) produce decisive ethical judgments, (2) demonstrate notable cross-model alignment in moral decision-making, and (3) generally correspond with empirically established human moral preferences. This research contributes a scalable, extensible methodology for ethical benchmarking while highlighting both the promising capabilities and systematic limitations in current AI moral reasoning architectures–insights critical for responsible development as these systems assume increasingly significant societal roles.
zh

[AI-51] Generative Adversarial Network based Voice Conversion: Techniques Challenges and Recent Advancements

【速读】:该论文旨在解决生成式语音转换(Generative AI-based Voice Conversion, GAN-based VC)系统中仍存在的训练稳定性、语言一致性及感知自然性等关键挑战。其解决方案的关键在于通过生成对抗网络(Generative Adversarial Networks, GANs)的强大特征映射能力,提升语音转换的逼真度与自然性。本文通过对现有方法的系统性综述,分析技术障碍并评估最新进展,以期为构建更鲁棒和高效的语音转换系统提供理论支持与实践指导。

链接: https://arxiv.org/abs/2504.19197
作者: Sandipan Dhar,Nanda Dulal Jana,Swagatam Das
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 19 pages, 12 figures, 1 table

点击查看摘要

Abstract:Voice conversion (VC) stands as a crucial research area in speech synthesis, enabling the transformation of a speaker’s vocal characteristics to resemble another while preserving the linguistic content. This technology has broad applications, including automated movie dubbing, speech-to-singing conversion, and assistive devices for pathological speech rehabilitation. With the increasing demand for high-quality and natural-sounding synthetic voices, researchers have developed a wide range of VC techniques. Among these, generative adversarial network (GAN)-based approaches have drawn considerable attention for their powerful feature-mapping capabilities and potential to produce highly realistic speech. Despite notable advancements, challenges such as ensuring training stability, maintaining linguistic consistency, and achieving perceptual naturalness continue to hinder progress in GAN-based VC systems. This systematic review presents a comprehensive analysis of the voice conversion landscape, highlighting key techniques, key challenges, and the transformative impact of GANs in the field. The survey categorizes existing methods, examines technical obstacles, and critically evaluates recent developments in GAN-based VC. By consolidating and synthesizing research findings scattered across the literature, this review provides a structured understanding of the strengths and limitations of different approaches. The significance of this survey lies in its ability to guide future research by identifying existing gaps, proposing potential directions, and offering insights for building more robust and efficient VC systems. Overall, this work serves as an essential resource for researchers, developers, and practitioners aiming to advance the state-of-the-art (SOTA) in voice conversion technology.
zh

[AI-52] A Design Framework for operationalizing Trustworthy Artificial Intelligence in Healthcare: Requirements Tradeoffs and Challenges for its Clinical Adoption

【速读】:该论文试图解决医疗人工智能(Artificial Intelligence, AI)在临床实践中广泛应用所面临的障碍,尤其是技术性能之外的伦理、监管和信任问题。其解决方案的关键在于提出一个设计框架,以支持开发者将可信人工智能(Trustworthy AI, TAI)原则嵌入到医疗AI系统中,从而确保系统在人类代理与监督、算法稳健性、隐私与数据治理、透明性、避免偏见与歧视以及问责等方面符合高标准要求。

链接: https://arxiv.org/abs/2504.19179
作者: Pedro A. Moreno-Sánchez,Javier Del Ser,Mark van Gils,Jussi Hernesniemi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) holds great promise for transforming healthcare, particularly in disease diagnosis, prognosis, and patient care. The increasing availability of digital medical data, such as images, omics, biosignals, and electronic health records, combined with advances in computing, has enabled AI models to approach expert-level performance. However, widespread clinical adoption remains limited, primarily due to challenges beyond technical performance, including ethical concerns, regulatory barriers, and lack of trust. To address these issues, AI systems must align with the principles of Trustworthy AI (TAI), which emphasize human agency and oversight, algorithmic robustness, privacy and data governance, transparency, bias and discrimination avoidance, and accountability. Yet, the complexity of healthcare processes (e.g., screening, diagnosis, prognosis, and treatment) and the diversity of stakeholders (clinicians, patients, providers, regulators) complicate the integration of TAI principles. To bridge the gap between TAI theory and practical implementation, this paper proposes a design framework to support developers in embedding TAI principles into medical AI systems. Thus, for each stakeholder identified across various healthcare processes, we propose a disease-agnostic collection of requirements that medical AI systems should incorporate to adhere to the principles of TAI. Additionally, we examine the challenges and tradeoffs that may arise when applying these principles in practice. To ground the discussion, we focus on cardiovascular diseases, a field marked by both high prevalence and active AI innovation, and demonstrate how TAI principles have been applied and where key obstacles persist.
zh

[AI-53] A Dynamic Fuzzy Rule and Attribute Management Framework for Fuzzy Inference Systems in High-Dimensional Data

【速读】:该论文旨在解决高维数据在神经模糊推理系统中带来的挑战,即如何在保持模型性能和可解释性的前提下,有效简化复杂的模糊模型。其解决方案的关键在于提出了一种自适应动态属性与规则(ADAR)框架,该框架通过集成双权重机制——对属性和规则分配自适应重要性——以及自动增长与剪枝策略,实现对模型的动态优化。这种机制不仅降低了模型复杂度,还显著提升了模型的可解释性和预测精度。

链接: https://arxiv.org/abs/2504.19148
作者: Ke Liu,Jing Ma,Edmund M-K Lai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents an Adaptive Dynamic Attribute and Rule (ADAR) framework designed to address the challenges posed by high-dimensional data in neuro-fuzzy inference systems. By integrating dual weighting mechanisms-assigning adaptive importance to both attributes and rules-together with automated growth and pruning strategies, ADAR adaptively streamlines complex fuzzy models without sacrificing performance or interpretability. Experimental evaluations on four diverse datasets - Auto MPG (7 variables), Beijing PM2.5 (10 variables), Boston Housing (13 variables), and Appliances Energy Consumption (27 variables) show that ADAR-based models achieve consistently lower Root Mean Square Error (RMSE) compared to state-of-the-art baselines. On the Beijing PM2.5 dataset, for instance, ADAR-SOFENN attained an RMSE of 56.87 with nine rules, surpassing traditional ANFIS [12] and SOFENN [16] models. Similarly, on the high-dimensional Appliances Energy dataset, ADAR-ANFIS reached an RMSE of 83.25 with nine rules, outperforming established fuzzy logic approaches and interpretability-focused methods such as APLR. Ablation studies further reveal that combining rule-level and attribute-level weight assignment significantly reduces model overlap while preserving essential features, thereby enhancing explainability. These results highlight ADAR’s effectiveness in dynamically balancing rule complexity and feature importance, paving the way for scalable, high-accuracy, and transparent neuro-fuzzy systems applicable to a range of real-world scenarios.
zh

[AI-54] ChiseLLM : Unleashing the Power of Reasoning LLM s for Chisel Agile Hardware Development

【速读】:该论文旨在解决在生成式AI(Generative AI)辅助的硬件构造语言(HCL)开发中,大型语言模型(LLMs)在Chisel代码生成任务中面临的语法正确性和设计可变性不足的问题。解决方案的关键在于提出ChiseLLM,其核心包括数据处理与转换、提示引导的推理轨迹合成以及领域适应的模型训练,通过高质量数据集构建和提示增强方法引导模型采用结构化思维模式,从而显著提升Chisel代码的语法正确率和设计可变性。

链接: https://arxiv.org/abs/2504.19144
作者: Bowei Wang,Jiaran Gao,Yelai Feng,Renzhi Chen,Shanshan Li,Lei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The growing demand for Domain-Specific Architecture (DSA) has driven the development of Agile Hardware Development Methodology (AHDM). Hardware Construction Language (HCL) like Chisel offers high-level abstraction features, making it an ideal language for HCL-Based AHDM. While Large Language Models (LLMs) excel in code generation tasks, they still face challenges with Chisel generation, particularly regarding syntax correctness and design variability. Recent reasoning models have significantly enhanced code generation capabilities through test-time scaling techniques. However, we found that reasoning models without domain adaptation cannot bring substantial benefits to Chisel code generation tasks. This paper presents ChiseLLM, a solution comprising data processing and transformation, prompt-guided reasoning trace synthesis, and domain-adapted model training. We constructed high-quality datasets from public RTL code resources and guided the model to adopt structured thinking patterns through prompt enhancement methods. Experiments demonstrate that our ChiseLLM-7B and ChiseLLM-32B models improved syntax correctness by 18.85% and 26.32% respectively over base models, while increasing variability design ability by 47.58% compared to baseline reasoning models. Our datasets and models are publicly available, providing high-performance, cost-effective models for HCL-Based AHDM, and offering an effective baseline for future research. Github repository: this https URL
zh

[AI-55] BQSched: A Non-intrusive Scheduler for Batch Concurrent Queries via Reinforcement Learning ICDE’25

【速读】:该论文试图解决批量并发查询调度中的整体完成时间(makespan)优化问题,旨在提高大规模企业中基于SQL的操作数据处理效率。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的非侵入式调度器BQSched,其核心创新包括:设计了一种基于注意力机制的状态表示以捕捉复杂的查询模式,并引入IQ-PPO算法,通过辅助任务增强的近端策略优化(PPO)来充分利用日志中单个查询完成的丰富信号;此外,还结合了自适应掩码、基于调度收益的查询聚类以及增量模拟器等三种优化策略,以降低动作空间复杂度、处理大规模查询集并减少采样成本。

链接: https://arxiv.org/abs/2504.19142
作者: Chenhao Xu,Chunyu Chen,Jinglin Peng,Jiannan Wang,Jun Gao
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: Accepted by ICDE '25

点击查看摘要

Abstract:Most large enterprises build predefined data pipelines and execute them periodically to process operational data using SQL queries for various tasks. A key issue in minimizing the overall makespan of these pipelines is the efficient scheduling of concurrent queries within the pipelines. Existing tools mainly rely on simple heuristic rules due to the difficulty of expressing the complex features and mutual influences of queries. The latest reinforcement learning (RL) based methods have the potential to capture these patterns from feedback, but it is non-trivial to apply them directly due to the large scheduling space, high sampling cost, and poor sample utilization. Motivated by these challenges, we propose BQSched, a non-intrusive Scheduler for Batch concurrent Queries via reinforcement learning. Specifically, BQSched designs an attention-based state representation to capture the complex query patterns, and proposes IQ-PPO, an auxiliary task-enhanced proximal policy optimization (PPO) algorithm, to fully exploit the rich signals of Individual Query completion in logs. Based on the RL framework above, BQSched further introduces three optimization strategies, including adaptive masking to prune the action space, scheduling gain-based query clustering to deal with large query sets, and an incremental simulator to reduce sampling cost. To our knowledge, BQSched is the first non-intrusive batch query scheduler via RL. Extensive experiments show that BQSched can significantly improve the efficiency and stability of batch query scheduling, while also achieving remarkable scalability and adaptability in both data and queries. For example, across all DBMSs and scales tested, BQSched reduces the overall makespan of batch queries on TPC-DS benchmark by an average of 34% and 13%, compared with the commonly used heuristic strategy and the adapted RL-based scheduler, respectively. Comments: Accepted by ICDE '25 Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.19142 [cs.DB] (or arXiv:2504.19142v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2504.19142 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-56] Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments

【速读】:该论文旨在解决序列决策中任务鲁棒性适应的问题,即在面对不同任务时,如何提升策略的泛化能力和鲁棒性。现有方法如风险规避策略(例如条件风险价值原则)虽然能够优先处理困难任务,但需要耗费大量计算资源进行评估。为提高效率,研究提出了一种稳健的主动任务采样方法,其关键在于利用风险预测模型来替代策略评估,从而降低计算成本。本文进一步将该优化流程建模为马尔可夫决策过程,并提出了一个易于实现的方法——后验与多样性协同任务采样(Posterior and Diversity Synergized Task Sampling, PDTS),以支持快速且稳健的序列决策。

链接: https://arxiv.org/abs/2504.19139
作者: Yun Qu, Qi (Cheems)Wang,Yixiu Mao,Yiqin Lv,Xiangyang Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Task robust adaptation is a long-standing pursuit in sequential decision-making. Some risk-averse strategies, e.g., the conditional value-at-risk principle, are incorporated in domain randomization or meta reinforcement learning to prioritize difficult tasks in optimization, which demand costly intensive evaluations. The efficiency issue prompts the development of robust active task sampling to train adaptive policies, where risk-predictive models are used to surrogate policy evaluation. This work characterizes the optimization pipeline of robust active task sampling as a Markov decision process, posits theoretical and practical insights, and constitutes robustness concepts in risk-averse scenarios. Importantly, we propose an easy-to-implement method, referred to as Posterior and Diversity Synergized Task Sampling (PDTS), to accommodate fast and robust sequential decision-making. Extensive experiments show that PDTS unlocks the potential of robust active task sampling, significantly improves the zero-shot and few-shot adaptation robustness in challenging tasks, and even accelerates the learning process under certain scenarios. Our project website is at this https URL.
zh

[AI-57] Beyond Levels of Driving Automation: A Triadic Framework of Human-AI Collaboration in On-Road Mobility

【速读】:该论文试图解决在动态驾驶场景中,人类用户与生成式 AI (Generative AI) 如何实现实时协作的问题,现有分类体系(如 SAE Levels of Automation)仅关注车辆控制权的归属,而未明确协作机制。解决方案的关键在于提出一种三元人机协同框架,包含三种动态适应人类需求的 AI 角色:顾问(Advisor)、副驾驶(Co-Pilot)和监护者(Guardian),从而为自动驾驶车辆中自适应的角色化人机协作策略奠定基础。

链接: https://arxiv.org/abs/2504.19120
作者: Gaojian Huang,Yantong Jin,Wei-Hsiang Lo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The goal of the current study is to introduce a triadic human-AI collaboration framework for the automated vehicle domain. Previous classifications (e.g., SAE Levels of Automation) focus on defining automation levels based on who controls the vehicle. However, it remains unclear how human users and AI should collaborate in real-time, especially in dynamic driving contexts, where roles can shift frequently. To fill the gap, this study proposes a triadic human-AI collaboration framework with three AI roles (i.e., Advisor, Co-Pilot, and Guardian) that dynamically adapt to human needs. Overall, the study lays a foundation for developing adaptive, role-based human-AI collaboration strategies in automated vehicles.
zh

[AI-58] VeriDebug: A Unified LLM for Verilog Debugging via Contrastive Embedding and Guided Correction

【速读】:该论文旨在解决Verilog调试中自动化程度不足的问题,特别是针对现有大型语言模型(Large Language Models, LLMs)在这一领域应用研究的缺乏。其解决方案的关键在于提出VeriDebug,该方法通过对比表示(contrastive representation)和引导修复(guided correction)能力,实现Verilog代码的自动调试。VeriDebug通过共享参数空间统一了Bug检测与修复过程,并通过同时学习Bug模式与修复方案,利用对比嵌入和引导修复机制提升调试效率。实验结果表明,VeriDebug在Bug修复准确率上显著优于现有开源模型和部分闭源模型。

链接: https://arxiv.org/abs/2504.19099
作者: Ning Wang,Bingkun Yao,Jie Zhou,Yuchen Hu,Xi Wang,Nan Guan,Zhe Jiang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable potential in debugging for various programming languages. However, the application of LLMs to Verilog debugging remains insufficiently explored. Here, we present VeriDebug, an approach that integrates contrastive representation and guided correction capabilities for automated Verilog debugging. Unlike existing methods, VeriDebug employs an embedding-based technique to accurately retrieve internal information, followed by bug-fixing. VeriDebug unifies Verilog bug detection and correction through a shared parameter space. By simultaneously learning bug patterns and fixes, it streamlines debugging via contrastive embedding and guided correction. Empirical results show the efficacy of VeriDebug in enhancing Verilog debugging. Our VeriDebugLoc, Type model achieves 64.7 accuracy in bug fixing (Acc1), a significant improvement from the existing open-source SOTAs 11.3. This performance not only outperforms open-source alternatives but also exceeds larger closed-source models like GPT-3.5-turbo (36.6), offering a more accurate alternative to conventional debugging methods.
zh

[AI-59] CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在需要密码学专业知识的领域中推理能力不足的问题,特别是在加密解密任务中的表现。其解决方案的关键在于构建了一个全面的基准测试集——CipherBank,该基准包含2,358个精心设计的问题,覆盖了5个领域和14个子领域,涉及9种不同的加密算法,从经典密码到定制化加密技术,旨在评估LLMs在隐私敏感和实际应用场景下的推理能力。通过在CipherBank上对多种先进LLMs进行评估,研究揭示了当前模型在处理经典密码解密任务时存在的显著能力差距,并为提升LLMs在密码学推理方面的性能提供了关键洞察。

链接: https://arxiv.org/abs/2504.19093
作者: Yu Li,Qizhi Pei,Mengyuan Sun,Honglin Lin,Chenlin Ming,Xin Gao,Jiang Wu,Conghui He,Lijun Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, especially the recent advancements in reasoning, such as o1 and o3, pushing the boundaries of AI. Despite these impressive achievements in mathematics and coding, the reasoning abilities of LLMs in domains requiring cryptographic expertise remain underexplored. In this paper, we introduce CipherBank, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs in cryptographic decryption tasks. CipherBank comprises 2,358 meticulously crafted problems, covering 262 unique plaintexts across 5 domains and 14 subdomains, with a focus on privacy-sensitive and real-world scenarios that necessitate encryption. From a cryptographic perspective, CipherBank incorporates 3 major categories of encryption methods, spanning 9 distinct algorithms, ranging from classical ciphers to custom cryptographic techniques. We evaluate state-of-the-art LLMs on CipherBank, e.g., GPT-4o, DeepSeek-V3, and cutting-edge reasoning-focused models such as o1 and DeepSeek-R1. Our results reveal significant gaps in reasoning abilities not only between general-purpose chat LLMs and reasoning-focused LLMs but also in the performance of current reasoning-focused models when applied to classical cryptographic decryption tasks, highlighting the challenges these models face in understanding and manipulating encrypted data. Through detailed analysis and error investigations, we provide several key observations that shed light on the limitations and potential improvement areas for LLMs in cryptographic reasoning. These findings underscore the need for continuous advancements in LLM reasoning capabilities.
zh

[AI-60] Improved Molecular Generation through Attribute-Driven Integrative Embeddings and GAN Selectivity

【速读】:该论文试图解决如何高效生成具有特定性质的分子问题,特别是在药物发现和化学工程等领域中对定制化分子的需求日益增长。解决方案的关键在于结合基于Transformer的向量嵌入生成器与改进的生成对抗网络(GAN),其中向量嵌入生成器采用了一种新的分子描述符,将Morgan指纹与全局分子属性相结合,使Transformer能够捕捉局部功能基团和更广泛的分子特征;同时,通过修改GAN的生成器损失函数,确保生成具有特定所需性质的分子。

链接: https://arxiv.org/abs/2504.19040
作者: Nandan Joshi,Erhan Guven
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing demand for molecules with tailored properties in fields such as drug discovery and chemical engineering has driven advancements in computational methods for molecular design. Machine learning-based approaches for de-novo molecular generation have recently garnered significant attention. This paper introduces a transformer-based vector embedding generator combined with a modified Generative Adversarial Network (GAN) to generate molecules with desired properties. The embedding generator utilizes a novel molecular descriptor, integrating Morgan fingerprints with global molecular attributes, enabling the transformer to capture local functional groups and broader molecular characteristics. Modifying the GAN generator loss function ensures the generation of molecules with specific desired properties. The transformer achieves a reconversion accuracy of 94% while translating molecular descriptors back to SMILES strings, validating the utility of the proposed embeddings for generative tasks. The approach is validated by generating novel odorant molecules using a labeled dataset of odorant and non-odorant compounds. With the modified range-loss function, the GAN exclusively generates odorant molecules. This work underscores the potential of combining novel vector embeddings with transformers and modified GAN architectures to accelerate the discovery of tailored molecules, offering a robust tool for diverse molecular design applications.
zh

[AI-61] Improving Pretrained YAMNet for Enhanced Speech Command Detection via Transfer Learning

【速读】:该论文旨在解决语音命令识别系统在准确性和效率方面的需求,这是提升各类智能应用中用户交互体验的关键组成部分。其解决方案的关键在于利用预训练的YAMNet模型和迁移学习技术,通过适配和训练YAMNet深度学习模型,有效检测和解析音频信号中的语音命令。研究采用经过详细标注的Speech Commands数据集,并进行数据增强和特征提取以提升模型性能,最终实现了95.28%的识别准确率,展示了先进机器学习技术在语音命令识别中的显著影响。

链接: https://arxiv.org/abs/2504.19030
作者: Sidahmed Lachenani,Hamza Kheddar,Mohamed Ouldzmirli
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This work addresses the need for enhanced accuracy and efficiency in speech command recognition systems, a critical component for improving user interaction in various smart applications. Leveraging the robust pretrained YAMNet model and transfer learning, this study develops a method that significantly improves speech command recognition. We adapt and train a YAMNet deep learning model to effectively detect and interpret speech commands from audio signals. Using the extensively annotated Speech Commands dataset (speech_commands_v0.01), our approach demonstrates the practical application of transfer learning to accurately recognize a predefined set of speech commands. The dataset is meticulously augmented, and features are strategically extracted to boost model performance. As a result, the final model achieved a recognition accuracy of 95.28%, underscoring the impact of advanced machine learning techniques on speech command recognition. This achievement marks substantial progress in audio processing technologies and establishes a new benchmark for future research in the field.
zh

[AI-62] DiCE-Extended: A Robust Approach to Counterfactual Explanations in Machine Learning

【速读】:该论文旨在解决现有生成式AI(Generative AI)中的反事实(Counterfactual, CF)解释方法在接近性、多样性与鲁棒性之间难以平衡的问题,从而限制了其在现实场景中的应用。其解决方案的关键在于提出DiCE-Extended框架,该框架通过引入多目标优化技术,结合基于Dice-Sorensen系数的鲁棒性度量以及加权损失组件(lambda_p, lambda_d, lambda_r),在保持可解释性的前提下提升CF解释的稳定性与有效性。

链接: https://arxiv.org/abs/2504.19027
作者: Volkan Bakir,Polat Goktas,Sureyya Akyuz
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: MCO 2025, 5th International Conference on Modelling, Computation and Optimization in Information Systems and Management Sciences

点击查看摘要

Abstract:Explainable artificial intelligence (XAI) has become increasingly important in decision-critical domains such as healthcare, finance, and law. Counterfactual (CF) explanations, a key approach in XAI, provide users with actionable insights by suggesting minimal modifications to input features that lead to different model outcomes. Despite significant advancements, existing CF generation methods often struggle to balance proximity, diversity, and robustness, limiting their real-world applicability. A widely adopted framework, Diverse Counterfactual Explanations (DiCE), emphasizes diversity but lacks robustness, making CF explanations sensitive to perturbations and domain constraints. To address these challenges, we introduce DiCE-Extended, an enhanced CF explanation framework that integrates multi-objective optimization techniques to improve robustness while maintaining interpretability. Our approach introduces a novel robustness metric based on the Dice-Sorensen coefficient, ensuring stability under small input variations. Additionally, we refine CF generation using weighted loss components (lambda_p, lambda_d, lambda_r) to balance proximity, diversity, and robustness. We empirically validate DiCE-Extended on benchmark datasets (COMPAS, Lending Club, German Credit, Adult Income) across multiple ML backends (Scikit-learn, PyTorch, TensorFlow). Results demonstrate improved CF validity, stability, and alignment with decision boundaries compared to standard DiCE-generated explanations. Our findings highlight the potential of DiCE-Extended in generating more reliable and interpretable CFs for high-stakes applications. Future work will explore adaptive optimization techniques and domain-specific constraints to further enhance CF generation in real-world scenarios.
zh

[AI-63] GLaMoR: Consistency Checking of OWL Ontologies using Graph Language Models

【速读】:该论文旨在解决语义推理中的本体一致性验证问题,这一问题在处理大规模OWL本体时面临计算成本高和效率下降的挑战。其解决方案的关键在于提出GLaMoR(Graph Language Model for Reasoning),该方法将OWL本体转换为图结构数据,并适配图语言模型(Graph Language Model, GLM)架构以实现高效的本体一致性检查。实验结果表明,GLaMoR在NCBO BioPortal仓库的本体数据上表现出色,准确率达到95%,且相比传统推理机快20倍。

链接: https://arxiv.org/abs/2504.19023
作者: Justin Mücke,Ansgar Scherp
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Semantic reasoning aims to infer new knowledge from existing knowledge, with OWL ontologies serving as a standardized framework for organizing information. A key challenge in semantic reasoning is verifying ontology consistency. However, state-of-the-art reasoners are computationally expensive, and their efficiency decreases as ontology sizes grow. While classical machine learning models have been explored for consistency checking, they struggle to capture complex relationships within ontologies. Large language models (LLMs) have shown promising results for simple reasoning tasks but perform poorly on structured reasoning. The recently introduced Graph Language Model (GLM) offers a way to simultaneously process graph-structured data and text. This paper proposes GLaMoR (Graph Language Model for Reasoning), a reasoning pipeline that transforms OWL ontologies into graph-structured data and adapts the GLM architecture for consistency checking. We evaluate GLaMoR on ontologies from the NCBO BioPortal repository, converting them into triples suitable for model input. Our results show that the GLM outperforms all baseline models, achieving 95% accuracy while being 20 times faster than classical reasoners. The Code is accessible under: this https URL Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.19023 [cs.AI] (or arXiv:2504.19023v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.19023 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-64] Sparks: Multi-Agent Artificial Intelligence Model Discovers Protein Design Principles

【速读】:该论文试图解决传统人工智能系统在科学研究中仅能复现训练数据中隐含知识的问题,而无法实现自主发现新科学原理的挑战。其解决方案的关键在于提出Sparks模型,该模型是一个多模态多智能体AI系统,能够独立执行从假设生成、实验设计到迭代优化的完整发现流程,并生成可泛化的科学原理和报告。Sparks通过结合生成式序列设计、高精度结构预测以及物理感知性质模型,结合配对生成与反思智能体以实现自我校正和可重复性,从而实现了自主的严谨科学探究并发现了之前未知的科学原则。

链接: https://arxiv.org/abs/2504.19017
作者: Alireza Ghafarollahi,Markus J. Buehler
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Advances in artificial intelligence (AI) promise autonomous discovery, yet most systems still resurface knowledge latent in their training data. We present Sparks, a multi-modal multi-agent AI model that executes the entire discovery cycle that includes hypothesis generation, experiment design and iterative refinement to develop generalizable principles and a report without human intervention. Applied to protein science, Sparks uncovered two previously unknown phenomena: (i) a length-dependent mechanical crossover whereby beta-sheet-biased peptides surpass alpha-helical ones in unfolding force beyond ~80 residues, establishing a new design principle for peptide mechanics; and (ii) a chain-length/secondary-structure stability map revealing unexpectedly robust beta-sheet-rich architectures and a “frustration zone” of high variance in mixed alpha/beta folds. These findings emerged from fully self-directed reasoning cycles that combined generative sequence design, high-accuracy structure prediction and physics-aware property models, with paired generation-and-reflection agents enforcing self-correction and reproducibility. The key result is that Sparks can independently conduct rigorous scientific inquiry and identify previously unknown scientific principles.
zh

[AI-65] PINN – a Domain Decomposition Method for Bayesian Physics-Informed Neural Networks

【速读】:该论文试图解决在存在噪声和稀疏初始及边界数据的情况下,如何高效量化大规模多尺度问题中的认知不确定性(epistemic uncertainty)和随机不确定性(aleatoric uncertainty)的问题。其解决方案的关键在于提出一种基于贝叶斯框架的新型方法——\ PINN,通过将局部贝叶斯物理信息神经网络(BPINN)与域分解技术相结合,实现对偏微分方程(PDEs)全局不确定性的计算,同时通过在相邻子域接口上施加通量连续性条件来保证解的跨子域连续性。

链接: https://arxiv.org/abs/2504.19013
作者: Júlia Vicens Figueres,Juliette Vanderhaeghen,Federica Bragone,Kateryna Morozovska,Khemraj Shukla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP)
备注: 37 pages, 22 figures

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are a novel computational approach for solving partial differential equations (PDEs) with noisy and sparse initial and boundary data. Although, efficient quantification of epistemic and aleatoric uncertainties in big multi-scale problems remains challenging. We propose \ PINN a novel method of computing global uncertainty in PDEs using a Bayesian framework, by combining local Bayesian Physics-Informed Neural Networks (BPINN) with domain decomposition. The solution continuity across subdomains is obtained by imposing the flux continuity across the interface of neighboring subdomains. To demonstrate the effectiveness of \ PINN, we conduct a series of computational experiments on PDEs in 1D and 2D spatial domains. Although we have adopted conservative PINNs (cPINNs), the method can be seamlessly extended to other domain decomposition techniques. The results infer that the proposed method recovers the global uncertainty by computing the local uncertainty exactly more efficiently as the uncertainty in each subdomain can be computed concurrently. The robustness of \ PINN is verified by adding uncorrelated random noise to the training data up to 15% and testing for different domain sizes.
zh

[AI-66] Feature Fusion Revisited: Multimodal CTR Prediction for MMCTR Challenge WWW2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推荐系统中应用时因模型规模庞大导致的高延迟问题。其解决方案的关键在于提升多模态表示学习的效率,以适应信息检索任务的需求。通过实验探索多种优化方法,并在任务2(多模态点击率预测)中获得了优胜,展示了其方法在实际应用中的有效性。

链接: https://arxiv.org/abs/2504.18961
作者: Junjie Zhou
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: A technical report for the MMCTR Challenge held by EReL@MIR Workshop at WWW 2025

点击查看摘要

Abstract:With the rapid advancement of Multimodal Large Language Models (MLLMs), an increasing number of researchers are exploring their application in recommendation systems. However, the high latency associated with large models presents a significant challenge for such use cases. The EReL@MIR workshop provided a valuable opportunity to experiment with various approaches aimed at improving the efficiency of multimodal representation learning for information retrieval tasks. As part of the competition’s requirements, participants were mandated to submit a technical report detailing their methodologies and findings. Our team was honored to receive the award for Task 2 - Winner (Multimodal CTR Prediction). In this technical report, we present our methods and key findings. Additionally, we propose several directions for future work, particularly focusing on how to effectively integrate recommendation signals into multimodal representations. The codebase for our implementation is publicly available at: this https URL, and the trained model weights can be accessed at: this https URL.
zh

[AI-67] Application of the Brain Drain Optimization Algorithm to the N-Queens Problem

【速读】:该论文试图解决N-Queens问题——一个经典的组合优化问题,该问题要求在棋盘上放置N个皇后,使得它们互不攻击。解决方案的关键在于引入一种受智力人才外流启发的群体智能元启发式算法——Brain Drain Optimization (BRADO),通过设计的成本函数引导搜索过程,并利用基于TOPSIS的多准则决策方法进行参数调优,从而在解的质量上优于其他传统算法。

链接: https://arxiv.org/abs/2504.18953
作者: Sahar Ramezani Jolfaei,Sepehr Khodadadi Hossein Abadi
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces the application of the Brain Drain Optimization algorithm – a swarm-based metaheuristic inspired by the emigration of intellectual elites – to the N-Queens problem. The N-Queens problem, a classic combinatorial optimization problem, serves as a challenge for applying the BRADO. A designed cost function guides the search, and the configurations are tuned using a TOPSIS-based multicriteria decision making process. BRADO consistently outperforms alternatives in terms of solution quality, achieving fewer threats and better objective function values. To assess BRADO’s efficacy, it is benchmarked against several established metaheuristic algorithms, including Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Imperialist Competitive Algorithm (ICA), Iterated Local Search (ILS), and basic Local Search (LS). The study highlights BRADO’s potential as a general-purpose solver for combinatorial problems, opening pathways for future applications in other domains of artificial intelligence.
zh

[AI-68] Use of Metric Learning for the Recognition of Handwritten Digits and its Application to Increase the Outreach of Voice-based Communication Platforms

【速读】:该论文旨在解决在发展项目中,由于设备成本、培训不足等原因,难以通过数字设备进行现场数据收集的问题。其解决方案的关键在于采用纸质数据收集方式,并通过光学字符识别(OCR)和光学标记识别(OMR)技术实现数据的自动化数字化。研究提供了大规模的手写数字数据集以及基于深度学习的模型和方法,这些工具在实际环境中表现出良好的效果,并成功应用于印度北部农村地区妇女自我帮助小组(SHG)的母婴健康与营养意识项目中。

链接: https://arxiv.org/abs/2504.18948
作者: Devesh Pant,Dibyendu Talukder,Deepak Kumar,Rachit Pandey,Aaditeshwar Seth,Chetan Arora
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 Pages, 7 Figures, ACM COMPASS 2022

点击查看摘要

Abstract:Initiation, monitoring, and evaluation of development programmes can involve field-based data collection about project activities. This data collection through digital devices may not always be feasible though, for reasons such as unaffordability of smartphones and tablets by field-based cadre, or shortfalls in their training and capacity building. Paper-based data collection has been argued to be more appropriate in several contexts, with automated digitization of the paper forms through OCR (Optical Character Recognition) and OMR (Optical Mark Recognition) techniques. We contribute with providing a large dataset of handwritten digits, and deep learning based models and methods built using this data, that are effective in real-world environments. We demonstrate the deployment of these tools in the context of a maternal and child health and nutrition awareness project, which uses IVR (Interactive Voice Response) systems to provide awareness information to rural women SHG (Self Help Group) members in north India. Paper forms were used to collect phone numbers of the SHG members at scale, which were digitized using the OCR tools developed by us, and used to push almost 4 million phone calls. The data, model, and code have been released in the open-source domain.
zh

[AI-69] GPU accelerated program synthesis: Enumerate semantics not syntax!

【速读】:该论文试图解决程序合成(program synthesis)中如何利用GPU加速搜索过程以提升性能的问题。其解决方案的关键在于采用适合GPU的编程技术,通过利用公式的语义来减少数据移动和数据依赖性分支,从而实现对更大规模合成问题的高效处理,并显著提升运行速度。

链接: https://arxiv.org/abs/2504.18943
作者: Martin Berger,Nathanaël Fijalkow,Mojtaba Valizadeh
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 10 pages

点击查看摘要

Abstract:Program synthesis is an umbrella term for generating programs and logical formulae from specifications. With the remarkable performance improvements that GPUs enable for deep learning, a natural question arose: can we also implement a search-based program synthesiser on GPUs to achieve similar performance improvements? In this article we discuss our insights on this question, based on recent works~. The goal is to build a synthesiser running on GPUs which takes as input positive and negative example traces and returns a logical formula accepting the positive and rejecting the negative traces. With GPU-friendly programming techniques – using the semantics of formulae to minimise data movement and reduce data-dependent branching – our synthesiser scales to significantly larger synthesis problems, and operates much faster than the previous CPU-based state-of-the-art. We believe the insights that make our approach GPU-friendly have wide potential for enhancing the performance of other formal methods (FM) workloads.
zh

[AI-70] AI Chatbots for Mental Health: Values and Harms from Lived Experiences of Depression

【速读】:该论文试图解决生成式 AI (Generative AI) 在心理健康聊天机器人应用中可能带来的潜在危害问题,特别是如何通过理解具有实际经历的用户价值来识别和缓解这些危害。解决方案的关键在于开发了一个基于 GPT-4o 的技术探测工具 Zenny,该工具通过模拟抑郁自我管理场景,与17名有抑郁经历的个体进行访谈,从而提取出关键的价值维度,包括信息支持、情感支持、个性化、隐私和危机管理,进而为心理健康 AI 聊天机器人的设计提供指导。

链接: https://arxiv.org/abs/2504.18932
作者: Dong Whi Yoo,Jiayue Melissa Shi,Violeta J. Rodriguez,Koustuv Saha
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in LLMs enable chatbots to interact with individuals on a range of queries, including sensitive mental health contexts. Despite uncertainties about their effectiveness and reliability, the development of LLMs in these areas is growing, potentially leading to harms. To better identify and mitigate these harms, it is critical to understand how the values of people with lived experiences relate to the harms. In this study, we developed a technology probe, a GPT-4o based chatbot called Zenny, enabling participants to engage with depression self-management scenarios informed by previous research. We used Zenny to interview 17 individuals with lived experiences of depression. Our thematic analysis revealed key values: informational support, emotional support, personalization, privacy, and crisis management. This work explores the relationship between lived experience values, potential harms, and design recommendations for mental health AI chatbots, aiming to enhance self-management support while minimizing risks.
zh

[AI-71] Advanced Longitudinal Control and Collision Avoidance for High-Risk Edge Cases in Autonomous Driving

【速读】:该论文试图解决高级驾驶辅助系统(ADAS)和高级驾驶系统(ADS)在复杂交通环境中对跟随车辆行为关注不足的问题,这一缺陷可能导致高速、密集车流中的连锁碰撞。解决方案的关键在于提出一种融合自适应巡航与紧急制动的纵向控制与碰撞避免算法,该算法利用深度强化学习(DRL)同时考虑前车和后车的行为,并通过数据预处理框架提升训练过程的鲁棒性和可靠性,从而有效防止潜在的多车连撞事故。

链接: https://arxiv.org/abs/2504.18931
作者: Dianwei Chen,Yaobang Gong,Xianfeng Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advanced Driver Assistance Systems (ADAS) and Advanced Driving Systems (ADS) are key to improving road safety, yet most existing implementations focus primarily on the vehicle ahead, neglecting the behavior of following vehicles. This shortfall often leads to chain reaction collisions in high speed, densely spaced traffic particularly when a middle vehicle suddenly brakes and trailing vehicles cannot respond in time. To address this critical gap, we propose a novel longitudinal control and collision avoidance algorithm that integrates adaptive cruising with emergency braking. Leveraging deep reinforcement learning, our method simultaneously accounts for both leading and following vehicles. Through a data preprocessing framework that calibrates real-world sensor data, we enhance the robustness and reliability of the training process, ensuring the learned policy can handle diverse driving conditions. In simulated high risk scenarios (e.g., emergency braking in dense traffic), the algorithm effectively prevents potential pile up collisions, even in situations involving heavy duty vehicles. Furthermore, in typical highway scenarios where three vehicles decelerate, the proposed DRL approach achieves a 99% success rate far surpassing the standard Federal Highway Administration speed concepts guide, which reaches only 36.77% success under the same conditions.
zh

[AI-72] Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity

【速读】:该论文试图解决在Transformer模型压缩过程中,如何准确评估模型压缩效果以及比较学习到的分布与目标分布的信息内容问题。传统方法依赖于目标分布作为评价标准,但目标分布通常未知且熵计算成本过高。论文的关键解决方案是通过受控实验设置,揭示Transformer在数据压缩中表现出的归纳偏置:即除了接近目标分布外,Transformer更倾向于学习低熵分布,且这种倾向随模型规模增大而增强,这导致其无法完全对齐目标分布,而是进一步压缩信息内容。此外,论文还指出前馈网络(FFN)模块在驱动这一偏置中起关键作用,并引入动态稀疏性来表征参数中的冗余性。

链接: https://arxiv.org/abs/2504.18929
作者: Ruifeng Ren,Yong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compression has been a critical lens to understand the success of Transformers. In the past, we have typically taken the target distribution as a criterion to evaluate a model’s compression performance. Nevertheless,it often remains challenging to precisely assess how well the model achieves compression and to compare the information content of the learned distribution with that of the target distribution during compression,as the target distribution is typically unknown and entropy computation often incurs exponential cost. In this work, we explore these issues under a controlled experimental setup. We find that Transformers exhibit a unique inductive bias in data compression: beyond approaching the target distribution, they tend to favor learning lower-entropy distributions, with this tendency becoming more pronounced as the model size increases. This preference prevents Transformers from perfectly aligning with the target distribution, instead further compressing its information content. Furthermore, we show that the FFN module plays a critical role in driving this bias. In addition, while models remove informational redundancy from data during compression, they also exhibit redundancy within their parameters, which enables compression and can be characterized through dynamic sparsity. However, the dynamic sparsity patterns in Transformers, particularly in attention and FFN modules, demand further exploration. As for this, we show that larger Transformers show stronger preferences for bypassing attention computations via residual connections and have lower proportion of active neurons. Interestingly, we also find that training instability in larger models strongly correlates with sudden increases in dead neurons. Our work contributes to a deeper understanding of Transformers from the lens of entropy and dynamic sparsity.
zh

[AI-73] UnifyFL: Enabling Decentralized Cross-Silo Federated Learning

【速读】:该论文试图解决联邦学习(Federated Learning, FL)中组织间协作缺乏有效机制的问题,特别是在信任与资源效率之间的平衡问题。现有方法要么依赖可信第三方聚合器(多级FL),这可能引入偏见或不可靠性;要么直接共享本地模型,但需要大量计算资源进行验证,二者均存在显著的权衡缺陷。论文提出的解决方案关键在于构建一个基于信任的跨孤岛联邦学习框架\proj,该框架采用去中心化编排和分布式存储,提供同步和异步模式以应对延迟节点,并在保持与理想多级集中式FL相当性能的同时,实现信任与资源利用的优化。

链接: https://arxiv.org/abs/2504.18916
作者: Sarang S,Druva Dhakshinamoorthy,Aditya Shiva Sharma,Yuvraj Singh Bhadauria,Siddharth Chaitra Vivek,Arihant Bansal,Arnab K. Paul
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, 7 tables. Accepted at the 26th ACM/IFIP International Middleware Conference (MIDDLEWARE 2025)

点击查看摘要

Abstract:Federated Learning (FL) is a decentralized machine learning (ML) paradigm in which models are trained on private data across several devices called clients and combined at a single node called an aggregator rather than aggregating the data itself. Many organizations employ FL to have better privacy-aware ML-driven decision-making capabilities. However, organizations often operate independently rather than collaborate to enhance their FL capabilities due to the lack of an effective mechanism for collaboration. The challenge lies in balancing trust and resource efficiency. One approach relies on trusting a third-party aggregator to consolidate models from all organizations (multilevel FL), but this requires trusting an entity that may be biased or unreliable. Alternatively, organizations can bypass a third party by sharing their local models directly, which requires significant computational resources for validation. Both approaches reflect a fundamental trade-off between trust and resource constraints, with neither offering an ideal solution. In this work, we develop a trust-based cross-silo FL framework called \proj, which uses decentralized orchestration and distributed storage. \proj provides flexibility to the participating organizations and presents synchronous and asynchronous modes to handle stragglers. Our evaluation on a diverse testbed shows that \proj achieves a performance comparable to the ideal multilevel centralized FL while allowing trust and optimal use of resources.
zh

[AI-74] ransformer-Empowered Actor-Critic Reinforcement Learning for Sequence-Aware Service Function Chain Partitioning

【速读】:该论文旨在解决在6G网络中,由于严格的时延约束和有限的资源可用性,跨多域网络基础设施对服务功能链(Service Function Chains, SFCs)进行有效划分所面临的挑战。传统基于优化的方法通常可扩展性较差,而现有的数据驱动方法难以在计算效率与对SFC内在依赖关系的有效建模之间取得平衡。论文提出的解决方案关键在于引入一种基于Transformer的强化学习框架,利用自注意力机制有效建模VNF之间的复杂依赖关系,从而实现协调且并行的决策过程,并通过ε-LoPe探索策略和渐近回报归一化提升训练稳定性和收敛性。

链接: https://arxiv.org/abs/2504.18902
作者: Cyril Shih-Huan Hsu,Anestis Dalgkitsis,Chrysa Papagianni,Paola Grosso
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:In the forthcoming era of 6G networks, characterized by unprecedented data rates, ultra-low latency, and extensive connectivity, effective management of Virtualized Network Functions (VNFs) is essential. VNFs are software-based counterparts of traditional hardware devices that facilitate flexible and scalable service provisioning. Service Function Chains (SFCs), structured as ordered sequences of VNFs, are pivotal in orchestrating complex network services. Nevertheless, partitioning SFCs across multi-domain network infrastructures presents substantial challenges due to stringent latency constraints and limited resource availability. Conventional optimization-based methods typically exhibit low scalability, whereas existing data-driven approaches often fail to adequately balance computational efficiency with the capability to effectively account for dependencies inherent in SFCs. To overcome these limitations, we introduce a Transformer-empowered actor-critic framework specifically designed for sequence-aware SFC partitioning. By utilizing the self-attention mechanism, our approach effectively models complex inter-dependencies among VNFs, facilitating coordinated and parallelized decision-making processes. Additionally, we enhance training stability and convergence using \epsilon -LoPe exploration strategy as well as Asymptotic Return Normalization. Comprehensive simulation results demonstrate that the proposed methodology outperforms existing state-of-the-art solutions in terms of long-term acceptance rates, resource utilization efficiency, and scalability, while achieving rapid inference. This study not only advances intelligent network orchestration by delivering a scalable and robust solution for SFC partitioning within emerging 6G environments, but also bridging recent advancements in Large Language Models (LLMs) with the optimization of next-generation networks.
zh

[AI-75] SPD Learning for Covariance-Based Neuroimaging Analysis: Perspectives Methods and Challenges

【速读】:该论文旨在解决神经影像数据中任务特异性特征解码所面临的固有挑战,例如原始脑电记录中的低信噪比、跨会话非平稳性以及样本量有限等问题。其解决方案的关键在于利用基于协方差的神经影像数据,通过将对称正定(SPD)矩阵嵌入黎曼度量空间(如仿射不变或对数欧几里得度量),构建黎曼流形,并在此流形上统一多种方法,从而系统性地利用SPD流形的几何结构来处理协方差特征,提升脑成像分析的性能。

链接: https://arxiv.org/abs/2504.18882
作者: Ce Ju,Reinmar J. Kobler,Antoine Collas,Motoaki Kawanabe,Cuntai Guan,Bertrand Thirion
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures, 2 tables; This paper has been submitted for possible publication, and currently under review

点击查看摘要

Abstract:Neuroimaging provides a critical framework for characterizing brain activity by quantifying connectivity patterns and functional architecture across modalities. While modern machine learning has significantly advanced our understanding of neural processing mechanisms through these datasets, decoding task-specific signatures must contend with inherent neuroimaging constraints, for example, low signal-to-noise ratios in raw electrophysiological recordings, cross-session non-stationarity, and limited sample sizes. This review focuses on machine learning approaches for covariance-based neuroimaging data, where often symmetric positive definite (SPD) matrices under full-rank conditions encode inter-channel relationships. By equipping the space of SPD matrices with Riemannian metrics (e.g., affine-invariant or log-Euclidean), their space forms a Riemannian manifold enabling geometric analysis. We unify methodologies operating on this manifold under the SPD learning framework, which systematically leverages the SPD manifold’s geometry to process covariance features, thereby advancing brain imaging analytics.
zh

[AI-76] Reshaping MOFs Text Mining with a Dynamic Multi-Agent Framework of Large Language Agents

【速读】:该论文旨在解决金属-有机框架(Metal-Organic Frameworks, MOFs)合成条件的挖掘问题,特别是在众多可能的合成条件下确定特定MOFs的精确合成条件这一挑战。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs),特别是gpt-4o-mini,作为核心代理,整合多种与MOF相关的代理,包括合成、属性和化学信息代理,从而开发出MOFh6工具,以优化MOF的合成流程并提高研究效率。

链接: https://arxiv.org/abs/2504.18880
作者: Zuhong Lin,Daoyuan Ren,Kai Ran,Sun Jing,Xiaotiang Huang,Haiyang He,Pengxu Pan,Xiaohang Zhang,Ying Fang,Tianying Wang,Minli Wu,Zhanglin Li,Xiaochuan Zhang,Haipu Li,Jingjing Yao
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注:

点击查看摘要

Abstract:The mining of synthesis conditions for metal-organic frameworks (MOFs) is a significant focus in materials science. However, identifying the precise synthesis conditions for specific MOFs within the vast array of possibilities presents a considerable challenge. Large Language Models (LLMs) offer a promising solution to this problem. We leveraged the capabilities of LLMs, specifically gpt-4o-mini, as core agents to integrate various MOF-related agents, including synthesis, attribute, and chemical information agents. This integration culminated in the development of MOFh6, an LLM tool designed to streamline the MOF synthesis process. MOFh6 allows users to query in multiple formats, such as submitting scientific literature, or inquiring about specific MOF codes or structural properties. The tool analyzes these queries to provide optimal synthesis conditions and generates model files for density functional theory pre modeling. We believe MOFh6 will enhance efficiency in the MOF synthesis of all researchers.
zh

[AI-77] SRM: A Lightweight Temporal Feature Encoding Architecture for Time Series Forecasting and Imputation

【速读】:该论文旨在解决多变量时间序列预测与填补问题,提出了一种称为时间序列表示模型(Time Series Representation Model, TSRM)的时序特征编码架构。其解决方案的关键在于采用基于卷积神经网络(CNN)的表示层,每个层专注于独立的表示学习任务以捕捉多样的时序模式,并结合基于注意力机制的特征提取层和融合层,用于聚合提取的特征。该架构的核心设计受到Transformer编码器的启发,通过自注意力机制实现对时间序列特征的有效建模,从而在多个基准数据集上取得了优于现有方法的性能,同时显著降低了可学习参数的复杂度。

链接: https://arxiv.org/abs/2504.18878
作者: Robert Leppich,Michael Stenger,Daniel Grillmeyer,Vanessa Borst,Samuel Kounev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a temporal feature encoding architecture called Time Series Representation Model (TSRM) for multivariate time series forecasting and imputation. The architecture is structured around CNN-based representation layers, each dedicated to an independent representation learning task and designed to capture diverse temporal patterns, followed by an attention-based feature extraction layer and a merge layer, designed to aggregate extracted features. The architecture is fundamentally based on a configuration that is inspired by a Transformer encoder, with self-attention mechanisms at its core. The TSRM architecture outperforms state-of-the-art approaches on most of the seven established benchmark datasets considered in our empirical evaluation for both forecasting and imputation tasks. At the same time, it significantly reduces complexity in the form of learnable parameters. The source code is available at this https URL.
zh

[AI-78] Generative to Agent ic AI: Survey Conceptualization and Challenges

【速读】:该论文试图解决当前对生成式人工智能(Generative AI)与代理型人工智能(Agentic AI)之间区别理解不足的问题,以及探索代理型人工智能在复杂任务处理中的潜力与挑战。其解决方案的关键在于通过系统比较两种技术的核心特性、演进路径及局限性,揭示代理型人工智能如何弥补生成式人工智能的不足,并深入探讨代理型人工智能的新发展和实际应用中的关键问题,从而为未来研究提供方向并警示潜在风险。

链接: https://arxiv.org/abs/2504.18875
作者: Johannes Schneider
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic Artificial Intelligence (AI) builds upon Generative AI (GenAI). It constitutes the next major step in the evolution of AI with much stronger reasoning and interaction capabilities that enable more autonomous behavior to tackle complex tasks. Since the initial release of ChatGPT (3.5), Generative AI has seen widespread adoption, giving users firsthand experience. However, the distinction between Agentic AI and GenAI remains less well understood. To address this gap, our survey is structured in two parts. In the first part, we compare GenAI and Agentic AI using existing literature, discussing their key characteristics, how Agentic AI remedies limitations of GenAI, and the major steps in GenAI’s evolution toward Agentic AI. This section is intended for a broad audience, including academics in both social sciences and engineering, as well as industry professionals. It provides the necessary insights to comprehend novel applications that are possible with Agentic AI but not with GenAI. In the second part, we deep dive into novel aspects of Agentic AI, including recent developments and practical concerns such as defining agents. Finally, we discuss several challenges that could serve as a future research agenda, while cautioning against risks that can emerge when exceeding human intelligence.
zh

[AI-79] Why you shouldnt fully trust ChatGPT : A synthesis of this AI tools error rates across disciplines and the software engineering lifecycle

【速读】:该论文试图解决生成式 AI(Generative AI)在不同领域和软件工程(Software Engineering, SE)生命周期中的可靠性问题,特别是其错误率的分布与变化。解决方案的关键在于通过多源文献综述(Multivocal Literature Review, MLR)方法,系统地收集和分析学术研究、报告、基准测试及灰色文献中的数据,分类统计不同领域和 SE 阶段的错误类型(如事实性错误、推理错误、编码错误和解释性错误),并通过箱线图可视化错误分布,从而提供一个基于证据的评估框架,揭示 ChatGPT 在不同任务、领域和模型版本中的表现差异。

链接: https://arxiv.org/abs/2504.18858
作者: Vahid Garousi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context: ChatGPT and other large language models (LLMs) are widely used across healthcare, business, economics, engineering, and software engineering (SE). Despite their popularity, concerns persist about their reliability, especially their error rates across domains and the software development lifecycle (SDLC). Objective: This study synthesizes and quantifies ChatGPT’s reported error rates across major domains and SE tasks aligned with SDLC phases. It provides an evidence-based view of where ChatGPT excels, where it fails, and how reliability varies by task, domain, and model version (GPT-3.5, GPT-4, GPT-4-turbo, GPT-4o). Method: A Multivocal Literature Review (MLR) was conducted, gathering data from academic studies, reports, benchmarks, and grey literature up to 2025. Factual, reasoning, coding, and interpretive errors were considered. Data were grouped by domain and SE phase and visualized using boxplots to show error distributions. Results: Error rates vary across domains and versions. In healthcare, rates ranged from 8% to 83%. Business and economics saw error rates drop from ~50% with GPT-3.5 to 15-20% with GPT-4. Engineering tasks averaged 20-30%. Programming success reached 87.5%, though complex debugging still showed over 50% errors. In SE, requirements and design phases showed lower error rates (~5-20%), while coding, testing, and maintenance phases had higher variability (10-50%). Upgrades from GPT-3.5 to GPT-4 improved reliability. Conclusion: Despite improvements, ChatGPT still exhibits non-negligible error rates varying by domain, task, and SDLC phase. Full reliance without human oversight remains risky, especially in critical settings. Continuous evaluation and critical validation are essential to ensure reliability and trustworthiness. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.18858 [cs.SE] (or arXiv:2504.18858v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2504.18858 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vahid Garousi [view email] [v1] Sat, 26 Apr 2025 08:49:33 UTC (538 KB) Full-text links: Access Paper: View a PDF of the paper titled Why you shouldn’t fully trust ChatGPT: A synthesis of this AI tool’s error rates across disciplines and the software engineering lifecycle, by Vahid GarousiView PDFOther Formats view license Current browse context: cs.SE prev | next new | recent | 2025-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-80] Imitation Learning for Autonomous Driving: Insights from Real-World Testing

【速读】:该论文旨在解决在真实世界中部署深度学习驱动的自动驾驶系统所面临的挑战,特别是如何确保深度神经网络(DNN)在高采样频率下能够准确、快速地生成平滑的车辆操控指令。其解决方案的关键在于采用增量设计方法,通过逐步增强模型容量和数据集来应对现实驾驶场景中的复杂性,同时比较了多种DNN结构(如PD系统、CNN、CNN-LSTM和CNN-NODE)的性能,以确定最适合实时自动驾驶的有效方法。

链接: https://arxiv.org/abs/2504.18847
作者: Hidayet Ersin Dursun,Yusuf Güven,Tufan Kumbasar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: In International Congress on Human-Computer Interaction, Optimization and Robotic Applications, 2025

点击查看摘要

Abstract:This work focuses on the design of a deep learning-based autonomous driving system deployed and tested on the real-world MIT Racecar to assess its effectiveness in driving scenarios. The Deep Neural Network (DNN) translates raw image inputs into real-time steering commands in an end-to-end learning fashion, following the imitation learning framework. The key design challenge is to ensure that DNN predictions are accurate and fast enough, at a high sampling frequency, and result in smooth vehicle operation under different operating conditions. In this study, we design and compare various DNNs, to identify the most effective approach for real-time autonomous driving. In designing the DNNs, we adopted an incremental design approach that involved enhancing the model capacity and dataset to address the challenges of real-world driving scenarios. We designed a PD system, CNN, CNN-LSTM, and CNN-NODE, and evaluated their performance on the real-world MIT Racecar. While the PD system handled basic lane following, it struggled with sharp turns and lighting variations. The CNN improved steering but lacked temporal awareness, which the CNN-LSTM addressed as it resulted in smooth driving performance. The CNN-NODE performed similarly to the CNN-LSTM in handling driving dynamics, yet with slightly better driving performance. The findings of this research highlight the importance of iterative design processes in developing robust DNNs for autonomous driving applications. The experimental video is available at this https URL.
zh

[AI-81] Introducing Interval Neural Networks for Uncertainty-Aware System Identification

【速读】:该论文旨在解决系统辨识(SysID)中深度学习(DL)模型缺乏不确定性量化(UQ)的问题,从而影响模型的可靠性和安全性。其解决方案的关键在于构建区间神经网络(INN),通过将预训练神经网络的可学习参数转换为区间值参数,无需依赖概率假设,利用区间算术生成能够有效捕捉目标覆盖率的预测区间(PI)。此外,论文将长短期记忆网络(LSTM)和神经微分方程(Neural ODE)扩展为区间LSTM(ILSTM)和区间NODE(INODE)架构,并提出一种结合UQ损失函数和参数化技巧的深度学习框架以训练INN,同时引入“弹性”概念来表征不确定性来源。

链接: https://arxiv.org/abs/2504.18845
作者: Mehmet Ali Ferah,Tufan Kumbasar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: In International Congress on Human-Computer Interaction, Optimization and Robotic Applications, 2025

点击查看摘要

Abstract:System Identification (SysID) is crucial for modeling and understanding dynamical systems using experimental data. While traditional SysID methods emphasize linear models, their inability to fully capture nonlinear dynamics has driven the adoption of Deep Learning (DL) as a more powerful alternative. However, the lack of uncertainty quantification (UQ) in DL-based models poses challenges for reliability and safety, highlighting the necessity of incorporating UQ. This paper introduces a systematic framework for constructing and learning Interval Neural Networks (INNs) to perform UQ in SysID tasks. INNs are derived by transforming the learnable parameters (LPs) of pre-trained neural networks into interval-valued LPs without relying on probabilistic assumptions. By employing interval arithmetic throughout the network, INNs can generate Prediction Intervals (PIs) that capture target coverage effectively. We extend Long Short-Term Memory (LSTM) and Neural Ordinary Differential Equations (Neural ODEs) into Interval LSTM (ILSTM) and Interval NODE (INODE) architectures, providing the mathematical foundations for their application in SysID. To train INNs, we propose a DL framework that integrates a UQ loss function and parameterization tricks to handle constraints arising from interval LPs. We introduce novel concept “elasticity” for underlying uncertainty causes and validate ILSTM and INODE in SysID experiments, demonstrating their effectiveness.
zh

[AI-82] st It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在上下文学习(In-context Learning, ICL)过程中对细微对抗扰动敏感以及在面对语言变化时表现出不可预测行为的问题。解决方案的关键在于引入一种受软件测试原理启发的框架——MMT4NL,通过利用对抗扰动和软件测试技术来评估ICL的可信度,其核心思想是通过从测试集中生成变换性对抗样本,量化并定位ICL提示词设计中的缺陷,从而将LLM视为软件进行功能验证。

链接: https://arxiv.org/abs/2504.18827
作者: Teeradaj Racharak,Chaiyong Ragkhitwetsagul,Chommakorn Sontesadisai,Thanwadee Sunetnanta
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In-context learning (ICL) has emerged as a powerful capability of large language models (LLMs), enabling them to perform new tasks based on a few provided examples without explicit fine-tuning. Despite their impressive adaptability, these models remain vulnerable to subtle adversarial perturbations and exhibit unpredictable behavior when faced with linguistic variations. Inspired by software testing principles, we introduce a software testing-inspired framework, called MMT4NL, for evaluating the trustworthiness of in-context learning by utilizing adversarial perturbations and software testing techniques. It includes diverse evaluation aspects of linguistic capabilities for testing the ICL capabilities of LLMs. MMT4NL is built around the idea of crafting metamorphic adversarial examples from a test set in order to quantify and pinpoint bugs in the designed prompts of ICL. Our philosophy is to treat any LLM as software and validate its functionalities just like testing the software. Finally, we demonstrate applications of MMT4NL on the sentiment analysis and question-answering tasks. Our experiments could reveal various linguistic bugs in state-of-the-art LLMs.
zh

[AI-83] Preserving Seasonal and Trend Information: A Variational Autoencoder-Latent Space Arithmetic Based Approach for Non-stationary Learning

【速读】:该论文试图解决传统AI模型在非平稳时间序列数据上表现不佳的问题,因为这些模型通常假设训练环境是平稳的,从而无法有效捕捉非平稳数据中的趋势和季节性信息。解决方案的关键在于在潜在空间中强制实现平稳行为,同时保留趋势和季节性信息,该方法通过差分、时间序列分解和潜在空间算术(Latent Space Arithmetic, LSA)等技术实现,将关键信息以嵌入形式存储在变分自编码器(Variational Autoencoder, VAE)的潜在空间中。

链接: https://arxiv.org/abs/2504.18819
作者: Hassan Wasswa,Aziida Nanyonga,Timothy Lynar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI models have garnered significant research attention towards predictive task automation. However, a stationary training environment is an underlying assumption for most models and such models simply do not work on non-stationary data since a stationary relationship is learned. The existing solutions propose making data stationary prior to model training and evaluation. This leads to loss of trend and seasonal patterns which are vital components for learning temporal dependencies of the system under study. This research aims to address this limitation by proposing a method for enforcing stationary behaviour within the latent space while preserving trend and seasonal information. The method deploys techniques including Differencing, Time-series decomposition, and Latent Space Arithmetic (LSA), to learn information vital for efficient approximation of trend and seasonal information which is then stored as embeddings within the latent space of a Variational Autoencoder (VAE). The approach’s ability to preserve trend and seasonal information was evaluated on two time-series non-stationary datasets. For predictive performance evaluation, four deep learning models were trained on the latent vector representations of the datasets after application of the proposed method and all models produced competitive results in comparison with state-of-the-art techniques using RMSE as the performance metric.
zh

[AI-84] Zero-Day Botnet Attack Detection in IoV: A Modular Approach Using Isolation Forests and Particle Swarm Optimization

【速读】:该论文旨在解决车联网(IoV)中因设备互联增加而带来的安全威胁问题,特别是针对联网和自动驾驶车辆(CAVs)的恶意软件和网络攻击问题。其解决方案的关键在于提出一种基于边缘计算的入侵检测系统(IDS),该系统采用元集成分类器架构,通过在多接入边缘计算(MEC)服务器上训练多个孤立森林(Isolation Forest, IF)模型,并利用粒子群优化(PSO)的堆叠策略对这些模型进行聚合,从而构建出一个能够识别已知(N-day)攻击和检测未知(zero-day)攻击的鲁棒性元分类器。

链接: https://arxiv.org/abs/2504.18814
作者: Abdelaziz Amara korba,Nour Elislem Karabadji,Yacine Ghamri-Doudane
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Internet of Vehicles (IoV) is transforming transportation by enhancing connectivity and enabling autonomous driving. However, this increased interconnectivity introduces new security vulnerabilities. Bot malware and cyberattacks pose significant risks to Connected and Autonomous Vehicles (CAVs), as demonstrated by real-world incidents involving remote vehicle system compromise. To address these challenges, we propose an edge-based Intrusion Detection System (IDS) that monitors network traffic to and from CAVs. Our detection model is based on a meta-ensemble classifier capable of recognizing known (Nday) attacks and detecting previously unseen (zero-day) attacks. The approach involves training multiple Isolation Forest (IF) models on Multi-access Edge Computing (MEC) servers, with each IF specialized in identifying a specific type of botnet attack. These IFs, either trained locally or shared by other MEC nodes, are then aggregated using a Particle Swarm Optimization (PSO) based stacking strategy to construct a robust meta-classifier. The proposed IDS has been evaluated on a vehicular botnet dataset, achieving an average detection rate of 92.80% for N-day attacks and 77.32% for zero-day attacks. These results highlight the effectiveness of our solution in detecting both known and emerging threats, providing a scalable and adaptive defense mechanism for CAVs within the IoV ecosystem.
zh

[AI-85] Clones in the Machine: A Feminist Critique of Agency in Digital Cloning

【速读】:该论文试图解决数字克隆(digital cloning)在学术研究中所引发的伦理问题,特别是其对知情同意、主体性和代表性等方面的忽视。论文指出,数字克隆通过复制用户数据来模拟行为,虽然被视作获取行为洞察的可扩展工具,但这种框架掩盖了深层次的伦理争议。为应对这些问题,论文提出的关键解决方案是建立去中心化的数据存储库和动态同意模型,以促进符合伦理且情境敏感的人工智能实践,从而挑战人工智能解决方案主义(AI solutionism)的简化逻辑。

链接: https://arxiv.org/abs/2504.18807
作者: Siân Brooke
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: ACM CHI Conference on Human Factors in Computing Systems 2025

点击查看摘要

Abstract:This paper critiques digital cloning in academic research, highlighting how it exemplifies AI solutionism. Digital clones, which replicate user data to simulate behavior, are often seen as scalable tools for behavioral insights. However, this framing obscures ethical concerns around consent, agency, and representation. Drawing on feminist theories of agency, the paper argues that digital cloning oversimplifies human complexity and risks perpetuating systemic biases. To address these issues, it proposes decentralized data repositories and dynamic consent models, promoting ethical, context-aware AI practices that challenge the reductionist logic of AI solutionism
zh

[AI-86] Can We Enhance Bug Report Quality Using LLM s?: An Empirical Study of LLM -Based Bug Report Generation

【速读】:该论文旨在解决软件缺陷报告(bug reports)中信息不清晰、不完整或模糊导致的缺陷分类与修复效率低下的问题。其解决方案的关键在于利用指令微调的大规模语言模型(Instruction fine-tuned Large Language Models, LLMs)将非结构化的、随意的缺陷报告自动转换为符合标准模板的高质量结构化缺陷报告。通过对比不同模型在CTQRS、ROUGE、METEOR和SBERT等指标上的表现,验证了微调模型在提升缺陷报告质量方面的有效性。

链接: https://arxiv.org/abs/2504.18804
作者: Jagrit Acharya,Gouri Ginde
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bug reports contain the information developers need to triage and fix software bugs. However, unclear, incomplete, or ambiguous information may lead to delays and excessive manual effort spent on bug triage and resolution. In this paper, we explore whether Instruction fine-tuned Large Language Models (LLMs) can automatically transform casual, unstructured bug reports into high-quality, structured bug reports adhering to a standard template. We evaluate three open-source instruction-tuned LLMs (\emphQwen 2.5, Mistral, and Llama 3.2) against ChatGPT-4o, measuring performance on established metrics such as CTQRS, ROUGE, METEOR, and SBERT. Our experiments show that fine-tuned Qwen 2.5 achieves a CTQRS score of \textbf77%, outperforming both fine-tuned Mistral (\textbf71%), Llama 3.2 (\textbf63%) and ChatGPT in 3-shot learning (\textbf75%). Further analysis reveals that Llama 3.2 shows higher accuracy of detecting missing fields particularly Expected Behavior and Actual Behavior, while Qwen 2.5 demonstrates superior performance in capturing Steps-to-Reproduce, with an F1 score of 76%. Additional testing of the models on other popular projects (e.g., Eclipse, GCC) demonstrates that our approach generalizes well, achieving up to \textbf70% CTQRS in unseen projects’ bug reports. These findings highlight the potential of instruction fine-tuning in automating structured bug report generation, reducing manual effort for developers and streamlining the software maintenance process.
zh

[AI-87] Hierarchical Reinforcement Learning in Multi-Goal Spatial Navigation with Autonomous Mobile Robots

【速读】:该论文试图解决在稀疏奖励机制下,传统强化学习算法在复杂导航任务中表现不佳的问题,而提出使用分层强化学习(Hierarchical Reinforcement Learning, HRL)作为改进方案。其解决方案的关键在于HRL能够通过创建子目标(sub-goals)和设计终止函数来分解任务层次结构,从而提升学习效率与性能。实验对比了PPO与HRL的差异,以及手动与自动子目标生成方式和终止频率对性能的影响,验证了HRL在复杂任务中的优势。

链接: https://arxiv.org/abs/2504.18794
作者: Brendon Johnson,Alfredo Weitzenfeld
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Hierarchical reinforcement learning (HRL) is hypothesized to be able to take advantage of the inherent hierarchy in robot learning tasks with sparse reward schemes, in contrast to more traditional reinforcement learning algorithms. In this research, hierarchical reinforcement learning is evaluated and contrasted with standard reinforcement learning in complex navigation tasks. We evaluate unique characteristics of HRL, including their ability to create sub-goals and the termination function. We constructed experiments to test the differences between PPO and HRL, different ways of creating sub-goals, manual vs automatic sub-goal creation, and the effects of the frequency of termination on performance. These experiments highlight the advantages of HRL and how it achieves these advantages.
zh

[AI-88] Evaluating AI-Driven Automated Map Digitization in QGIS

【速读】:该论文试图解决传统地图数字化过程中需要大量人工参与的问题,其解决方案的关键在于利用深度学习驱动的工具Deepness(Deep Neural Remote Sensing),通过自动化方式实现地图数字化。该工具作为QGIS的应用插件,旨在提升数字化效率并减少人为干预。

链接: https://arxiv.org/abs/2504.18777
作者: Diana Febrita
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to 2025 Indiana Geographic Information Council (IGIC) Conference

点击查看摘要

Abstract:Map digitization is an important process that converts maps into digital formats that can be used for further analysis. This process typically requires a deep human involvement because of the need for interpretation and decision-making when translating complex features. With the advancement of artificial intelligence, there is an alternative to conducting map digitization with the help of machine learning techniques. Deepness, or Deep Neural Remote Sensing, is an advanced AI-driven tool designed and integrated as a plugin in QGIS application. This research focuses on assessing the effectiveness of Deepness in automated digitization. This study analyses AI-generated digitization results from Google Earth imagery and compares them with digitized outputs from OpenStreetMap (OSM) to evaluate performance.
zh

[AI-89] Dynamic Action Interpolation: A Universal Approach for Accelerating Reinforcement Learning with Expert Guidance

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在早期训练阶段样本效率低下的问题,即需要大量环境交互才能达到良好性能。其解决方案的关键是提出一种通用且简洁的框架——动态动作插值(Dynamic Action Interpolation, DAI),通过时间变化的权重α(t)对专家动作和RL动作进行插值,并将其无缝集成到任何Actor-Critic算法中,仅需少量代码修改,无需额外网络或损失函数。

链接: https://arxiv.org/abs/2504.18766
作者: Wenjun Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) suffers from severe sample inefficiency, especially during early training, requiring extensive environmental interactions to perform competently. Existing methods tend to solve this by incorporating prior knowledge, but introduce significant architectural and implementation complexity. We propose Dynamic Action Interpolation (DAI), a universal yet straightforward framework that interpolates expert and RL actions via a time-varying weight \alpha(t) , integrating into any Actor-Critic algorithm with just a few lines of code and without auxiliary networks or additional losses. Our theoretical analysis shows that DAI reshapes state visitation distributions to accelerate value function learning while preserving convergence guarantees. Empirical evaluations across MuJoCo continuous control tasks demonstrate that DAI improves early-stage performance by over 160% on average and final performance by more than 50%, with the Humanoid task showing a 4 \times improvement early on and a 2 \times gain at convergence. These results challenge the assumption that complex architectural modifications are necessary for sample-efficient reinforcement learning.
zh

[AI-90] A Vision for Auto Research with LLM Agents

【速读】:该论文试图解决科学研究所面临的流程碎片化、方法论专业能力不均以及认知过载等问题,其解决方案的关键在于构建一个基于智能体(Agent)的自动化研究框架,通过大型语言模型(Large Language Models, LLMs)和模块化智能体协作,实现科研全生命周期的自动化、协调与优化。

链接: https://arxiv.org/abs/2504.18765
作者: Chengwei Liu,Chong Wang,Jiayue Cao,Jingquan Ge,Kun Wang,Lvye Zhang,Ming-Ming Cheng,Penghai Zhao,Tianlin Li,Xiaojun Jia,Xiang Li,Xinfeng Li,Yang Liu,Yebo Feng,Yihao Huang,Yijia Xu,Yuqiang Sun,Zhenhong Zhou,Zhengzi Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces Agent-Based Auto Research, a structured multi-agent framework designed to automate, coordinate, and optimize the full lifecycle of scientific research. Leveraging the capabilities of large language models (LLMs) and modular agent collaboration, the system spans all major research phases, including literature review, ideation, methodology planning, experimentation, paper writing, peer review response, and dissemination. By addressing issues such as fragmented workflows, uneven methodological expertise, and cognitive overload, the framework offers a systematic and scalable approach to scientific inquiry. Preliminary explorations demonstrate the feasibility and potential of Auto Research as a promising paradigm for self-improving, AI-driven research processes.
zh

[AI-91] LoRA: Tri-Matrix Low-Rank Adaptation of Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在微调过程中参数效率低的问题,即如何在减少可训练参数数量的同时保持模型性能。其解决方案的关键在于提出TLoRA,一种三矩阵低秩适配方法,通过将权重更新分解为两个固定随机矩阵和一个可训练矩阵,并结合可学习的层级缩放因子,实现高效参数适应,同时引入的计算开销最小。

链接: https://arxiv.org/abs/2504.18735
作者: Tanvir Islam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose TLoRA, a novel tri-matrix low-rank adaptation method that decomposes weight updates into three matrices: two fixed random matrices and one trainable matrix, combined with a learnable, layer-wise scaling factor. This tri-matrix design enables TLoRA to achieve highly efficient parameter adaptation while introducing minimal additional computational overhead. Through extensive experiments on the GLUE benchmark, we demonstrate that TLoRA achieves comparable performance to existing low-rank methods such as LoRA and Adapter-based techniques, while requiring significantly fewer trainable parameters. Analyzing the adaptation dynamics, we observe that TLoRA exhibits Gaussian-like weight distributions, stable parameter norms, and scaling factor variability across layers, further highlighting its expressive power and adaptability. Additionally, we show that TLoRA closely resembles LoRA in its eigenvalue distributions, parameter norms, and cosine similarity of updates, underscoring its ability to effectively approximate LoRA’s adaptation behavior. Our results establish TLoRA as a highly efficient and effective fine-tuning method for LLMs, offering a significant step forward in resource-efficient model adaptation.
zh

[AI-92] World Food Atlas Project

【速读】:该论文试图解决如何更好地理解和控制饮食问题,以应对全球范围内的疫情导致的居家生活模式。其解决方案的关键在于构建一个综合性的世界食物图谱(World Food Atlas, WFA),其中包含两个主要尝试:食品知识图谱(Food Knowledge Graph, FKG)和FoodLog Athl与RecipeLog应用。FKG通过从食谱和食品营养数据中提取知识,以图形化方式表示食品及其成分之间的关系,而FoodLog Athl与RecipeLog则用于收集个人详细的饮食记录,从而为建立WFA提供数据支持。

链接: https://arxiv.org/abs/2504.18727
作者: Ali Rostami,Z Xie,A Ishino,Y Yamakata,K Aizawa,Ramesh Jain
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A coronavirus pandemic is forcing people to be “at home” all over the world. In a life of hardly ever going out, we would have realized how the food we eat affects our bodies. What can we do to know our food more and control it better? To give us a clue, we are trying to build a World Food Atlas (WFA) that collects all the knowledge about food in the world. In this paper, we present two of our trials. The first is the Food Knowledge Graph (FKG), which is a graphical representation of knowledge about food and ingredient relationships derived from recipes and food nutrition data. The second is the FoodLog Athl and the RecipeLog that are applications for collecting people’s detailed records about food habit. We also discuss several problems that we try to solve to build the WFA by integrating these two ideas.
zh

[AI-93] MODP: Multi Objective Directional Prompting KDD2025

【速读】:该论文试图解决传统提示工程(prompt engineering)在利用大型语言模型(large language models, LLMs)时存在的主观性和近似性问题,尤其是在提示开发过程中忽视LLM内在行为的局限性。其解决方案的关键在于提出MODP——多目标定向提示(Multi-Objective Directional Prompting)框架,该框架引入了两个核心概念:多目标性,即在提示开发中将LLM的内在行为作为额外目标进行考虑;以及定向提示,即一种基于指标驱动的方法,用于确保生成鲁棒且高精度的提示。

链接: https://arxiv.org/abs/2504.18722
作者: Aashutosh Nema,Samaksh Gulati,Evangelos Giakoumakis,Bipana Thapaliya
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, submission to KDD 2025

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have led to their popularity across multiple use-cases. However, prompt engineering, the process for optimally utilizing such models, remains approximation-driven and subjective. Most of the current research on prompt engineering focuses on task-specific optimization, while neglecting the behavior of the LLM under consideration during prompt development. This paper introduces MODP – Multi Objective Directional Prompting, a framework based on two key concepts: 1) multi-objectivity: the importance of considering an LLM’s intrinsic behavior as an additional objective in prompt development, and 2) directional prompting: a metrics-driven method for prompt engineering to ensure development of robust and high-precision prompts. We demonstrate the effectiveness of our proposed ideas on a summarization task, using a synthetically created dataset, achieving a 26% performance gain over initial prompts. Finally, we apply MODP to develop prompts for Dell’s Next Best Action support tool, which is now in production and is used by more than 10,000 internal support agents and serving millions of customers worldwide.
zh

[AI-94] Explicit neural network classifiers for non-separable data

【速读】:该论文试图解决如何通过前馈神经网络(feedforward neural networks)实现对同心数据(concentric data)的特征映射(feature map)分离问题。其解决方案的关键在于通过对截断映射(truncation maps)的全面表征,揭示ReLU神经网络在构造具有分离能力的特征映射中的有效性。

链接: https://arxiv.org/abs/2504.18710
作者: Patrícia Muñoz Ewald
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 10 pages

点击查看摘要

Abstract:We fully characterize a large class of feedforward neural networks in terms of truncation maps. As an application, we show how a ReLU neural network can implement a feature map which separates concentric data.
zh

[AI-95] chnical Challenges in Maintaining Tax Prep Software with Large Language Models

【速读】:该论文试图解决税务准备软件在面对不断变化的税法时,维护过程繁琐且容易出错的问题(tax preparation software maintenance)。其关键解决方案是利用大型语言模型(Large Language Models, LLMs),如ChatGPT和Llama,自动从美国国税局(IRS)发布的文件中提取代码差异,并将其与旧版本代码进行自动集成,从而实现税务准备软件维护的自动化。

链接: https://arxiv.org/abs/2504.18693
作者: Sina Gogani-Khiabani,Varsha Dewangan,Nina Olson,Ashutosh Trivedi,Saeid Tizpaz-Niari
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to 14th Annual IRS/TPC Joint Research Conference on Tax Administration (IRS-TPC 2024)

点击查看摘要

Abstract:As the US tax law evolves to adapt to ever-changing politico-economic realities, tax preparation software plays a significant role in helping taxpayers navigate these complexities. The dynamic nature of tax regulations poses a significant challenge to accurately and timely maintaining tax software artifacts. The state-of-the-art in maintaining tax prep software is time-consuming and error-prone as it involves manual code analysis combined with an expert interpretation of tax law amendments. We posit that the rigor and formality of tax amendment language, as expressed in IRS publications, makes it amenable to automatic translation to executable specifications (code). Our research efforts focus on identifying, understanding, and tackling technical challenges in leveraging Large Language Models (LLMs), such as ChatGPT and Llama, to faithfully extract code differentials from IRS publications and automatically integrate them with the prior version of the code to automate tax prep software maintenance.
zh

[AI-96] From Prompts to Propositions: A Logic-Based Lens on Student-LLM Interactions

【速读】:该论文试图解决学生在计算任务中使用生成式 AI (Generative AI) 时,如何系统性分析其提示(prompt)行为的问题,特别是识别提示演化的模式、检测学习困难的学生以及洞察有效与无效策略。解决方案的关键在于提出一种名为 Prompt2Constraints 的新方法,该方法将学生的提示转化为逻辑约束,从而以简洁且可量化的形式表征提示的意图,进而实现对提示演化过程的深入分析。

链接: https://arxiv.org/abs/2504.18691
作者: Ali Alfageeh,Sadegh AlMahdi Kazemi Zarkouei,Daye Nam,Daniel Prol,Matin Amoozadeh,Souti Chattopadhyay,James Prather,Paul Denny,Juho Leinonen,Michael Hilton,Sruti Srinivasa Ragavan,Mohammad Amin Alipour
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Background and Context. The increasing integration of large language models (LLMs) in computing education presents an emerging challenge in understanding how students use LLMs and craft prompts to solve computational tasks. Prior research has used both qualitative and quantitative methods to analyze prompting behavior, but these approaches lack scalability or fail to effectively capture the semantic evolution of prompts. Objective. In this paper, we investigate whether students prompts can be systematically analyzed using propositional logic constraints. We examine whether this approach can identify patterns in prompt evolution, detect struggling students, and provide insights into effective and ineffective strategies. Method. We introduce Prompt2Constraints, a novel method that translates students prompts into logical constraints. The constraints are able to represent the intent of the prompts in succinct and quantifiable ways. We used this approach to analyze a dataset of 1,872 prompts from 203 students solving introductory programming tasks. Findings. We find that while successful and unsuccessful attempts tend to use a similar number of constraints overall, when students fail, they often modify their prompts more significantly, shifting problem-solving strategies midway. We also identify points where specific interventions could be most helpful to students for refining their prompts. Implications. This work offers a new and scalable way to detect students who struggle in solving natural language programming tasks. This work could be extended to investigate more complex tasks and integrated into programming tools to provide real-time support.
zh

[AI-97] ransformational Creativity in Science: A Graphical Theory

【速读】:该论文试图解决如何理论化和解释科学创造性中的变革性创新(transformational creativity)问题,其核心在于构建一个图形化理论来整合Boden关于“启用约束”(enabling constraints)变化引发变革性创造力的观点与Kuhn的范式转移(paradigm shifts)理论。解决方案的关键在于证明对图形模型公理的修改具有最大的变革潜力,并通过该框架捕捉历史上的变革性创造力实例。

链接: https://arxiv.org/abs/2504.18687
作者: Samuel Schapiro,Jonah Black,Lav R. Varshney
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Creative processes are typically divided into three types: combinatorial, exploratory, and transformational. Here, we provide a graphical theory of transformational scientific creativity, synthesizing Boden’s insight that transformational creativity arises from changes in the “enabling constraints” of a conceptual space and Kuhn’s structure of scientific revolutions as resulting from paradigm shifts. We prove that modifications made to axioms of our graphical model have the most transformative potential and then illustrate how several historical instances of transformational creativity can be captured by our framework.
zh

[AI-98] Proof-of-TBI – Fine-Tuned Vision Language Model Consortium and OpenAI -o3 Reasoning LLM -Based Medical Diagnosis Support System for Mild Traumatic Brain Injury (TBI) Prediction

【速读】:该论文旨在解决轻度创伤性脑损伤(Mild Traumatic Brain Injury, TBI)在医学影像中症状表现细微且常具歧义所带来的诊断难题。其解决方案的关键在于提出一种基于多模型融合与推理大语言模型(LLM)的诊断支持系统——Proof-of-TBI,该系统通过微调多个视觉-语言模型并结合OpenAI-o3推理LLM进行共识决策,实现对TBI的高精度诊断。该方法利用自定义提示工程和LLM代理协调视觉-语言模型与推理模型之间的交互,确保最终决策过程的透明性、可靠性和自动化。

链接: https://arxiv.org/abs/2504.18671
作者: Ross Gore,Eranga Bandara,Sachin Shetty,Alberto E. Musto,Pratip Rana,Ambrosio Valencia-Romero,Christopher Rhea,Lobat Tayebi,Heather Richter,Atmaram Yarlagadda,Donna Edmonds,Steven Wallace,Donna Broshek
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mild Traumatic Brain Injury (TBI) detection presents significant challenges due to the subtle and often ambiguous presentation of symptoms in medical imaging, making accurate diagnosis a complex task. To address these challenges, we propose Proof-of-TBI, a medical diagnosis support system that integrates multiple fine-tuned vision-language models with the OpenAI-o3 reasoning large language model (LLM). Our approach fine-tunes multiple vision-language models using a labeled dataset of TBI MRI scans, training them to diagnose TBI symptoms effectively. The predictions from these models are aggregated through a consensus-based decision-making process. The system evaluates the predictions from all fine-tuned vision language models using the OpenAI-o3 reasoning LLM, a model that has demonstrated remarkable reasoning performance, to produce the most accurate final diagnosis. The LLM Agents orchestrates interactions between the vision-language models and the reasoning LLM, managing the final decision-making process with transparency, reliability, and automation. This end-to-end decision-making workflow combines the vision-language model consortium with the OpenAI-o3 reasoning LLM, enabled by custom prompt engineering by the LLM agents. The prototype for the proposed platform was developed in collaboration with the U.S. Army Medical Research team in Newport News, Virginia, incorporating five fine-tuned vision-language models. The results demonstrate the transformative potential of combining fine-tuned vision-language model inputs with the OpenAI-o3 reasoning LLM to create a robust, secure, and highly accurate diagnostic system for mild TBI prediction. To the best of our knowledge, this research represents the first application of fine-tuned vision-language models integrated with a reasoning LLM for TBI prediction tasks.
zh

[AI-99] M2R2: MulitModal Robotic Representation for Temporal Action Segmentation

【速读】:该论文旨在解决多模态时间动作分割(Temporal Action Segmentation, TAS)中的特征复用困难与视觉信息受限的问题。现有方法在机器人领域中通常依赖本体感觉信息,而计算机视觉则主要使用外部感觉传感器,导致多模态TAS模型难以跨模型复用学习到的特征,同时基于预训练视觉模型在物体可见性受限的场景下表现不佳。论文提出的解决方案是M2R2,一种专为TAS设计的多模态特征提取器,通过融合本体感觉与外部感觉信息,并引入一种新颖的预训练策略,实现跨多个TAS模型的特征复用,从而提升分割性能。

链接: https://arxiv.org/abs/2504.18662
作者: Daniel Sliwowski,Dongheui Lee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Temporal action segmentation (TAS) has long been a key area of research in both robotics and computer vision. In robotics, algorithms have primarily focused on leveraging proprioceptive information to determine skill boundaries, with recent approaches in surgical robotics incorporating vision. In contrast, computer vision typically relies on exteroceptive sensors, such as cameras. Existing multimodal TAS models in robotics integrate feature fusion within the model, making it difficult to reuse learned features across different models. Meanwhile, pretrained vision-only feature extractors commonly used in computer vision struggle in scenarios with limited object visibility. In this work, we address these challenges by proposing M2R2, a multimodal feature extractor tailored for TAS, which combines information from both proprioceptive and exteroceptive sensors. We introduce a novel pretraining strategy that enables the reuse of learned features across multiple TAS models. Our method achieves state-of-the-art performance on the REASSEMBLE dataset, a challenging multimodal robotic assembly dataset, outperforming existing robotic action segmentation models by 46.6%. Additionally, we conduct an extensive ablation study to evaluate the contribution of different modalities in robotic TAS tasks.
zh

[AI-100] he Big Send-off: High Performance Collectives on GPU-based Supercomputers

【速读】:该论文旨在解决基于GPU的超计算机在大规模语言模型(Large Language Model, LLM)训练中的集体通信性能瓶颈问题。现有库如RCCL和Cray-MPICH在Frontier系统上表现出关键限制,其中Cray-MPICH未能充分利用网络和计算资源,而RCCL则存在严重的可扩展性问题。论文提出的解决方案是PCCL,其关键在于为分布式深度学习工作负载量身定制了all-gather和reduce-scatter操作的高度优化实现,旨在最大限度地利用所有可用的网络和计算资源,并高效扩展至数千个GPU。

链接: https://arxiv.org/abs/2504.18658
作者: Siddharth Singh,Mahua Singh,Abhinav Bhatele
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We evaluate the current state of collective communication on GPU-based supercomputers for large language model (LLM) training at scale. Existing libraries such as RCCL and Cray-MPICH exhibit critical limitations on systems such as Frontier – Cray-MPICH underutilizes network and compute resources, while RCCL suffers from severe scalability issues. To address these challenges, we introduce PCCL, a communication library with highly optimized implementations of all-gather and reduce-scatter operations tailored for distributed deep learning workloads. PCCL is designed to maximally utilize all available network and compute resources and to scale efficiently to thousands of GPUs. It achieves substantial performance improvements, delivering 6-33x speedups over RCCL and 28-70x over Cray-MPICH for all-gather on 2048 GCDs of Frontier. These gains translate directly to end-to-end performance: in large-scale GPT-3-style training, PCCL provides up to 60% and 40% speedups over RCCL for 7B and 13B parameter models, respectively.
zh

[AI-101] Exploring a Large Language Model for Transforming Taxonomic Data into OWL: Lessons Learned and Implications for Ontology Development

【速读】:该论文试图解决在表示物种分类的本体中管理科学名称的问题,这一问题由于分类学的不断演变而变得复杂,手动维护数千个科学名称变得愈发困难。解决方案的关键在于利用ChatGPT-4自动化开发农业产品类型本体(APTO)中的:Organism模块,通过提取GBIF Backbone API的数据并生成OWL文件,从而实现物种分类的自动化处理。研究探索了两种方法:一种是通过BrowserOP插件向ChatGPT-4发送一系列提示以执行任务,另一种是指导ChatGPT-4设计Python算法完成类似任务,其中后者通过算法克服了可扩展性限制,但在数据处理中的拼写错误方面存在挑战。

链接: https://arxiv.org/abs/2504.18651
作者: Filipi Miranda Soares,Antonio Mauro Saraiva,Luís Ferreira Pires,Luiz Olavo Bonino da Silva Santos,Dilvan de Abreu Moreira,Fernando Elias Corrêa,Kelly Rosa Braghetto,Debora Pignatari Drucker,Alexandre Cláudio Botazzo Delbem
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, 6 Figures, accepted for publication in Data Intelligence

点击查看摘要

Abstract:Managing scientific names in ontologies that represent species taxonomies is challenging due to the ever-evolving nature of these taxonomies. Manually maintaining these names becomes increasingly difficult when dealing with thousands of scientific names. To address this issue, this paper investigates the use of ChatGPT-4 to automate the development of the :Organism module in the Agricultural Product Types Ontology (APTO) for species classification. Our methodology involved leveraging ChatGPT-4 to extract data from the GBIF Backbone API and generate OWL files for further integration in APTO. Two alternative approaches were explored: (1) issuing a series of prompts for ChatGPT-4 to execute tasks via the BrowserOP plugin and (2) directing ChatGPT-4 to design a Python algorithm to perform analogous tasks. Both approaches rely on a prompting method where we provide instructions, context, input data, and an output indicator. The first approach showed scalability limitations, while the second approach used the Python algorithm to overcome these challenges, but it struggled with typographical errors in data handling. This study highlights the potential of Large language models like ChatGPT-4 to streamline the management of species names in ontologies. Despite certain limitations, these tools offer promising advancements in automating taxonomy-related tasks and improving the efficiency of ontology development.
zh

[AI-102] A Gradient-Optimized TSK Fuzzy Framework for Explainable Phishing Detection

【速读】:该论文试图解决现有钓鱼URL检测方法在准确性和可解释性之间难以兼顾的问题,即传统方法要么无法检测新型攻击,要么作为黑箱模型缺乏透明度。解决方案的关键在于提出一种基于一阶Takagi-Sugeno-Kang(TSK)模糊推理模型的检测系统,并通过基于梯度的优化技术进行参数调优,特别是采用Adam优化器提升模型性能,从而在保持高检测精度的同时实现决策过程的可解释性。

链接: https://arxiv.org/abs/2504.18636
作者: Lohith Srikanth Pentapalli,Jon Salisbury,Josette Riep,Kelly Cohen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Phishing attacks represent an increasingly sophisticated and pervasive threat to individuals and organizations, causing significant financial losses, identity theft, and severe damage to institutional reputations. Existing phishing detection methods often struggle to simultaneously achieve high accuracy and explainability, either failing to detect novel attacks or operating as opaque black-box models. To address this critical gap, we propose a novel phishing URL detection system based on a first-order Takagi-Sugeno-Kang (TSK) fuzzy inference model optimized through gradient-based techniques. Our approach intelligently combines the interpretability and human-like reasoning capabilities of fuzzy logic with the precision and adaptability provided by gradient optimization methods, specifically leveraging the Adam optimizer for efficient parameter tuning. Experiments conducted using a comprehensive dataset of over 235,000 URLs demonstrate rapid convergence, exceptional predictive performance (accuracy averaging 99.95% across 5 cross-validation folds, with a perfect AUC i.e. 1.00). Furthermore, optimized fuzzy rules and membership functions improve interoperability, clearly indicating how the model makes decisions - an essential feature for cybersecurity applications. This high-performance, transparent, and interpretable phishing detection framework significantly advances current cybersecurity defenses, providing practitioners with accurate and explainable decision-making tools.
zh

[AI-103] Research on Personalized Medical Intervention Strategy Generation System based on Group Relative Policy Optimization and Time-Series Data Fusion

【速读】:该论文旨在解决基于高维异构时间序列信息生成个性化干预方案的挑战,特别是在电子病历、可穿戴设备等多源医疗数据日益增长和多样化的背景下。其解决方案的关键在于结合群体相对策略优化(Group Relative Policy Optimization, GRPO)与时间序列数据融合技术,通过引入群体间相对策略约束以自适应平衡个体与群体收益,并采用多层神经网络结构对患者特征进行分组编码,提升决策的鲁棒性与可解释性;同时,利用多通道神经网络与自注意力机制实现多模态异构时间序列的快速融合,通过可微分门控网络完成关键特征筛选与聚合,最终结合遗传算法与蒙特卡洛树搜索的协同搜索过程实现全局优化。

链接: https://arxiv.org/abs/2504.18631
作者: Dingxin Lu,Shurui Wu,Xinyi Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the timely formation of personalized intervention plans based on high-dimensional heterogeneous time series information becoming an important challenge in the medical field today, electronic medical records, wearables, and other multi-source medical data are increasingly generated and diversified. In this work, we develop a system to generate personalized medical intervention strategies based on Group Relative Policy Optimization (GRPO) and Time-Series Data Fusion. First, by incorporating relative policy constraints among the groups during policy gradient updates, we adaptively balance individual and group gains. To improve the robustness and interpretability of decision-making, a multi-layer neural network structure is employed to group-code patient characteristics. Second, for the rapid multi-modal fusion of multi-source heterogeneous time series, a multi-channel neural network combined with a self-attention mechanism is used for dynamic feature extraction. Key feature screening and aggregation are achieved through a differentiable gating network. Finally, a collaborative search process combining a genetic algorithm and Monte Carlo tree search is proposed to find the ideal intervention strategy, achieving global optimization. Experimental results show significant improvements in accuracy, coverage, and decision-making benefits compared with existing methods.
zh

[AI-104] A Cognitive-Mechanistic Human Reliability Analysis Framework: A Nuclear Power Plant Case Study

【速读】:该论文旨在解决传统人类可靠性分析(Human Reliability Analysis, HRA)方法在认知机制建模方面的不足,以及在先进核电厂中进行人机协同实验的可行性问题。其解决方案的关键在于提出一种基于认知机制的框架(Cognitive-Mechanistic Framework, COGMIF),通过集成ACT-R模型驱动的人类数字孪生(Human Digital Twin, HDT)与TimeGAN增强的仿真技术,生成高保真操作员行为数据,并将其用于改进IDHEAS-ECA方法,从而实现对人类错误概率(Human Error Probabilities, HEPs)的机制驱动估算。

链接: https://arxiv.org/abs/2504.18604
作者: Xingyu Xiao,Peng Chen,Jiejuan Tong,Shunshun Liu,Hongru Zhao,Jun Zhao,Qianqian Jia,Jingang Liang,Haitao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional human reliability analysis (HRA) methods, such as IDHEAS-ECA, rely on expert judgment and empirical rules that often overlook the cognitive underpinnings of human error. Moreover, conducting human-in-the-loop experiments for advanced nuclear power plants is increasingly impractical due to novel interfaces and limited operational data. This study proposes a cognitive-mechanistic framework (COGMIF) that enhances the IDHEAS-ECA methodology by integrating an ACT-R-based human digital twin (HDT) with TimeGAN-augmented simulation. The ACT-R model simulates operator cognition, including memory retrieval, goal-directed procedural reasoning, and perceptual-motor execution, under high-fidelity scenarios derived from a high-temperature gas-cooled reactor (HTGR) simulator. To overcome the resource constraints of large-scale cognitive modeling, TimeGAN is trained on ACT-R-generated time-series data to produce high-fidelity synthetic operator behavior datasets. These simulations are then used to drive IDHEAS-ECA assessments, enabling scalable, mechanism-informed estimation of human error probabilities (HEPs). Comparative analyses with SPAR-H and sensitivity assessments demonstrate the robustness and practical advantages of the proposed COGMIF. Finally, procedural features are mapped onto a Bayesian network to quantify the influence of contributing factors, revealing key drivers of operational risk. This work offers a credible and computationally efficient pathway to integrate cognitive theory into industrial HRA practices.
zh

[AI-105] oward Personalizing Quantum Computing Education: An Evolutionary LLM -Powered Approach

【速读】:该论文旨在解决量子计算教育中因学科复杂性和现有工具局限性所带来的挑战,其解决方案的关键在于设计一种基于知识图谱增强架构的智能教学助手,该系统包含两个专门的大型语言模型(Large Language Model, LLM)代理:教学代理用于动态交互,课程规划代理用于生成课程计划。系统通过知识图谱跟踪并存储学生的互动数据,以实现个性化学习路径的推理与适应,同时采用双代理架构和用户可见的标签系统来提升系统协调性和用户控制能力。

链接: https://arxiv.org/abs/2504.18603
作者: Iizalaarab Elhaimeur,Nikos Chrisochoides
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Quantum computing education faces significant challenges due to its complexity and the limitations of current tools; this paper introduces a novel Intelligent Teaching Assistant for quantum computing education and details its evolutionary design process. The system combines a knowledge-graph-augmented architecture with two specialized Large Language Model (LLM) agents: a Teaching Agent for dynamic interaction, and a Lesson Planning Agent for lesson plan generation. The system is designed to adapt to individual student needs, with interactions meticulously tracked and stored in a knowledge graph. This graph represents student actions, learning resources, and relationships, aiming to enable reasoning about effective learning pathways. We describe the implementation of the system, highlighting the challenges encountered and the solutions implemented, including introducing a dual-agent architecture where tasks are separated, all coordinated through a central knowledge graph that maintains system awareness, and a user-facing tag system intended to mitigate LLM hallucination and improve user control. Preliminary results illustrate the system’s potential to capture rich interaction data, dynamically adapt lesson plans based on student feedback via a tag system in simulation, and facilitate context-aware tutoring through the integrated knowledge graph, though systematic evaluation is required.
zh

[AI-106] he Philosophic Turn for AI Agents : Replacing centralized digital rhetoric with decentralized truth-seeking

【速读】:该论文试图解决AI决策支持系统在规模应用中可能对人类自主性(autonomy)和代理权(agency)造成的威胁问题。随着AI技术的快速发展,个体越来越依赖AI代理来应对生活中的复杂决策,这可能导致个体在面对复杂选择时失去代理权,或因外部控制的选择架构而削弱自主性。论文提出的解决方案关键在于在AI设计中引入哲学转向,使AI系统能够促进去中心化的真理探索和开放性探究,模仿苏格拉底式对话方法,从而通过增强个体和集体的适应性学习,使用户保持对其判断的控制,实现代理权的增强而不损害自主性。

链接: https://arxiv.org/abs/2504.18601
作者: Philipp Koralus
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the face of rapidly advancing AI technology, individuals will increasingly rely on AI agents to navigate life’s growing complexities, raising critical concerns about maintaining both human agency and autonomy. This paper addresses a fundamental dilemma posed by AI decision-support systems: the risk of either becoming overwhelmed by complex decisions, thus losing agency, or having autonomy compromised by externally controlled choice architectures reminiscent of nudging'' practices. While the nudge’’ framework, based on the use of choice-framing to guide individuals toward presumed beneficial outcomes, initially appeared to preserve liberty, at AI-driven scale, it threatens to erode autonomy. To counteract this risk, the paper proposes a philosophic turn in AI design. AI should be constructed to facilitate decentralized truth-seeking and open-ended inquiry, mirroring the Socratic method of philosophical dialogue. By promoting individual and collective adaptive learning, such AI systems would empower users to maintain control over their judgments, augmenting their agency without undermining autonomy. The paper concludes by outlining essential features for autonomy-preserving AI systems, sketching a path toward AI systems that enhance human judgment rather than undermine it.
zh

[AI-107] QuantBench: Benchmarking AI Methods for Quantitative Investment

【速读】:该论文试图解决量化投资领域中人工智能(Artificial Intelligence, AI)缺乏与行业实践对齐的标准基准平台的问题,这一问题阻碍了研究进展和学术创新的实际应用。论文提出的解决方案是QuantBench,其关键在于提供一个符合量化投资行业实践的标准化平台,支持多种AI算法的集成,并覆盖整个量化投资流程的全链条,从而为研究与实践提供统一的评估基础和协作环境。

链接: https://arxiv.org/abs/2504.18600
作者: Saizhuo Wang,Hao Kong,Jiadong Guo,Fengrui Hua,Yiyan Qi,Wanyun Zhou,Jiahao Zheng,Xinyu Wang,Lionel M. Ni,Jian Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:The field of artificial intelligence (AI) in quantitative investment has seen significant advancements, yet it lacks a standardized benchmark aligned with industry practices. This gap hinders research progress and limits the practical application of academic innovations. We present QuantBench, an industrial-grade benchmark platform designed to address this critical need. QuantBench offers three key strengths: (1) standardization that aligns with quantitative investment industry practices, (2) flexibility to integrate various AI algorithms, and (3) full-pipeline coverage of the entire quantitative investment process. Our empirical studies using QuantBench reveal some critical research directions, including the need for continual learning to address distribution shifts, improved methods for modeling relational financial data, and more robust approaches to mitigate overfitting in low signal-to-noise environments. By providing a common ground for evaluation and fostering collaboration between researchers and practitioners, QuantBench aims to accelerate progress in AI for quantitative investment, similar to the impact of benchmark platforms in computer vision and natural language processing.
zh

[AI-108] BadMoE: Backdooring Mixture-of-Experts LLM s via Optimizing Routing Triggers and Infecting Dormant Experts

【速读】:该论文试图解决基于Mixture-of-Experts (MoE)架构的大语言模型(LLMs)中存在的后门攻击漏洞问题,这类漏洞尚未被充分研究。解决方案的关键在于利用MoE模型中“非活跃专家”(dormant experts)的特性,通过优化路由触发器并注入 poisoned 训练数据,将这些非活跃专家转化为主导专家(dominating experts),从而控制模型的输出。该攻击方法名为\textsc{BadMoE},其核心步骤包括识别与目标任务无关的非活跃专家、构建感知路由的损失函数以优化这些专家的激活触发器,以及通过污染训练数据提升非活跃专家的主导地位。

链接: https://arxiv.org/abs/2504.18598
作者: Qingyue Wang,Qi Pang,Xixun Lin,Shuai Wang,Daoyuan Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) have emerged as a powerful architecture for large language models (LLMs), enabling efficient scaling of model capacity while maintaining manageable computational costs. The key advantage lies in their ability to route different tokens to different expert'' networks within the model, enabling specialization and efficient handling of diverse input. However, the vulnerabilities of MoE-based LLMs still have barely been studied, and the potential for backdoor attacks in this context remains largely unexplored. This paper presents the first backdoor attack against MoE-based LLMs where the attackers poison dormant experts’’ (i.e., underutilized experts) and activate them by optimizing routing triggers, thereby gaining control over the model’s output. We first rigorously prove the existence of a few ``dominating experts’’ in MoE models, whose outputs can determine the overall MoE’s output. We also show that dormant experts can serve as dominating experts to manipulate model predictions. Accordingly, our attack, namely \textscBadMoE, exploits the unique architecture of MoE models by 1) identifying dormant experts unrelated to the target task, 2) constructing a routing-aware loss to optimize the activation triggers of these experts, and 3) promoting dormant experts to dominating roles via poisoned training data. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.18598 [cs.CR] (or arXiv:2504.18598v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.18598 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Qingyue Wang [view email] [v1] Thu, 24 Apr 2025 16:42:38 UTC (867 KB) Full-text links: Access Paper: View a PDF of the paper titled BadMoE: Backdooring Mixture-of-Experts LLMs via Optimizing Routing Triggers and Infecting Dormant Experts, by Qingyue Wang and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CR prev | next new | recent | 2025-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-109] EnviroPiNet: A Physics-Guided AI Model for Predicting Biofilter Performance

【速读】:该论文旨在解决环境生物技术系统(如饮用水生物过滤器)性能预测的难题,这类系统依赖于微生物群落与物理-化学环境之间的复杂相互作用,而高维、稀疏的数据集缺乏多样性,难以全面反映系统行为。解决方案的关键在于应用布金汉π定理(Buckingham Pi theory)进行维度约简,识别出具有实际意义的无量纲变量,从而提升预测精度并增强模型的可解释性。基于这些变量,研究者构建了环境布金汉π神经网络(EnviroPiNet),其在测试数据集上的R²值达到0.9236,显著优于传统数据驱动方法。

链接: https://arxiv.org/abs/2504.18595
作者: Uzma,Fabien Cholet,Domenic Quinn,Cindy Smith,Siming You,William Sloan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Environmental biotechnologies, such as drinking water biofilters, rely on complex interactions between microbial communities and their surrounding physical-chemical environments. Predicting the performance of these systems is challenging due to high-dimensional, sparse datasets that lack diversity and fail to fully capture system behaviour. Accurate predictive models require innovative, science-guided approaches. In this study, we present the first application of Buckingham Pi theory to modelling biofilter performance. This dimensionality reduction technique identifies meaningful, dimensionless variables that enhance predictive accuracy and improve model interpretability. Using these variables, we developed the Environmental Buckingham Pi Neural Network (EnviroPiNet), a physics-guided model benchmarked against traditional data-driven methods, including Principal Component Analysis (PCA) and autoencoder neural networks. Our findings demonstrate that the EnviroPiNet model achieves an R^2 value of 0.9236 on the testing dataset, significantly outperforming PCA and autoencoder methods. The Buckingham Pi variables also provide insights into the physical and chemical relationships governing biofilter behaviour, with implications for system design and optimization. This study highlights the potential of combining physical principles with AI approaches to model complex environmental systems characterized by sparse, high-dimensional datasets.
zh

[AI-110] A Simple DropConnect Approach to Transfer-based Targeted Attack

【速读】:该论文试图解决基于迁移的黑盒攻击问题,即利用单一替代模型生成的对抗样本直接作用于目标模型时,攻击成功率(Attack Success Rates, ASRs)在定向攻击设置中仍然较低的问题。现有方法生成的对抗样本往往过度拟合替代模型,但无法有效误导其他模型。论文的关键解决方案是提出通过DropConnect机制减轻扰动协同适应(perturbation co-adaptation),在每次优化迭代中生成多样化的替代模型变体,从而提升对抗样本的迁移能力。

链接: https://arxiv.org/abs/2504.18594
作者: Tongrui Su,Qingbin Li,Shengyu Zhu,Wei Chen,Xueqi Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study the problem of transfer-based black-box attack, where adversarial samples generated using a single surrogate model are directly applied to target models. Compared with untargeted attacks, existing methods still have lower Attack Success Rates (ASRs) in the targeted setting, i.e., the obtained adversarial examples often overfit the surrogate model but fail to mislead other models. In this paper, we hypothesize that the pixels or features in these adversarial examples collaborate in a highly dependent manner to maximize the success of an adversarial attack on the surrogate model, which we refer to as perturbation co-adaptation. Then, we propose to Mitigate perturbation Co-adaptation by DropConnect (MCD) to enhance transferability, by creating diverse variants of surrogate model at each optimization iteration. We conduct extensive experiments across various CNN- and Transformer-based models to demonstrate the effectiveness of MCD. In the challenging scenario of transferring from a CNN-based model to Transformer-based models, MCD achieves 13% higher average ASRs compared with state-of-the-art baselines. MCD boosts the performance of self-ensemble methods by bringing in more diversification across the variants while reserving sufficient semantic information for each variant. In addition, MCD attains the highest performance gain when scaling the compute of crafting adversarial examples.
zh

[AI-111] Severity Classification of Chronic Obstructive Pulmonary Disease in Intensive Care Units: A Semi-Supervised Approach Using MIMIC-III Dataset

【速读】:该论文旨在解决慢性阻塞性肺疾病(COPD)在重症监护病房(ICU)中严重程度评估的精准性问题,以提升临床管理效果。其解决方案的关键在于构建一种基于机器学习的分类框架,利用MIMIC-III数据库中的关键ICU参数(如血气分析和生命体征)进行模型训练,并采用半监督学习技术有效利用未标记数据,从而提升模型性能。实验结果显示,随机森林分类器在区分轻中度与重度COPD病例中表现出色,准确率达到92.51%,ROC AUC为0.98,为ICU环境下的COPD快速评估提供了高效且准确的工具。

链接: https://arxiv.org/abs/2504.18593
作者: Akram Shojaei,Mehdi Delrobaei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chronic obstructive pulmonary disease (COPD) represents a significant global health burden, where precise severity assessment is particularly critical for effective clinical management in intensive care unit (ICU) settings. This study introduces an innovative machine learning framework for COPD severity classification utilizing the MIMIC-III critical care database, thereby expanding the applications of artificial intelligence in critical care medicine. Our research developed a robust classification model incorporating key ICU parameters such as blood gas measurements and vital signs, while implementing semi-supervised learning techniques to effectively utilize unlabeled data and enhance model performance. The random forest classifier emerged as particularly effective, demonstrating exceptional discriminative capability with 92.51% accuracy and 0.98 ROC AUC in differentiating between mild-to-moderate and severe COPD cases. This machine learning approach provides clinicians with a practical, accurate, and efficient tool for rapid COPD severity evaluation in ICU environments, with significant potential to improve both clinical decision-making processes and patient outcomes. Future research directions should prioritize external validation across diverse patient populations and integration with clinical decision support systems to optimize COPD management in critical care settings.
zh

[AI-112] A multilevel approach to accelerate the training of Transformers

【速读】:该论文试图解决Transformer架构训练速度较慢的问题,其解决方案的关键在于利用常微分方程(Ordinary Differential Equation, ODE)对这些架构的解释,并通过调整ODE Transformers的离散化方式来加速训练过程。

链接: https://arxiv.org/abs/2504.18590
作者: Guillaume Lauga(OCKHAM),Maël Chaumette(OCKHAM),Edgar Desainte-Maréville(OCKHAM),Étienne Lasalle(OCKHAM),Arthur Lebeurrier(OCKHAM)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:In this article, we investigate the potential of multilevel approaches to accelerate the training of transformer architectures. Using an ordinary differential equation (ODE) interpretation of these architectures, we propose an appropriate way of varying the discretization of these ODE Transformers in order to accelerate the training. We validate our approach experimentally by a comparison with the standard training procedure.
zh

[AI-113] Dynamic QoS Prediction via a Non-Negative Tensor Snowflake Factorization

【速读】:该论文旨在解决Web服务中由于用户和服务数量增加而导致的大量未观测QoS数据问题,这些问题严重影响用户对服务的选择。解决方案的关键在于提出一种非负雪花张量分解模型(Non-negative Snowflake Factorization of tensors),该模型通过设计一个雪花核心张量来增强模型的学习能力,并采用基于单一潜在因子的非负乘法更新(SLF-NMUT)进行参数学习,从而更准确地捕捉动态用户-服务交互模式,提升缺失QoS数据的预测性能。

链接: https://arxiv.org/abs/2504.18588
作者: YongHui Xia,Lan Wang,Hao Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic quality of service (QoS) data exhibit rich temporal patterns in user-service interactions, which are crucial for a comprehensive understanding of user behavior and service conditions in Web service. As the number of users and services increases, there is a large amount of unobserved QoS data, which significantly affects users’choice of services. To predict unobserved QoS data, we propose a Non-negative Snowflake Factorization of tensors model. This method designs a snowflake core tensor to enhance the model’s learning capability. Additionally, it employs a single latent factor-based, nonnegative multiplication update on tensor (SLF-NMUT) for parameter learning. Empirical results demonstrate that the proposed model more accurately learns dynamic user-service interaction patterns, thereby yielding improved predictions for missing QoS data.
zh

[AI-114] raining Large Language Models to Reason via EM Policy Gradient

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在推理能力上的提升问题,特别是通过优化预期回报来增强其在复杂任务中的推理表现。其解决方案的关键在于提出一种无偏策略梯度算法——EM Policy Gradient,该方法将推理任务建模为期望最大化(Expectation-Maximization, EM)优化问题,通过交替采样多样化的推理轨迹和进行奖励引导的微调,实现更高效、更简洁的策略更新。与PPO和GRPO等方法相比,EM Policy Gradient避免了复杂的重要性权重和启发式裁剪机制,提供了一种更简单且更具理论依据的离策略策略梯度方法。

链接: https://arxiv.org/abs/2504.18587
作者: Tianbing Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Recently, foundation models such as OpenAI’s O1 and O3, along with DeepSeek’s R1, have demonstrated strong reasoning capacities and problem-solving skills acquired through large-scale reinforcement learning (RL), with wide applications in mathematics, coding, science, intelligent agents, and virtual assistants. In this work, we introduce an off-policy reinforcement learning algorithm, EM Policy Gradient, aimed at enhancing LLM reasoning by optimizing expected return over reasoning trajectories. We frame the reasoning task as an Expectation-Maximization (EM) optimization problem, alternating between sampling diverse rationale trajectories and performing reward-guided fine-tuning. Unlike PPO and GRPO, which rely on complex importance weights and heuristic clipping, our method provides a simpler, more principled off-policy policy gradient approach, eliminating these complexities while maintaining strong performance. We evaluate the effectiveness of EM Policy Gradient on the GSM8K and MATH (HARD) datasets, where it achieves performance comparable to or slightly surpassing the state-of-the-art GRPO, while offering additional advantages in scalability, simplicity, and reasoning conciseness. Moreover, models fine-tuned with our method exhibit cognitive behaviors, such as sub-problem decomposition, self-verification, and backtracking, highlighting its potential to enhance both the interpretability and robustness of LLM reasoning.
zh

[AI-115] WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

【速读】:该论文试图解决Web导航AI代理在面对间接提示注入(indirect prompt injections)时的安全性问题,即攻击者通过隐蔽的指令诱导代理执行与用户合法意图不同的操作。解决方案的关键在于构建一个新的基准测试框架WASP(Web Agent Security against Prompt injection attacks),该框架引入了真实的网络代理劫持目标,并提供一个隔离环境以测试这些攻击,而不会影响真实用户或实时网络。此外,研究还针对三种主流的网络代理系统进行了基线攻击实验,揭示了即使具备先进推理能力和指令层级缓解机制的模型仍可能受到低努力成本的人工编写提示注入的影响,但当前代理在完成攻击者目标方面仍存在显著能力不足。

链接: https://arxiv.org/abs/2504.18575
作者: Ivan Evtimov,Arman Zharmagambetov,Aaron Grattafiori,Chuan Guo,Kamalika Chaudhuri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Web navigation AI agents use language-and-vision foundation models to enhance productivity but these models are known to be susceptible to indirect prompt injections that get them to follow instructions different from the legitimate user’s. Existing explorations of this threat applied to web agents often focus on a single isolated adversarial goal, test with injected instructions that are either too easy or not truly malicious, and often give the adversary unreasonable access. In order to better focus adversarial research, we construct a new benchmark called WASP (Web Agent Security against Prompt injection attacks) that introduces realistic web agent hijacking objectives and an isolated environment to test them in that does not affect real users or the live web. As part of WASP, we also develop baseline attacks against three popular web agentic systems (VisualWebArena, Claude Computer Use, and Operator) instantiated with various state-of-the-art models. Our evaluation shows that even AI agents backed by models with advanced reasoning capabilities and by models with instruction hierarchy mitigations are susceptible to low-effort human-written prompt injections. However, the realistic objectives in WASP also allow us to observe that agents are currently not capable enough to complete the goals of attackers end-to-end. Agents begin executing the adversarial instruction between 16 and 86% of the time but only achieve the goal between 0 and 17% of the time. Based on these findings, we argue that adversarial researchers should demonstrate stronger attacks that more consistently maintain control over the agent given realistic constraints on the adversary’s power.
zh

[AI-116] Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism

【速读】:该论文试图解决序列模型(如Transformer和状态空间模型SSMs)在上下文检索任务中的性能差异问题,特别是SSMs在处理需要回顾过去信息的算法任务时表现不佳的问题。其解决方案的关键在于发现并分析两种架构中普遍存在的Gather-and-Aggregate (GA)机制,该机制由Gather Head提取相关上下文信息并由Aggregate Head整合为最终表示。研究揭示GA机制集中在少数头部,这些头部是影响模型检索能力的关键瓶颈,同时指出SSMs在实现GA时存在注意力模式平滑的问题,这与Transformer的尖锐token转换不同,从而导致性能差距。通过将注意力机制引入SSMs的GA头,可以有效提升其检索能力。

链接: https://arxiv.org/abs/2504.18574
作者: Aviv Bick,Eric Xing,Albert Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SSMs offer efficient processing of long sequences with fixed state sizes, but struggle with algorithmic tasks like retrieving past context. In this work, we examine how such in-context retrieval operates within Transformer- and SSM-based language models. We find that both architectures develop the same fundamental Gather-and-Aggregate (GA) mechanism. A Gather Head first identifies and extracts relevant information from the context, which an Aggregate Head then integrates into a final representation. Across both model types, GA concentrates in just a few heads, making them critical bottlenecks even for benchmarks that require a basic form of retrieval. For example, disabling a single Gather or Aggregate Head of a pruned Llama-3.1-8B degrades its ability to retrieve the correct answer letter in MMLU, reducing accuracy from 66% to 25%. This finding suggests that in-context retrieval can obscure the limited knowledge demands of certain tasks. Despite strong MMLU performance with retrieval intact, the pruned model fails on other knowledge tests. Similar GA dependencies exist in GSM8K, BBH, and dialogue tasks. Given the significance of GA in performance, we show that retrieval challenges in SSMs manifest in how they implement GA, leading to smoother attention patterns rather than the sharp token transitions that effective GA relies on. Thus, while a gap exists between Transformers and SSMs in implementing in-context retrieval, it is confined to a few heads, not the entire model. This insight suggests a unified explanation for performance differences between Transformers and SSMs while also highlighting ways to combine their strengths. For example, in pretrained hybrid models, attention components naturally take on the role of Aggregate Heads. Similarly, in a pretrained pure SSM, replacing a single GA head with an attention-based variant significantly improves retrieval.
zh

[AI-117] BELL: Benchmarking the Explainability of Large Language Models

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在决策过程中的透明性不足问题,这一问题引发了对信任度、偏见和模型性能的重大担忧。论文提出的解决方案的关键是引入一种标准化的基准测试方法,即“Benchmarking the Explainability of Large Language Models”,旨在评估大型语言模型的可解释性。

链接: https://arxiv.org/abs/2504.18572
作者: Syed Quiser Ahmed,Bharathi Vokkaliga Ganesh,Jagadish Babu P,Karthick Selvaraj,ReddySiva Naga Parvathi Devi,Sravya Kappala
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable capabilities in natural language processing, yet their decision-making processes often lack transparency. This opaqueness raises significant concerns regarding trust, bias, and model performance. To address these issues, understanding and evaluating the interpretability of LLMs is crucial. This paper introduces a standardised benchmarking technique, Benchmarking the Explainability of Large Language Models, designed to evaluate the explainability of large language models.
zh

[AI-118] Large Language Model Empowered Privacy-Protected Framework for PHI Annotation in Clinical Notes

【速读】:该论文旨在解决医疗数据中隐私信息(Protected Health Information, PHI)去标识化的问题,以降低保密性泄露的风险。现有基于规则和学习的方法在泛化能力和标注数据需求方面存在局限,而大型语言模型(Large Language Models, LLMs)虽具备优势,但面临隐私风险和计算成本高的挑战。论文提出的解决方案关键在于LPPA框架,通过在本地微调LLMs并使用合成病历进行训练,实现了强隐私保护与高PHI标注精度的平衡,从而提供了一种可扩展且高效的患者隐私保护方法。

链接: https://arxiv.org/abs/2504.18569
作者: Guanchen Wu,Linzhi Zheng,Han Xie,Zhen Xiang,Jiaying Lu,Darren Liu,Delgersuren Bold,Bo Li,Xiao Hu,Carl Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Shorter version published in MedInfo 2025

点击查看摘要

Abstract:The de-identification of private information in medical data is a crucial process to mitigate the risk of confidentiality breaches, particularly when patient personal details are not adequately removed before the release of medical records. Although rule-based and learning-based methods have been proposed, they often struggle with limited generalizability and require substantial amounts of annotated data for effective performance. Recent advancements in large language models (LLMs) have shown significant promise in addressing these issues due to their superior language comprehension capabilities. However, LLMs present challenges, including potential privacy risks when using commercial LLM APIs and high computational costs for deploying open-source LLMs locally. In this work, we introduce LPPA, an LLM-empowered Privacy-Protected PHI Annotation framework for clinical notes, targeting the English language. By fine-tuning LLMs locally with synthetic notes, LPPA ensures strong privacy protection and high PHI annotation accuracy. Extensive experiments demonstrate LPPA’s effectiveness in accurately de-identifying private information, offering a scalable and efficient solution for enhancing patient privacy protection.
zh

[AI-119] Feature Selection via GANs (GANFS): Enhancing Machine Learning Models for DDoS Mitigation

【速读】:该论文旨在解决分布式拒绝服务(Distributed Denial of Service, DDoS)攻击在大规模网络系统中检测的难题,特别是在高维和冗余网络流量数据中识别有效特征的挑战。传统特征选择方法在复杂攻击环境下的可扩展性和适应性存在不足。论文提出的解决方案的关键在于引入一种基于生成对抗网络的特征选择(Generative Adversarial Network-based Feature Selection, GANFS)方法,通过对抗学习动态识别最具信息量的特征,并利用扰动敏感性分析对判别器进行优化,从而在无需完全监督的情况下有效评估特征重要性。

链接: https://arxiv.org/abs/2504.18566
作者: Harsh Patel
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Distributed Denial of Service (DDoS) attacks represent a persistent and evolving threat to modern networked systems, capable of causing large-scale service disruptions. The complexity of such attacks, often hidden within high-dimensional and redundant network traffic data, necessitates robust and intelligent feature selection techniques for effective detection. Traditional methods such as filter-based, wrapper-based, and embedded approaches, each offer strengths but struggle with scalability or adaptability in complex attack environments. In this study, we explore these existing techniques through a detailed comparative analysis and highlight their limitations when applied to large-scale DDoS detection tasks. Building upon these insights, we introduce a novel Generative Adversarial Network-based Feature Selection (GANFS) method that leverages adversarial learning dynamics to identify the most informative features. By training a GAN exclusively on attack traffic and employing a perturbation-based sensitivity analysis on the Discriminator, GANFS effectively ranks feature importance without relying on full supervision. Experimental evaluations using the CIC-DDoS2019 dataset demonstrate that GANFS not only improves the accuracy of downstream classifiers but also enhances computational efficiency by significantly reducing feature dimensionality. These results point to the potential of integrating generative learning models into cybersecurity pipelines to build more adaptive and scalable detection systems.
zh

[AI-120] RepliBench: Evaluating the autonomous replication capabilities of language model agents

【速读】:该论文试图解决语言模型代理(language model agents)不可控的自主复制所带来的安全风险问题。其解决方案的关键在于引入RepliBench,这是一个评估体系,用于衡量自主复制能力,并通过分解该能力为四个核心领域——获取资源、泄露模型权重、在计算资源上复制以及长期驻留于计算资源——来系统性地测试模型的表现。研究创建了20个新的任务族共计86项具体任务,并对5个前沿模型进行了基准测试,以评估它们在不同场景下的自主复制能力。

链接: https://arxiv.org/abs/2504.18565
作者: Sid Black,Asa Cooper Stickland,Jake Pencharz,Oliver Sourbut,Michael Schmatz,Jay Bailey,Ollie Matthews,Ben Millwood,Alex Remedios,Alan Cooney
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Uncontrollable autonomous replication of language model agents poses a critical safety risk. To better understand this risk, we introduce RepliBench, a suite of evaluations designed to measure autonomous replication capabilities. RepliBench is derived from a decomposition of these capabilities covering four core domains: obtaining resources, exfiltrating model weights, replicating onto compute, and persisting on this compute for long periods. We create 20 novel task families consisting of 86 individual tasks. We benchmark 5 frontier models, and find they do not currently pose a credible threat of self-replication, but succeed on many components and are improving rapidly. Models can deploy instances from cloud compute providers, write self-propagating programs, and exfiltrate model weights under simple security setups, but struggle to pass KYC checks or set up robust and persistent agent deployments. Overall the best model we evaluated (Claude 3.7 Sonnet) has a 50% pass@10 score on 15/20 task families, and a 50% pass@10 score for 9/20 families on the hardest variants. These findings suggest autonomous replication capability could soon emerge with improvements in these remaining areas or with human assistance.
zh

[AI-121] DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization

【速读】:该论文旨在解决现有攻击方法在面对由防护机制(Guardrails)保护的安全对齐大型语言模型(LLMs)时效果有限的问题,特别是针对同时攻击LLMs和Guardrails的双防护突破(dual-jailbreaking)问题。解决方案的关键在于提出一种目标驱动的双防护突破框架DualBreach,其核心包括目标驱动初始化(Target-driven Initialization, TDI)策略和多目标优化(Multi-Target Optimization, MTO)方法,通过动态构造初始提示并利用近似梯度联合调整提示以适应Guardrails和LLMs,从而在减少查询次数的同时实现高双防护突破成功率。

链接: https://arxiv.org/abs/2504.18564
作者: Xinzhe Huang,Kedong Xiu,Tianhang Zheng,Churui Zeng,Wangze Ni,Zhan Qiin,Kui Ren,Chun Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:Recent research has focused on exploring the vulnerabilities of Large Language Models (LLMs), aiming to elicit harmful and/or sensitive content from LLMs. However, due to the insufficient research on dual-jailbreaking – attacks targeting both LLMs and Guardrails, the effectiveness of existing attacks is limited when attempting to bypass safety-aligned LLMs shielded by guardrails. Therefore, in this paper, we propose DualBreach, a target-driven framework for dual-jailbreaking. DualBreach employs a Target-driven Initialization (TDI) strategy to dynamically construct initial prompts, combined with a Multi-Target Optimization (MTO) method that utilizes approximate gradients to jointly adapt the prompts across guardrails and LLMs, which can simultaneously save the number of queries and achieve a high dual-jailbreaking success rate. For black-box guardrails, DualBreach either employs a powerful open-sourced guardrail or imitates the target black-box guardrail by training a proxy model, to incorporate guardrails into the MTO process. We demonstrate the effectiveness of DualBreach in dual-jailbreaking scenarios through extensive evaluation on several widely-used datasets. Experimental results indicate that DualBreach outperforms state-of-the-art methods with fewer queries, achieving significantly higher success rates across all settings. More specifically, DualBreach achieves an average dual-jailbreaking success rate of 93.67% against GPT-4 with Llama-Guard-3 protection, whereas the best success rate achieved by other methods is 88.33%. Moreover, DualBreach only uses an average of 1.77 queries per successful dual-jailbreak, outperforming other state-of-the-art methods. For the purpose of defense, we propose an XGBoost-based ensemble defensive mechanism named EGuard, which integrates the strengths of multiple guardrails, demonstrating superior performance compared with Llama-Guard-3. Comments: 20 pages, 8 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.18564 [cs.CR] (or arXiv:2504.18564v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.18564 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-122] Deep Learning with Pretrained Internal World Layers: A Gemma 3-Based Modular Architecture for Wildfire Prediction

【速读】:该论文试图解决火灾发生预测问题,特别是针对有限数据条件下模型的泛化能力和可解释性不足的问题。其解决方案的关键在于利用预训练的大型Transformer模型(如Gemma 3)的中间层作为“内部世界”,通过引入模块化架构,将表格型火灾特征转换为适合模型中层Transformer块的隐藏维度,同时冻结这些Transformer子层以保留其预训练表示能力,仅训练较小的输入和输出网络。这种方法在减少可训练参数数量和降低过拟合风险的同时,保留了模型的广泛知识,从而提升了预测精度和鲁棒性。

链接: https://arxiv.org/abs/2504.18562
作者: Ayoub Jadouli,Chaker El Amrani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models, especially large Transformers, carry substantial “memory” in their intermediate layers – an \emphinternal world that encodes a wealth of relational and contextual knowledge. This work harnesses that internal world for wildfire occurrence prediction by introducing a modular architecture built upon Gemma 3, a state-of-the-art multimodal model. Rather than relying on Gemma 3’s original embedding and positional encoding stacks, we develop a custom feed-forward module that transforms tabular wildfire features into the hidden dimension required by Gemma 3’s mid-layer Transformer blocks. We freeze these Gemma 3 sub-layers – thus preserving their pretrained representation power – while training only the smaller input and output networks. This approach minimizes the number of trainable parameters and reduces the risk of overfitting on limited wildfire data, yet retains the benefits of Gemma 3’s broad knowledge. Evaluations on a Moroccan wildfire dataset demonstrate improved predictive accuracy and robustness compared to standard feed-forward and convolutional baselines. Ablation studies confirm that the frozen Transformer layers consistently contribute to better representations, underscoring the feasibility of reusing large-model mid-layers as a learned internal world. Our findings suggest that strategic modular reuse of pretrained Transformers can enable more data-efficient and interpretable solutions for critical environmental applications such as wildfire risk management.
zh

[AI-123] RDI: An adversarial robustness evaluation metric for deep neural networks based on sample clustering features

【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)在对抗样本下的鲁棒性评估问题,现有评估方法存在依赖特定攻击算法、计算效率低或评估准确性不足等缺陷。其解决方案的关键是提出一种新的鲁棒性评估指标——鲁棒性差异指数(Robustness Difference Index, RDI),该指标基于样本聚类特征,通过分析决策边界分割后的特征向量的类内和类间距离来量化模型鲁棒性,具有攻击无关性和高计算效率。

链接: https://arxiv.org/abs/2504.18556
作者: Jialei Song,Xingquan Zuo,Feiyang Wang,Hai Huang,Tianle Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) are highly susceptible to adversarial samples, raising concerns about their reliability in safety-critical tasks. Currently, methods of evaluating adversarial robustness are primarily categorized into attack-based and certified robustness evaluation approaches. The former not only relies on specific attack algorithms but also is highly time-consuming, while the latter due to its analytical nature, is typically difficult to implement for large and complex models. A few studies evaluate model robustness based on the model’s decision boundary, but they suffer from low evaluation accuracy. To address the aforementioned issues, we propose a novel adversarial robustness evaluation metric, Robustness Difference Index (RDI), which is based on sample clustering features. RDI draws inspiration from clustering evaluation by analyzing the intra-class and inter-class distances of feature vectors separated by the decision boundary to quantify model robustness. It is attack-independent and has high computational efficiency. Experiments show that, RDI demonstrates a stronger correlation with the gold-standard adversarial robustness metric of attack success rate (ASR). The average computation time of RDI is only 1/30 of the evaluation method based on the PGD attack. Our open-source code is available at: this https URL.
zh

[AI-124] Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review

【速读】:该论文试图解决合成健康数据质量评估的挑战,特别是在缺乏统一评估方法、评估指标使用不当、领域专家参与不足、数据集特征报告不充分以及结果可复现性有限等方面的问题。其解决方案的关键在于提出一套全面的生成与评估合成数据的指南,以确保合成数据的可靠性、相关性及适用性,从而推动合成数据在实际应用中的有效利用和技术创新。

链接: https://arxiv.org/abs/2504.18544
作者: Nazia Nafis,Inaki Esnaola,Alvaro Martinez-Perez,Maria-Cruz Villa-Uriol,Venet Osmani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generating synthetic tabular data can be challenging, however evaluation of their quality is just as challenging, if not more. This systematic review sheds light on the critical importance of rigorous evaluation of synthetic health data to ensure reliability, relevance, and their appropriate use. Based on screening of 1766 papers and a detailed review of 101 papers we identified key challenges, including lack of consensus on evaluation methods, improper use of evaluation metrics, limited input from domain experts, inadequate reporting of dataset characteristics, and limited reproducibility of results. In response, we provide several guidelines on the generation and evaluation of synthetic data, to allow the community to unlock and fully harness the transformative potential of synthetic data and accelerate innovation.
zh

[AI-125] Sharp higher order convergence rates for the Adam optimizer

【速读】:该论文旨在研究基于梯度下降的优化方法在训练深度神经网络时的收敛速度问题,特别是对比不同优化算法的收敛性能。其关键解决方案是揭示Adam优化器在严格局部极小值附近的收敛速率与动量方法相同,均为(x1)(x+1)1(\sqrt{x} - 1)(\sqrt{x} + 1)^{-1},而RMSprop的收敛速率则仅为(x1)(x+1)1(x - 1)(x + 1)^{-1},其中xx为目标函数在局部极小值处的Hessian矩阵条件数(condition number)。

链接: https://arxiv.org/abs/2504.19426
作者: Steffen Dereich,Arnulf Jentzen,Adrian Riekert
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注: 27 pages

点击查看摘要

Abstract:Gradient descent based optimization methods are the methods of choice to train deep neural networks in machine learning. Beyond the standard gradient descent method, also suitable modified variants of standard gradient descent involving acceleration techniques such as the momentum method and/or adaptivity techniques such as the RMSprop method are frequently considered optimization methods. These days the most popular of such sophisticated optimization schemes is presumably the Adam optimizer that has been proposed in 2014 by Kingma and Ba. A highly relevant topic of research is to investigate the speed of convergence of such optimization methods. In particular, in 1964 Polyak showed that the standard gradient descent method converges in a neighborhood of a strict local minimizer with rate (x - 1)(x + 1)^-1 while momentum achieves the (optimal) strictly faster convergence rate (\sqrtx - 1)(\sqrtx + 1)^-1 where x \in (1,\infty) is the condition number (the ratio of the largest and the smallest eigenvalue) of the Hessian of the objective function at the local minimizer. It is the key contribution of this work to reveal that Adam also converges with the strictly faster convergence rate (\sqrtx - 1)(\sqrtx + 1)^-1 while RMSprop only converges with the convergence rate (x - 1)(x + 1)^-1.
zh

[AI-126] Machine Learning-Based Modeling of the Anode Heel Effect in X-ray Beam Monte Carlo Simulations

【速读】:该论文旨在解决X射线成像中蒙特卡罗(Monte Carlo)模拟精度不足的问题,特别是针对阳极效应(anode heel effect)带来的光束强度分布不均和剂量学精度偏差。其解决方案的关键在于开发了一种基于人工智能(AI)的模型,通过动态调整X射线管阳极和阴极侧的光束权重,有效模拟临床X射线束的不对称特性,从而实现更精确的剂量分布和图像质量。

链接: https://arxiv.org/abs/2504.19155
作者: Hussein Harb,Didier Benoit,Axel Rannou,Chi-Hieu Pham,Valentin Tissot,Bahaa Nasr,Julien Bert
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:This study enhances Monte Carlo simulation accuracy in X-ray imaging by developing an AI-driven model for the anode heel effect, achieving improved beam intensity distribution and dosimetric precision. Through dynamic adjustments to beam weights on the anode and cathode sides of the X-ray tube, our machine learning model effectively replicates the asymmetry characteristic of clinical X-ray beams. Experimental results reveal dose rate increases of up to 9.6% on the cathode side and reductions of up to 12.5% on the anode side, for energy levels between 50 and 120 kVp. These experimentally optimized beam weights were integrated into the OpenGATE and GGEMS Monte Carlo toolkits, significantly advancing dosimetric simulation accuracy and the image quality which closely resembles the clinical imaging. Validation with fluence and dose actors demonstrated that the AI-based model closely mirrors clinical beam behavior, providing substantial improvements in dose consistency and accuracy over conventional X-ray models. This approach provides a robust framework for improving X-ray dosimetry, with potential applications in dose optimization, imaging quality enhancement, and radiation safety in both clinical and research settings.
zh

[AI-127] AI Recommendations and Non-instrumental Image Concerns

【速读】:该论文试图解决人类在与AI协作过程中,往往未能充分利用AI建议的问题,其核心原因是个体对自身形象的非工具性担忧(non-instrumental image concerns)。研究指出,即使这些担忧不会带来经济后果,人们仍会因为担心被他人如何看待而忽视AI的建议,从而降低任务表现。解决方案的关键在于识别并干预这种非工具性的社会认知因素,以促进更有效的AI-human协作。

链接: https://arxiv.org/abs/2504.19047
作者: David Almog
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:There is growing enthusiasm about the potential for humans and AI to collaborate by leveraging their respective strengths. Yet in practice, this promise often falls short. This paper uses an online experiment to identify non-instrumental image concerns as a key reason individuals underutilize AI recommendations. I show that concerns about how one is perceived, even when those perceptions carry no monetary consequences, lead participants to disregard AI advice and reduce task performance.
zh

[AI-128] Enhancing Cochlear Implant Signal Coding with Scaled Dot-Product Attention

【速读】:该论文旨在解决传统助听器编码策略在适应性和精确性方面的局限性,以提升人工耳蜗(Cochlear Implants, CI)的听力恢复效果。其解决方案的关键在于利用深度学习(Deep Learning, DL)技术生成电极图(electrodograms),通过该模型重构音频信号并评估其可懂度,结果表明该模型在短时客观可懂度(STOI)指标上接近传统先进组合编码器(ACE)策略,同时展现出更高的灵活性和适应性。

链接: https://arxiv.org/abs/2504.19046
作者: Billel Essaid,Hamza Kheddar,Noureddine Batel
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Cochlear implants (CIs) play a vital role in restoring hearing for individuals with severe to profound sensorineural hearing loss by directly stimulating the auditory nerve with electrical signals. While traditional coding strategies, such as the advanced combination encoder (ACE), have proven effective, they are constrained by their adaptability and precision. This paper investigates the use of deep learning (DL) techniques to generate electrodograms for CIs, presenting our model as an advanced alternative. We compared the performance of our model with the ACE strategy by evaluating the intelligibility of reconstructed audio signals using the short-time objective intelligibility (STOI) metric. The results indicate that our model achieves a STOI score of 0.6031, closely approximating the 0.6126 score of the ACE strategy, and offers potential advantages in flexibility and adaptability. This study underscores the benefits of incorporating artificial intelligent (AI) into CI technology, such as enhanced personalization and efficiency.
zh

[AI-129] Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion Collider

【速读】:该论文旨在解决传统模拟框架(如Geant4)在实验核物理和粒子物理中计算成本高、效率低的问题,特别是在切连科夫探测器(Cherenkov detectors)中,光学光子在复杂几何结构和反射表面中的传输模拟成为主要瓶颈。其解决方案的关键在于提出一个开放的、独立的快速模拟工具,用于检测内部反射切连科夫光(DIRC)探测器,特别是未来电子-离子对撞机(EIC)中的高性能DIRC(hpDIRC)。该工具集成了针对粒子识别(PID)任务优化的生成式模型,提供了一种可扩展的、基于GPU加速的替代方案,以替代传统的全Geant4模拟,从而显著提升模拟效率并支持大规模高保真数据集的生成。

链接: https://arxiv.org/abs/2504.19042
作者: James Giroux,Michael Martinez,Cristiano Fanelli
机构: 未知
类目: Instrumentation and Detectors (physics.ins-det); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex)
备注: 45 pages, 27 figures

点击查看摘要

Abstract:The integration of Deep Learning (DL) into experimental nuclear and particle physics has driven significant progress in simulation and reconstruction workflows. However, traditional simulation frameworks such as Geant4 remain computationally intensive, especially for Cherenkov detectors, where simulating optical photon transport through complex geometries and reflective surfaces introduces a major bottleneck. To address this, we present an open, standalone fast simulation tool for Detection of Internally Reflected Cherenkov Light (DIRC) detectors, with a focus on the High-Performance DIRC (hpDIRC) at the future Electron-Ion Collider (EIC). Our framework incorporates a suite of generative models tailored to accelerate particle identification (PID) tasks by offering a scalable, GPU-accelerated alternative to full Geant4-based simulations. Designed with accessibility in mind, our simulation package enables both DL researchers and physicists to efficiently generate high-fidelity large-scale datasets on demand, without relying on complex traditional simulation stacks. This flexibility supports the development and benchmarking of novel DL-driven PID methods. Moreover, this fast simulation pipeline represents a critical step toward enabling EIC-wide PID strategies that depend on virtually unlimited simulated samples, spanning the full acceptance of the hpDIRC.
zh

[AI-130] Predicting Stress in Two-phase Random Materials and Super-Resolution Method for Stress Images by Embedding Physical Information

【速读】:该论文旨在解决在具有复杂微结构的双相随机材料(TRMs)中,由于相界面处的应力集中导致的应力预测误差问题。现有深度学习方法受限于材料显微图像的像素数量,难以生成高分辨率的应力图像以准确识别应力集中区域。论文提出的解决方案关键在于构建一个结合多组成U-Net(MC U-Net)和混合物理信息神经网络(MPINN)的应力预测框架。MC U-Net通过引入相界面信息有效降低相界处的预测误差,而SRPINN则通过引入物理约束,无需成对的应力图像即可实现应力图像的超分辨率重建,从而实现相界处应力集中区域的多尺度分析。

链接: https://arxiv.org/abs/2504.18854
作者: Tengfei Xing,Xiaodan Ren,Jie Li
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Stress analysis is an important part of material design. For materials with complex microstructures, such as two-phase random materials (TRMs), material failure is often accompanied by stress concentration. Phase interfaces in two-phase materials are critical for stress concentration. Therefore, the prediction error of stress at phase boundaries is crucial. In practical engineering, the pixels of the obtained material microstructure images are limited, which limits the resolution of stress images generated by deep learning methods, making it difficult to observe stress concentration regions. Existing Image Super-Resolution (ISR) technologies are all based on data-driven supervised learning. However, stress images have natural physical constraints, which provide new ideas for new ISR technologies. In this study, we constructed a stress prediction framework for TRMs. First, the framework uses a proposed Multiple Compositions U-net (MC U-net) to predict stress in low-resolution material microstructures. By considering the phase interface information of the microstructure, the MC U-net effectively reduces the problem of excessive prediction errors at phase boundaries. Secondly, a Mixed Physics-Informed Neural Network (MPINN) based method for stress ISR (SRPINN) was proposed. By introducing the constraints of physical information, the new method does not require paired stress images for training and can increase the resolution of stress images to any multiple. This enables a multiscale analysis of the stress concentration regions at phase boundaries. Finally, we performed stress analysis on TRMs with different phase volume fractions and loading states through transfer learning. The results show the proposed stress prediction framework has satisfactory accuracy and generalization ability.
zh

机器学习

[LG-0] Socially-Aware Autonomous Driving: Inferring Yielding Intentions for Safer Interactions

链接: https://arxiv.org/abs/2504.20004
作者: Jing Wang,Yan Jin,Hamid Taghavifar,Fei Ding,Chongfeng Wei
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Since the emergence of autonomous driving technology, it has advanced rapidly over the past decade. It is becoming increasingly likely that autonomous vehicles (AVs) would soon coexist with human-driven vehicles (HVs) on the roads. Currently, safety and reliable decision-making remain significant challenges, particularly when AVs are navigating lane changes and interacting with surrounding HVs. Therefore, precise estimation of the intentions of surrounding HVs can assist AVs in making more reliable and safe lane change decision-making. This involves not only understanding their current behaviors but also predicting their future motions without any direct communication. However, distinguishing between the passing and yielding intentions of surrounding HVs still remains ambiguous. To address the challenge, we propose a social intention estimation algorithm rooted in Directed Acyclic Graph (DAG), coupled with a decision-making framework employing Deep Reinforcement Learning (DRL) algorithms. To evaluate the method’s performance, the proposed framework can be tested and applied in a lane-changing scenario within a simulated environment. Furthermore, the experiment results demonstrate how our approach enhances the ability of AVs to navigate lane changes safely and efficiently on roads.

[LG-1] Emergence and scaling laws in SGD learning of shallow neural networks

链接: https://arxiv.org/abs/2504.19983
作者: Yunwei Ren,Eshaan Nichani,Denny Wu,Jason D. Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 100 pages

点击查看摘要

Abstract:We study the complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with P neurons on isotropic Gaussian data: f_(\boldsymbolx) = \sum_p=1^P a_p\cdot \sigma(\langle\boldsymbolx,\boldsymbolv_p^\rangle) , \boldsymbolx \sim \mathcalN(0,\boldsymbolI_d) , where the activation \sigma:\mathbbR\to\mathbbR is an even function with information exponent k_2 (defined as the lowest degree in the Hermite expansion), \boldsymbolv^p_p\in[P]\subset \mathbbR^d are orthonormal signal directions, and the non-negative second-layer coefficients satisfy \sum_p a_p^2=1 . We focus on the challenging ``extensive-width’’ regime P\gg 1 and permit diverging condition number in the second-layer, covering as a special case the power-law scaling a_p\asymp p^-\beta where \beta\in\mathbbR\ge 0 . We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective, and explicitly identify sharp transition times to recover each signal direction. In the power-law setting, we characterize scaling law exponents for the MSE loss with respect to the number of training samples and SGD steps, as well as the number of parameters in the student neural network. Our analysis entails that while the learning of individual teacher neurons exhibits abrupt transitions, the juxtaposition of P\gg 1 emergent learning curves at different timescales leads to a smooth scaling law in the cumulative objective.

[LG-2] Accurate and Diverse LLM Mathematical Reasoning via Automated PRM-Guided GFlowNets

链接: https://arxiv.org/abs/2504.19981
作者: Adam Younsi,Abdalgader Abubaker,Mohamed El Amine Seddik,Hakim Hacid,Salem Lahlou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Achieving both accuracy and diverse reasoning remains challenging for Large Language Models (LLMs) in complex domains like mathematics. A key bottleneck is evaluating intermediate reasoning steps to guide generation without costly human annotations. To address this, we first introduce a novel Process Reward Model (PRM) trained automatically using Monte Carlo Tree Search coupled with a similarity-based data augmentation technique, effectively capturing step-level reasoning quality. Leveraging this PRM, we then adapt Generative Flow Networks (GFlowNets) to operate at the reasoning step level. Unlike traditional reinforcement learning focused on maximizing a single reward, GFlowNets naturally sample diverse, high-quality solutions proportional to their rewards, as measured by our PRM. Empirical evaluation shows strong improvements in both accuracy and solution diversity on challenging mathematical benchmarks (e.g., +2.59% absolute accuracy on MATH Level 5 for Llama3.2-3B), with effective generalization to unseen datasets (+9.4% absolute on SAT MATH). Our work demonstrates the potential of PRM-guided, step-level GFlowNets for developing more robust and versatile mathematical reasoning in LLMs.

[LG-3] ransfer Learning Under High-Dimensional Network Convolutional Regression Model

链接: https://arxiv.org/abs/2504.19979
作者: Liyuan Wang,Jiachen Chen,Kathryn L. Lunetta,Danyang Huang,Huimin Cheng,Debarghya Mukherjee
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Transfer learning enhances model performance by utilizing knowledge from related domains, particularly when labeled data is scarce. While existing research addresses transfer learning under various distribution shifts in independent settings, handling dependencies in networked data remains challenging. To address this challenge, we propose a high-dimensional transfer learning framework based on network convolutional regression (NCR), inspired by the success of graph convolutional networks (GCNs). The NCR model incorporates random network structure by allowing each node’s response to depend on its features and the aggregated features of its neighbors, capturing local dependencies effectively. Our methodology includes a two-step transfer learning algorithm that addresses domain shift between source and target networks, along with a source detection mechanism to identify informative domains. Theoretically, we analyze the lasso estimator in the context of a random graph based on the Erdos-Renyi model assumption, demonstrating that transfer learning improves convergence rates when informative sources are present. Empirical evaluations, including simulations and a real-world application using Sina Weibo data, demonstrate substantial improvements in prediction accuracy, particularly when labeled data in the target domain is limited.

[LG-4] Robust Federated Personalised Mean Estimation for the Gaussian Mixture Model

链接: https://arxiv.org/abs/2504.19955
作者: Malhar A. Managoli,Vinod M. Prabhakaran,Suhas Diggavi
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Federated learning with heterogeneous data and personalization has received significant recent attention. Separately, robustness to corrupted data in the context of federated learning has also been studied. In this paper we explore combining personalization for heterogeneous data with robustness, where a constant fraction of the clients are corrupted. Motivated by this broad problem, we formulate a simple instantiation which captures some of its difficulty. We focus on the specific problem of personalized mean estimation where the data is drawn from a Gaussian mixture model. We give an algorithm whose error depends almost linearly on the ratio of corrupted to uncorrupted samples, and show a lower bound with the same behavior, albeit with a gap of a constant factor.

[LG-5] Accelerating Mixture-of-Experts Training with Adaptive Expert Replication

链接: https://arxiv.org/abs/2504.19925
作者: Athinagoras Skiadopoulos,Mark Zhao,Swapnil Gandhi,Thomas Norrie,Shrijeet Mukherjee,Christos Kozyrakis
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models have become a widely adopted solution to continue scaling model sizes without a corresponding linear increase in compute. During MoE model training, each input token is dynamically routed to a subset of experts – sparsely-activated feed-forward networks – within each transformer layer. The distribution of tokens assigned to each expert varies widely and rapidly over the course of training. To handle the wide load imbalance across experts, current systems are forced to either drop tokens assigned to popular experts, degrading convergence, or frequently rebalance resources allocated to each expert based on popularity, incurring high state migration overheads. To break this performance-accuracy tradeoff, we introduce SwiftMoE, an adaptive MoE training system. The key insight of SwiftMoE is to decouple the placement of expert parameters from their large optimizer state. SwiftMoE statically partitions the optimizer of each expert across all training nodes. Meanwhile, SwiftMoE dynamically adjusts the placement of expert parameters by repurposing existing weight updates, avoiding migration overheads. In doing so, SwiftMoE right-sizes the GPU resources allocated to each expert, on a per-iteration basis, with minimal overheads. Compared to state-of-the-art MoE training systems, DeepSpeed and FlexMoE, SwiftMoE is able to achieve a 30.5% and 25.9% faster time-to-convergence, respectively. Comments: Preprint. Under review Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2504.19925 [cs.DC] (or arXiv:2504.19925v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2504.19925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] Convergence Analysis of Asynchronous Federated Learning with Gradient Compression for Non-Convex Optimization

链接: https://arxiv.org/abs/2504.19903
作者: Diying Yang,Yingwei Hou,Danyang Xiao,Weigang Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gradient compression is an effective technique for reducing communication costs in federated learning (FL), and error feedback (EF) is usually adopted to remedy the compression errors. However, there remains a lack of systematic study on these techniques in asynchronous FL. In this paper, we fill this gap by analyzing the convergence behaviors of FL under different frameworks. We firstly consider a basic asynchronous FL framework AsynFL, and provide an improved convergence analysis that relies on fewer assumptions and yields a superior convergence rate than prior studies. Then, we consider a variant framework with gradient compression, AsynFLC. We show sufficient conditions for its convergence to the optimum, indicating the interaction between asynchronous delay and compression rate. Our analysis also demonstrates that asynchronous delay amplifies the variance caused by compression, thereby hindering convergence, and such an impact is exacerbated by high data heterogeneity. Furthermore, we study the convergence of AsynFLC-EF, the framework that further integrates EF. We prove that EF can effectively reduce the variance of gradient estimation despite asynchronous delay, which enables AsynFLC-EF to match the convergence rate of AsynFL. We also show that the impact of asynchronous delay on EF is limited to slowing down the higher-order convergence term. Experimental results substantiate our analytical findings very well.

[LG-7] Hierarchical Uncertainty-Aware Graph Neural Network

链接: https://arxiv.org/abs/2504.19820
作者: Yoonhyuk Choi,Chong-Kwon Kim
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent research on graph neural networks (GNNs) has explored mechanisms for capturing local uncertainty and exploiting graph hierarchies to mitigate data sparsity and leverage structural properties. However, the synergistic integration of these two approaches remains underexplored. In this work, we introduce a novel architecture, the Hierarchical Uncertainty-Aware Graph Neural Network (HU-GNN), which unifies multi-scale representation learning, principled uncertainty estimation, and self-supervised embedding diversity within a single end-to-end framework. Specifically, HU-GNN adaptively forms node clusters and estimates uncertainty at multiple structural scales from individual nodes to higher levels. These uncertainty estimates guide a robust message-passing mechanism and attention weighting, effectively mitigating noise and adversarial perturbations while preserving predictive accuracy on both node- and graph-level tasks. We also offer key theoretical contributions, including a probabilistic formulation, rigorous uncertainty-calibration guarantees, and formal robustness bounds. Finally, by incorporating recent advances in graph contrastive learning, HU-GNN maintains diverse, structurally faithful embeddings. Extensive experiments on standard benchmarks demonstrate that our model achieves state-of-the-art robustness and interpretability.

[LG-8] Digital Twin-based Out-of-Distribution Detection in Autonomous Vessels

链接: https://arxiv.org/abs/2504.19816
作者: Erblin Isaku,Hassan Sartaj,Shaukat Ali
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 34 pages, 12 figures, 11 tables

点击查看摘要

Abstract:An autonomous vessel (AV) is a complex cyber-physical system (CPS) with software enabling many key functionalities, e.g., navigation software enables an AV to autonomously or semi-autonomously follow a path to its destination. Digital twins of such AVs enable advanced functionalities such as running what-if scenarios, performing predictive maintenance, and enabling fault diagnosis. Due to technological improvements, real-time analyses using continuous data from vessels’ real-time operations have become increasingly possible. However, the literature has little explored developing advanced analyses in real-time data in AVs with digital twins built with machine learning techniques. To this end, we present a novel digital twin-based approach (ODDIT) to detect future out-of-distribution (OOD) states of an AV before reaching them, enabling proactive intervention. Such states may indicate anomalies requiring attention (e.g., manual correction by the ship master) and assist testers in scenario-centered testing. The digital twin consists of two machine-learning models predicting future vessel states and whether the predicted state will be OOD. We evaluated ODDIT with five vessels across waypoint and zigzag maneuvering under simulated conditions, including sensor and actuator noise and environmental disturbances i.e., ocean current. ODDIT achieved high accuracy in detecting OOD states, with AUROC and TNR@TPR95 scores reaching 99% across multiple vessels.

[LG-9] Dynamic Tsetlin Machine Accelerators for On-Chip Training at the Edge using FPGAs

链接: https://arxiv.org/abs/2504.19797
作者: Gang Mao,Tousif Rahman,Sidharth Maheshwari,Bob Pattison,Zhuang Shao,Rishad Shafik,Alex Yakovlev
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increased demand for data privacy and security in machine learning (ML) applications has put impetus on effective edge training on Internet-of-Things (IoT) nodes. Edge training aims to leverage speed, energy efficiency and adaptability within the resource constraints of the nodes. Deploying and training Deep Neural Networks (DNNs)-based models at the edge, although accurate, posit significant challenges from the back-propagation algorithm’s complexity, bit precision trade-offs, and heterogeneity of DNN layers. This paper presents a Dynamic Tsetlin Machine (DTM) training accelerator as an alternative to DNN implementations. DTM utilizes logic-based on-chip inference with finite-state automata-driven learning within the same Field Programmable Gate Array (FPGA) package. Underpinned on the Vanilla and Coalesced Tsetlin Machine algorithms, the dynamic aspect of the accelerator design allows for a run-time reconfiguration targeting different datasets, model architectures, and model sizes without resynthesis. This makes the DTM suitable for targeting multivariate sensor-based edge tasks. Compared to DNNs, DTM trains with fewer multiply-accumulates, devoid of derivative computation. It is a data-centric ML algorithm that learns by aligning Tsetlin automata with input data to form logical propositions enabling efficient Look-up-Table (LUT) mapping and frugal Block RAM usage in FPGA training implementations. The proposed accelerator offers 2.54x more Giga operations per second per Watt (GOP/s per W) and uses 6x less power than the next-best comparable design.

[LG-10] Heterophily-informed Message Passing

链接: https://arxiv.org/abs/2504.19785
作者: Haishan Wang,Arno Solin,Vikas Garg
类目: Machine Learning (cs.LG)
*备注: Appearing in Transactions on Machine Learning Research (TMLR) 2025

点击查看摘要

Abstract:Graph neural networks (GNNs) are known to be vulnerable to oversmoothing due to their implicit homophily assumption. We mitigate this problem with a novel scheme that regulates the aggregation of messages, modulating the type and extent of message passing locally thereby preserving both the low and high-frequency components of information. Our approach relies solely on learnt embeddings, obviating the need for auxiliary labels, thus extending the benefits of heterophily-aware embeddings to broader applications, e.g., generative modelling. Our experiments, conducted across various data sets and GNN architectures, demonstrate performance enhancements and reveal heterophily patterns across standard classification benchmarks. Furthermore, application to molecular generation showcases notable performance improvements on chemoinformatics benchmarks.

[LG-11] If Concept Bottlenecks are the Question are Foundation Models the Answer?

链接: https://arxiv.org/abs/2504.19774
作者: Nicola Debole,Pietro Barbiero,Francesco Giannini,Andrea Passeggini,Stefano Teso,Emanuele Marconato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) are neural networks designed to conjoin high performance with ante-hoc interpretability. CBMs work by first mapping inputs (e.g., images) to high-level concepts (e.g., visible objects and their properties) and then use these to solve a downstream task (e.g., tagging or scoring an image) in an interpretable manner. Their performance and interpretability, however, hinge on the quality of the concepts they learn. The go-to strategy for ensuring good quality concepts is to leverage expert annotations, which are expensive to collect and seldom available in applications. Researchers have recently addressed this issue by introducing “VLM-CBM” architectures that replace manual annotations with weak supervision from foundation models. It is however unclear what is the impact of doing so on the quality of the learned concepts. To answer this question, we put state-of-the-art VLM-CBMs to the test, analyzing their learned concepts empirically using a selection of significant metrics. Our results show that, depending on the task, VLM supervision can sensibly differ from expert annotations, and that concept accuracy and quality are not strongly correlated. Our code is available at this https URL.

[LG-12] FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLM s DATE2025

链接: https://arxiv.org/abs/2504.19746
作者: Xilong Xie,Liang Wang,Limin Xiao,Meng Han,Lin Sun,Shuai Zheng,Xiangrong Xu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: DATE 2025

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced the natural language processing paradigm but impose substantial demands on memory and computational resources. Quantization is one of the most effective ways to reduce memory consumption of LLMs. However, advanced single-precision quantization methods experience significant accuracy degradation when quantizing to ultra-low bits. Existing mixed-precision quantization methods are quantized by groups with coarse granularity. Employing high precision for group data leads to substantial memory overhead, whereas low precision severely impacts model accuracy. To address this issue, we propose FineQ, software-hardware co-design for low-bit fine-grained mixed-precision quantization of LLMs. First, FineQ partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters, thus achieving a balance between model accuracy and memory overhead. Then, we propose an outlier protection mechanism within clusters that uses 3 bits to represent outliers and introduce an encoding scheme for index and data concatenation to enable aligned memory access. Finally, we introduce an accelerator utilizing temporal coding that effectively supports the quantization algorithm while simplifying the multipliers in the systolic array. FineQ achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width. Meanwhile, the accelerator achieves up to 1.79x energy efficiency and reduces the area of the systolic array by 61.2%.

[LG-13] Graph Fourier Transformer with Structure-Frequency Information

链接: https://arxiv.org/abs/2504.19740
作者: Yonghui Zhai,Yang Zhang,Minghao Shang,Lihua Pang,Yaxin Ren
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Graph Transformers (GTs) have shown advantages in numerous graph structure tasks but their self-attention mechanism ignores the generalization bias of graphs, with existing methods mainly compensating for this bias from aspects like position encoding, attention bias and relative distance yet still having sub-optimal performance and being insufficient by only considering the structural perspective of generalization bias. To address this, this paper proposes Grafourierformer, which innovatively combines GT with inductive bias containing Frequency-Structure information by applying Graph Fourier Transform to the Attention Matrix: specifically, eigenvalues from the Graph Laplacian matrix are used to construct an Eigenvalue matrix mask (reflecting node positions and structural relationships with neighboring nodes to enable consideration of node range structural characteristics and focus on local graph details), and inverse Fourier transform is employed to extract node high-frequency and low-frequency features, calculate low-frequency and high-frequency energy, and construct a node frequency-energy matrix to filter the eigenvalue matrix mask, allowing attention heads to incorporate both graph structural information and node frequency information optimization, adaptively distinguish global trends from local details, and effectively suppress redundant information interference. Extensive experiments on various benchmarks show Grafourierformer consistently outperforms GNN and GT-based models in graph classification and node classification tasks, with ablation experiments further validating the effectiveness and necessity of the method. Codes are available at this https URL

[LG-14] Intelligent4DSE: Optimizing High-Level Synthesis Design Space Exploration with Graph Neural Networks and Large Language Models

链接: https://arxiv.org/abs/2504.19649
作者: Lei Xu,Shanshan Wang,Emmanuel Casseau,Chenglong Xiao
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:High-level synthesis (HLS) design space exploration (DSE) is an optimization process in electronic design automation (EDA) that systematically explores high-level design configurations to achieve Pareto-optimal hardware implementations balancing performance, area, and power (PPA). To optimize this process, HLS prediction tasks often employ message-passing neural networks (MPNNs), leveraging complex architectures to achieve high accuracy. These predictors serve as evaluators in the DSE process, effectively bypassing the time-consuming estimations traditionally required by HLS tools. However, existing models often prioritize structural complexity and minimization of training loss, overlooking task-specific characteristics. Additionally, while evolutionary algorithms are widely used in DSE, they typically require extensive domain-specific knowledge to design effective crossover and mutation operators. To address these limitations, we propose CoGNNs-LLMEA, a framework that integrates a graph neural network with task-adaptive message passing and a large language model-enhanced evolutionary algorithm. As a predictive model, CoGNNs directly leverages intermediate representations generated from source code after compiler front-end processing, enabling prediction of quality of results (QoR) without invoking HLS tools. Due to its strong adaptability to tasks, CoGNNs can be tuned to predict post-HLS and post-implementation outcomes, effectively bridging the gap between high-level abstractions and physical implementation characteristics. CoGNNs achieves state-of-the-art prediction accuracy in post-HLS QoR prediction, reducing mean prediction errors by 2.8 \times for latency and 3.4 \times for resource utilization compared to baseline models.

[LG-15] A Unified Benchmark of Federated Learning with Kolmogorov-Arnold Networks for Medical Imaging

链接: https://arxiv.org/abs/2504.19639
作者: Youngjoon Lee,Jinu Gong,Joonhyuk Kang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5 pages

点击查看摘要

Abstract:Federated Learning (FL) enables model training across decentralized devices without sharing raw data, thereby preserving privacy in sensitive domains like healthcare. In this paper, we evaluate Kolmogorov-Arnold Networks (KAN) architectures against traditional MLP across six state-of-the-art FL algorithms on a blood cell classification dataset. Notably, our experiments demonstrate that KAN can effectively replace MLP in federated environments, achieving superior performance with simpler architectures. Furthermore, we analyze the impact of key hyperparameters-grid size and network architecture-on KAN performance under varying degrees of Non-IID data distribution. Additionally, our ablation studies reveal that optimizing KAN width while maintaining minimal depth yields the best performance in federated settings. As a result, these findings establish KAN as a promising alternative for privacy-preserving medical imaging applications in distributed healthcare. To the best of our knowledge, this is the first comprehensive benchmark of KAN in FL settings for medical imaging task.

[LG-16] LODAP: On-Device Incremental Learning Via Lightweight Operations and Data Pruning

链接: https://arxiv.org/abs/2504.19638
作者: Biqing Duan,Qing Wang,Di Liu,Wei Zhou,Zhenli He,Shengfa Miao
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Incremental learning that learns new classes over time after the model’s deployment is becoming increasingly crucial, particularly for industrial edge systems, where it is difficult to communicate with a remote server to conduct computation-intensive learning. As more classes are expected to learn after their execution for edge devices. In this paper, we propose LODAP, a new on-device incremental learning framework for edge systems. The key part of LODAP is a new module, namely Efficient Incremental Module (EIM). EIM is composed of normal convolutions and lightweight operations. During incremental learning, EIM exploits some lightweight operations, called adapters, to effectively and efficiently learn features for new classes so that it can improve the accuracy of incremental learning while reducing model complexity as well as training overhead. The efficiency of LODAP is further enhanced by a data pruning strategy that significantly reduces the training data, thereby lowering the training overhead. We conducted extensive experiments on the CIFAR-100 and Tiny- ImageNet datasets. Experimental results show that LODAP improves the accuracy by up to 4.32% over existing methods while reducing around 50% of model complexity. In addition, evaluations on real edge systems demonstrate its applicability for on-device machine learning. The code is available at this https URL.

[LG-17] Diffusion Stochastic Learning Over Adaptive Competing Networks

链接: https://arxiv.org/abs/2504.19635
作者: Yike Zhao,Haoyuan Cai,Ali H. Sayed
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper studies a stochastic dynamic game between two competing teams, each consisting of a network of collaborating agents. Unlike fully cooperative settings, where all agents share a common objective, each team in this game aims to minimize its own distinct objective. In the adversarial setting, their objectives could be conflicting as in zero-sum games. Throughout the competition, agents share strategic information within their own team while simultaneously inferring and adapting to the strategies of the opposing team. We propose diffusion learning algorithms to address two important classes of this network game: i) a zero-sum game characterized by weak cross-team subgraph interactions, and ii) a general non-zero-sum game exhibiting strong cross-team subgraph interactions. We analyze the stability performance of the proposed algorithms under reasonable assumptions and illustrate the theoretical results through experiments on Cournot team competition and decentralized GAN training.

[LG-18] Rulebook: bringing co-routines to reinforcement learning environments

链接: https://arxiv.org/abs/2504.19625
作者: Massimo Fioravanti,Samuele Pasini,Giovanni Agosta
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) algorithms, due to their reliance on external systems to learn from, require digital environments (e.g., simulators) with very simple interfaces, which in turn constrain significantly the implementation of such environments. In particular, these environments are implemented either as separate processes or as state machines, leading to synchronization and communication overheads in the first case, and to unstructured programming in the second. We propose a new domain-specific, co-routine-based, compiled language, called Rulebook, designed to automatically generate the state machine required to interact with machine learning (ML) algorithms and similar applications, with no performance overhead. Rulebook allows users to express programs without needing to be aware of the specific interface required by the ML components. By decoupling the execution model of the program from the syntactical encoding of the program, and thus without the need for manual state management, Rulebook allows to create larger and more sophisticated environments at a lower development cost. Subjects: Programming Languages (cs.PL); Machine Learning (cs.LG) Cite as: arXiv:2504.19625 [cs.PL] (or arXiv:2504.19625v1 [cs.PL] for this version) https://doi.org/10.48550/arXiv.2504.19625 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-19] AI Alignment in Medical Imaging: Unveiling Hidden Biases Through Counterfactual Analysis

链接: https://arxiv.org/abs/2504.19621
作者: Haroui Ma,Francesco Quinzan,Theresa Willem,Stefan Bauer
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning (ML) systems for medical imaging have demonstrated remarkable diagnostic capabilities, but their susceptibility to biases poses significant risks, since biases may negatively impact generalization performance. In this paper, we introduce a novel statistical framework to evaluate the dependency of medical imaging ML models on sensitive attributes, such as demographics. Our method leverages the concept of counterfactual invariance, measuring the extent to which a model’s predictions remain unchanged under hypothetical changes to sensitive attributes. We present a practical algorithm that combines conditional latent diffusion models with statistical hypothesis testing to identify and quantify such biases without requiring direct access to counterfactual data. Through experiments on synthetic datasets and large-scale real-world medical imaging datasets, including \textsccheXpert and MIMIC-CXR, we demonstrate that our approach aligns closely with counterfactual fairness principles and outperforms standard baselines. This work provides a robust tool to ensure that ML diagnostic systems generalize well, e.g., across demographic groups, offering a critical step towards AI safety in healthcare. Code: this https URL.

[LG-20] Soft-Label Caching and Sharpening for Communication-Efficient Federated Distillation

链接: https://arxiv.org/abs/2504.19602
作者: Kitsuya Azuma,Takayuki Nishio,Yuichi Kitagawa,Wakako Nakano,Takahito Tanimura
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across decentralized clients, enhancing privacy by keeping data local. Yet conventional FL, relying on frequent parameter-sharing, suffers from high communication overhead and limited model heterogeneity. Distillation-based FL approaches address these issues by sharing predictions (soft-labels) instead, but they often involve redundant transmissions across communication rounds, reducing efficiency. We propose SCARLET, a novel framework integrating synchronized soft-label caching and an enhanced Entropy Reduction Aggregation (Enhanced ERA) mechanism. SCARLET minimizes redundant communication by reusing cached soft-labels, achieving up to 50% reduction in communication costs compared to existing methods while maintaining accuracy. Enhanced ERA can be tuned to adapt to non-IID data variations, ensuring robust aggregation and performance in diverse client scenarios. Experimental evaluations demonstrate that SCARLET consistently outperforms state-of-the-art distillation-based FL methods in terms of accuracy and communication efficiency. The implementation of SCARLET is publicly available at this https URL.

[LG-21] Quantifying Memory Utilization with Effective State-Size

链接: https://arxiv.org/abs/2504.19561
作者: Rom N. Parnichkun,Neehal Tumma,Armin W. Thomas,Alessandro Moro,Qi An,Taiji Suzuki,Atsushi Yamashita,Michael Poli,Stefano Massaroli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The need to develop a general framework for architecture analysis is becoming increasingly important, given the expanding design space of sequence models. To this end, we draw insights from classical signal processing and control theory, to develop a quantitative measure of \textitmemory utilization: the internal mechanisms through which a model stores past information to produce future outputs. This metric, which we call \textbf\textiteffective state-size (ESS), is tailored to the fundamental class of systems with \textitinput-invariant and \textitinput-varying linear operators, encompassing a variety of computational units such as variants of attention, convolutions, and recurrences. Unlike prior work on memory utilization, which either relies on raw operator visualizations (e.g. attention maps), or simply the total \textitmemory capacity (i.e. cache size) of a model, our metrics provide highly interpretable and actionable measurements. In particular, we show how ESS can be leveraged to improve initialization strategies, inform novel regularizers and advance the performance-efficiency frontier through model distillation. Furthermore, we demonstrate that the effect of context delimiters (such as end-of-speech tokens) on ESS highlights cross-architectural differences in how large language models utilize their available memory to recall information. Overall, we find that ESS provides valuable insights into the dynamics that dictate memory utilization, enabling the design of more efficient and effective sequence models.

[LG-22] owards Faster and More Compact Foundation Models for Molecular Property Prediction

链接: https://arxiv.org/abs/2504.19538
作者: Yasir Ghunaim,Andrés Villa,Gergo Ignacz,Gyorgy Szekely,Motasem Alfarra,Bernard Ghanem
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Advancements in machine learning for molecular property prediction have improved accuracy but at the expense of higher computational cost and longer training times. Recently, the Joint Multi-domain Pre-training (JMP) foundation model has demonstrated strong performance across various downstream tasks with reduced training time over previous models. Despite JMP’s advantages, fine-tuning it on molecular datasets ranging from small-scale to large-scale requires considerable time and computational resources. In this work, we investigate strategies to enhance efficiency by reducing model size while preserving performance. To better understand the model’s efficiency, we analyze the layer contributions of JMP and find that later interaction blocks provide diminishing returns, suggesting an opportunity for model compression. We explore block reduction strategies by pruning the pre-trained model and evaluating its impact on efficiency and accuracy during fine-tuning. Our analysis reveals that removing two interaction blocks results in a minimal performance drop, reducing the model size by 32% while increasing inference throughput by 1.3x. These results suggest that JMP-L is over-parameterized and that a smaller, more efficient variant can achieve comparable performance with lower computational cost. Our study provides insights for developing lighter, faster, and more scalable foundation models for molecular and materials discovery. The code is publicly available at: this https URL.

[LG-23] Euclidean Distance Matrix Completion via Asymmetric Projected Gradient Descent

链接: https://arxiv.org/abs/2504.19530
作者: Yicheng Li,Xinghua Sun
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper proposes and analyzes a gradient-type algorithm based on Burer-Monteiro factorization, called the Asymmetric Projected Gradient Descent (APGD), for reconstructing the point set configuration from partial Euclidean distance measurements, known as the Euclidean Distance Matrix Completion (EDMC) problem. By paralleling the incoherence matrix completion framework, we show for the first time that global convergence guarantee with exact recovery of this routine can be established given \mathcalO(\mu^2 r^3 \kappa^2 n \log n) Bernoulli random observations without any sample splitting. Unlike leveraging the tangent space Restricted Isometry Property (RIP) and local curvature of the low-rank embedding manifold in some very recent works, our proof provides new upper bounds to replace the random graph lemma under EDMC setting. The APGD works surprisingly well and numerical experiments demonstrate exact linear convergence behavior in rich-sample regions yet deteriorates fast when compared with the performance obtained by optimizing the s-stress function, i.e., the standard but unexplained non-convex approach for EDMC, if the sample size is limited. While virtually matching our theoretical prediction, this unusual phenomenon might indicate that: (i) the power of implicit regularization is weakened when specified in the APGD case; (ii) the stabilization of such new gradient direction requires substantially more samples than the information-theoretic limit would suggest.

[LG-24] Identification and Estimation of Long-Term Treatment Effects with Monotone Missing

链接: https://arxiv.org/abs/2504.19527
作者: Qinwei Yang,Ruocheng Guo,Shasha Han,Peng Wu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Estimating long-term treatment effects has a wide range of applications in various domains. A key feature in this context is that collecting long-term outcomes typically involves a multi-stage process and is subject to monotone missing, where individuals missing at an earlier stage remain missing at subsequent stages. Despite its prevalence, monotone missing has been rarely explored in previous studies on estimating long-term treatment effects. In this paper, we address this gap by introducing the sequential missingness assumption for identification. We propose three novel estimation methods, including inverse probability weighting, sequential regression imputation, and sequential marginal structural model (SeqMSM). Considering that the SeqMSM method may suffer from high variance due to severe data sparsity caused by monotone missing, we further propose a novel balancing-enhanced approach, BalanceNet, to improve the stability and accuracy of the estimation methods. Extensive experiments on two widely used benchmark datasets demonstrate the effectiveness of our proposed methods.

[LG-25] Negative Imaginary Neural ODEs: Learning to Control Mechanical Systems with Stability Guarantees

链接: https://arxiv.org/abs/2504.19497
作者: Kanghong Shi,Ruigang Wang,Ian R. Manchester
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose a neural control method to provide guaranteed stabilization for mechanical systems using a novel negative imaginary neural ordinary differential equation (NINODE) controller. Specifically, we employ neural networks with desired properties as state-space function matrices within a Hamiltonian framework to ensure the system possesses the NI property. This NINODE system can serve as a controller that asymptotically stabilizes an NI plant under certain conditions. For mechanical plants with colocated force actuators and position sensors, we demonstrate that all the conditions required for stability can be translated into regularity constraints on the neural networks used in the controller. We illustrate the utility, effectiveness, and stability guarantees of the NINODE controller through an example involving a nonlinear mass-spring system.

[LG-26] Stability Enhancement in Reinforcement Learning via Adaptive Control Lyapunov Function

链接: https://arxiv.org/abs/2504.19473
作者: Donghe Chen,Han Wang,Lin Cheng,Shengping Gong
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown promise in control tasks but faces significant challenges in real-world applications, primarily due to the absence of safety guarantees during the learning process. Existing methods often struggle with ensuring safe exploration, leading to potential system failures and restricting applications primarily to simulated environments. Traditional approaches such as reward shaping and constrained policy optimization can fail to guarantee safety during initial learning stages, while model-based methods using Control Lyapunov Functions (CLFs) or Control Barrier Functions (CBFs) may hinder efficient exploration and performance. To address these limitations, this paper introduces Soft Actor-Critic with Control Lyapunov Function (SAC-CLF), a framework that enhances stability and safety through three key innovations: (1) a task-specific CLF design method for safe and optimal performance; (2) dynamic adjustment of constraints to maintain robustness under unmodeled dynamics; and (3) improved control input smoothness while ensuring safety. Experimental results on a classical nonlinear system and satellite attitude control demonstrate the effectiveness of SAC-CLF in overcoming the shortcomings of existing methods.

[LG-27] Geometry-Informed Neural Operator Transformer

链接: https://arxiv.org/abs/2504.19452
作者: Qibang Liu,Vincient Zhong,Hadi Meidani,Diab Abueidda,Seid Koric,Philippe Geubelle
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Machine-learning-based surrogate models offer significant computational efficiency and faster simulations compared to traditional numerical methods, especially for problems requiring repeated evaluations of partial differential equations. This work introduces the Geometry-Informed Neural Operator Transformer (GINOT), which integrates the transformer architecture with the neural operator framework to enable forward predictions for arbitrary geometries. GINOT encodes the surface points cloud of a geometry using a sampling and grouping mechanism combined with an attention mechanism, ensuring invariance to point order and padding while maintaining robustness to variations in point density. The geometry information is seamlessly integrated with query points in the solution decoder through the attention mechanism. The performance of GINOT is validated on multiple challenging datasets, showcasing its high accuracy and strong generalization capabilities for complex and arbitrary 2D and 3D geometries.

[LG-28] R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference ICLR2025

链接: https://arxiv.org/abs/2504.19449
作者: Zhenyu Zhang,Zechun Liu,Yuandong Tian,Harshit Khaitan,Zhangyang Wang,Steven Li
类目: Machine Learning (cs.LG)
*备注: ICLR 2025

点击查看摘要

Abstract:Large Language Models (LLMs), while demonstrating remarkable capabilities across various applications, present significant challenges during inference due to their substantial model size, especially when deployed on edge devices. Activation sparsity offers a promising solution to reduce computation and memory movement, enabling more efficient inference, particularly for small-batch on-device applications. However, current approaches face limitations with non-ReLU activation function, which are foundational to most advanced LLMs, or require heavy continual training. Additionally, the difficulty in predicting active channels and limited achievable sparsity ratios constrain the effectiveness of activation sparsity-based methods. In this paper, we introduce R-Sparse, a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs. We conducted two preliminary investigations into how different components contribute to the output within a single linear layer and found two key observations: (i) the non-sparse components of the input function can be regarded as a few bias terms, and (ii) The full computation can be effectively approximated by an appropriate combination of input channels and weight singular values. Building on this, we replace the linear layers in LLMs with a rank-aware sparse inference method that leverages the sparsity of input channels and singular value components, eliminating the need for active channel prediction like the output sparsity based approaches. Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity, resulting in a significant 43% end-to-end efficient improvements with customized kernels.

[LG-29] Learning High-dimensional Gaussians from Censored Data

链接: https://arxiv.org/abs/2504.19446
作者: Arnab Bhattacharyya,Constantinos Daskalakis,Themis Gouleakis,Yuhao Wang
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We provide efficient algorithms for the problem of distribution learning from high-dimensional Gaussian data where in each sample, some of the variable values are missing. We suppose that the variables are missing not at random (MNAR). The missingness model, denoted by S(y) , is the function that maps any point y in R^d to the subsets of its coordinates that are seen. In this work, we assume that it is known. We study the following two settings: (i) Self-censoring: An observation x is generated by first sampling the true value y from a d -dimensional Gaussian N(\mu*, \Sigma*) with unknown \mu* and \Sigma* . For each coordinate i , there exists a set S_i subseteq R^d such that x_i = y_i if and only if y_i in S_i . Otherwise, x_i is missing and takes a generic value (e.g., “?”). We design an algorithm that learns N(\mu*, \Sigma*) up to total variation (TV) distance epsilon, using poly(d, 1/\epsilon) samples, assuming only that each pair of coordinates is observed with sufficiently high probability. (ii) Linear thresholding: An observation x is generated by first sampling y from a d -dimensional Gaussian N(\mu*, \Sigma) with unknown \mu* and known \Sigma , and then applying the missingness model S where S(y) = i in [d] : v_i^T y = b_i for some v_1, …, v_d in R^d and b_1, …, b_d in R . We design an efficient mean estimation algorithm, assuming that none of the possible missingness patterns is very rare conditioned on the values of the observed coordinates and that any small subset of coordinates is observed with sufficiently high probability. Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2504.19446 [cs.LG] (or arXiv:2504.19446v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.19446 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] Graph-based Semi-supervised and Unsupervised Methods for Local Clustering

链接: https://arxiv.org/abs/2504.19419
作者: Zhaiming Shen,Sung Ha Kang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Local clustering aims to identify specific substructures within a large graph without requiring full knowledge of the entire graph. These substructures are typically small compared to the overall graph, enabling the problem to be approached by finding a sparse solution to a linear system associated with the graph Laplacian. In this work, we first propose a method for identifying specific local clusters when very few labeled data is given, which we term semi-supervised local clustering. We then extend this approach to the unsupervised setting when no prior information on labels is available. The proposed methods involve randomly sampling the graph, applying diffusion through local cluster extraction, then examining the overlap among the results to find each cluster. We establish the co-membership conditions for any pair of nodes and rigorously prove the correctness of our methods. Additionally, we conduct extensive experiments to demonstrate that the proposed methods achieve state-of-the-arts results in the low-label rates regime.

[LG-31] Observational Learning with a Budget

链接: https://arxiv.org/abs/2504.19396
作者: Shuo Wu,Pawan Poojary,Randall Berry
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Submitted to ISIT 2025 Conference, 11 pages, 4 figures

点击查看摘要

Abstract:We consider a model of Bayesian observational learning in which a sequence of agents receives a private signal about an underlying binary state of the world. Each agent makes a decision based on its own signal and its observations of previous agents. A central planner seeks to improve the accuracy of these signals by allocating a limited budget to enhance signal quality across agents. We formulate and analyze the budget allocation problem and propose two optimal allocation strategies. At least one of these strategies is shown to maximize the probability of achieving a correct information cascade.

[LG-32] Bi-directional Model Cascading with Proxy Confidence

链接: https://arxiv.org/abs/2504.19391
作者: David Warren,Mark Dras
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model Cascading, recently applied successfully to LLMs, is a simple but powerful technique that improves the efficiency of inference by selectively applying models of varying sizes. Models are used in sequence from smallest to largest, only deferring samples to large, costly models when smaller models are not sufficiently confident. Existing approaches to deferral use only limited small model confidence estimates because of the inaccessibility of the large model, although large model confidence is known to be important. We therefore propose a bi-directional approach to deferral that considers the confidence of small and large models in the cascade simultaneously through the use of a proxy for the large model. This requires a richer representation of model confidence to enable comparative calibration: we use an analysis of hidden states to improve post-invocation confidence of the small model, which in itself improves cascading results over prior approaches. We then combine this with a tiny proxy model to estimate pre-invocation confidence of the large model. We examine the proposed cascading system over challenging, multiple-choice datasets, finding improvements over standard cascading baselines reflected in reductions in deferrals to more costly models.

[LG-33] HyperController: A Hyperparameter Controller for Fast and Stable Training of Reinforcement Learning Neural Networks

链接: https://arxiv.org/abs/2504.19382
作者: Jonathan Gornet,Yiannis Kantaros,Bruno Sinopoli
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We introduce Hyperparameter Controller (HyperController), a computationally efficient algorithm for hyperparameter optimization during training of reinforcement learning neural networks. HyperController optimizes hyperparameters quickly while also maintaining improvement of the reinforcement learning neural network, resulting in faster training and deployment. It achieves this by modeling the hyperparameter optimization problem as an unknown Linear Gaussian Dynamical System, which is a system with a state that linearly changes. It then learns an efficient representation of the hyperparameter objective function using the Kalman filter, which is the optimal one-step predictor for a Linear Gaussian Dynamical System. To demonstrate the performance of HyperController, it is applied as a hyperparameter optimizer during training of reinforcement learning neural networks on a variety of OpenAI Gymnasium environments. In four out of the five Gymnasium environments, HyperController achieves highest median reward during evaluation compared to other algorithms. The results exhibit the potential of HyperController for efficient and stable training of reinforcement learning neural networks.

[LG-34] O(1/k) Finite-Time Bound for Non-Linear Two-Time-Scale Stochastic Approximation

链接: https://arxiv.org/abs/2504.19375
作者: Siddharth Chandak
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Submitted to IEEE Transactions on Automatic Control

点击查看摘要

Abstract:Two-time-scale stochastic approximation is an algorithm with coupled iterations which has found broad applications in reinforcement learning, optimization and game control. While several prior works have obtained a mean square error bound of O(1/k) for linear two-time-scale iterations, the best known bound in the non-linear contractive setting has been O(1/k^2/3) . In this work, we obtain an improved bound of O(1/k) for non-linear two-time-scale stochastic approximation. Our result applies to algorithms such as gradient descent-ascent and two-time-scale Lagrangian optimization. The key step in our analysis involves rewriting the original iteration in terms of an averaged noise sequence which decays sufficiently fast. Additionally, we use an induction-based approach to show that the iterates are bounded in expectation.

[LG-35] Ethical Challenges of Using Artificial Intelligence in Judiciary

链接: https://arxiv.org/abs/2504.19284
作者: Angel Mary John,Aiswarya M. U.,Jerrin Thomas Panachakel
类目: Machine Learning (cs.LG)
*备注: 2023 IEEE MetroXRAINE 2023

点击查看摘要

Abstract:Artificial intelligence (AI) has emerged as a ubiquitous concept in numerous domains, including the legal system. AI has the potential to revolutionize the functioning of the judiciary and the dispensation of justice. Incorporating AI into the legal system offers the prospect of enhancing decision-making for judges, lawyers, and legal professionals, while concurrently providing the public with more streamlined, efficient, and cost-effective services. The integration of AI into the legal landscape offers manifold benefits, encompassing tasks such as document review, legal research, contract analysis, case prediction, and decision-making. By automating laborious and error-prone procedures, AI has the capacity to alleviate the burden associated with these arduous tasks. Consequently, courts around the world have begun embracing AI technology as a means to enhance the administration of justice. However, alongside its potential advantages, the use of AI in the judiciary poses a range of ethical challenges. These ethical quandaries must be duly addressed to ensure the responsible and equitable deployment of AI systems. This article delineates the principal ethical challenges entailed in employing AI within the judiciary and provides recommendations to effectively address these issues.

[LG-36] Navigating AI Policy Landscapes: Insights into Human Rights Considerations Across IEEE Regions

链接: https://arxiv.org/abs/2504.19264
作者: Angel Mary John,Jerrin Thomas Panachakel,Anusha S.P
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 2024 IEEE 12th Region 10 Humanitarian Technology Conference (R10-HTC). IEEE, 2024

点击查看摘要

Abstract:This paper explores the integration of human rights considerations into AI regulatory frameworks across different IEEE regions - specifically the United States (Region 1-6), Europe (Region 8), China (part of Region 10), and Singapore (part of Region 10). While all acknowledge the transformative potential of AI and the necessity of ethical guidelines, their regulatory approaches significantly differ. Europe exhibits a rigorous framework with stringent protections for individual rights, while the U.S. promotes innovation with less restrictive regulations. China emphasizes state control and societal order in its AI strategies. In contrast, Singapore’s advisory framework encourages self-regulation and aligns closely with international norms. This comparative analysis underlines the need for ongoing global dialogue to harmonize AI regulations that safeguard human rights while promoting technological advancement, reflecting the diverse perspectives and priorities of each region.

[LG-37] Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence

链接: https://arxiv.org/abs/2504.19259
作者: Adwait Datar,Nihat Ay
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The Kullback-Leibler (KL) divergence plays a central role in probabilistic machine learning, where it commonly serves as the canonical loss function. Optimization in such settings is often performed over the probability simplex, where the choice of parameterization significantly impacts convergence. In this work, we study the problem of minimizing the KL divergence and analyze the behavior of gradient-based optimization algorithms under two dual coordinate systems within the framework of information geometry - the exponential family ( \theta coordinates) and the mixture family ( \eta coordinates). We compare Euclidean gradient descent (GD) in these coordinates with the coordinate-invariant natural gradient descent (NGD), where the natural gradient is a Riemannian gradient that incorporates the intrinsic geometry of the parameter space. In continuous time, we prove that the convergence rates of GD in the \theta and \eta coordinates provide lower and upper bounds, respectively, on the convergence rate of NGD. Moreover, under affine reparameterizations of the dual coordinates, the convergence rates of GD in \eta and \theta coordinates can be scaled to 2c and \frac2c , respectively, for any c0 , while NGD maintains a fixed convergence rate of 2 , remaining invariant to such transformations and sandwiched between them. Although this suggests that NGD may not exhibit uniformly superior convergence in continuous time, we demonstrate that its advantages become pronounced in discrete time, where it achieves faster convergence and greater robustness to noise, outperforming GD. Our analysis hinges on bounding the spectrum and condition number of the Hessian of the KL divergence at the optimum, which coincides with the Fisher information matrix.

[LG-38] HetGL2R: Learning to Rank Critical Road Segments via Attributed Heterogeneous Graph Random Walks

链接: https://arxiv.org/abs/2504.19199
作者: Ming Xu,Jinrong Xiang,Zilong Xie,Xiangfu Meng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately identifying critical nodes with high spatial influence in road networks is essential for enhancing the efficiency of traffic management and urban planning. However, existing node importance ranking methods mainly rely on structural features and topological information, often overlooking critical factors such as origin-destination (OD) demand and route information. This limitation leaves considerable room for improvement in ranking accuracy. To address this issue, we propose HetGL2R, an attributed heterogeneous graph learning approach for ranking node importance in road networks. This method introduces a tripartite graph (trip graph) to model the structure of the road network, integrating OD demand, route choice, and various structural features of road segments. Based on the trip graph, we design an embedding method to learn node representations that reflect the spatial influence of road segments. The method consists of a heterogeneous random walk sampling algorithm (HetGWalk) and a Transformer encoder. HetGWalk constructs multiple attribute-guided graphs based on the trip graph to enrich the diversity of semantic associations between nodes. It then applies a joint random walk mechanism to convert both topological structures and node attributes into sequences, enabling the encoder to capture spatial dependencies more effectively among road segments. Finally, a listwise ranking strategy is employed to evaluate node importance. To validate the performance of our method, we construct two synthetic datasets using SUMO based on simulated road networks. Experimental results demonstrate that HetGL2R significantly outperforms baselines in incorporating OD demand and route choice information, achieving more accurate and robust node ranking. Furthermore, we conduct a case study using real-world taxi trajectory data from Beijing, further verifying the practicality of the proposed method.

[LG-39] Newton-Puiseux Analysis for Interpretability and Calibration of Complex-Valued Neural Networks

链接: https://arxiv.org/abs/2504.19176
作者: Piotr Migus
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Complex-valued neural networks (CVNNs) excel where phase matters, yet their multi-sheeted decision surfaces defy standard explainability and calibration tools. We propose a \emphNewton-Puiseux framework that fits a local polynomial surrogate to a high-uncertainty input and analytically decomposes this surrogate into fractional-power series. The resulting Puiseux expansions, dominant Puiseux coefficients, and phase-aligned curvature descriptors deliver closed-form estimates of robustness and over-confidence that gradient - or perturbation-based methods (saliency, LIME, SHAP) cannot provide. On a controlled \mathbbC^2 helix the surrogate attains RMSE 0.09 while recovering the number of decision sheets; quartic coefficients predict adversarial flip radii within 10^-3 . On the real-world MIT-BIH arrhythmia corpus, Puiseux-guided, phase-aware temperature scaling lowers expected calibration error from 0.087 to 0.034, contributing to the advancement of CVNNs. Full code, pre-trained weights, and scripts are at this https URL.

[LG-40] Reliable Thermal Monitoring of Electric Machines through Machine Learning

链接: https://arxiv.org/abs/2504.19141
作者: Panagiotis Kakosimos
类目: Machine Learning (cs.LG)
*备注: 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:The electrification of powertrains is rising as the objective for a more viable future is intensified. To ensure continuous and reliable operation without undesirable malfunctions, it is essential to monitor the internal temperatures of machines and keep them within safe operating limits. Conventional modeling methods can be complex and usually require expert knowledge. With the amount of data collected these days, it is possible to use information models to assess thermal behaviors. This paper investigates artificial intelligence techniques for monitoring the cooling efficiency of induction machines. Experimental data was collected under specific operating conditions, and three machine-learning models have been developed. The optimal configuration for each approach was determined through rigorous hyperparameter searches, and the models were evaluated using a variety of metrics. The three solutions performed well in monitoring the condition of the machine even under transient operation, highlighting the potential of data-driven methods in improving the thermal management.

[LG-41] Harmonizing Generalization and Personalization in Ring-topology Decentralized Federated Learning

链接: https://arxiv.org/abs/2504.19103
作者: Shunxin Guo,Jiaqi Lv,Xin Geng
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:We introduce Ring-topology Decentralized Federated Learning (RDFL) for distributed model training, aiming to avoid the inherent risks of centralized failure in server-based FL. However, RDFL faces the challenge of low information-sharing efficiency due to the point-to-point communication manner when handling inherent data heterogeneity. Existing studies to mitigate data heterogeneity focus on personalized optimization of models, ignoring that the lack of shared information constraints can lead to large differences among models, weakening the benefits of collaborative learning. To tackle these challenges, we propose a Divide-and-conquer RDFL framework (DRDFL) that uses a feature generation model to extract personalized information and invariant shared knowledge from the underlying data distribution, ensuring both effective personalization and strong generalization. Specifically, we design a \textitPersonaNet module that encourages class-specific feature representations to follow a Gaussian mixture distribution, facilitating the learning of discriminative latent representations tailored to local data distributions. Meanwhile, the \textitLearngene module is introduced to encapsulate shared knowledge through an adversarial classifier to align latent representations and extract globally invariant information. Extensive experiments demonstrate that DRDFL outperforms state-of-the-art methods in various data heterogeneity settings.

[LG-42] Score-Debiased Kernel Density Estimation ICLR2025

链接: https://arxiv.org/abs/2504.19084
作者: Elliot L. Epstein,Rajat Dwaraknath,Thanawat Sornwanee,John Winnicki,Jerry Weihong Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICLR 2025 Workshop on Frontiers of Probabilistic Inference

点击查看摘要

Abstract:We propose a novel method for density estimation that leverages an estimated score function to debias kernel density estimation (SD-KDE). In our approach, each data point is adjusted by taking a single step along the score function with a specific choice of step size, followed by standard KDE with a modified bandwidth. The step size and modified bandwidth are chosen to remove the leading order bias in the KDE. Our experiments on synthetic tasks in 1D, 2D and on MNIST, demonstrate that our proposed SD-KDE method significantly reduces the mean integrated squared error compared to the standard Silverman KDE, even with noisy estimates in the score function. These results underscore the potential of integrating score-based corrections into nonparametric density estimation.

[LG-43] Atlantes: A system of GPS transformers for global-scale real-time maritime intelligence ICLR

链接: https://arxiv.org/abs/2504.19036
作者: Henry Herzog,Joshua Hansen,Yawen Zhang,Patrick Beukema
类目: Machine Learning (cs.LG)
*备注: 8 pages, 10 figures, ICLR CCAI 2025, spotlight talk

点击查看摘要

Abstract:Unsustainable exploitation of the oceans exacerbated by global warming is threatening coastal communities worldwide. Accurate and timely monitoring of maritime activity is an essential step to effective governance and to inform future policy. In support of this complex global-scale effort, we built Atlantes, a deep learning based system that provides the first-ever real-time view of vessel behavior at global scale. Atlantes leverages a series of bespoke transformers to distill a high volume, continuous stream of GPS messages emitted by hundreds of thousands of vessels into easily quantifiable behaviors. The combination of low latency and high performance enables operationally relevant decision-making and successful interventions on the high seas where illegal and exploitative activity is too common. Atlantes is already in use by hundreds of organizations worldwide. Here we provide an overview of the model and infrastructure that enables this system to function efficiently and cost-effectively at global-scale and in real-time.

[LG-44] On learning functions over biological sequence space: relating Gaussian process priors regularization and gauge fixing

链接: https://arxiv.org/abs/2504.19034
作者: Samantha Petti,Carlos Martí-Gómez,Justin B. Kinney,Juannan Zhou,David M. McCandlish
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Mappings from biological sequences (DNA, RNA, protein) to quantitative measures of sequence functionality play an important role in contemporary biology. We are interested in the related tasks of (i) inferring predictive sequence-to-function maps and (ii) decomposing sequence-function maps to elucidate the contributions of individual subsequences. Because each sequence-function map can be written as a weighted sum over subsequences in multiple ways, meaningfully interpreting these weights requires “gauge-fixing,” i.e., defining a unique representation for each map. Recent work has established that most existing gauge-fixed representations arise as the unique solutions to L_2 -regularized regression in an overparameterized “weight space” where the choice of regularizer defines the gauge. Here, we establish the relationship between regularized regression in overparameterized weight space and Gaussian process approaches that operate in “function space,” i.e. the space of all real-valued functions on a finite set of sequences. We disentangle how weight space regularizers both impose an implicit prior on the learned function and restrict the optimal weights to a particular gauge. We also show how to construct regularizers that correspond to arbitrary explicit Gaussian process priors combined with a wide variety of gauges. Next, we derive the distribution of gauge-fixed weights implied by the Gaussian process posterior and demonstrate that even for long sequences this distribution can be efficiently computed for product-kernel priors using a kernel trick. Finally, we characterize the implicit function space priors associated with the most common weight space regularizers. Overall, our framework unifies and extends our ability to infer and interpret sequence-function relationships.

[LG-45] Smooth Approximations of the Rounding Function

链接: https://arxiv.org/abs/2504.19026
作者: Stanislav Semenov
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 9 pages, 1 figure, submitted to arXiv

点击查看摘要

Abstract:We propose novel smooth approximations to the classical rounding function, suitable for differentiable optimization and machine learning applications. Our constructions are based on two approaches: (1) localized sigmoid window functions centered at each integer, and (2) normalized weighted sums of sigmoid derivatives representing local densities. The first method approximates the step-like behavior of rounding through differences of shifted sigmoids, while the second method achieves smooth interpolation between integers via density-based weighting. Both methods converge pointwise to the classical rounding function as the sharpness parameter k tends to infinity, and allow controlled trade-offs between smoothness and approximation accuracy. We demonstrate that by restricting the summation to a small set of nearest integers, the computational cost remains low without sacrificing precision. These constructions provide fully differentiable alternatives to hard rounding, which are valuable in contexts where gradient-based methods are essential.

[LG-46] owards minimax optimal algorithms for Active Simple Hypothesis Testing

链接: https://arxiv.org/abs/2504.19014
作者: Sushant Vijayan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the Active Simple Hypothesis Testing (ASHT) problem, a simpler variant of the Fixed Budget Best Arm Identification problem. In this work, we provide novel game theoretic formulation of the upper bounds of the ASHT problem. This formulation allows us to leverage tools of differential games and Partial Differential Equations (PDEs) to propose an approximately optimal algorithm that is computationally tractable compared to prior work. However, the optimal algorithm still suffers from a curse of dimensionality and instead we use a novel link to Blackwell Approachability to propose an algorithm that is far more efficient computationally. We show that this new algorithm, although not proven to be optimal, is always better than static algorithms in all instances of ASHT and is numerically observed to attain the optimal exponent in various instances.

[LG-47] Unveiling and Mitigating Adversarial Vulnerabilities in Iterative Optimizers

链接: https://arxiv.org/abs/2504.19000
作者: Elad Sofer,Tomer Shaked,Caroline Chaux,Nir Shlezinger
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Under review for publication in the IEEE

点击查看摘要

Abstract:Machine learning (ML) models are often sensitive to carefully crafted yet seemingly unnoticeable perturbations. Such adversarial examples are considered to be a property of ML models, often associated with their black-box operation and sensitivity to features learned from data. This work examines the adversarial sensitivity of non-learned decision rules, and particularly of iterative optimizers. Our analysis is inspired by the recent developments in deep unfolding, which cast such optimizers as ML models. We show that non-learned iterative optimizers share the sensitivity to adversarial examples of ML models, and that attacking iterative optimizers effectively alters the optimization objective surface in a manner that modifies the minima sought. We then leverage the ability to cast iteration-limited optimizers as ML models to enhance robustness via adversarial training. For a class of proximal gradient optimizers, we rigorously prove how their learning affects adversarial sensitivity. We numerically back our findings, showing the vulnerability of various optimizers, as well as the robustness induced by unfolding and adversarial training.

[LG-48] Speaker Retrieval in the Wild: Challenges Effectiveness and Robustness

链接: https://arxiv.org/abs/2504.18950
作者: Erfan Loweimi,Mengjie Qian,Kate Knill,Mark Gales
类目: ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 13 pages, 10 figures, 10 tables, 76 references

点击查看摘要

Abstract:There is a growing abundance of publicly available or company-owned audio/video archives, highlighting the increasing importance of efficient access to desired content and information retrieval from these archives. This paper investigates the challenges, solutions, effectiveness, and robustness of speaker retrieval systems developed “in the wild” which involves addressing two primary challenges: extraction of task-relevant labels from limited metadata for system development and evaluation, as well as the unconstrained acoustic conditions encountered in the archive, ranging from quiet studios to adverse noisy environments. While we focus on the publicly-available BBC Rewind archive (spanning 1948 to 1979), our framework addresses the broader issue of speaker retrieval on extensive and possibly aged archives with no control over the content and acoustic conditions. Typically, these archives offer a brief and general file description, mostly inadequate for specific applications like speaker retrieval, and manual annotation of such large-scale archives is unfeasible. We explore various aspects of system development (e.g., speaker diarisation, embedding extraction, query selection) and analyse the challenges, possible solutions, and their functionality. To evaluate the performance, we conduct systematic experiments in both clean setup and against various distortions simulating real-world applications. Our findings demonstrate the effectiveness and robustness of the developed speaker retrieval systems, establishing the versatility and scalability of the proposed framework for a wide range of applications beyond the BBC Rewind corpus.

[LG-49] Meta-Learning in Self-Play Regret Minimization

链接: https://arxiv.org/abs/2504.18917
作者: David Sychrovský,Martin Schmid,Michal Šustr,Michael Bowling
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regret minimization is a general approach to online optimization which plays a crucial role in many algorithms for approximating Nash equilibria in two-player zero-sum games. The literature mainly focuses on solving individual games in isolation. However, in practice, players often encounter a distribution of similar but distinct games. For example, when trading correlated assets on the stock market, or when refining the strategy in subgames of a much larger game. Recently, offline meta-learning was used to accelerate one-sided equilibrium finding on such distributions. We build upon this, extending the framework to the more challenging self-play setting, which is the basis for most state-of-the-art equilibrium approximation algorithms for domains at scale. When selecting the strategy, our method uniquely integrates information across all decision states, promoting global communication as opposed to the traditional local regret decomposition. Empirical evaluation on normal-form games and river poker subgames shows our meta-learned algorithms considerably outperform other state-of-the-art regret minimization algorithms.

[LG-50] Factor Analysis with Correlated Topic Model for Multi-Modal Data AISTATS2025

链接: https://arxiv.org/abs/2504.18914
作者: Małgorzata Łazęcka,Ewa Szczurek
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: AISTATS 2025

点击查看摘要

Abstract:Integrating various data modalities brings valuable insights into underlying phenomena. Multimodal factor analysis (FA) uncovers shared axes of variation underlying different simple data modalities, where each sample is represented by a vector of features. However, FA is not suited for structured data modalities, such as text or single cell sequencing data, where multiple data points are measured per each sample and exhibit a clustering structure. To overcome this challenge, we introduce FACTM, a novel, multi-view and multi-structure Bayesian model that combines FA with correlated topic modeling and is optimized using variational inference. Additionally, we introduce a method for rotating latent factors to enhance interpretability with respect to binary features. On text and video benchmarks as well as real-world music and COVID-19 datasets, we demonstrate that FACTM outperforms other methods in identifying clusters in structured data, and integrating them with simple modalities via the inference of shared, interpretable factors.

[LG-51] SCAN: Context-Aware Uplift Modeling via Two-Stage Training for Online Merchant Business Diagnosis

链接: https://arxiv.org/abs/2504.18881
作者: Hangtao Zhang,Zhe Li,Kairui Zhang
类目: Machine Learning (cs.LG)
*备注: 15 pages,7 figures

点击查看摘要

Abstract:A primary challenge in ITE estimation is sample selection bias. Traditional approaches utilize treatment regularization techniques such as the Integral Probability Metrics (IPM), re-weighting, and propensity score modeling to mitigate this bias. However, these regularizations may introduce undesirable information loss and limit the performance of the model. Furthermore, treatment effects vary across different external contexts, and the existing methods are insufficient in fully interacting with and utilizing these contextual features. To address these issues, we propose a Context-Aware uplift model based on the Two-Stage training approach (TSCAN), comprising CAN-U and CAN-D sub-models. In the first stage, we train an uplift model, called CAN-U, which includes the treatment regularizations of IPM and propensity score prediction, to generate a complete dataset with counterfactual uplift labels. In the second stage, we train a model named CAN-D, which utilizes an isotonic output layer to directly model uplift effects, thereby eliminating the reliance on the regularization components. CAN-D adaptively corrects the errors estimated by CAN-U through reinforcing the factual samples, while avoiding the negative impacts associated with the aforementioned regularizations. Additionally, we introduce a Context-Aware Attention Layer throughout the two-stage process to manage the interactions between treatment, merchant, and contextual features, thereby modeling the varying treatment effect in different contexts. We conduct extensive experiments on two real-world datasets to validate the effectiveness of TSCAN. Ultimately, the deployment of our model for real-world merchant diagnosis on one of China’s largest online food ordering platforms validates its practical utility and impact.

[LG-52] Approximating Nash Equilibria in General-Sum Games via Meta-Learning

链接: https://arxiv.org/abs/2504.18868
作者: David Sychrovský,Christopher Solinas,Revan MacQueen,Kevin Wang,James R. Wright,Nathan R. Sturtevant,Michael Bowling
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nash equilibrium is perhaps the best-known solution concept in game theory. Such a solution assigns a strategy to each player which offers no incentive to unilaterally deviate. While a Nash equilibrium is guaranteed to always exist, the problem of finding one in general-sum games is PPAD-complete, generally considered intractable. Regret minimization is an efficient framework for approximating Nash equilibria in two-player zero-sum games. However, in general-sum games, such algorithms are only guaranteed to converge to a coarse-correlated equilibrium (CCE), a solution concept where players can correlate their strategies. In this work, we use meta-learning to minimize the correlations in strategies produced by a regret minimizer. This encourages the regret minimizer to find strategies that are closer to a Nash equilibrium. The meta-learned regret minimizer is still guaranteed to converge to a CCE, but we give a bound on the distance to Nash equilibrium in terms of our meta-loss. We evaluate our approach in general-sum imperfect information games. Our algorithms provide significantly better approximations of Nash equilibria than state-of-the-art regret minimization techniques.

[LG-53] Diffeomorphic Obstacle Avoidance for Contractive Dynamical Systems via Implicit Representations

链接: https://arxiv.org/abs/2504.18860
作者: Ken-Joel Simmoteit,Philipp Schillinger,Leonel Rozo
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at R:SS 2025

点击查看摘要

Abstract:Ensuring safety and robustness of robot skills is becoming crucial as robots are required to perform increasingly complex and dynamic tasks. The former is essential when performing tasks in cluttered environments, while the latter is relevant to overcome unseen task situations. This paper addresses the challenge of ensuring both safety and robustness in dynamic robot skills learned from demonstrations. Specifically, we build on neural contractive dynamical systems to provide robust extrapolation of the learned skills, while designing a full-body obstacle avoidance strategy that preserves contraction stability via diffeomorphic transforms. This is particularly crucial in complex environments where implicit scene representations, such as Signed Distance Fields (SDFs), are necessary. To this end, our framework called Signed Distance Field Diffeomorphic Transform, leverages SDFs and flow-based diffeomorphisms to achieve contraction-preserving obstacle avoidance. We thoroughly evaluate our framework on synthetic datasets and several real-world robotic tasks in a kitchen environment. Our results show that our approach locally adapts the learned contractive vector field while staying close to the learned dynamics and without introducing highly-curved motion paths, thus outperforming several state-of-the-art methods.

[LG-54] heoretical Framework for Tempered Fractional Gradient Descent: Application to Breast Cancer Classification

链接: https://arxiv.org/abs/2504.18849
作者: Omar Naifar
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This paper introduces Tempered Fractional Gradient Descent (TFGD), a novel optimization framework that synergizes fractional calculus with exponential tempering to enhance gradient-based learning. Traditional gradient descent methods often suffer from oscillatory updates and slow convergence in high-dimensional, noisy landscapes. TFGD addresses these limitations by incorporating a tempered memory mechanism, where historical gradients are weighted by fractional coefficients |w_j| = \binom\alphaj and exponentially decayed via a tempering parameter \lambda . Theoretical analysis establishes TFGD’s convergence guarantees: in convex settings, it achieves an \mathcalO(1/K) rate with alignment coefficient d_\alpha,\lambda = (1 - e^-\lambda)^-\alpha , while stochastic variants attain \mathcalO(1/k^\alpha) error decay. The algorithm maintains \mathcalO(n) time complexity equivalent to SGD, with memory overhead scaling as \mathcalO(d/\lambda) for parameter dimension d . Empirical validation on the Breast Cancer Wisconsin dataset demonstrates TFGD’s superiority, achieving 98.25% test accuracy (vs. 92.11% for SGD) and 2 \times faster convergence. The tempered memory mechanism proves particularly effective in medical classification tasks, where feature correlations benefit from stable gradient averaging. These results position TFGD as a robust alternative to conventional optimizers in both theoretical and applied machine learning.

[LG-55] Frequency-Integrated Transformer for Arbitrary-Scale Super-Resolution

链接: https://arxiv.org/abs/2504.18818
作者: Xufei Wang,Fei Ge,Jinchen Zhu,Mingjian Zhang,Qi Wu,Jifeng Ren Shizhuang Weng
类目: Machine Learning (cs.LG)
*备注: 11pages,8figures

点击查看摘要

Abstract:Methods based on implicit neural representation have demonstrated remarkable capabilities in arbitrary-scale super-resolution (ASSR) tasks, but they neglect the potential value of the frequency domain, leading to sub-optimal performance. We proposes a novel network called Frequency-Integrated Transformer (FIT) to incorporate and utilize frequency information to enhance ASSR performance. FIT employs Frequency Incorporation Module (FIM) to introduce frequency information in a lossless manner and Frequency Utilization Self-Attention module (FUSAM) to efficiently leverage frequency information by exploiting spatial-frequency interrelationship and global nature of frequency. FIM enriches detail characterization by incorporating frequency information through a combination of Fast Fourier Transform (FFT) with real-imaginary mapping. In FUSAM, Interaction Implicit Self-Attention (IISA) achieves cross-domain information synergy by interacting spatial and frequency information in subspace, while Frequency Correlation Self-attention (FCSA) captures the global context by computing correlation in frequency. Experimental results demonstrate FIT yields superior performance compared to existing methods across multiple benchmark datasets. Visual feature map proves the superiority of FIM in enriching detail characterization. Frequency error map validates IISA productively improve the frequency fidelity. Local attribution map validates FCSA effectively captures global context.

[LG-56] Nonconvex Linear System Identification with Minimal State Representation

链接: https://arxiv.org/abs/2504.18791
作者: Uday Kiran Reddy Tadipatri,Benjamin D. Haeffele,Joshua Agterberg,Ingvar Ziemann,René Vidal
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: Accepted to 7th Annual Conference on Learning for Dynamics and Control (L4DC) 2025. The full version including appendix

点击查看摘要

Abstract:Low-order linear System IDentification (SysID) addresses the challenge of estimating the parameters of a linear dynamical system from finite samples of observations and control inputs with minimal state representation. Traditional approaches often utilize Hankel-rank minimization, which relies on convex relaxations that can require numerous, costly singular value decompositions (SVDs) to optimize. In this work, we propose two nonconvex reformulations to tackle low-order SysID (i) Burer-Monterio (BM) factorization of the Hankel matrix for efficient nuclear norm minimization, and (ii) optimizing directly over system parameters for real, diagonalizable systems with an atomic norm style decomposition. These reformulations circumvent the need for repeated heavy SVD computations, significantly improving computational efficiency. Moreover, we prove that optimizing directly over the system parameters yields lower statistical error rates, and lower sample complexities that do not scale linearly with trajectory length like in Hankel-nuclear norm minimization. Additionally, while our proposed formulations are nonconvex, we provide theoretical guarantees of achieving global optimality in polynomial time. Finally, we demonstrate algorithms that solve these nonconvex programs and validate our theoretical claims on synthetic data.

[LG-57] ALF: Advertiser Large Foundation Model for Multi-Modal Advertiser Understanding

链接: https://arxiv.org/abs/2504.18785
作者: Santosh Rajagopalan,Jonathan Vronsky,Songbai Yan,S. Alireza Golestaneh,Shubhra Chandra,Min Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present ALF (Advertiser Large Foundation model), a multi-modal transformer architecture for understanding advertiser behavior and intent across text, image, video and structured data modalities. Through contrastive learning and multi-task optimization, ALF creates unified advertiser representations that capture both content and behavioral patterns. Our model achieves state-of-the-art performance on critical tasks including fraud detection, policy violation identification, and advertiser similarity matching. In production deployment, ALF reduces false positives by 90% while maintaining 99.8% precision on abuse detection tasks. The architecture’s effectiveness stems from its novel combination of multi-modal transformations, inter-sample attention mechanism, spectrally normalized projections, and calibrated probabilistic outputs.

[LG-58] Performance of Machine Learning Classifiers for Anomaly Detection in Cyber Security Applications

链接: https://arxiv.org/abs/2504.18771
作者: Markus Haug,Gissel Velarde
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This work empirically evaluates machine learning models on two imbalanced public datasets (KDDCUP99 and Credit Card Fraud 2013). The method includes data preparation, model training, and evaluation, using an 80/20 (train/test) split. Models tested include eXtreme Gradient Boosting (XGB), Multi Layer Perceptron (MLP), Generative Adversarial Network (GAN), Variational Autoencoder (VAE), and Multiple-Objective Generative Adversarial Active Learning (MO-GAAL), with XGB and MLP further combined with Random-Over-Sampling (ROS) and Self-Paced-Ensemble (SPE). Evaluation involves 5-fold cross-validation and imputation techniques (mean, median, and IterativeImputer) with 10, 20, 30, and 50 % missing data. Findings show XGB and MLP outperform generative models. IterativeImputer results are comparable to mean and median, but not recommended for large datasets due to increased complexity and execution time. The code used is publicly available on GitHub (this http URL).

[LG-59] High-order Graph Neural Networks with Common Neighbor Awareness for Link Prediction

链接: https://arxiv.org/abs/2504.18758
作者: Ling Wang,Minglian Han
类目: Machine Learning (cs.LG)
*备注: Accepted By ICAISISAS 2025

点击查看摘要

Abstract:Link prediction is a fundamental task in dynamic graph learning (DGL), inherently shaped by the topology of the DG. Recent advancements in dynamic graph neural networks (DGNN), primarily by modeling the relationships among nodes via a message passing scheme, have significantly improved link prediction performance. However, DGNNs heavily rely on the pairwise node interactions, which neglect the common neighbor interaction in DGL. To address this limitation, we propose a High-order Graph Neural Networks with Common Neighbor Awareness (HGNN-CNA) for link prediction with two-fold ideas: a) estimating correlation score by considering multi-hop common neighbors for capturing the complex interaction between nodes; b) fusing the correlation into the message-passing process to consider common neighbor interaction directly in DGL. Experimental results on three real DGs demonstrate that the proposed HGNN-CNA acquires a significant accuracy gain over several state-of-the-art models on the link prediction task.

[LG-60] Non-Asymptotic Guarantees for Averag e-Reward Q-Learning with Adaptive Stepsizes

链接: https://arxiv.org/abs/2504.18743
作者: Zaiwei Chen
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 63 pages and 4 figures

点击查看摘要

Abstract:This work presents the first finite-time analysis for the last-iterate convergence of average-reward Q-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes, which serve as local clocks for each state-action pair. We show that the iterates generated by this Q-learning algorithm converge at a rate of O(1/k) (in the mean-square sense) to the optimal relative Q-function in the span seminorm. Moreover, by adding a centering step to the algorithm, we further establish pointwise mean-square convergence to a centered optimal relative Q-function, also at a rate of O(1/k) . To prove these results, we show that adaptive stepsizes are necessary, as without them, the algorithm fails to converge to the correct target. In addition, adaptive stepsizes can be interpreted as a form of implicit importance sampling that counteracts the effects of asynchronous updates. Technically, the use of adaptive stepsizes makes each Q-learning update depend on the entire sample history, introducing strong correlations and making the algorithm a non-Markovian stochastic approximation (SA) scheme. Our approach to overcoming this challenge involves (1) a time-inhomogeneous Markovian reformulation of non-Markovian SA, and (2) a combination of almost-sure time-varying bounds, conditioning arguments, and Markov chain concentration inequalities to break the strong correlations between the adaptive stepsizes and the iterates. The tools developed in this work are likely to be broadly applicable to the analysis of general SA algorithms with adaptive stepsizes. Comments: 63 pages and 4 figures Subjects: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2504.18743 [cs.LG] (or arXiv:2504.18743v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.18743 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] Multimodal graph representation learning for website generation based on visual sketch

链接: https://arxiv.org/abs/2504.18729
作者: Tung D. Vu,Chung Hoang,Truong-Son Hy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Design2Code problem, which involves converting digital designs into functional source code, is a significant challenge in software development due to its complexity and time-consuming nature. Traditional approaches often struggle with accurately interpreting the intricate visual details and structural relationships inherent in webpage designs, leading to limitations in automation and efficiency. In this paper, we propose a novel method that leverages multimodal graph representation learning to address these challenges. By integrating both visual and structural information from design sketches, our approach enhances the accuracy and efficiency of code generation, particularly in producing semantically correct and structurally sound HTML code. We present a comprehensive evaluation of our method, demonstrating significant improvements in both accuracy and efficiency compared to existing techniques. Extensive evaluation demonstrates significant improvements of multimodal graph learning over existing techniques, highlighting the potential of our method to revolutionize design-to-code automation. Code available at this https URL

[LG-62] Appa: Bending Weather Dynamics with Latent Diffusion Models for Global Data Assimilation

链接: https://arxiv.org/abs/2504.18720
作者: Gérôme Andry,François Rozet,Sacha Lewin,Omer Rochman,Victor Mangeleer,Matthias Pirlet,Elise Faulx,Marilaure Grégoire,Gilles Louppe
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Deep learning has transformed weather forecasting by improving both its accuracy and computational efficiency. However, before any forecast can begin, weather centers must identify the current atmospheric state from vast amounts of observational data. To address this challenging problem, we introduce Appa, a score-based data assimilation model producing global atmospheric trajectories at 0.25-degree resolution and 1-hour intervals. Powered by a 1.5B-parameter spatio-temporal latent diffusion model trained on ERA5 reanalysis data, Appa can be conditioned on any type of observations to infer the posterior distribution of plausible state trajectories, without retraining. Our unified probabilistic framework flexibly tackles multiple inference tasks – reanalysis, filtering, and forecasting – using the same model, eliminating the need for task-specific architectures or training procedures. Experiments demonstrate physical consistency on a global scale and good reconstructions from observations, while showing competitive forecasting skills. Our results establish latent score-based data assimilation as a promising foundation for future global atmospheric modeling systems.

[LG-63] Active Few-Shot Learning for Vertex Classification Starting from an Unlabeled Dataset IJCNN2025

链接: https://arxiv.org/abs/2504.18696
作者: Felix Burr,Marcel Hoffmann,Ansgar Scherp
类目: Machine Learning (cs.LG)
*备注: Accepted at IJCNN 2025

点击查看摘要

Abstract:Despite the ample availability of graph data, obtaining vertex labels is a tedious and expensive task. Therefore, it is desirable to learn from a few labeled vertices only. Existing few-shot learners assume a class oracle, which provides labeled vertices for a desired class. However, such an oracle is not available in a real-world setting, i.e., when drawing a vertex for labeling it is unknown to which class the vertex belongs. Few-shot learners are often combined with prototypical networks, while classical semi-supervised vertex classification uses discriminative models, e.g., Graph Convolutional Networks (GCN). In this paper, we train our models by iteratively prompting a human annotator with vertices to annotate. We perform three experiments where we continually relax our assumptions. First, we assume a class oracle, i.e., the human annotator is provided with an equal number of vertices to label for each class. We denote this as "Balanced Sampling’‘. In the subsequent experiment, "Unbalanced Sampling,’’ we replace the class oracle with k -medoids clustering and draw vertices to label from the clusters. In the last experiment, the "Unknown Number of Classes,‘’ we no longer assumed we knew the number and distribution of classes. Our results show that prototypical models outperform discriminative models in all experiments when fewer than 20 samples per class are available. While dropping the assumption of the class oracle for the "Unbalanced Sampling’’ experiment reduces the performance of the GCN by 9% , the prototypical network loses only 1% on average. For the "Unknown Number of Classes’’ experiment, the average performance for both models decreased further by 1% . Source code: this https URL Comments: Accepted at IJCNN 2025 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.18696 [cs.LG] (or arXiv:2504.18696v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.18696 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-64] A Unified MDL-based Binning and Tensor Factorization Framework for PDF Estimation

链接: https://arxiv.org/abs/2504.18686
作者: Mustafa Musab,Joseph K. Chege,Arie Yeredor,Martin Haardt
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reliable density estimation is fundamental for numerous applications in statistics and machine learning. In many practical scenarios, data are best modeled as mixtures of component densities that capture complex and multimodal patterns. However, conventional density estimators based on uniform histograms often fail to capture local variations, especially when the underlying distribution is highly nonuniform. Furthermore, the inherent discontinuity of histograms poses challenges for tasks requiring smooth derivatives, such as gradient-based optimization, clustering, and nonparametric discriminant analysis. In this work, we present a novel non-parametric approach for multivariate probability density function (PDF) estimation that utilizes minimum description length (MDL)-based binning with quantile cuts. Our approach builds upon tensor factorization techniques, leveraging the canonical polyadic decomposition (CPD) of a joint probability tensor. We demonstrate the effectiveness of our method on synthetic data and a challenging real dry bean classification dataset.

[LG-65] Exploring the Potential of Latent Embeddings for Sea Ice Characterization using ICESat-2 Data

链接: https://arxiv.org/abs/2504.18668
作者: Daehyeon Han,Morteza Karimzadeh
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 4 pages, 4 figures

点击查看摘要

Abstract:The Ice, Cloud, and Elevation Satellite-2 (ICESat-2) provides high-resolution measurements of sea ice height. Recent studies have developed machine learning methods on ICESat-2 data, primarily focusing on surface type classification. However, the heavy reliance on manually collected labels requires significant time and effort for supervised learning, as it involves cross-referencing track measurements with overlapping background optical imagery. Additionally, the coincidence of ICESat-2 tracks with background images is relatively rare due to the different overpass patterns and atmospheric conditions. To address these limitations, this study explores the potential of unsupervised autoencoder on unlabeled data to derive latent embeddings. We develop autoencoder models based on Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN) to reconstruct topographic sequences from ICESat-2 and derive embeddings. We then apply Uniform Manifold Approximation and Projection (UMAP) to reduce dimensions and visualize the embeddings. Our results show that embeddings from autoencoders preserve the overall structure but generate relatively more compact clusters compared to the original ICESat-2 data, indicating the potential of embeddings to lessen the number of required labels samples.

[LG-66] Unsupervised outlier detection to improve bird audio dataset labels

链接: https://arxiv.org/abs/2504.18650
作者: Bruce Collins
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 27 pages, 9 figures

点击查看摘要

Abstract:The Xeno-Canto bird audio repository is an invaluable resource for those interested in vocalizations and other sounds made by birds around the world. This is particularly the case for machine learning researchers attempting to improve on the bird species recognition accuracy of classification models. However, the task of extracting labeled datasets from the recordings found in this crowd-sourced repository faces several challenges. One challenge of particular significance to machine learning practitioners is that one bird species label is applied to each audio recording, but frequently other sounds are also captured including other bird species, other animal sounds, anthropogenic and other ambient sounds. These non-target bird species sounds can result in dataset labeling discrepancies referred to as label noise. In this work we present a cleaning process consisting of audio preprocessing followed by dimensionality reduction and unsupervised outlier detection (UOD) to reduce the label noise in a dataset derived from Xeno-Canto recordings. We investigate three neural network dimensionality reduction techniques: two flavors of convolutional autoencoders and variational deep embedding (VaDE (Jiang, 2017)). While both methods show some degree of effectiveness at detecting outliers for most bird species datasets, we found significant variation in the performance of the methods from one species to the next. We believe that the results of this investigation demonstrate that the application of our cleaning process can meaningfully reduce the label noise of bird species datasets derived from Xeno-Canto audio repository but results vary across species.

[LG-67] Periodic Online Testing for Sparse Systolic Tensor Arrays

链接: https://arxiv.org/abs/2504.18628
作者: Christodoulos Peltekis,Chrysostomos Nicopoulos,Giorgos Dimitrakopoulos
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: International Conference on Modern Circuits and Systems Technologies (MOCAST) 2025

点击查看摘要

Abstract:Modern Machine Learning (ML) applications often benefit from structured sparsity, a technique that efficiently reduces model complexity and simplifies handling of sparse data in hardware. Sparse systolic tensor arrays - specifically designed to accelerate these structured-sparse ML models - play a pivotal role in enabling efficient computations. As ML is increasingly integrated into safety-critical systems, it is of paramount importance to ensure the reliability of these systems. This paper introduces an online error-checking technique capable of detecting and locating permanent faults within sparse systolic tensor arrays before computation begins. The new technique relies on merely four test vectors and exploits the weight values already loaded within the systolic array to comprehensively test the system. Fault-injection campaigns within the gate-level netlist, while executing three well-established Convolutional Neural Networks (CNN), validate the efficiency of the proposed approach, which is shown to achieve very high fault coverage, while incurring minimal performance and area overheads.

[LG-68] A Hybrid Framework for Real-Time Data Drift and Anomaly Identification Using Hierarchical Temporal Memory and Statistical Tests

链接: https://arxiv.org/abs/2504.18599
作者: Subhadip Bandyopadhyay,Joy Bose,Sujoy Roy Chowdhury
类目: Machine Learning (cs.LG)
*备注: 26 pages, 9 figures

点击查看摘要

Abstract:Data Drift is the phenomenon where the generating model behind the data changes over time. Due to data drift, any model built on the past training data becomes less relevant and inaccurate over time. Thus, detecting and controlling for data drift is critical in machine learning models. Hierarchical Temporal Memory (HTM) is a machine learning model developed by Jeff Hawkins, inspired by how the human brain processes information. It is a biologically inspired model of memory that is similar in structure to the neocortex, and whose performance is claimed to be comparable to state of the art models in detecting anomalies in time series data. Another unique benefit of HTMs is its independence from training and testing cycle; all the learning takes place online with streaming data and no separate training and testing cycle is required. In sequential learning paradigm, Sequential Probability Ratio Test (SPRT) offers some unique benefit for online learning and inference. This paper proposes a novel hybrid framework combining HTM and SPRT for real-time data drift detection and anomaly identification. Unlike existing data drift methods, our approach eliminates frequent retraining and ensures low false positive rates. HTMs currently work with one dimensional or univariate data. In a second study, we also propose an application of HTM in multidimensional supervised scenario for anomaly detection by combining the outputs of multiple HTM columns, one for each dimension of the data, through a neural network. Experimental evaluations demonstrate that the proposed method outperforms conventional drift detection techniques like the Kolmogorov-Smirnov (KS) test, Wasserstein distance, and Population Stability Index (PSI) in terms of accuracy, adaptability, and computational efficiency. Our experiments also provide insights into optimizing hyperparameters for real-time deployment in domains such as Telecom.

[LG-69] PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

链接: https://arxiv.org/abs/2504.18583
作者: Zihao An,Huajun Bai,Ziqiong Liu,Dong Li,Emad Barsoum
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:The autoregressive nature of large language models (LLMs) limits inference speed. Each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding alleviates this issue using a draft-then-verify approach to accelerate token generation. However, the overhead introduced during the draft phase and the training cost of the draft model limit the efficiency and adaptability of speculative decoding. In this work, we introduce PARallel Draft (PARD), a novel speculative decoding method that enables low-cost adaptation of autoregressive draft models into parallel draft models. PARD enhances inference efficiency by predicting multiple future tokens in a single forward pass of the draft phase, and incorporates a conditional drop token method to accelerate training. Its target-independence property allows a single draft model to be applied to an entire family of different models, minimizing the adaptation cost. Our proposed conditional drop token method can improves draft model training efficiency by 3x. On our optimized inference framework, PARD accelerates LLaMA3.1-8B inference by 4.08x, achieving 311.5 tokens per second.

[LG-70] Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averag ing

链接: https://arxiv.org/abs/2504.18580
作者: Shi Jie Yu,Sehyun Choi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Checkpoint merging is a technique for combining multiple model snapshots into a single superior model, potentially reducing training time for large language models. This paper explores checkpoint merging in the context of parameter-efficient fine-tuning (PEFT), where only small adapter modules (e.g. LoRA) are trained. We propose Metrics-Weighted Averaging (MWA), a simple yet effective method to merge model checkpoints by weighting their parameters according to performance metrics. In particular, we investigate weighting by training loss and by training steps, under the intuition that lower-loss or later-step checkpoints are more valuable. We introduce a formula with a penalty factor to adjust weight distribution, requiring only one hyperparameter regardless of the number of checkpoints. Experiments on three fine-tuning tasks (mathematical reasoning, preference alignment, and general instruction tuning) show that MWA consistently produces merged models that outperform the naive uniform average of checkpoints. Notably, loss-weighted merging often yields the best results, delivering up to 5% higher task accuracy than the baseline uniform merge and even surpassing the final individual checkpoint’s performance. These findings validate checkpoint merging for PEFT and demonstrate that a metric-driven weighting heuristic can efficiently boost model performance with minimal computational overhead.

[LG-71] ZipR1: Reinforcing Token Sparsity in MLLM s

链接: https://arxiv.org/abs/2504.18579
作者: Feng Chen,Yefei He,Lequan Lin,Jing Liu,Bohan Zhuang,Qi Wu
类目: Machine Learning (cs.LG)
*备注: work in process

点击查看摘要

Abstract:Sparse attention mechanisms aim to reduce computational overhead by selectively processing a subset of salient tokens while preserving model performance. Despite the effectiveness of such designs, how to actively encourage token sparsity of well-posed MLLMs remains under-explored, which fundamentally limits the achievable acceleration effect during inference. In this paper, we propose a simple RL-based post-training method named \textbfZipR1 that treats the token reduction ratio as the efficiency reward and answer accuracy as the performance reward. In this way, our method can jointly alleviate the computation and memory bottlenecks via directly optimizing the inference-consistent efficiency-performance tradeoff. Experimental results demonstrate that ZipR1 can reduce the token ratio of Qwen2/2.5-VL from 80% to 25% with a minimal accuracy reduction on 13 image and video benchmarks. Comments: work in process Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.18579 [cs.LG] (or arXiv:2504.18579v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.18579 Focus to learn more arXiv-issued DOI via DataCite

[LG-72] An Artificial Intelligence-Based Framework for Predicting Emergency Department Overcrowding: Development and Evaluation Study

链接: https://arxiv.org/abs/2504.18578
作者: Orhun Vural,Bunyamin Ozaydin,Khalid Y. Aram,James Booth,Brittany F. Lindsey,Abdulaziz Ahmed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Emergency department (ED) overcrowding remains a major challenge, causing delays in care and increased operational strain. Hospital management often reacts to congestion after it occurs. Machine learning predictive modeling offers a proactive approach by forecasting patient flow metrics, such as waiting count, to improve resource planning and hospital efficiency. Objective: This study develops machine learning models to predict ED waiting room occupancy at two time scales. The hourly model forecasts the waiting count six hours ahead (e.g., a 1 PM prediction for 7 PM), while the daily model estimates the average waiting count for the next 24 hours (e.g., a 5 PM prediction for the following day’s average). These tools support staffing decisions and enable earlier interventions to reduce overcrowding. Methods: Data from a partner hospital’s ED in the southeastern United States were used, integrating internal metrics and external features. Eleven machine learning algorithms, including traditional and deep learning models, were trained and evaluated. Feature combinations were optimized, and performance was assessed across varying patient volumes and hours. Results: TSiTPlus achieved the best hourly prediction (MAE: 4.19, MSE: 29.32). The mean hourly waiting count was 18.11, with a standard deviation of 9.77. Accuracy varied by hour, with MAEs ranging from 2.45 (11 PM) to 5.45 (8 PM). Extreme case analysis at one, two, and three standard deviations above the mean showed MAEs of 6.16, 10.16, and 15.59, respectively. For daily predictions, XCMPlus performed best (MAE: 2.00, MSE: 6.64), with a daily mean of 18.11 and standard deviation of 4.51. Conclusions: These models accurately forecast ED waiting room occupancy and support proactive resource allocation. Their implementation has the potential to improve patient flow and reduce overcrowding in emergency care settings. Subjects: Machine Learning (cs.LG) MSC classes: 68T07 ACMclasses: I.2.6; J.3 Cite as: arXiv:2504.18578 [cs.LG] (or arXiv:2504.18578v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.18578 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Orhun Vural [view email] [v1] Wed, 23 Apr 2025 00:23:13 UTC (1,266 KB) Full-text links: Access Paper: View a PDF of the paper titled An Artificial Intelligence-Based Framework for Predicting Emergency Department Overcrowding: Development and Evaluation Study, by Orhun Vural and 5 other authorsView PDFOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-73] Intelligent Detection of Non-Essential IoT Traffic on the Home Gateway

链接: https://arxiv.org/abs/2504.18571
作者: Fabio Palmese,Anna Maria Mandalari,Hamed Haddadi,Alessandro Enrico Cesare Redondi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Paper accepted for publication at 10th International Workshop on Traffic Measurements for Cybersecurity (WTMC 2025)

点击查看摘要

Abstract:The rapid expansion of Internet of Things (IoT) devices, particularly in smart home environments, has introduced considerable security and privacy concerns due to their persistent connectivity and interaction with cloud services. Despite advancements in IoT security, effective privacy measures remain uncovered, with existing solutions often relying on cloud-based threat detection that exposes sensitive data or outdated allow-lists that inadequately restrict non-essential network traffic. This work presents ML-IoTrim, a system for detecting and mitigating non-essential IoT traffic (i.e., not influencing the device operations) by analyzing network behavior at the edge, leveraging Machine Learning to classify network destinations. Our approach includes building a labeled dataset based on IoT device behavior and employing a feature-extraction pipeline to enable a binary classification of essential vs. non-essential network destinations. We test our framework in a consumer smart home setup with IoT devices from five categories, demonstrating that the model can accurately identify and block non-essential traffic, including previously unseen destinations, without relying on traditional allow-lists. We implement our solution on a home access point, showing the framework has strong potential for scalable deployment, supporting near-real-time traffic classification in large-scale IoT environments with hundreds of devices. This research advances privacy-aware traffic control in smart homes, paving the way for future developments in IoT device privacy.

[LG-74] Residual-Evasive Attacks on ADMM in Distributed Optimization

链接: https://arxiv.org/abs/2504.18570
作者: Sabrina Bruckmeier,Huadong Mo,James Qin
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 10 pages, 12 figures, 2 tables

点击查看摘要

Abstract:This paper presents two attack strategies designed to evade detection in ADMM-based systems by preventing significant changes to the residual during the attacked iteration. While many detection algorithms focus on identifying false data injection through residual changes, we show that our attacks remain undetected by keeping the residual largely unchanged. The first strategy uses a random starting point combined with Gram-Schmidt orthogonalization to ensure stealth, with potential for refinement by enhancing the orthogonal component to increase system disruption. The second strategy builds on the first, targeting financial gains by manipulating reactive power and pushing the system to its upper voltage limit, exploiting operational constraints. The effectiveness of the proposed attack-resilient mechanism is demonstrated through case studies on the IEEE 14-bus system. A comparison of the two strategies, along with commonly used naive attacks, reveals trade-offs between simplicity, detectability, and effectiveness, providing insights into ADMM system vulnerabilities. These findings underscore the need for more robust monitoring algorithms to protect against advanced attack strategies.

[LG-75] Curiosity Driven Exploration to Optimize Structure-Property Learning in Microscopy

链接: https://arxiv.org/abs/2504.20011
作者: Aditya Vatsavai,Ganesh Narasimha,Yongtao Liu,Jan-Chi Yang,Hiroshu Funakubo,Maxim Ziatdinov,Rama Vasudevan
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Rapidly determining structure-property correlations in materials is an important challenge in better understanding fundamental mechanisms and greatly assists in materials design. In microscopy, imaging data provides a direct measurement of the local structure, while spectroscopic measurements provide relevant functional property information. Deep kernel active learning approaches have been utilized to rapidly map local structure to functional properties in microscopy experiments, but are computationally expensive for multi-dimensional and correlated output spaces. Here, we present an alternative lightweight curiosity algorithm which actively samples regions with unexplored structure-property relations, utilizing a deep-learning based surrogate model for error prediction. We show that the algorithm outperforms random sampling for predicting properties from structures, and provides a convenient tool for efficient mapping of structure-property relationships in materials science.

[LG-76] Graph Neural Network Prediction of Nonlinear Optical Properties

链接: https://arxiv.org/abs/2504.19987
作者: Yomn Alkabakibi,Congwei Xie,Artem R. Oganov
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Optics (physics.optics)
*备注: 7 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Nonlinear optical (NLO) materials for generating lasers via second harmonic generation (SHG) are highly sought in today’s technology. However, discovering novel materials with considerable SHG is challenging due to the time-consuming and costly nature of both experimental methods and first-principles calculations. In this study, we present a deep learning approach using the Atomistic Line Graph Neural Network (ALIGNN) to predict NLO properties. Sourcing data from the Novel Opto-Electronic Materials Discovery (NOEMD) database and using the Kurtz-Perry (KP) coefficient as the key target, we developed a robust model capable of accurately estimating nonlinear optical responses. Our results demonstrate that the model achieves 82.5% accuracy at a tolerated absolute error up to 1 pm/V and relative error not exceeding 0.5. This work highlights the potential of deep learning in accelerating the discovery and design of advanced optical materials with desired properties.

[LG-77] On Stopping Times of Power-one Sequential Tests: Tight Lower and Upper Bounds

链接: https://arxiv.org/abs/2504.19952
作者: Shubhada Agrawal,Aaditya Ramdas
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 36 pages

点击查看摘要

Abstract:We prove two lower bounds for stopping times of sequential tests between general composite nulls and alternatives. The first lower bound is for the setting where the type-1 error level \alpha approaches zero, and equals \log(1/\alpha) divided by a certain infimum KL divergence, termed \operatornameKL_inf . The second lower bound applies to the setting where \alpha is fixed and \operatornameKL_inf approaches 0 (meaning that the null and alternative sets are not separated) and equals c \operatornameKL_inf^-1 \log \log \operatornameKL_inf^-1 for a universal constant c 0 . We also provide a sufficient condition for matching the upper bounds and show that this condition is met in several special cases. Given past work, these upper and lower bounds are unsurprising in their form; our main contribution is the generality in which they hold, for example, not requiring reference measures or compactness of the classes.

[LG-78] Interpretable machine learning-guided design of Fe-based soft magnetic alloys

链接: https://arxiv.org/abs/2504.19787
作者: Aditi Nachnani,Kai K. Li-Caldwell,Saptarshi Biswas,Prince Sharma,Gaoyuan Ouyang,Prashant Singh
类目: Materials Science (cond-mat.mtrl-sci); Other Condensed Matter (cond-mat.other); Machine Learning (cs.LG)
*备注: 24 Pages, 6 Figure, 1 Table

点击查看摘要

Abstract:We present a machine-learning guided approach to predict saturation magnetization (MS) and coercivity (HC) in Fe-rich soft magnetic alloys, particularly Fe-Si-B systems. ML models trained on experimental data reveals that increasing Si and B content reduces MS from 1.81T (DFT~2.04 T) to ~1.54 T (DFT~1.56T) in Fe-Si-B, which is attributed to decreased magnetic density and structural modifications. Experimental validation of ML predicted magnetic saturation on Fe-1Si-1B (2.09T), Fe-5Si-5B (2.01T) and Fe-10Si-10B (1.54T) alloy compositions further support our findings. These trends are consistent with density functional theory (DFT) predictions, which link increased electronic disorder and band broadening to lower MS values. Experimental validation on selected alloys confirms the predictive accuracy of the ML model, with good agreement across compositions. Beyond predictive accuracy, detailed uncertainty quantification and model interpretability including through feature importance and partial dependence analysis reveals that MS is governed by a nonlinear interplay between Fe content, early transition metal ratios, and annealing temperature, while HC is more sensitive to processing conditions such as ribbon thickness and thermal treatment windows. The ML framework was further applied to Fe-Si-B/Cr/Cu/Zr/Nb alloys in a pseudo-quaternary compositional space, which shows comparable magnetic properties to NANOMET (Fe84.8Si0.5B9.4Cu0.8 P3.5C1), FINEMET (Fe73.5Si13.5B9 Cu1Nb3), NANOPERM (Fe88Zr7B4Cu1), and HITPERM (Fe44Co44Zr7B4Cu1. Our fundings demonstrate the potential of ML framework for accelerated search of high-performance, Co- and Ni-free, soft magnetic materials.

[LG-79] Neuronal correlations shape the scaling behavior of memory capacity and nonlinear computational capability of recurrent neural networks

链接: https://arxiv.org/abs/2504.19657
作者: Shotaro Takasu,Toshio Aoyagi
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 19 pages, 8 figures

点击查看摘要

Abstract:Reservoir computing is a powerful framework for real-time information processing, characterized by its high computational ability and quick learning, with applications ranging from machine learning to biological systems. In this paper, we demonstrate that the memory capacity of a reservoir recurrent neural network scales sublinearly with the number of readout neurons. To elucidate this phenomenon, we develop a theoretical framework for analytically deriving memory capacity, attributing the decaying growth of memory capacity to neuronal correlations. In addition, numerical simulations reveal that once memory capacity becomes sublinear, increasing the number of readout neurons successively enables nonlinear processing at progressively higher polynomial orders. Furthermore, our theoretical framework suggests that neuronal correlations govern not only memory capacity but also the sequential growth of nonlinear computational capabilities. Our findings establish a foundation for designing scalable and cost-effective reservoir computing, providing novel insights into the interplay among neuronal correlations, linear memory, and nonlinear processing.

[LG-80] QFDNN: A Resource-Efficient Variational Quantum Feature Deep Neural Networks for Fraud Detection and Loan Prediction

链接: https://arxiv.org/abs/2504.19632
作者: Subham Das,Ashtakala Meghanath,Bikash K. Behera,Shahid Mumtaz,Saif Al-Kuwari,Ahmed Farouk
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Social financial technology focuses on trust, sustainability, and social responsibility, which require advanced technologies to address complex financial tasks in the digital era. With the rapid growth in online transactions, automating credit card fraud detection and loan eligibility prediction has become increasingly challenging. Classical machine learning (ML) models have been used to solve these challenges; however, these approaches often encounter scalability, overfitting, and high computational costs due to complexity and high-dimensional financial data. Quantum computing (QC) and quantum machine learning (QML) provide a promising solution to efficiently processing high-dimensional datasets and enabling real-time identification of subtle fraud patterns. However, existing quantum algorithms lack robustness in noisy environments and fail to optimize performance with reduced feature sets. To address these limitations, we propose a quantum feature deep neural network (QFDNN), a novel, resource efficient, and noise-resilient quantum model that optimizes feature representation while requiring fewer qubits and simpler variational circuits. The model is evaluated using credit card fraud detection and loan eligibility prediction datasets, achieving competitive accuracies of 82.2% and 74.4%, respectively, with reduced computational overhead. Furthermore, we test QFDNN against six noise models, demonstrating its robustness across various error conditions. Our findings highlight QFDNN potential to enhance trust and security in social financial technology by accurately detecting fraudulent transactions while supporting sustainability through its resource-efficient design and minimal computational overhead.

[LG-81] owards Robust Multimodal Physiological Foundation Models: Handling Arbitrary Missing Modalities

链接: https://arxiv.org/abs/2504.19596
作者: Xi Fu,Wei-Bang Jiang,Yi Ding,Cuntai Guan
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Multimodal physiological signals, such as EEG, ECG, EOG, and EMG, are crucial for healthcare and brain-computer interfaces. While existing methods rely on specialized architectures and dataset-specific fusion strategies, they struggle to learn universal representations that generalize across datasets and handle missing modalities at inference time. To address these issues, we propose PhysioOmni, a foundation model for multimodal physiological signal analysis that models both homogeneous and heterogeneous features to decouple multimodal signals and extract generic representations while maintaining compatibility with arbitrary missing modalities. PhysioOmni trains a decoupled multimodal tokenizer, enabling masked signal pre-training via modality-invariant and modality-specific objectives. To ensure adaptability to diverse and incomplete modality combinations, the pre-trained encoders undergo resilient fine-tuning with prototype alignment on downstream datasets. Extensive experiments on four downstream tasks, emotion recognition, sleep stage classification, motor prediction, and mental workload detection, demonstrate that PhysioOmni achieves state-of-the-art performance while maintaining strong robustness to missing modalities. Our code and model weights will be released.

[LG-82] wo-parameter superposable S-curves

链接: https://arxiv.org/abs/2504.19488
作者: Vijay Prakash S
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Straight line equation y=mx with slope m , when singularly perturbed as ay^3+y=mx with a positive parameter a , results in S-shaped curves or S-curves on a real plane. As a\rightarrow 0 , we get back y=mx which is a cumulative distribution function of a continuous uniform distribution that describes the occurrence of every event in an interval to be equally probable. As a\rightarrow\infty , the derivative of y has finite support only at y=0 resembling a degenerate distribution. Based on these arguments, in this work, we propose that these S-curves can represent maximum entropy uniform distribution to a zero entropy single value. We also argue that these S-curves are superposable as they are only parametrically nonlinear but fundamentally linear. So far, the superposed forms have been used to capture the patterns of natural systems such as nonlinear dynamics of biological growth and kinetics of enzyme reactions. Here, we attempt to use the S-curve and its superposed form as a statistical model. We fit the models on a classical dataset containing flower measurements of iris plants and analyze their usefulness in pattern recognition. Based on these models, we claim that any non-uniform pattern can be represented as a singular perturbation to uniform distribution. However, our parametric estimation procedure have some limitations such as sensitivity to initial conditions depending on the data at hand.

[LG-83] Optimal Sequential Recommendations: Exploiting User and Item Structure

链接: https://arxiv.org/abs/2504.19476
作者: Mina Karzand,Guy Bresler
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 91 pages, 7 figures

点击查看摘要

Abstract:We consider an online model for recommendation systems, with each user being recommended an item at each time-step and providing ‘like’ or ‘dislike’ feedback. A latent variable model specifies the user preferences: both users and items are clustered into types. The model captures structure in both the item and user spaces, as used by item-item and user-user collaborative filtering algorithms. We study the situation in which the type preference matrix has i.i.d. entries. Our main contribution is an algorithm that simultaneously uses both item and user structures, proved to be near-optimal via corresponding information-theoretic lower bounds. In particular, our analysis highlights the sub-optimality of using only one of item or user structure (as is done in most collaborative filtering algorithms).

[LG-84] Model uncertainty quantification using feature confidence sets for outcome excursions

链接: https://arxiv.org/abs/2504.19464
作者: Junting Ren,Armin Schwartzman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When implementing prediction models for high-stakes real-world applications such as medicine, finance, and autonomous systems, quantifying prediction uncertainty is critical for effective risk management. Traditional approaches to uncertainty quantification, such as confidence and prediction intervals, provide probability coverage guarantees for the expected outcomes f(\boldsymbolx) or the realized outcomes f(\boldsymbolx)+\epsilon . Instead, this paper introduces a novel, model-agnostic framework for quantifying uncertainty in continuous and binary outcomes using confidence sets for outcome excursions, where the goal is to identify a subset of the feature space where the expected or realized outcome exceeds a specific value. The proposed method constructs data-dependent inner and outer confidence sets that aim to contain the true feature subset for which the expected or realized outcomes of these features exceed a specified threshold. We establish theoretical guarantees for the probability that these confidence sets contain the true feature subset, both asymptotically and for finite sample sizes. The framework is validated through simulations and applied to real-world datasets, demonstrating its utility in contexts such as housing price prediction and time to sepsis diagnosis in healthcare. This approach provides a unified method for uncertainty quantification that is broadly applicable across various continuous and binary prediction models.

[LG-85] Composable and adaptive design of machine learning interatomic potentials guided by Fisher-information analysis

链接: https://arxiv.org/abs/2504.19372
作者: Weishi Wang,Mark K. Transtrum,Vincenzo Lordi,Vasily V. Bulatov,Amit Samanta
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Numerical Analysis (math.NA); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
*备注: 18 pages, 7 figures, and 6 tables

点击查看摘要

Abstract:An adaptive physics-informed model design strategy for machine-learning interatomic potentials (MLIPs) is proposed. This strategy follows an iterative reconfiguration of composite models from single-term models, followed by a unified training procedure. A model evaluation method based on the Fisher information matrix (FIM) and multiple-property error metrics is proposed to guide model reconfiguration and hyperparameter optimization. Combining the model reconfiguration and the model evaluation subroutines, we provide an adaptive MLIP design strategy that balances flexibility and extensibility. In a case study of designing models against a structurally diverse niobium dataset, we managed to obtain an optimal configuration with 75 parameters generated by our framework that achieved a force RMSE of 0.172 eV/Å and an energy RMSE of 0.013 eV/atom.

[LG-86] Metric Similarity and Manifold Learning of Circular Dichroism Spectra of Proteins

链接: https://arxiv.org/abs/2504.19355
作者: Gionni Marchetti
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG)
*备注: 13 pages, 16 figures

点击查看摘要

Abstract:We present a machine learning analysis of circular dichroism spectra of globular proteins from the SP175 database, using the optimal transport-based 1 -Wasserstein distance \mathcalW_1 (with order p=1 ) and the manifold learning algorithm t -SNE. Our results demonstrate that \mathcalW_1 is consistent with both Euclidean and Manhattan metrics while exhibiting robustness to noise. On the other hand, t -SNE uncovers meaningful structure in the high-dimensional data. The clustering in the t -SNE embedding is primarily determined by proteins with distinct secondary structure compositions: one cluster predominantly contains \beta -rich proteins, while the other consists mainly of proteins with mixed \alpha/\beta and \alpha -helical content.

[LG-87] he Double Descent Behavior in Two Layer Neural Network for Binary Classification

链接: https://arxiv.org/abs/2504.19351
作者: Chathurika S Abeykoon,Aleksandr Beknazaryan,Hailin Sang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies observed a surprising concept on model test error called the double descent phenomenon, where the increasing model complexity decreases the test error first and then the error increases and decreases again. To observe this, we work on a two layer neural network model with a ReLU activation function designed for binary classification under supervised learning. Our aim is to observe and investigate the mathematical theory behind the double descent behavior of model test error for varying model sizes. We quantify the model size by the ratio of number of training samples to the dimension of the model. Due to the complexity of the empirical risk minimization procedure, we use the Convex Gaussian Min Max Theorem to find a suitable candidate for the global training loss.

[LG-88] Contextual Online Uncertainty-Aware Preference Learning for Human Feedback

链接: https://arxiv.org/abs/2504.19342
作者: Nan Lu,Ethan X. Fang,Junwei Lu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm in artificial intelligence to align large models with human preferences. In this paper, we propose a novel statistical framework to simultaneously conduct the online decision-making and statistical inference on the optimal model using human preference data based on dynamic contextual information. Our approach introduces an efficient decision strategy that achieves both the optimal regret bound and the asymptotic distribution of the estimators. A key challenge in RLHF is handling the dependent online human preference outcomes with dynamic contexts. To address this, in the methodological aspect, we propose a two-stage algorithm starting with \epsilon -greedy followed by exploitations; in the theoretical aspect, we tailor anti-concentration inequalities and matrix martingale concentration techniques to derive the uniform estimation rate and asymptotic normality of the estimators using dependent samples from both stages. Extensive simulation results demonstrate that our method outperforms state-of-the-art strategies. We apply the proposed framework to analyze the human preference data for ranking large language models on the Massive Multitask Language Understanding dataset, yielding insightful results on the performance of different large language models for medical anatomy knowledge.

[LG-89] he effect of the number of parameters and the number of local feature patches on loss landscapes in distributed quantum neural networks

链接: https://arxiv.org/abs/2504.19239
作者: Yoshiaki Kawase
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages + Appendices

点击查看摘要

Abstract:Quantum neural networks hold promise for tackling computationally challenging tasks that are intractable for classical computers. However, their practical application is hindered by significant optimization challenges, arising from complex loss landscapes characterized by barren plateaus and numerous local minima. These problems become more severe as the number of parameters or qubits increases, hampering effective training. To mitigate these optimization challenges, particularly for quantum machine learning applied to classical data, we employ an approach of distributing overlapping local patches across multiple quantum neural networks, processing each patch with an independent quantum neural network, and aggregating their outputs for prediction. In this study, we investigate how the number of parameters and patches affects the loss landscape geometry of this distributed quantum neural network architecture via Hessian analysis and loss landscape visualization. Our results confirm that increasing the number of parameters tends to lead to deeper and sharper loss landscapes. Crucially, we demonstrate that increasing the number of patches significantly reduces the largest Hessian eigenvalue at minima. This finding suggests that our distributed patch approach acts as a form of implicit regularization, promoting optimization stability and potentially enhancing generalization. Our study provides valuable insights into optimization challenges and highlights that the distributed patch approach is a promising strategy for developing more trainable and practical quantum machine learning models for classical data tasks.

[LG-90] st Set Sizing for the Ridge Regression

链接: https://arxiv.org/abs/2504.19231
作者: Alexander Dubbs
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We derive the ideal train/test split for the ridge regression to high accuracy in the limit that the number of training rows m becomes large. The split must depend on the ridge tuning parameter, alpha, but we find that the dependence is weak and can asymptotically be ignored; all parameters vanish except for m and the number of features, n. This is the first time that such a split is calculated mathematically for a machine learning model in the large data limit. The goal of the calculations is to maximize “integrity,” so that the measured error in the trained model is as close as possible to what it theoretically should be. This paper’s result for the ridge regression split matches prior art for the plain vanilla linear regression split to the first two terms asymptotically, and it appears that practically there is no difference.

[LG-91] Global Climate Model Bias Correction Using Deep Learning

链接: https://arxiv.org/abs/2504.19145
作者: Abhishek Pasula,Deepak N. Subramani
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 40 pages, 15 figures

点击查看摘要

Abstract:Climate change affects ocean temperature, salinity and sea level, impacting monsoons and ocean productivity. Future projections by Global Climate Models based on shared socioeconomic pathways from the Coupled Model Intercomparison Project (CMIP) are widely used to understand the effects of climate change. However, CMIP models have significant bias compared to reanalysis in the Bay of Bengal for the time period when both projections and reanalysis are available. For example, there is a 1.5C root mean square error (RMSE) in the sea surface temperature (SST) projections of the climate model CNRM-CM6 compared to the Ocean Reanalysis System (ORAS5). We develop a suite of data-driven deep learning models for bias correction of climate model projections and apply it to correct SST projections of the Bay of Bengal. We propose the use of three different deep neural network architectures: convolutional encoder-decoder UNet, Bidirectional LSTM and ConvLSTM. We also use a baseline linear regression model and the Equi-Distant Cumulative Density Function (EDCDF) bias correction method for comparison and evaluating the impact of the new deep learning models. All bias correction models are trained using pairs of monthly CMIP6 projections and the corresponding month’s ORAS5 as input and output. Historical data (1950-2014) and future projection data (2015-2020) of CNRM-CM6 are used for training and validation, including hyperparameter tuning. Testing is performed on future projection data from 2021 to 2024. Detailed analysis of the three deep neural models has been completed. We found that the UNet architecture trained using a climatology-removed CNRM-CM6 projection as input and climatology-removed ORAS5 as output gives the best bias-corrected projections. Our novel deep learning-based method for correcting CNRM-CM6 data has a 15% reduction in RMSE compared EDCDF.

[LG-92] Inverse-Transpilation: Reverse-Engineering Quantum Compiler Optimization Passes from Circuit Snapshots

链接: https://arxiv.org/abs/2504.19113
作者: Satwik Kundu,Swaroop Ghosh
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Circuit compilation, a crucial process for adapting quantum algorithms to hardware constraints, often operates as a ``black box,‘’ with limited visibility into the optimization techniques used by proprietary systems or advanced open-source frameworks. Due to fundamental differences in qubit technologies, efficient compiler design is an expensive process, further exposing these systems to various security threats. In this work, we take a first step toward evaluating one such challenge affecting compiler confidentiality, specifically, reverse-engineering compilation methodologies. We propose a simple ML-based framework to infer underlying optimization techniques by leveraging structural differences observed between original and compiled circuits. The motivation is twofold: (1) enhancing transparency in circuit optimization for improved cross-platform debugging and performance tuning, and (2) identifying potential intellectual property (IP)-protected optimizations employed by commercial systems. Our extensive evaluation across thousands of quantum circuits shows that a neural network performs the best in detecting optimization passes, with individual pass F1-scores reaching as high as 0.96. Thus, our initial study demonstrates the viability of this threat to compiler confidentiality and underscores the need for active research in this area.

[LG-93] QFGN: A Quantum Approach to High-Fidelity Implicit Neural Representations

链接: https://arxiv.org/abs/2504.19053
作者: Hongni Jin,Gurinder Singh,Kenneth M. Merz Jr
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Implicit neural representations have shown potential in various applications. However, accurately reconstructing the image or providing clear details via image super-resolution remains challenging. This paper introduces Quantum Fourier Gaussian Network (QFGN), a quantum-based machine learning model for better signal representations. The frequency spectrum is well balanced by penalizing the low-frequency components, leading to the improved expressivity of quantum circuits. The results demonstrate that with minimal parameters, QFGN outperforms the current state-of-the-art (SOTA) models. Despite noise on hardware, the model achieves accuracy comparable to that of SIREN, highlighting the potential applications of quantum machine learning in this field.

[LG-94] Geometry-aware Active Learning of Spatiotemporal Dynamic Systems

链接: https://arxiv.org/abs/2504.19012
作者: Xizhuo(Cici)Zhang,Bing Yao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rapid developments in advanced sensing and imaging have significantly enhanced information visibility, opening opportunities for predictive modeling of complex dynamic systems. However, sensing signals acquired from such complex systems are often distributed across 3D geometries and rapidly evolving over time, posing significant challenges in spatiotemporal predictive modeling. This paper proposes a geometry-aware active learning framework for modeling spatiotemporal dynamic systems. Specifically, we propose a geometry-aware spatiotemporal Gaussian Process (G-ST-GP) to effectively integrate the temporal correlations and geometric manifold features for reliable prediction of high-dimensional dynamic behaviors. In addition, we develop an adaptive active learning strategy to strategically identify informative spatial locations for data collection and further maximize the prediction accuracy. This strategy achieves the adaptive trade-off between the prediction uncertainty in the G-ST-GP model and the space-filling design guided by the geodesic distance across the 3D geometry. We implement the proposed framework to model the spatiotemporal electrodynamics in a 3D heart geometry. Numerical experiments show that our framework outperforms traditional methods lacking the mechanism of geometric information incorporation or effective data collection.

[LG-95] Learning Stochastic Thermodynamics Directly from Correlation and Trajectory-Fluctuation Currents

链接: https://arxiv.org/abs/2504.19007
作者: Jinghao Lyu,Kyle J. Ray,James P. Crutchfield
类目: atistical Mechanics (cond-mat.stat-mech); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO); Machine Learning (stat.ML)
*备注: 11 pages, 6 appendices (10 pages), 4 figures; this https URL

点击查看摘要

Abstract:Markedly increased computational power and data acquisition have led to growing interest in data-driven inverse dynamics problems. These seek to answer a fundamental question: What can we learn from time series measurements of a complex dynamical system? For small systems interacting with external environments, the effective dynamics are inherently stochastic, making it crucial to properly manage noise in data. Here, we explore this for systems obeying Langevin dynamics and, using currents, we construct a learning framework for stochastic modeling. Currents have recently gained increased attention for their role in bounding entropy production (EP) from thermodynamic uncertainty relations (TURs). We introduce a fundamental relationship between the cumulant currents there and standard machine-learning loss functions. Using this, we derive loss functions for several key thermodynamic functions directly from the system dynamics without the (common) intermediate step of deriving a TUR. These loss functions reproduce results derived both from TURs and other methods. More significantly, they open a path to discover new loss functions for previously inaccessible quantities. Notably, this includes access to per-trajectory entropy production, even if the observed system is driven far from its steady-state. We also consider higher order estimation. Our method is straightforward and unifies dynamic inference with recent approaches to entropy production estimation. Taken altogether, this reveals a deep connection between diffusion models in machine learning and entropy production estimation in stochastic thermodynamics.

[LG-96] Modeling Regime Structure and Informational Drivers of Stock Market Volatility via the Financial Chaos Index

链接: https://arxiv.org/abs/2504.18958
作者: Masoud Ataei
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the structural dynamics of stock market volatility through the Financial Chaos Index, a tensor- and eigenvalue-based measure designed to capture realized volatility via mutual fluctuations among asset prices. Motivated by empirical evidence of regime-dependent volatility behavior and perceptual time dilation during financial crises, we develop a regime-switching framework based on the Modified Lognormal Power-Law distribution. Analysis of the FCIX from January 1990 to December 2023 identifies three distinct market regimes, low-chaos, intermediate-chaos, and high-chaos, each characterized by differing levels of systemic stress, statistical dispersion and persistence characteristics. Building upon the segmented regime structure, we further examine the informational forces that shape forward-looking market expectations. Using sentiment-based predictors derived from the Equity Market Volatility tracker, we employ an elastic net regression model to forecast implied volatility, as proxied by the VIX index. Our findings indicate that shifts in macroeconomic, financial, policy, and geopolitical uncertainty exhibit strong predictive power for volatility dynamics across regimes. Together, these results offer a unified empirical perspective on how systemic uncertainty governs both the realized evolution of financial markets and the anticipatory behavior embedded in implied volatility measures.

[LG-97] A Langevin sampling algorithm inspired by the Adam optimizer

链接: https://arxiv.org/abs/2504.18911
作者: Benedict Leimkuhler,René Lohmann,Peter Whalley
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present a framework for adaptive-stepsize MCMC sampling based on time-rescaled Langevin dynamics, in which the stepsize variation is dynamically driven by an additional degree of freedom. Our approach augments the phase space by an additional variable which in turn defines a time reparameterization. The use of an auxiliary relaxation equation allows accumulation of a moving average of a local monitor function and provides for precise control of the timestep while circumventing the need to modify the drift term in the physical system. Our algorithm is straightforward to implement and can be readily combined with any off-the-peg fixed-stepsize Langevin integrator. As a particular example, we consider control of the stepsize by monitoring the norm of the log-posterior gradient, which takes inspiration from the Adam optimizer, the stepsize being automatically reduced in regions of steep change of the log posterior and increased on plateaus, improving numerical stability and convergence speed. As in Adam, the stepsize variation depends on the recent history of the gradient norm, which enhances stability and improves accuracy compared to more immediate control approaches. We demonstrate the potential benefit of this method–both in accuracy and in stability–in numerical experiments including Neal’s funnel and a Bayesian neural network for classification of MNIST data.

[LG-98] ReLU integral probability metric and its applications

链接: https://arxiv.org/abs/2504.18897
作者: Yuha Park,Kunwoong Kim,Insung Kong,Yongdai Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 49 pages, 9 figures

点击查看摘要

Abstract:We propose a parametric integral probability metric (IPM) to measure the discrepancy between two probability measures. The proposed IPM leverages a specific parametric family of discriminators, such as single-node neural networks with ReLU activation, to effectively distinguish between distributions, making it applicable in high-dimensional settings. By optimizing over the parameters of the chosen discriminator class, the proposed IPM demonstrates that its estimators have good convergence rates and can serve as a surrogate for other IPMs that use smooth nonparametric discriminator classes. We present an efficient algorithm for practical computation, offering a simple implementation and requiring fewer hyperparameters. Furthermore, we explore its applications in various tasks, such as covariate balancing for causal inference and fair representation learning. Across such diverse applications, we demonstrate that the proposed IPM provides strong theoretical guarantees, and empirical experiments show that it achieves comparable or even superior performance to other methods.

[LG-99] A Dictionary of Closed-Form Kernel Mean Embeddings

链接: https://arxiv.org/abs/2504.18830
作者: François-Xavier Briol,Alexandra Gessner,Toni Karvonen,Maren Mahsereci
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Kernel mean embeddings – integrals of a kernel with respect to a probability distribution – are essential in Bayesian quadrature, but also widely used in other computational tools for numerical integration or for statistical inference based on the maximum mean discrepancy. These methods often require, or are enhanced by, the availability of a closed-form expression for the kernel mean embedding. However, deriving such expressions can be challenging, limiting the applicability of kernel-based techniques when practitioners do not have access to a closed-form embedding. This paper addresses this limitation by providing a comprehensive dictionary of known kernel mean embeddings, along with practical tools for deriving new embeddings from known ones. We also provide a Python library that includes minimal implementations of the embeddings.

[LG-100] Local Polynomial Lp-norm Regression

链接: https://arxiv.org/abs/2504.18695
作者: Ladan Tazik(1),James Stafford(2),John Braun(1) ((1) Dept. of Computer Science, Mathematics, Physics and Statistics, University of British Columbia, Okanagan campus, (2) Dept. of Statistical Sciences, University of Toronto)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:The local least squares estimator for a regression curve cannot provide optimal results when non-Gaussian noise is present. Both theoretical and empirical evidence suggests that residuals often exhibit distributional properties different from those of a normal distribution, making it worthwhile to consider estimation based on other norms. It is suggested that L_p -norm estimators be used to minimize the residuals when these exhibit non-normal kurtosis. In this paper, we propose a local polynomial L_p -norm regression that replaces weighted least squares estimation with weighted L_p -norm estimation for fitting the polynomial locally. We also introduce a new method for estimating the parameter p from the residuals, enhancing the adaptability of the approach. Through numerical and theoretical investigation, we demonstrate our method’s superiority over local least squares in one-dimensional data and show promising outcomes for higher dimensions, specifically in 2D.

[LG-101] Foundations of Safe Online Reinforcement Learning in the Linear Quadratic Regulator: sqrtT-Regret

链接: https://arxiv.org/abs/2504.18657
作者: Benjamin Schiffer,Lucas Janson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Understanding how to efficiently learn while adhering to safety constraints is essential for using online reinforcement learning in practical applications. However, proving rigorous regret bounds for safety-constrained reinforcement learning is difficult due to the complex interaction between safety, exploration, and exploitation. In this work, we seek to establish foundations for safety-constrained reinforcement learning by studying the canonical problem of controlling a one-dimensional linear dynamical system with unknown dynamics. We study the safety-constrained version of this problem, where the state must with high probability stay within a safe region, and we provide the first safe algorithm that achieves regret of \tildeO_T(\sqrtT) . Furthermore, the regret is with respect to the baseline of truncated linear controllers, a natural baseline of non-linear controllers that are well-suited for safety-constrained linear systems. In addition to introducing this new baseline, we also prove several desirable continuity properties of the optimal controller in this baseline. In showing our main result, we prove that whenever the constraints impact the optimal controller, the non-linearity of our controller class leads to a faster rate of learning than in the unconstrained setting.

[LG-102] Statistical Inference for Clustering-based Anomaly Detection

链接: https://arxiv.org/abs/2504.18633
作者: Nguyen Thi Minh Phu,Duong Tan Loc,Vo Nguyen Le Duy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference for CLustering-based Anomaly Detection), a novel statistical framework for testing the clustering-based AD results. The key strength of SI-CLAD lies in its ability to rigorously control the probability of falsely identifying anomalies, maintaining it below a pre-specified significance level \alpha (e.g., \alpha = 0.05 ). By analyzing the selection mechanism inherent in clustering-based AD and leveraging the Selective Inference (SI) framework, we prove that false detection control is attainable. Moreover, we introduce a strategy to boost the true detection rate, enhancing the overall performance of SI-CLAD. Extensive experiments on synthetic and real-world datasets provide strong empirical support for our theoretical findings, showcasing the superior performance of the proposed method.

[LG-103] Explainable Deep-Learning Based Potentially Hazardous Asteroids Classification Using Graph Neural Networks

链接: https://arxiv.org/abs/2504.18605
作者: Baimam Boukar Jean Jacques
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classifying potentially hazardous asteroids (PHAs) is crucial for planetary defense and deep space navigation, yet traditional methods often overlook the dynamical relationships among asteroids. We introduce a Graph Neural Network (GNN) approach that models asteroids as nodes with orbital and physical features, connected by edges representing their similarities, using a NASA dataset of 958,524 records. Despite an extreme class imbalance with only 0.22% of the dataset with the hazardous label, our model achieves an overall accuracy of 99% and an AUC of 0.99, with a recall of 78% and an F1-score of 37% for hazardous asteroids after applying the Synthetic Minority Oversampling Technique. Feature importance analysis highlights albedo, perihelion distance, and semi-major axis as main predictors. This framework supports planetary defense missions and confirms AI’s potential in enabling autonomous navigation for future missions such as NASA’s NEO Surveyor and ESA’s Ramses, offering an interpretable and scalable solution for asteroid hazard assessment.

[LG-104] Parameter Tuning of the Firefly Algorithm by Three Tuning Methods: Standard Monte Carlo Quasi-Monte Carlo and Latin Hypercube Sampling Methods

链接: https://arxiv.org/abs/2504.18545
作者: Geethu Joy,Christian Huyck,Xin-She Yang
类目: Computation (stat.CO); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 21 pages

点击查看摘要

Abstract:There are many different nature-inspired algorithms in the literature, and almost all such algorithms have algorithm-dependent parameters that need to be tuned. The proper setting and parameter tuning should be carried out to maximize the performance of the algorithm under consideration. This work is the extension of the recent work on parameter tuning by Joy et al. (2024) presented at the International Conference on Computational Science (ICCS 2024), and the Firefly Algorithm (FA) is tuned using three different methods: the Monte Carlo method, the Quasi-Monte Carlo method and the Latin Hypercube Sampling. The FA with the tuned parameters is then used to solve a set of six different optimization problems, and the possible effect of parameter setting on the quality of the optimal solutions is analyzed. Rigorous statistical hypothesis tests have been carried out, including Student’s t-tests, F-tests, non-parametric Friedman tests and ANOVA. Results show that the performance of the FA is not influenced by the tuning methods used. In addition, the tuned parameter values are largely independent of the tuning methods used. This indicates that the FA can be flexible and equally effective in solving optimization problems, and any of the three tuning methods can be used to tune its parameters effectively.

[LG-105] Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation

链接: https://arxiv.org/abs/2504.18539
作者: Sungnyun Kim,Sungwoo Cho,Sangmin Bae,Kangwook Jang,Se-Young Yun
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 22 pages, 6 figures, 14 tables

点击查看摘要

Abstract:Audio-visual speech recognition (AVSR) incorporates auditory and visual modalities to improve recognition accuracy, particularly in noisy environments where audio-only speech systems are insufficient. While previous research has largely addressed audio disruptions, few studies have dealt with visual corruptions, e.g., lip occlusions or blurred videos, which are also detrimental. To address this real-world challenge, we propose CAV2vec, a novel self-supervised speech representation learning framework particularly designed to handle audio-visual joint corruption. CAV2vec employs a self-distillation approach with a corrupted prediction task, where the student model learns to predict clean targets, generated by the teacher model, with corrupted input frames. Specifically, we suggest a unimodal multi-task learning, which distills cross-modal knowledge and aligns the corrupted modalities, by predicting clean audio targets with corrupted videos, and clean video targets with corrupted audios. This strategy mitigates the dispersion in the representation space caused by corrupted modalities, leading to more reliable and robust audio-visual fusion. Our experiments on robust AVSR benchmarks demonstrate that the corrupted representation learning method significantly enhances recognition accuracy across generalized environments involving various types of corruption.

信息检索

[IR-0] Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

链接: https://arxiv.org/abs/2504.20006
作者: Sahel Sharifymoghaddam,Shivani Upadhyay,Nandan Thakur,Ronak Pradeep,Jimmy Lin
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 8 figures, 3 tables

点击查看摘要

Abstract:Battles, or side-by-side comparisons in so called arenas that elicit human preferences, have emerged as a popular approach to assessing the output quality of LLMs. Recently, this idea has been extended to retrieval-augmented generation (RAG) systems. While undoubtedly representing an advance in evaluation, battles have at least two drawbacks, particularly in the context of complex information-seeking queries: they are neither explanatory nor diagnostic. Recently, the nugget evaluation methodology has emerged as a promising approach to evaluate the quality of RAG answers. Nuggets decompose long-form LLM-generated answers into atomic facts, highlighting important pieces of information necessary in a “good” response. In this work, we apply our AutoNuggetizer framework to analyze data from roughly 7K Search Arena battles provided by LMArena in a fully automatic manner. Our results show a significant correlation between nugget scores and human preferences, showcasing promise in our approach to explainable and diagnostic system evaluations.

[IR-1] How Cohesive Are Community Search Results on Online Social Networks?: An Experimental Evaluation

链接: https://arxiv.org/abs/2504.19489
作者: Yining Zhao,Sourav S Bhowmick,Nastassja L. Fischer,SH Annabel Chen
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Recently, numerous community search methods for large graphs have been proposed, at the core of which is defining and measuring cohesion. This paper experimentally evaluates the effectiveness of these community search algorithms w.r.t. cohesiveness in the context of online social networks. Social communities are formed and developed under the influence of group cohesion theory, which has been extensively studied in social psychology. However, current generic methods typically measure cohesiveness using structural or attribute-based approaches and overlook domain-specific concepts such as group cohesion. We introduce five novel psychology-informed cohesiveness measures, based on the concept of group cohesion from social psychology, and propose a novel framework called CHASE for evaluating eight representative CS algorithms this http URL measures on online social networks. Our analysis reveals that there is no clear correlation between structural and psychological cohesiveness, and no algorithm effectively identifies psychologically cohesive communities in online social networks. This study provides new insights that could guide the development of future community search methods.

[IR-2] Scalable Substructure Discovery Algorithm For Homogeneous Multilayer Networks

链接: https://arxiv.org/abs/2504.19328
作者: Arshdeep Singh,Abhishek Santra,Sharma Chakravarthy
类目: ocial and Information Networks (cs.SI); Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph mining analyzes real-world graphs to find core substructures (connected subgraphs) in applications modeled as graphs. Substructure discovery is a process that involves identifying meaningful patterns, structures, or components within a large data set. These substructures can be of various types, such as frequent patterns, motifs, or other relevant features within the data. To model complex data sets – with multiple types of entities and relationships – multilayer networks (or MLNs) have been shown to be more effective as compared to simple and attributed graphs. Analysis algorithms on MLNs using the decoupling approach have been shown to be both efficient and accurate. Hence, this paper focuses on substructure discovery in homogeneous multilayer networks (one type of MLN) using a novel decoupling-based approach. In this approach, each layer is processed independently, and then the results from two or more layers are composed to identify substructures in the entire MLN. The algorithm is designed and implemented, including the composition part, using one of the distributed processing frameworks (the Map/Reduce paradigm) to provide scalability. After establishing the correctness, we analyze the speedup and response time of the proposed algorithm and approach through extensive experimental analysis on large synthetic and real-world data sets with diverse graph characteristics. Subjects: Social and Information Networks (cs.SI); Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2504.19328 [cs.SI] (or arXiv:2504.19328v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2504.19328 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] AlphaFuse: Learn ID Embeddings for Sequential Recommendation in Null Space of Language Embeddings

链接: https://arxiv.org/abs/2504.19218
作者: Guoqing Hu,An Zhang,Shuo Liu,Zhibo Cai,Xun Yang,Xiang Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advancements in sequential recommendation have underscored the potential of Large Language Models (LLMs) for enhancing item embeddings. However, existing approaches face three key limitations: 1) the degradation of the semantic space when high-dimensional language embeddings are mapped to lower-dimensional ID embeddings, 2) the underutilization of language embeddings, and 3) the reliance on additional trainable parameters, such as an adapter, to bridge the gap between the semantic and behavior this http URL this paper, we introduce AlphaFuse, a simple but effective language-guided learning strategy that addresses these challenges by learning ID embeddings within the null space of language embeddings. Specifically, we decompose the semantic space of language embeddings via Singular Value Decomposition (SVD), distinguishing it into a semantic-rich row space and a semantic-sparse null space. Collaborative signals are then injected into the null space, while preserving the rich semantics of the row space. AlphaFuse prevents degradation of the semantic space, integrates the retained language embeddings into the final item embeddings, and eliminates the need for auxiliary trainable modules, enabling seamless adaptation to any sequential recommendation framework. We validate the effectiveness and flexibility of AlphaFuse through extensive experiments on three benchmark datasets, including cold-start user and long-tail settings, showcasing significant improvements in both discriminative and diffusion-based generative sequential recommenders. Our codes and datasets are available at this https URL.

[IR-4] Relative Contrastive Learning for Sequential Recommendation with Similarity-based Positive Pair Selection ATC

链接: https://arxiv.org/abs/2504.19178
作者: Zhikai Wang,Yanyan Shen,Zexi Zhang,Li He,Yichun Li,Hao Gu,Yinghua Zhang
类目: Information Retrieval (cs.IR)
*备注: The code can be found at this https URL

点击查看摘要

Abstract:Contrastive Learning (CL) enhances the training of sequential recommendation (SR) models through informative self-supervision signals. Existing methods often rely on data augmentation strategies to create positive samples and promote representation invariance. Some strategies such as item reordering and item substitution may inadvertently alter user intent. Supervised Contrastive Learning (SCL) based methods find an alternative to augmentation-based CL methods by selecting same-target sequences (interaction sequences with the same target item) to form positive samples. However, SCL-based methods suffer from the scarcity of same-target sequences and consequently lack enough signals for contrastive learning. In this work, we propose to use similar sequences (with different target items) as additional positive samples and introduce a Relative Contrastive Learning (RCL) framework for sequential recommendation. RCL comprises a dual-tiered positive sample selection module and a relative contrastive learning module. The former module selects same-target sequences as strong positive samples and selects similar sequences as weak positive samples. The latter module employs a weighted relative contrastive loss, ensuring that each sequence is represented closer to its strong positive samples than its weak positive samples. We apply RCL on two mainstream deep learning-based SR models, and our empirical results reveal that RCL can achieve 4.88% improvement averagely than the state-of-the-art SR methods on five public datasets and one private dataset.

[IR-5] LLM -Evaluation Tropes: Perspectives on the Validity of LLM -Evaluations

链接: https://arxiv.org/abs/2504.19076
作者: Laura Dietz,Oleg Zendel,Peter Bailey,Charles Clarke,Ellese Cotterill,Jeff Dalton,Faegheh Hasibi,Mark Sanderson,Nick Craswell
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to evaluate information retrieval (IR) systems, generating relevance judgments traditionally made by human assessors. Recent empirical studies suggest that LLM-based evaluations often align with human judgments, leading some to suggest that human judges may no longer be necessary, while others highlight concerns about judgment reliability, validity, and long-term impact. As IR systems begin incorporating LLM-generated signals, evaluation outcomes risk becoming self-reinforcing, potentially leading to misleading conclusions. This paper examines scenarios where LLM-evaluators may falsely indicate success, particularly when LLM-based judgments influence both system development and evaluation. We highlight key risks, including bias reinforcement, reproducibility challenges, and inconsistencies in assessment methodologies. To address these concerns, we propose tests to quantify adverse effects, guardrails, and a collaborative framework for constructing reusable test collections that integrate LLM judgments responsibly. By providing perspectives from academia and industry, this work aims to establish best practices for the principled use of LLMs in IR evaluation. Subjects: Information Retrieval (cs.IR) ACMclasses: H.3.3 Cite as: arXiv:2504.19076 [cs.IR] (or arXiv:2504.19076v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.19076 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表