Arxiv今日论文 | 2025-03-20

本篇博文主要内容为 2025-03-20 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决现有图像-文本对比模型（如CLIP和SigLIP）在高保真图像理解任务（如计数、深度估计和细粒度物体识别）上的不足。这些模型倾向于通过语言对齐优先关注高层语义，从而削弱了其图像理解能力。同时，视觉专注模型虽擅长处理视觉信息，但在理解语言方面表现欠佳，限制了其在语言驱动任务中的灵活性。为此，论文提出TULIP，一种开源的、可替换现有CLIP类模型的方案。其关键在于利用生成式数据增强、增强的图像-图像与文本-文本对比学习以及图像/文本重建正则化，以学习细粒度的视觉特征的同时保持全局语义对齐。这一方法扩展至超过10亿参数规模，在多个基准测试中超越现有最先进的模型，并在ImageNet-1K零样本性能、RxRx1少量样本分类线性探测以及MMVP任务中分别实现高达2倍和3倍以上的性能提升。

链接: https://arxiv.org/abs/2503.15485
作者: Zineng Tang,Long Lian,Seun Eisape,XuDong Wang,Roei Herzig,Adam Yala,Alane Suhr,Trevor Darrell,David M. Chan
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a 2\times enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over 3\times higher scores than SigLIP on MMVP. Our code/checkpoints are available at this https URL
zh

[NLP-1] Value Profiles for Encoding Human Variation

【速读】：该论文试图解决如何有效建模人类在评分任务中的个体差异性问题，以支持个性化 AI 系统、多利益相关者模型对齐以及计算社会科学的研究。论文的关键解决方案是提出通过价值剖面（value profiles）来表示个体，即从情境演示中提取并压缩得到的自然语言形式的价值描述，并结合可调节解码器模型（steerable decoder model）来预测基于价值剖面或其他评分者信息的评分。此外，论文引入了一种信息论方法衡量评分者表征中的预测信息，发现演示数据包含最多信息，其次是价值剖面，最后是人口统计学特征。然而，由于价值剖面采用压缩后的自然语言格式，其在透明性、可解释性和可控性方面具有独特优势。进一步研究表明，价值剖面能够有效保留演示数据中有用信息（信息保存率达 70%），并且聚类价值剖面比最具有预测性的分组更能解释评分者的变化。最终，论文展示了解码器模型不仅能根据语义剖面差异可解释地调整评分，还表现出良好的校准能力，并可通过模拟注释者群体来解释实例级的评分分歧。这些结果表明，价值剖面提供了一种新颖且有效的个体差异描述方式，超越了传统的基于人口统计学或群体信息的方法。

链接: https://arxiv.org/abs/2503.15484
作者: Taylor Sorensen,Pushkar Mishra,Roma Patel,Michael Henry Tessler,Michiel Bakker,Georgina Evans,Iason Gabriel,Noah Goodman,Verena Rieser
机构: Department of Computer Science, University of Washington (华盛顿大学计算机科学系); Google DeepMind (DeepMind)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modelling human variation in rating tasks is crucial for enabling AI systems for personalization, pluralistic model alignment, and computational social science. We propose representing individuals using value profiles – natural language descriptions of underlying values compressed from in-context demonstrations – along with a steerable decoder model to estimate ratings conditioned on a value profile or other rater information. To measure the predictive information in rater representations, we introduce an information-theoretic methodology. We find that demonstrations contain the most information, followed by value profiles and then demographics. However, value profiles offer advantages in terms of scrutability, interpretability, and steerability due to their compressed natural language format. Value profiles effectively compress the useful information from demonstrations (70% information preservation). Furthermore, clustering value profiles to identify similarly behaving individuals better explains rater variation than the most predictive demographic groupings. Going beyond test set performance, we show that the decoder models interpretably change ratings according to semantic profile differences, are well-calibrated, and can help explain instance-level disagreement by simulating an annotator population. These results demonstrate that value profiles offer novel, predictive ways to describe individual variation beyond demographics or group information.
zh

[NLP-2] What Makes a Reward Model a Good Teacher? An Optimization Perspective

【速读】：该论文试图解决的问题是如何全面评估奖励模型的有效性，而不仅仅依赖于其准确性。传统观点认为奖励模型的准确性是衡量其效果的核心指标，但本文从优化的角度提出质疑，指出奖励模型诱导的奖励方差（Reward Variance）同样至关重要。如果奖励模型诱导的方差过低，则会导致强化学习从人类反馈（RLHF）目标函数的优化景观变得平坦，从而显著减缓优化过程，即使该模型在准确性上达到完美也可能表现不佳。此外，论文还发现，一个对某一语言模型有效的奖励模型可能对另一模型产生低方差和较差的优化效果。因此，论文的关键解决方案在于强调奖励模型不仅需要具备高准确性，还需要诱导足够的方差以实现高效的优化。实验结果支持这一理论，验证了奖励方差、准确性与奖励最大化速率之间的相互作用。

链接: https://arxiv.org/abs/2503.15477
作者: Noam Razin,Zixuan Wang,Hubert Strauss,Stanley Wei,Jason D. Lee,Sanjeev Arora
机构: Princeton Language and Intelligence (普林斯顿语言与智能研究所), Princeton University (普林斯顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Code available at this https URL

点击查看摘要

Abstract:The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient optimization.
zh

[NLP-3] Dynamic Bi-Elman Attention Networks (DBEAN): Dual-Directional Context-Aware Representation Learning for Enhanced Text Classification

【速读】：该论文旨在解决传统文本分类方法在处理复杂语言结构和语义依赖时的局限性，同时克服现有深度学习模型在可解释性、计算效率和长程上下文理解之间的平衡难题。论文的关键创新在于提出了一种名为Dynamic Bidirectional Elman with Attention Network (DBEAN) 的网络架构，通过结合双向时间建模与自注意力机制，动态调整输入关键片段的权重，从而提升上下文表示能力的同时保持计算效率。

链接: https://arxiv.org/abs/2503.15469
作者: ZhengLin Lai,MengYao Liao,Dong Xu
机构: College of Computer Science and Software Engineering(计算机科学与软件工程学院), Shenzhen University(深圳大学); National Engineering Laboratory for Big Data System Computing Technology(大数据系统计算技术国家工程实验室), Shenzhen University(深圳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages,1 figure

点击查看摘要

Abstract:Text classification, a fundamental task in natural language processing (NLP), aims to categorize textual data into predefined labels. Traditional methods struggled with complex linguistic structures and semantic dependencies. The advent of deep learning, particularly recurrent neural networks (RNNs) and Transformer-based models, has significantly advanced the field by enabling nuanced feature extraction and context-aware predictions. Despite improvements, existing models exhibit limitations in balancing interpretability, computational efficiency, and long-range contextual understanding. This paper proposes the Dynamic Bidirectional Elman with Attention Network (DBEAN), which integrates bidirectional temporal modelling with self-attention mechanisms. DBEAN dynamically assigns weights to critical segments of input, improving contextual representation while maintaining computational efficiency.
zh

[NLP-4] From 1000000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment

【速读】：该论文试图解决传统大规模语言模型（Large Language Models, LLMs）对齐方法忽视用户价值与需求多样性的问题。解决方案的关键在于提出了一种全面的个性化对齐框架，通过构建一个系统化的偏好空间，涵盖心理和行为维度，并结合多样的人格表示来实现真实场景中的鲁棒偏好推理。基于此，论文引入了一个包含超过130万个个性化偏好示例的大规模数据集\textscAlignX，并开发了两种互补的对齐方法：一种是直接基于人格表示进行条件化对齐（in-context alignment），另一种是通过建模中间偏好分布实现偏好桥接对齐（preference-bridged alignment）。实验结果表明，该方法在四个基准测试中平均提升了17.06%的准确性，同时展现出对新偏好的强大适应能力、对有限用户数据的鲁棒性以及精确的偏好可控性，验证了该框架的有效性，推动了真正用户适应型人工智能系统的进步。

链接: https://arxiv.org/abs/2503.15463
作者: Jia-Nan Li,Jian Guan,Songhao Wu,Wei Wu,Rui Yan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have traditionally been aligned through one-size-fits-all approaches that assume uniform human preferences, fundamentally overlooking the diversity in user values and needs. This paper introduces a comprehensive framework for scalable personalized alignment of LLMs. We establish a systematic preference space characterizing psychological and behavioral dimensions, alongside diverse persona representations for robust preference inference in real-world scenarios. Building upon this foundation, we introduce \textscAlignX, a large-scale dataset of over 1.3 million personalized preference examples, and develop two complementary alignment approaches: \textitin-context alignment directly conditioning on persona representations and \textitpreference-bridged alignment modeling intermediate preference distributions. Extensive experiments demonstrate substantial improvements over existing methods, with an average 17.06% accuracy gain across four benchmarks while exhibiting a strong adaptation capability to novel preferences, robustness to limited user data, and precise preference controllability. These results validate our framework’s effectiveness, advancing toward truly user-adaptive AI systems.
zh

[NLP-5] Evaluating Bias in Retrieval-Augmented Medical Question-Answering Systems

【速读】：该论文旨在解决由 Retrieval-Augmented Generation (RAG) 模型驱动的医学问答（Medical QA）系统中可能引入的与种族、性别和社会健康决定因素相关的偏见问题。论文的关键解决方案在于通过分析对人口统计学敏感的查询并衡量检索差异，系统性评估基于 RAG 的大型语言模型（LLM）中的偏见。研究使用 MMLU 和 MedMCQA 等数据集，评估检索重叠度和正确性差异，强调了在 RAG 流水线中显式考虑公平性的方法对于实现临床决策公平性的必要性。

链接: https://arxiv.org/abs/2503.15454
作者: Yuelyu Ji,Hang Zhang,Yanshan Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical QA systems powered by Retrieval-Augmented Generation (RAG) models support clinical decision-making but may introduce biases related to race, gender, and social determinants of health. We systematically evaluate biases in RAG-based LLM by examining demographic-sensitive queries and measuring retrieval discrepancies. Using datasets like MMLU and MedMCQA, we analyze retrieval overlap and correctness disparities. Our findings reveal substantial demographic disparities within RAG pipelines, emphasizing the critical need for retrieval methods that explicitly account for fairness to ensure equitable clinical decision-making.
zh

[NLP-6] SkyLadder: Better and Faster Pretraining via Context Window Scheduling ICLR2025

【速读】：该论文旨在解决在固定计算资源（token预算）下，长上下文窗口模型性能通常不如短上下文窗口模型的问题。论文的关键解决方案是提出了一种名为SkyLadder的方法，它通过从短上下文窗口到长上下文窗口的逐步过渡策略，在保持标准基准性能的同时，显著提升了长上下文任务的表现，并实现了高达3.7%的基准性能提升以及最多22%的训练加速。这种方法有效平衡了长上下文能力与预训练效率之间的关系。

链接: https://arxiv.org/abs/2503.15450
作者: Tongyao Zhu,Qian Liu,Haonan Wang,Shiqi Chen,Xiangming Gu,Tianyu Pang,Min-Yen Kan
机构: National University of Singapore (新加坡国立大学); Sea AI Lab (海智实验室); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL)
备注: 22 pages. Accepted to ICLR 2025 Workshop on Open Science for Foundation Models

点击查看摘要

Abstract:Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at this https URL.
zh

[NLP-7] VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning

【速读】：该论文旨在解决自然语言处理（NLP）在跨学科领域，尤其是蛋白质工程中的应用限制问题，主要挑战包括数据收集、任务基准测试以及实际应用的推广。为应对这些挑战，论文提出的关键解决方案是开发VenusFactory，这是一个集成生物数据检索、标准化任务基准测试以及可模块化微调的蛋白质语言模型（PLMs）的多功能引擎。VenusFactory通过命令行执行和基于Gradio的无代码界面为计算机科学与生物学社区提供灵活选择，并整合了40多种蛋白质相关数据集和40多种流行PLMs，所有实现均开源发布。

链接: https://arxiv.org/abs/2503.15438
作者: Yang Tan,Chen Liu,Jingyuan Gao,Banghao Wu,Mingchen Li,Ruilin Wang,Lingrong Zhang,Huiqun Yu,Guisheng Fan,Liang Hong,Bingxin Zhou
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); East China University of Science and Technology (华东理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 12 pages, 1 figure, 8 tables

点击查看摘要

Abstract:Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating 40+ protein-related datasets and 40+ popular PLMs. All implementations are open-sourced on this https URL.
zh

[NLP-8] Real-world validation of a multimodal LLM -powered pipeline for High-Accuracy Clinical Trial Patient Matching leverag ing EHR data

【速读】：该论文旨在解决临床试验患者招募过程中因复杂入选标准和劳动密集型病历审查而导致的效率低下问题。传统基于文本模型的方法在处理这一问题时存在局限性，包括有限的推理能力、从视觉记录转换为文本的信息丢失，以及缺乏通用的电子健康记录（EHR）集成来提取患者数据。为了解决这些问题，论文提出了一种广泛适用、无需定制集成的大型语言模型（LLM）驱动的管道，该管道能够利用未经处理的文档自动完成患者与临床试验的匹配任务。其关键在于采用新的推理-LLM范式以评估复杂的入选标准，利用最新LLMs的视觉能力直接解读医疗记录而无需损失性的图像到文本转换，并通过多模态嵌入实现高效医疗记录搜索。这种创新方法显著提高了患者筛选的效率和准确性，在实际应用中实现了平均每位患者8分钟内完成整体资格审查的能力，相较于传统的手动审查提升了80%。

链接: https://arxiv.org/abs/2503.15374
作者: Anatole Callies(Inato),Quentin Bodinier(Inato),Philippe Ravaud(Inato, Université Paris Cité and Université Sorbonne Paris Nord, INSERM, INRAE, Paris, France, Centre d’epidémiologie clinique, AP-HP, Hôpital Hôtel Dieu, Paris, France),Kourosh Davarpanah(Inato)
机构: Center for Research in Epidemiology and Statistics (CRESS), Université Paris Cité and Université Sorbonne Paris Nord, INSERM, INRAE (流行病学与统计学研究中心); Centre d’epidémiologie clinique, AP-HP, Hôpital Hôtel Dieu (临床流行病学中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background: Patient recruitment in clinical trials is hindered by complex eligibility criteria and labor-intensive chart reviews. Prior research using text-only models have struggled to address this problem in a reliable and scalable way due to (1) limited reasoning capabilities, (2) information loss from converting visual records to text, and (3) lack of a generic EHR integration to extract patient data. Methods: We introduce a broadly applicable, integration-free, LLM-powered pipeline that automates patient-trial matching using unprocessed documents extracted from EHRs. Our approach leverages (1) the new reasoning-LLM paradigm, enabling the assessment of even the most complex criteria, (2) visual capabilities of latest LLMs to interpret medical records without lossy image-to-text conversions, and (3) multimodal embeddings for efficient medical record search. The pipeline was validated on the n2c2 2018 cohort selection dataset (288 diabetic patients) and a real-world dataset composed of 485 patients from 30 different sites matched against 36 diverse trials. Results: On the n2c2 dataset, our method achieved a new state-of-the-art criterion-level accuracy of 93%. In real-world trials, the pipeline yielded an accuracy of 87%, undermined by the difficulty to replicate human decision-making when medical records lack sufficient information. Nevertheless, users were able to review overall eligibility in under 9 minutes per patient on average, representing an 80% improvement over traditional manual chart reviews. Conclusion: This pipeline demonstrates robust performance in clinical trial patient matching without requiring custom integration with site systems or trial-specific tailoring, thereby enabling scalable deployment across sites seeking to leverage AI for patient matching. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.15374 [cs.CL] (or arXiv:2503.15374v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.15374 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anatole Callies [view email] [v1] Wed, 19 Mar 2025 16:12:11 UTC (4,721 KB)
zh

[NLP-9] SemEval-2025 Task 1: AdMIRe – Advancing Multimodal Idiomaticity Representation ACL2025 SEMEVAL-2025

【速读】：该论文试图解决自然语言处理（NLP）中习语理解这一长期存在的挑战，特别是大型语言模型（LLMs）在多模态上下文中对习语性（idiomaticity）的表征能力不足的问题。论文通过构建SemEval-2025 Task 1: AdMiRe的数据集与任务，评估和提升模型在多种语言中解读习语的能力。解决方案的关键在于利用预训练的LLMs和视觉-语言模型的混合专家设置（mixture-of-experts），并通过多查询策略弥补这些模型在习语性表征中的弱点，从而实现接近人类水平的性能。

链接: https://arxiv.org/abs/2503.15358
作者: Thomas Pickard,Aline Villavicencio,Maggie Mi,Wei He,Dylan Phelps,Carolina Scarton,Marco Idiart
机构: University of Sheffield (谢菲尔德大学), UK; University of Exeter (埃克塞特大学), UK; Federal University of Rio Grande do Sul (南里奥格兰德联邦大学), Brazil
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint; SemEval-2025 proceedings to appear at ACL 2025

点击查看摘要

Abstract:Idiomatic expressions present a unique challenge in NLP, as their meanings are often not directly inferable from their constituent words. Despite recent advancements in Large Language Models (LLMs), idiomaticity remains a significant obstacle to robust semantic representation. We present datasets and tasks for SemEval-2025 Task 1: AdMiRe (Advancing Multimodal Idiomaticity Representation), which challenges the community to assess and improve models’ ability to interpret idiomatic expressions in multimodal contexts and in multiple languages. Participants competed in two subtasks: ranking images based on their alignment with idiomatic or literal meanings, and predicting the next image in a sequence. The most effective methods achieved human-level performance by leveraging pretrained LLMs and vision-language models in mixture-of-experts settings, with multiple queries used to smooth over the weaknesses in these models’ representations of idiomaticity.
zh

[NLP-10] Optimizing Decomposition for Optimal Claim Verification

【速读】：该论文旨在解决现有“分解-再验证”（Decompose-Then-Verify）范式在评估长文本事实性时存在的问题，即分解与验证过程被孤立处理，未能充分考虑两者之间的相互作用及潜在不一致。论文指出，现有的分解策略（通常为手工设计的示例）在原子性（atomicity）这一新提出的度量标准下与下游验证器的适配性较差，导致验证结果次优。为解决此问题，论文将寻找最优分解策略以实现最佳验证建模为双层优化（bilevel optimization）问题，并提出动态分解（dynamic decomposition）框架，这是一种基于强化学习的方法，利用验证器反馈来学习动态分解声明至验证器偏好的原子性水平的策略。实验结果显示，动态分解方法在不同验证器、数据集以及输入声明的原子性条件下，平均提升了验证置信度0.07，验证准确率0.12。

链接: https://arxiv.org/abs/2503.15354
作者: Yining Lu,Noah Ziems,Hy Dang,Meng Jiang
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current research on the \textitDecompose-Then-Verify paradigm for evaluating the factuality of long-form text typically treats decomposition and verification in isolation, overlooking their interactions and potential misalignment. We find that existing decomposition policies, typically hand-crafted demonstrations, do not align well with downstream verifiers in terms of atomicity – a novel metric quantifying information density – leading to suboptimal verification results. We formulate finding the optimal decomposition policy for optimal verification as a bilevel optimization problem. To approximate a solution for this strongly NP-hard problem, we propose dynamic decomposition, a reinforcement learning framework that leverages verifier feedback to learn a policy for dynamically decomposing claims to verifier-preferred atomicity. Experimental results show that dynamic decomposition outperforms existing decomposition policies, improving verification confidence by 0.07 and accuracy by 0.12 (on a 0-1 scale) on average across varying verifiers, datasets, and atomcities of input claims.
zh

[NLP-11] SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models

【速读】：本文旨在解决现有基于嵌入（Embedding）的意图聚类方法在处理新数据集时泛化能力较弱的问题。这些方法通常依赖少量标注样本或针对每个新数据集进行无监督微调来优化结果，导致其难以适应多个数据集。为使现有嵌入器在不进行进一步微调的情况下对新的领域数据集更具通用性，本文提出了一种名为选择与池化结合大语言模型的方法（Selection and Pooling with Large Language Models, SPILL）。该方法的关键在于将聚类任务视为一个小规模的选择问题，并通过两阶段策略实现：第一阶段利用现有嵌入器为每个语句（称为种子）生成嵌入；第二阶段首先使用距离度量从候选集中选择接近种子的语句，由于嵌入器未针对新数据集进行优化，因此在此阶段引入大型语言模型（LLM），进一步筛选出与种子具有相同意图的语句。最终将选中的候选语句与种子一起池化，以获得种子的精炼嵌入。实验结果表明，该方法不仅优于直接使用嵌入器，而且达到了与其他最先进的研究相当的结果，展示了其高效性和强大性能。这表明，通过此方法可以使现有的嵌入器在无需额外微调的情况下得到改进，从而更好地适应新的领域数据集。此外，将聚类任务视为小规模选择问题，还为利用LLM根据用户目标定制聚类任务提供了可能性。

链接: https://arxiv.org/abs/2503.15351
作者: I-Fan Lin,Faegheh Hasibi,Suzan Verberne
机构: Leiden University (莱顿大学); Radboud University (拉德堡德大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose Selection and Pooling with Large Language Models (SPILL), an intuitive and domain-adaptive method for intent clustering without fine-tuning. Existing embeddings-based clustering methods rely on a few labeled examples or unsupervised fine-tuning to optimize results for each new dataset, which makes them less generalizable to multiple datasets. Our goal is to make these existing embedders more generalizable to new domain datasets without further fine-tuning. Inspired by our theoretical derivation and simulation results on the effectiveness of sampling and pooling techniques, we view the clustering task as a small-scale selection problem. A good solution to this problem is associated with better clustering performance. Accordingly, we propose a two-stage approach: First, for each utterance (referred to as the seed), we derive its embedding using an existing embedder. Then, we apply a distance metric to select a pool of candidates close to the seed. Because the embedder is not optimized for new datasets, in the second stage, we use an LLM to further select utterances from these candidates that share the same intent as the seed. Finally, we pool these selected candidates with the seed to derive a refined embedding for the seed. We found that our method generally outperforms directly using an embedder, and it achieves comparable results to other state-of-the-art studies, even those that use much larger models and require fine-tuning, showing its strength and efficiency. Our results indicate that our method enables existing embedders to be further improved without additional fine-tuning, making them more adaptable to new domain datasets. Additionally, viewing the clustering task as a small-scale selection problem gives the potential of using LLMs to customize clustering tasks according to the user’s goals.
zh

[NLP-12] Inside-Out: Hidden Factual Knowledge in LLM s

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）是否在其参数中编码了比其输出表达更多的事实知识这一问题。尽管已有少量研究暗示了这种可能性，但尚未有明确的定义或实证来展示这一现象。为了解决这个问题，论文首先提出了一个关于知识的形式化定义，并通过正确答案与错误答案对中正确答案排名更高的比例来量化给定问题的知识量。基于用于评分单个答案候选者的不同信息来源（即模型可观察到的token级概率或其中间计算），定义了外部知识和内部知识，当内部知识超过外部知识时则出现了隐含知识。论文的关键解决方案在于提出了一种框架来区分并衡量这些不同类型的知识，并通过封闭问答任务中的案例研究验证了该框架的有效性，结果显示LLMs内部编码的事实知识显著多于其外部表达的知识，平均差距达40%，并且有些知识被如此深刻地隐藏以至于即使模型完全知道正确答案，也几乎不可能在大规模重复采样中生成它，这揭示了LLMs生成能力的根本局限性。

链接: https://arxiv.org/abs/2503.15299
作者: Zorik Gekhman,Eyal Ben David,Hadas Orgad,Eran Ofek,Yonatan Belinkov,Idan Szpector,Jonathan Herzig,Roi Reichart
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model’s observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) puts a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.
zh

[NLP-13] ROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification

【速读】：该论文试图解决在高风险领域（如医疗、法律和新闻）中，大型语言模型（Large Language Models, LLMs）生成文本的可靠性与可追溯性问题。随着LLMs在这些领域的广泛应用，理解生成内容的来源及其形成过程变得尤为重要。为此，论文引入了Text pROVEnance (TROVE) 挑战，其关键在于通过细粒度标注（如引用、压缩、推理等关系）追溯目标文本中的每一句话到潜在输入源的具体句子，并提供对生成内容形成方式的深入理解。为实现这一目标，论文构建了一个包含英语和中文、覆盖11种多样化场景的数据集，并设计了三阶段标注流程以确保高质量数据。此外，研究评估了多种LLMs在直接提示和检索增强范式下的表现，揭示了检索对于鲁棒性能的重要性，以及大模型在复杂关系分类上的优势，同时指出开源模型在检索增强下展现出显著潜力。

链接: https://arxiv.org/abs/2503.15289
作者: Junnan Zhu,Min Xiao,Yining Wang,Feifei Zhai,Yu Zhou,Chengqing Zong
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS (自动化研究所，中国科学院), Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院), Beijing, China; Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd (中科凡羽科技有限公司), Beijing, China; Unisound AI Technology Co.Ltd (讯飞人工智能科技有限公司)
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:LLMs have achieved remarkable fluency and coherence in text generation, yet their widespread adoption has raised concerns about content reliability and accountability. In high-stakes domains such as healthcare, law, and news, it is crucial to understand where and how the content is created. To address this, we introduce the Text pROVEnance (TROVE) challenge, designed to trace each sentence of a target text back to specific source sentences within potentially lengthy or multi-document inputs. Beyond identifying sources, TROVE annotates the fine-grained relationships (quotation, compression, inference, and others), providing a deep understanding of how each target sentence is formed. To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios (e.g., QA and summarization) in English and Chinese, spanning source texts of varying lengths (0-5k, 5-10k, 10k+), emphasizing the multi-document and long-document settings essential for provenance. To ensure high-quality data, we employ a three-stage annotation process: sentence retrieval, GPT provenance, and human provenance. We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms, revealing that retrieval is essential for robust performance, larger models perform better in complex relationship classification, and closed-source models often lead, yet open-source models show significant promise, particularly with retrieval augmentation.
zh

[NLP-14] MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration NAACL2025

【速读】：该论文旨在解决长文本生成任务（如摘要生成和问答）中模型输出缺乏忠实性的问题，具体表现为去除生成结果中的事实不一致。论文的关键解决方案是将多智能体（multi-agent）与多模型（multi-model）协作扩展到生成任务的优化中，特别是在通过精炼（refinement）提升忠实性的过程中。这种方法通过迭代协作多个实例和类型的大型语言模型（LLMs），增强精炼过程中的子任务，如错误检测、批判不忠实句子以及基于批评进行修正。研究发现，多智能体和多模型方法在错误检测和批判子任务中表现出显著优势，且将批判和精炼重新定义为重排序任务而非生成任务进一步提升了多智能体的表现。最终，这些见解被整合为名为“多智能体多模型精炼”（MAMM-Refine）的方法，证明了该方案在多种数据集上的有效性和通用性。

链接: https://arxiv.org/abs/2503.15272
作者: David Wan,Justin Chih-Yao Chen,Elias Stengel-Eskin,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025, 18 pages. Code: this https URL

点击查看摘要

Abstract:Multi-agent collaboration among models has shown promise in reasoning tasks but is underexplored in long-form generation tasks like summarization and question-answering. We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement, i.e., revising model-generated outputs to remove factual inconsistencies. We investigate how iterative collaboration among multiple instances and types of large language models (LLMs) enhances subtasks in the refinement process, such as error detection, critiquing unfaithful sentences, and making corrections based on critiques. We design intrinsic evaluations for each subtask, with our findings indicating that both multi-agent (multiple instances) and multi-model (diverse LLM types) approaches benefit error detection and critiquing. Additionally, reframing critiquing and refinement as reranking rather than generation tasks improves multi-agent performance. We consolidate these insights into a final “recipe” called Multi-Agent Multi-Model Refinement (MAMM-Refine), where multi-agent and multi-model collaboration significantly boosts performance on three summarization datasets as well as on long-form question answering, demonstrating the effectiveness and generalizability of our recipe.
zh

[NLP-15] BigO(Bench) – Can LLM s Generate Code with Controlled Time and Space Complexity?

【速读】：该论文试图解决现有评估基准未能充分关注生成式语言模型（Generative Language Models）理解和生成受计算复杂性约束代码能力的问题。解决方案的关键在于提出了BigO(Bench)，这是一个新颖的编码基准，包含一套工具用于通过剖析测量推断Python函数的算法复杂性，并标注了从Code Contests中提取的3,105个编程问题及其1,190,250个解决方案的时间和空间复杂性标签，同时提供了大量输入规模下的运行时间和内存占用值。这一方法填补了当前评估中的空白，系统化地考察模型在指定复杂性条件下的代码理解与生成能力。

链接: https://arxiv.org/abs/2503.15242
作者: Pierre Chambon,Baptiste Roziere,Benoit Sagot,Gabriel Synnaeve
机构: Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:

点击查看摘要

Abstract:We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.
zh

[NLP-16] Exploring Large Language Models for Word Games:Who is the Spy?

【速读】：该论文旨在探索大型语言模型（Large Language Models, LLMs）在基于规则和情境的词游（word games）中的有效应用，并提出了一种无需训练的框架。论文以“谁是卧底”这一经典词游为例，设计了一个基于思维链（Chain-of-Thought, CoT）的调度框架，使LLMs能够在角色词推理和身份伪装等任务中表现出色。解决方案的关键在于通过CoT机制赋予LLMs强大的情境推理能力，使其能够适应结构化游戏环境中的社会交互与角色扮演需求。实验结果验证了该框架的有效性，显著提升了LLMs在多个数据集上的表现。

链接: https://arxiv.org/abs/2503.15235
作者: Chentian Wei,Jiewei Chen,Jinzhu Xu
机构: Institute for Network Sciences and Cyberspace (网络科学与网络空间研究所); School of Software (软件学院); Department of Computer Science and Technology (计算机科学与技术系); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Word games hold significant research value for natural language processing (NLP), game theory, and related fields due to their rule-based and situational nature. This study explores how large language models (LLMs) can be effectively involved in word games and proposes a training-free framework. “Shei Shi Wo Di” or “Who is the Spy” in English, is a classic word game. Using this game as an example, we introduce a Chain-of-Thought (CoT)-based scheduling framework to enable LLMs to achieve excellent performance in tasks such as inferring role words and disguising their identities. We evaluate the framework’s performance based on game success rates and the accuracy of the LLM agents’ analytical results. Experimental results affirm the framework’s effectiveness, demonstrating notable improvements in LLM performance across multiple datasets. This work highlights the potential of LLMs in mastering situational reasoning and social interactions within structured game environments. Our code is publicly available at this https URL.
zh

[NLP-17] Model Hubs and Beyond: Analyzing Model Popularity Performance and Documentation

【速读】：该论文旨在解决用户在选择适合下游任务的机器学习模型时面临的困境，即如何判断模型的实际性能与其流行度之间的关系，并评估模型文档的全面性对其受欢迎程度和性能的影响。论文的关键解决方案在于对Hugging Face平台上的500个情感分析模型进行全面评估，包括大量人工标注（近80,000条）以及模型训练与评价工作，从而揭示模型流行度与性能之间不存在必然关联，并发现约80%的模型在模型卡片中缺乏关键信息，约88%的作者存在性能高估现象。基于这些发现，论文提出了一个指导用户选择合适模型的检查清单。

链接: https://arxiv.org/abs/2503.15222
作者: Pritam Kadasi,Sriman Reddy,Srivathsa Vamsi Chaturvedula,Rudranshu Sen,Agnish Saha,Soumavo Sikdar,Sayani Sarkar,Suhani Mittal,Rohit Jindal,Mayank Singh
机构: IIT Gandhinagar (印度技术学院甘地讷格尔)
类目: Computation and Language (cs.CL)
备注: Accepted to ICWSM’25

点击查看摘要

Abstract:With the massive surge in ML models on platforms like Hugging Face, users often lose track and struggle to choose the best model for their downstream tasks, frequently relying on model popularity indicated by download counts, likes, or recency. We investigate whether this popularity aligns with actual model performance and how the comprehensiveness of model documentation correlates with both popularity and performance. In our study, we evaluated a comprehensive set of 500 Sentiment Analysis models on Hugging Face. This evaluation involved massive annotation efforts, with human annotators completing nearly 80,000 annotations, alongside extensive model training and evaluation. Our findings reveal that model popularity does not necessarily correlate with performance. Additionally, we identify critical inconsistencies in model card reporting: approximately 80% of the models analyzed lack detailed information about the model, training, and evaluation processes. Furthermore, about 88% of model authors overstate their models’ performance in the model cards. Based on our findings, we provide a checklist of guidelines for users to choose good models for downstream tasks.
zh

[NLP-18] Entity-aware Cross-lingual Claim Detection for Automated Fact-checking

【速读】：该论文致力于解决自动化事实核查中的关键任务——识别需要验证的主张，特别是在社交媒体平台上泛滥的多语言和多模态数据背景下。现有方法虽在多语言预训练语言模型的微调方面取得进展，但其在跨语言知识迁移以检测社交平台上传播的主张方面的有效性仍缺乏深入研究。为此，论文提出了一种名为\textit{EX-Claim}的实体感知跨语言主张检测模型，该模型能够有效处理任意语言的主张。其关键创新在于利用命名实体识别（Named Entity Recognition, NER）和实体链接（Entity Linking, EL）技术提取的实体信息，显著提升了已见与未见语言在主张检测任务上的性能表现。实验结果表明，该模型在三个来自不同社交媒体平台的数据集上，显著优于基线模型，并实现了高达27种语言的知识迁移能力，尤其是在有限标注数据条件下表现出色。

链接: https://arxiv.org/abs/2503.15220
作者: Rrubaa Panchendrarajan,Arkaitz Zubiaga
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Identifying claims requiring verification is a critical task in automated fact-checking, especially given the proliferation of misinformation on social media platforms. Despite significant progress in the task, there remain open challenges such as dealing with multilingual and multimodal data prevalent in online discourse. Addressing the multilingual challenge, recent efforts have focused on fine-tuning pre-trained multilingual language models. While these models can handle multiple languages, their ability to effectively transfer cross-lingual knowledge for detecting claims spreading on social media remains under-explored. In this paper, we introduce \textitEX-Claim, an entity-aware cross-lingual claim detection model that generalizes well to handle claims written in any language. The model leverages entity information derived from named entity recognition and entity linking techniques to improve the language-level performance of both seen and unseen languages during training. Extensive experiments conducted on three datasets from different social media platforms demonstrate that our proposed model significantly outperforms the baselines, across 27 languages, and achieves the highest rate of knowledge transfer, even with limited training data.
zh

[NLP-19] When Pigs Get Sick: Multi-Agent AI for Swine Disease Detection

【速读】：该论文旨在解决全球农业可持续发展中因有限兽医资源、病例识别延迟及诊断准确性变异所导致的猪病监测有效性不足的问题。解决方案的关键在于引入一种基于人工智能的多智能体诊断系统，该系统利用 Retrieval-Augmented Generation (RAG) 技术实现及时且基于证据的疾病检测与临床指导。通过自动将用户输入分类为知识检索查询或基于症状的诊断查询，系统确保目标信息检索并促进精确诊断推理；同时，自适应提问协议与置信加权决策融合机制协同工作，以生成稳健的疾病预测和治疗建议。综合评估表明，该系统在查询分类、疾病诊断和知识检索方面均表现出高精度、快速响应时间和一致可靠性。

链接: https://arxiv.org/abs/2503.15204
作者: Tittaya Mairittha,Tanakon Sawanglok,Panuwit Raden,Sorrawit Treesuk
机构: AXONS (AXONS)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注: 14 pages, 2 figures

点击查看摘要

Abstract:Swine disease surveillance is critical to the sustainability of global agriculture, yet its effectiveness is frequently undermined by limited veterinary resources, delayed identification of cases, and variability in diagnostic accuracy. To overcome these barriers, we introduce a novel AI-powered, multi-agent diagnostic system that leverages Retrieval-Augmented Generation (RAG) to deliver timely, evidence-based disease detection and clinical guidance. By automatically classifying user inputs into either Knowledge Retrieval Queries or Symptom-Based Diagnostic Queries, the system ensures targeted information retrieval and facilitates precise diagnostic reasoning. An adaptive questioning protocol systematically collects relevant clinical signs, while a confidence-weighted decision fusion mechanism integrates multiple diagnostic hypotheses to generate robust disease predictions and treatment recommendations. Comprehensive evaluations encompassing query classification, disease diagnosis, and knowledge retrieval demonstrate that the system achieves high accuracy, rapid response times, and consistent reliability. By providing a scalable, AI-driven diagnostic framework, this approach enhances veterinary decision-making, advances sustainable livestock management practices, and contributes substantively to the realization of global food security.
zh

[NLP-20] A Review on Large Language Models for Visual Analytics

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）与视觉分析整合中的关键挑战与机遇问题。论文的核心关注点是如何通过结合LLMs的自然语言理解和生成能力以及对话系统等特性，提升数据解释、可视化技术和交互探索的能力。关键解决方案在于系统性地评估现有工具（如LIDA、Chat2VIS、Julius AI、Zoho Analytics）及多模态模型（如ChartLlama、CharXIV）的功能与局限，并深入探讨LLM任务分类（从自然语言理解到文本-媒体转换）在支持数据探索、可视化增强、自动化报告生成及洞察提取方面的潜力。同时，论文强调需通过解决隐私问题、技能退化等威胁，并优化方法论以应对计算需求高、潜在偏见等弱点，从而实现伦理合规且高效的技术整合。

链接: https://arxiv.org/abs/2503.15176
作者: Navya Sonal Agarwal,Sanjay Kumar Sonbhadra
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper provides a comprehensive review of the integration of Large Language Models (LLMs) with visual analytics, addressing their foundational concepts, capabilities, and wide-ranging applications. It begins by outlining the theoretical underpinnings of visual analytics and the transformative potential of LLMs, specifically focusing on their roles in natural language understanding, natural language generation, dialogue systems, and text-to-media transformations. The review further investigates how the synergy between LLMs and visual analytics enhances data interpretation, visualization techniques, and interactive exploration capabilities. Key tools and platforms including LIDA, Chat2VIS, Julius AI, and Zoho Analytics, along with specialized multimodal models such as ChartLlama and CharXIV, are critically evaluated. The paper discusses their functionalities, strengths, and limitations in supporting data exploration, visualization enhancement, automated reporting, and insight extraction. The taxonomy of LLM tasks, ranging from natural language understanding (NLU), natural language generation (NLG), to dialogue systems and text-to-media transformations, is systematically explored. This review provides a SWOT analysis of integrating Large Language Models (LLMs) with visual analytics, highlighting strengths like accessibility and flexibility, weaknesses such as computational demands and biases, opportunities in multimodal integration and user collaboration, and threats including privacy concerns and skill degradation. It emphasizes addressing ethical considerations and methodological improvements for effective integration.
zh

[NLP-21] Comparing Llama3 and DeepSeek R1 on Biomedical Text Classification Tasks

【速读】：该论文旨在比较两个开源大语言模型（Large Language Models, LLMs）——Llama3-70B 和 DeepSeekR1-distill-Llama3-70B，在六个生物医学文本分类任务中的性能表现。这些任务包括来自社交媒体的四种数据和电子健康记录中的两种临床笔记数据，并在零样本（zero-shot）设置下进行实验。论文通过测量精确度（precision）、召回率（recall）和 F1 分数等指标评估模型性能，并计算了 95% 置信区间。研究的关键发现表明，DeepSeekR1-distill-Llama3-70B 在大多数任务上的精确度优于 Llama3-70B，但在召回率方面结果不一；尽管零样本 LLMs 在某些任务中表现出较高的 F1 分数，但对两类来源的数据均在其他任务中显著表现不足。因此，论文强调模型选择应基于具体任务需求，特别是在精确度与召回率权衡时需谨慎，并指出在存在标注数据的情况下，监督分类方法可能比零样本 LLMs 更可靠。

链接: https://arxiv.org/abs/2503.15169
作者: Yuting Guo,Abeed Sarker
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages

点击查看摘要

Abstract:This study compares the performance of two open-source large language models (LLMs)-Llama3-70B and DeepSeekR1-distill-Llama3-70B-on six biomedical text classification tasks. Four tasks involve data from social media, while two tasks focus on clinical notes from electronic health records, and all experiments were performed in zero-shot settings. Performance metrics, including precision, recall, and F1 scores, were measured for each task, along with their 95% confidence intervals. Results demonstrated that DeepSeekR1-distill-Llama3-70B generally performs better in terms of precision on most tasks, with mixed results on recall. While the zero-shot LLMs demonstrated high F1 scores for some tasks, they grossly underperformed on others, for data from both sources. The findings suggest that model selection should be guided by the specific requirements of the health-related text classification tasks, particularly when considering the precision-recall trade-offs, and that, in the presence of annotated data, supervised classification approaches may be more reliable than zero-shot LLMs.
zh

[NLP-22] Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU

【速读】：该论文旨在解决在预训练大模型中选择性移除特定概念（machine unlearning）的问题，特别关注超bolic空间中概念移除的有效性。此前的研究主要集中于欧几里得空间中的对比学习模型，而超bolic空间的应用尚未被充分探索。论文的关键在于将Alignment Calibration方法适配到MERU模型中，该模型通过在超bolic空间嵌入图像和文本来更好地捕捉语义层次结构。解决方案的关键包括引入超bolic-specific组件，如蕴涵校准（entailment calibration）和范数正则化（norm regularization），以利用超bolic空间的独特属性。实验结果表明，超bolic几何在概念移除方面具有显著优势，在实现接近完美的遗忘同时保持保留概念的合理性能，尤其适用于多概念移除场景。与欧几里得模型的对比分析揭示了两种方法在无学习动态上的根本差异，其中超bolic无学习重新组织了语义层次结构，而欧几里得方法仅断开跨模态关联。这些发现不仅推动了机器无学习技术的发展，还为影响多模态模型中概念表示和移除的几何特性提供了见解。

链接: https://arxiv.org/abs/2503.15166
作者: Àlex Pujol Vidal,Sergio Escalera,Kamal Nasrollahi,Thomas B. Moeslund
机构: Aalborg University (奥尔堡大学), Denmark; University of Barcelona and Computer Vision Center (巴塞罗那大学和计算机视觉中心), Spain; Milestone Systems (里程碑系统公司), Denmark; Pioneer Center for Artificial Intelligence (先锋人工智能中心), Denmark
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Preprint

点击查看摘要

Abstract:Machine unlearning methods have become increasingly important for selective concept removal in large pre-trained models. While recent work has explored unlearning in Euclidean contrastive vision-language models, the effectiveness of concept removal in hyperbolic spaces remains unexplored. This paper investigates machine unlearning in hyperbolic contrastive learning by adapting Alignment Calibration to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies. Through systematic experiments and ablation studies, we demonstrate that hyperbolic geometry offers distinct advantages for concept removal, achieving near perfect forgetting with reasonable performance on retained concepts, particularly when scaling to multiple concept removal. Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space. Comparative analysis with Euclidean models reveals fundamental differences in unlearning dynamics, with hyperbolic unlearning reorganizing the semantic hierarchy while Euclidean approaches merely disconnect cross-modal associations. These findings not only advance machine unlearning techniques but also provide insights into the geometric properties that influence concept representation and removal in multimodal models. Source code available at this https URL
zh

[NLP-23] EmoGRACE: Aspect-based emotion analysis for social media data

【速读】：该论文旨在解决情感分析从句子级向方面级（即具体情感相关术语的识别）发展过程中，方面级情感分析（Aspect-based Emotion Analysis, ABEA）面临的标注数据不足以及情感类别复杂度增加的问题。为解决这些问题，论文的关键方案是构建了一个包含2,621条英文推文的ABEA训练数据集，并基于此数据集微调了一个BERT模型以完成方面术语提取（Aspect Term Extraction, ATE）和方面情感分类（Aspect Emotion Classification, AEC）的子任务。数据集采用Shaver等人的层级情感理论进行标注，并通过群体注释与多数投票策略确保标签一致性。此外，论文还利用现有最先进的方面级情感分析模型GRACE进一步微调，以评估其在ABEA任务上的性能。然而，实验结果显示，受限于小规模训练数据和任务复杂性，模型在方面术语提取和联合提取任务上的F1分数分别达到70.1%和46.9%，存在性能瓶颈。因此，该研究的核心挑战在于如何克服数据规模限制并有效应对情感类别多样性和任务复杂性带来的困难。

链接: https://arxiv.org/abs/2503.15133
作者: Christina Zorenböhmer,Sebastian Schmidt,Bernd Resch
机构: plus.ac.at (萨尔茨堡应用技术大学); it-u.at (维也纳工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While sentiment analysis has advanced from sentence to aspect-level, i.e., the identification of concrete terms related to a sentiment, the equivalent field of Aspect-based Emotion Analysis (ABEA) is faced with dataset bottlenecks and the increased complexity of emotion classes in contrast to binary sentiments. This paper addresses these gaps, by generating a first ABEA training dataset, consisting of 2,621 English Tweets, and fine-tuning a BERT-based model for the ABEA sub-tasks of Aspect Term Extraction (ATE) and Aspect Emotion Classification (AEC). The dataset annotation process was based on the hierarchical emotion theory by Shaver et al. [1] and made use of group annotation and majority voting strategies to facilitate label consistency. The resulting dataset contained aspect-level emotion labels for Anger, Sadness, Happiness, Fear, and a None class. Using the new ABEA training dataset, the state-of-the-art ABSA model GRACE by Luo et al. [2] was fine-tuned for ABEA. The results reflected a performance plateau at an F1-score of 70.1% for ATE and 46.9% for joint ATE and AEC extraction. The limiting factors for model performance were broadly identified as the small training dataset size coupled with the increased task complexity, causing model overfitting and limited abilities to generalize well on new data. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2503.15133 [cs.CL] (or arXiv:2503.15133v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.15133 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-24] Increasing the Robustness of the Fine-tuned Multilingual Machine-Generated Text Detectors

【速读】：该论文试图解决大型语言模型（LLMs）生成的有害内容难以被有效检测的问题，特别是如何准确识别在线信息空间中的机器生成内容，以评估其可信度。论文的关键在于提出了一种针对检测任务的鲁棒微调方法，通过优化大型语言模型，使其对对抗性干扰更具鲁棒性，并且能够更好地泛化到分布外的数据，从而提升检测性能。

链接: https://arxiv.org/abs/2503.15128
作者: Dominik Macko,Robert Moro,Ivan Srba
机构: Kempelen Institute of Intelligent Technologies (凯姆佩伦智能技术研究所), Slovakia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Since the proliferation of LLMs, there have been concerns about their misuse for harmful content creation and spreading. Recent studies justify such fears, providing evidence of LLM vulnerabilities and high potential of their misuse. Humans are no longer able to distinguish between high-quality machine-generated and authentic human-written texts. Therefore, it is crucial to develop automated means to accurately detect machine-generated content. It would enable to identify such content in online information space, thus providing an additional information about its credibility. This work addresses the problem by proposing a robust fine-tuning process of LLMs for the detection task, making the detectors more robust against obfuscation and more generalizable to out-of-distribution data.
zh

[NLP-25] Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces

【速读】：该论文试图解决自动语音识别（Automatic Speech Recognition, ASR）转录结果中持续存在的错误以及现有置信分数（confidence scores）在错误检测中的局限性问题。论文的关键在于评估当前置信分数用于错误检测的可靠性，并通过端到端ASR模型的全面分析及包含36名参与者的用户研究揭示其不足之处。研究表明，尽管置信分数与转录准确性存在相关性，但其在错误检测中的表现有限，分类器经常漏检错误或产生大量误报，从而削弱了其实用价值。因此，论文强调需要更先进的方法来提升ASR结果的用户交互性和可解释性。

链接: https://arxiv.org/abs/2503.15124
作者: Korbinian Kuhn,Verena Kersken,Gottfried Zimmermann
机构: Stuttgart Media University(斯图加特媒体大学) Stuttgart Germany
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 7 pages, 1 figure, to be published in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '25)

点击查看摘要

Abstract:Despite advances in Automatic Speech Recognition (ASR), transcription errors persist and require manual correction. Confidence scores, which indicate the certainty of ASR results, could assist users in identifying and correcting errors. This study evaluates the reliability of confidence scores for error detection through a comprehensive analysis of end-to-end ASR models and a user study with 36 participants. The results show that while confidence scores correlate with transcription accuracy, their error detection performance is limited. Classifiers frequently miss errors or generate many false positives, undermining their practical utility. Confidence-based error detection neither improved correction efficiency nor was perceived as helpful by participants. These findings highlight the limitations of confidence scores and the need for more sophisticated approaches to improve user interaction and explainability of ASR results.
zh

[NLP-26] Exploring Model Editing for LLM -based Aspect-Based Sentiment Classification AAAI2025

【速读】：该论文试图解决通过高效微调大型语言模型（Large Language Models, LLMs）来完成基于方面的情感分类（Aspect-Based Sentiment Classification, ABSC）的问题。解决方案的关键在于利用模型编辑技术，通过对神经网络隐层状态进行因果干预，确定LLMs中对ABSC任务至关重要的特定中间表示部分。通过仅针对这些关键组件进行编辑，论文提出的方法显著减少了可训练参数数量，同时保持了与当前最优方法相当的性能，从而实现了更高效且可解释的微调策略。

链接: https://arxiv.org/abs/2503.15117
作者: Shichen Li,Zhongqing Wang,Zheyu Zhao,Yue Zhang,Peifeng Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI2025

点击查看摘要

Abstract:Model editing aims at selectively updating a small subset of a neural model’s parameters with an interpretable strategy to achieve desired modifications. It can significantly reduce computational costs to adapt to large language models (LLMs). Given its ability to precisely target critical components within LLMs, model editing shows great potential for efficient fine-tuning applications. In this work, we investigate model editing to serve an efficient method for adapting LLMs to solve aspect-based sentiment classification. Through causal interventions, we trace and determine which neuron hidden states are essential for the prediction of the model. By performing interventions and restorations on each component of an LLM, we identify the importance of these components for aspect-based sentiment classification. Our findings reveal that a distinct set of mid-layer representations is essential for detecting the sentiment polarity of given aspect words. Leveraging these insights, we develop a model editing approach that focuses exclusively on these critical parts of the LLM, leading to a more efficient method for adapting LLMs. Our in-domain and out-of-domain experiments demonstrate that this approach achieves competitive results compared to the currently strongest methods with significantly fewer trainable parameters, highlighting a more efficient and interpretable fine-tuning strategy.
zh

[NLP-27] owards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings

【速读】：该论文旨在解决大型基础模型（Large Foundation Models）的安全性评估问题，特别是针对DeepSeek模型生成内容的安全风险进行全面评估。论文的关键解决方案在于开发了一套双语（中文-英文）的安全评价数据集，该数据集专门针对中国社会文化背景设计，从而更全面地考察中文模型在安全能力方面的表现。通过系统分析DeepSeek系列语言模型、多模态语言模型以及文本到图像模型在生成不安全内容方面的性能，研究揭示了这些模型在算法歧视和色情内容等多个风险维度上的显著漏洞。这一工作为理解和提升大型基础模型的安全性提供了重要参考，并通过开源代码支持了研究的可复现性。

链接: https://arxiv.org/abs/2503.15092
作者: Zonghao Ying,Guangyi Zheng,Yongxin Huang,Deyue Zhang,Wenxin Zhang,Quanchen Zou,Aishan Liu,Xianglong Liu,Dacheng Tao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study presents the first comprehensive safety evaluation of the DeepSeek models, focusing on evaluating the safety risks associated with their generated content. Our evaluation encompasses DeepSeek’s latest generation of large language models, multimodal large language models, and text-to-image models, systematically examining their performance regarding unsafe content generation. Notably, we developed a bilingual (Chinese-English) safety evaluation dataset tailored to Chinese sociocultural contexts, enabling a more thorough evaluation of the safety capabilities of Chinese-developed models. Experimental results indicate that despite their strong general capabilities, DeepSeek models exhibit significant safety vulnerabilities across multiple risk dimensions, including algorithmic discrimination and sexual content. These findings provide crucial insights for understanding and improving the safety of large foundation models. Our code is available at this https URL.
zh

[NLP-28] A Data-driven Investigation of Euphemistic Language: Comparing the usage of “slave” and “servant” in 19th century US newspapers

【速读】：该论文旨在解决19世纪美国报纸中“奴隶”（slave）和“仆人”（servant）这两个术语在使用上的差异及其背后所反映的社会话语问题。论文的关键在于采用计算方法（computational methods），结合词嵌入技术（word embeddings）和统计分析（log-odds ratio），系统性地识别和比较这两个术语在南方和北方报纸中的语义关联及过度代表的话语词汇（over-represented discourse words）。通过引入FastText嵌入以包容OCR错误，并排除文本重印现象以反映19世纪文本重印的文化背景，论文揭示了“奴隶”更多与社会经济、法律和行政相关词汇联系，而“仆人”则在北方与宗教词汇、在南方与家庭和家务相关词汇联系紧密。此外，论文发现南方报纸中的“奴隶”话语词汇更常出现在北方报纸中，而南北双方关于“仆人”的话语词汇则更多局限于其各自地区。关键解决方案在于综合运用计算工具与统计分析，以量化方式揭示不同区域报纸如何构建关于被奴役非裔美国人群体的不同话语体系。

链接: https://arxiv.org/abs/2503.15057
作者: Jaihyun Park,Ryan Cordell
机构: Nanyang Technological University (南洋理工大学); University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Computation and Language (cs.CL)
备注: The 5th International Conference on Natural Language Processing for Digital Humanities (NLP4DH)

点击查看摘要

Abstract:This study investigates the usage of “slave” and “servant” in the 19th century US newspapers using computational methods. While both terms were used to refer to enslaved African Americans, they were used in distinct ways. In the Chronicling America corpus, we included possible OCR errors by using FastText embedding and excluded text reprints to consider text reprint culture in the 19th century. Word2vec embedding was used to find semantically close words to “slave” and “servant” and log-odds ratio was calculated to identify over-represented discourse words in the Southern and Northern newspapers. We found that “slave” is associated with socio-economic, legal, and administrative words, however, “servant” is linked to religious words in the Northern newspapers while Southern newspapers associated “servant” with domestic and familial words. We further found that slave discourse words in Southern newspapers are more prevalent in Northern newspapers while servant discourse words from each side are prevalent in their own region. This study contributes to the understanding of how newspapers created different discourses around enslaved African Americans in the 19th century US.
zh

[NLP-29] ELTEX: A Framework for Domain-Driven Synthetic Data Generation

【速读】：本文旨在解决大型语言模型（LLMs）在专业化领域（如网络安全）中因缺乏领域特定训练数据而导致性能受限的问题。论文提出了一种名为ELTEX（高效LLM标记提取）的领域驱动框架，其关键是通过系统性整合显式的领域指示符提取与动态提示技术，在生成过程中保留关键的领域知识。实验表明，ELTEX在区块链相关网络攻击检测任务中显著提升了Gemma-2B模型的表现，并以更少的计算资源实现了与GPT-4相当的标准分类指标和不确定性校准能力。由此，该研究证明了领域驱动的合成数据生成方法能够有效弥合资源高效模型与更大架构之间的性能差距。

链接: https://arxiv.org/abs/2503.15055
作者: Arina Razmyslovich,Kseniia Murasheva,Sofia Sedlova,Julien Capitaine,Eugene Dmitriev
机构: Distributed Networks Institute (DNI); Technologies Mésozoïques
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present ELTEX (Efficient LLM Token Extraction), a domain-driven framework for generating high-quality synthetic training data in specialized domains. While Large Language Models (LLMs) have shown impressive general capabilities, their performance in specialized domains like cybersecurity remains limited by the scarcity of domain-specific training data. ELTEX addresses this challenge by systematically integrating explicit domain indicator extraction with dynamic prompting to preserve critical domain knowledge throughout the generation process. We demonstrate ELTEX’s effectiveness in the context of blockchain-related cyberattack detection, where we fine-tune Gemma-2B using various combinations of real and ELTEX-generated data. Our results show that the ELTEX-enhanced model achieves performance competitive with GPT-4 across both standard classification metrics and uncertainty calibration, while requiring significantly fewer computational resources. We release a curated synthetic dataset of social media texts for cyberattack detection in blockchain. Our work demonstrates that domain-driven synthetic data generation can effectively bridge the performance gap between resource-efficient models and larger architectures in specialized domains.
zh

[NLP-30] SPADE: Systematic Prompt Framework for Automated Dialogue Expansion in Machine-Generated Text Detection

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）生成的合成文本滥用问题，特别是针对机器生成文本（Machine-Generated Text, MGT）检测模型面临的挑战，即缺乏高质量、系统化生成的数据集用于训练。为了解决这一问题，论文的关键在于提出五种新颖的数据增强框架，通过结构化提示（structured prompting）方法生成合成用户对话数据，从而降低传统数据收集方法的成本。这种方法生成了14个新的对话数据集，并通过与七种MGT检测模型的对比实验验证了其有效性，特别是在利用混合数据集时提升了检测模型的泛化性能。此外，论文还探讨了在线对话检测场景下聊天历史长度对检测准确性的影响，并在有限聊天记录条件下评估了所提框架的在线检测性能。这些开放获取的数据集可供研究人员下载使用。

链接: https://arxiv.org/abs/2503.15044
作者: Haoyi Li,Angela Yifei Yuan,Soyeon Caren Han,Christopher Leckie
机构: The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of systematically generated, high-quality datasets for training. To address this issue, we propose five novel data augmentation frameworks for synthetic user dialogue generation through a structured prompting approach, reducing the costs associated with traditional data collection methods. Our proposed method yields 14 new dialogue datasets, which we benchmark against seven MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by our proposed augmentation framework. Furthermore, considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. We also benchmark online detection performance with limited chat history on our frameworks. Our open-source datasets can be downloaded from this https URL.
zh

[NLP-31] LLM Alignment for the Arabs: A Homogenous Culture or Diverse Ones? NAACL2025

【速读】：该论文试图解决阿拉伯语大型语言模型（Arabic-specific LLMs）在开发过程中对阿拉伯文化多样性的忽视问题。作者指出，尽管阿拉伯世界内部存在显著的文化差异，但当前多语言及阿拉伯语特定模型的开发往往基于文化同质性假设，未能充分反映阿拉伯文化的多样性。论文的关键在于提出初步思路，以构建能够更好地代表阿拉伯世界内部文化多样性的系统，呼吁自然语言处理（NLP）社区在多语言模型开发中更加关注语言使用者之间的文化差异。

链接: https://arxiv.org/abs/2503.15003
作者: Amr Keleg
机构: Institute for Language, Cognition and Computation (语言、认知与计算研究所); School of Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Computation and Language (cs.CL)
备注: Accepted to the C3NLP workshop (Co-located with NAACL 2025)

点击查看摘要

Abstract:Large language models (LLMs) have the potential of being useful tools that can automate tasks and assist humans. However, these models are more fluent in English and more aligned with Western cultures, norms, and values. Arabic-specific LLMs are being developed to better capture the nuances of the Arabic language, as well as the views of the Arabs. Yet, Arabs are sometimes assumed to share the same culture. In this position paper, I discuss the limitations of this assumption and provide preliminary thoughts for how to build systems that can better represent the cultural diversity within the Arab world. The invalidity of the cultural homogeneity assumption might seem obvious, yet, it is widely adopted in developing multilingual and Arabic-specific LLMs. I hope that this paper will encourage the NLP community to be considerate of the cultural diversity within various communities speaking the same language.
zh

[NLP-32] Right Answer Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

【速读】：该论文试图解决Multiple-Choice Question Answering (MCQA)评估策略中存在的不一致性问题，这些不一致可能导致大型语言模型（Large Language Models, LLMs）能力被低估以及模型比较的误导性。论文的关键在于系统分析现有答案提取方法是否符合人类判断，并探究提示中答案约束对不同领域评估的影响。实验结果表明，传统的评估策略往往低估了LLMs的能力，而基于LLM的答案提取器容易产生系统性错误。此外，研究揭示了在提示中加入格式约束以简化答案提取与允许模型生成自由形式文本以提升推理能力之间存在的权衡关系。论文呼吁制定标准化的评估方法，并强调需要更可靠且一致的MCQA评估实践。

链接: https://arxiv.org/abs/2503.14996
作者: Francesco Maria Molfese,Luca Moroni,Luca Gioffrè,Alessandro Scirè,Simone Conia,Roberto Navigli
机构: Sapienza University of Rome (罗马大学); Babelscape (Babelscape)
类目: Computation and Language (cs.CL)
备注: 17 pages (9 main), 11 figures, 21 tables

点击查看摘要

Abstract:One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to assess, as the model’s answer is thought to be simple to extract and is directly compared to a set of predefined choices. However, recent studies have started to question the reliability of MCQA evaluation, showing that multiple factors can significantly impact the reported performance of LLMs, especially when the model generates free-form text before selecting one of the answer choices. In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons. We systematically analyze whether existing answer extraction methods are aligned with human judgment, and how they are influenced by answer constraints in the prompt across different domains. Our experiments demonstrate that traditional evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, we reveal a fundamental trade-off between including format constraints in the prompt to simplify answer extraction and allowing models to generate free-form text to improve reasoning. Our findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.
zh

[NLP-33] Inspecting the Representation Manifold of Differentially-Private Text

【速读】：该论文旨在解决可微分隐私（Differential Privacy, DP）在文本表示空间中的几何失真问题，特别是针对结构和复杂性的保持与平衡。传统词级方法在提升隐私保护的同时显著增加了表示流形的维度，而人类写作的语义复杂性未能得到良好保留。论文的关键在于探索句级方法的潜力，并发现掩码式改写（Masked Paraphrasing）相比因果式改写（Causal Paraphrasing）能够更好地保留结构性复杂性，表明自回归生成容易因不自然的词汇选择导致表示空间的扭曲累积与膨胀。因此，论文通过分析不同隐私预算下改写文本的内在维度，揭示了句级方法在平衡隐私与效用方面的优势及其对表示空间几何特性的影响。

链接: https://arxiv.org/abs/2503.14991
作者: Stefan Arnold
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (弗里德里希-亚历山大-埃尔兰根-纽伦堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Differential Privacy (DP) for text has recently taken the form of text paraphrasing using language models and temperature sampling to better balance privacy and utility. However, the geometric distortion of DP regarding the structure and complexity in the representation space remains unexplored. By estimating the intrinsic dimension of paraphrased text across varying privacy budgets, we find that word-level methods severely raise the representation manifold, while sentence-level methods produce paraphrases whose manifolds are topologically more consistent with human-written paraphrases. Among sentence-level methods, masked paraphrasing, compared to causal paraphrasing, demonstrates superior preservation of structural complexity, suggesting that autoregressive generation propagates distortions from unnatural word choices that cascade and inflate the representation space.
zh

[NLP-34] ML-Triton A Multi-Level Compilation and Language Extension to Triton GPU Programming

【速读】：该论文试图解决在大规模语言模型（LLMs）时代中，密集运算如矩阵乘法（GEMM）和多头注意力机制（MHA）在GPU上的高效编程问题。传统基于CUDA或SYCL的低级接口编程方式虽然功能强大，但不够用户友好且缺乏可移植性，而现有的Triton虽然提供了一种更高层次的抽象，但其直接从工作组（workgroup）级别降至线程级别的方式被认为过早（pre-mature lowering），未能充分利用现代GPU的分层架构特性。论文指出，现代GPU具有物理和逻辑上的分层结构，例如SIMD单元可以直接以warp或warp组粒度操作数据块，这种特性需要一种与GPU分层结构相匹配的多级编译流程来实现更高效的性能优化。

解决方案的关键在于提出ML-Triton，它通过引入多级编译流程和编程接口，从工作组级别逐步降低到warp级别和硬件原语级别，从而更好地映射GPU的分层架构。此外，ML-Triton扩展了Triton语言，支持用户设置编译器提示和warp级别的编程能力，使研究人员能够无需等待编译器更新即可获得接近专家手写内核的高性能。实验结果显示，该方法在Intel GPU上的性能达到了手工编写内核的95%以上（几何平均值）。

链接: https://arxiv.org/abs/2503.14985
作者: Dewei Wang,Wei Zhu,Liyang Ling,Ettore Tiotto,Quintin Wang,Whitney Tsang,Julian Opperman,Jacky Deng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces like CUDA or SYCL, Triton has emerged as a DSL that offers a more user-friendly and portable alternative by programming at a higher level. The current Triton starts at the workgroup (aka threadblock) level, and directly lowers to per-thread level. And then attempt to coalesce and amend through a series of passes, promoting information from low-level representation. We believe this is pre-mature lowering based on the below observations. 1. GPU has a hierarchical structure both physically and logically. Modern GPUs often feature SIMD units capable of directly operating on tiles on a warp or warpgroup basis, such as blocked load and blocked MMA. 2. Multi-level gradual lowering can make compiler decoupled and clean by separating considerations inter and intra a logical layer. 3. Kernel developers often need fine control to get good performance on the latest hardware. FlashAttention2 advocates explicit data partition between warps to make a performance boost. In this context, we propose ML-Triton which features multi-level compilation flow and programming interface. Our approach begins at the workgroup level and progressively lowers to the warp and intrinsic level, implementing a multilevel lowering align with the hierarchical nature of GPU. Additionally, we extend triton language to support user-set compiler hint and warp level programming, enabling researchers to get good out-of-the box performance without awaiting compiler updates. Experimental results demonstrate that our approach achieves performance above 95% of expert-written kernels on Intel GPU, as measured by the geometric mean.
zh

[NLP-35] Covering Cracks in Content Moderation: Delexicalized Distant Supervision for Illicit Drug Jargon Detection KDD2025

【速读】：该论文旨在解决社交媒体平台上因大量隐晦术语导致的非法药物相关内容检测难题。传统方法局限于提取术语列表，但存在两个根本性问题：易被词形替换规避且无法区分隐晦术语在毒品语境或良性语境中的使用。论文提出应基于上下文而非禁令列表进行药物内容审核，并指出人工标注数据昂贵且容易过时。为此，论文提出了JEDIS框架，其关键是结合远程监督与去词汇化技术，在无需人工标注数据的情况下训练模型，同时具备应对新术语和隐晦表达的鲁棒性。实验表明，JEDIS在F1分数和术语检测覆盖率方面显著优于现有基于词的方法，并通过定性分析验证了其对现有方法常见缺陷的鲁棒性。

链接: https://arxiv.org/abs/2503.14926
作者: Minkyoo Song,Eugene Jang,Jaehan Kim,Seungwon Shin
机构: Korea Advanced Institute of Science & Technology(韩国科学技术院)(Daejeon, Republic of Korea); S2W Inc. (Seongnam, Republic of Korea)
类目: Computation and Language (cs.CL)
备注: Accepted for publication in the KDD 2025 Research Track

点击查看摘要

Abstract:In light of rising drug-related concerns and the increasing role of social media, sales and discussions of illicit drugs have become commonplace online. Social media platforms hosting user-generated content must therefore perform content moderation, which is a difficult task due to the vast amount of jargon used in drug discussions. Previous works on drug jargon detection were limited to extracting a list of terms, but these approaches have fundamental problems in practical application. First, they are trivially evaded using word substitutions. Second, they cannot distinguish whether euphemistic terms such as “pot” or “crack” are being used as drugs or in their benign meanings. We argue that drug content moderation should be done using contexts rather than relying on a banlist. However, manually annotated datasets for training such a task are not only expensive but also prone to becoming obsolete. We present JEDIS, a framework for detecting illicit drug jargon terms by analyzing their contexts. JEDIS utilizes a novel approach that combines distant supervision and delexicalization, which allows JEDIS to be trained without human-labeled data while being robust to new terms and euphemisms. Experiments on two manually annotated datasets show JEDIS significantly outperforms state-of-the-art word-based baselines in terms of F1-score and detection coverage in drug jargon detection. We also conduct qualitative analysis that demonstrates JEDIS is robust against pitfalls faced by existing approaches.
zh

[NLP-36] MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models

【速读】：该论文旨在解决在大型语言模型（Large Language Models, LLMs）预训练过程中高质量数据选择的问题，特别是针对数学推理领域的特定需求。传统数据选择方法多关注通用数据，而忽视了领域相关数据的独特特性。论文的关键解决方案是提出了一种名为MASS（Mathematical data Selection framework using the Skill graph）的方法。MASS通过构建一个技能图谱（skill graph），捕捉数学技能及其相互关系，从而为数据集赋予质量评分，并选出高质量子集用于LLMs的预训练。这种方法不仅显著减少了所需训练标记的数量（降低50%-70%），同时提升了模型性能（相较于原始数据集提高了3.3%-5.9%），从而有效提升了预训练的效率与效果。

链接: https://arxiv.org/abs/2503.14917
作者: Jiazheng Li,Lu Yu,Qing Cui,Zhiqiang Zhang,Jun Zhou,Yanfang Ye,Chuxu Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to identify subsets of data that can effectively and efficiently enhance model performance. However, most of these methods focus on general data selection and tend to overlook the specific nuances of domain-related data. In this paper, we introduce MASS, a \textbfMAthematical data \textbfSelection framework using the \textbfSkill graph for pretraining LLMs in the mathematical reasoning domain. By taking into account the unique characteristics of mathematics and reasoning, we construct a skill graph that captures the mathematical skills and their interrelations from a reference dataset. This skill graph guides us in assigning quality scores to the target dataset, enabling us to select the top-ranked subset which is further used to pretrain LLMs. Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes (1B and 7B) and pretraining datasets (web data and synthetic data). Specifically, in terms of efficiency, models trained on subsets selected by MASS can achieve similar performance to models trained on the original datasets, with a significant reduction in the number of trained tokens - ranging from 50% to 70% fewer tokens. In terms of effectiveness, when trained on the same amount of tokens, models trained on the data selected by MASS outperform those trained on the original datasets by 3.3% to 5.9%. These results underscore the potential of MASS to improve both the efficiency and effectiveness of pretraining LLMs.
zh

[NLP-37] Deep Contrastive Unlearning for Language Models

【速读】：该论文旨在解决大型语言模型在训练过程中因使用包含版权内容及用户生成数据的文本数据集而导致的隐私泄露风险和版权侵犯问题。具体而言，论文关注如何实现“被遗忘权”（right to be forgotten），即从模型中移除特定训练样本携带的信息，同时保持其预测性能不受损害。由于语言模型的黑箱特性，这一任务极具挑战性。

现有研究大多侧重于减轻被遗忘样本对模型输出的影响，但并未显式考虑样本在模型潜在空间中的几何分布。为此，论文提出了一种名为Deep Contrastive Unlearning for fine-Tuning（DeepCUT）的机器无学习框架。该方法通过直接优化模型的潜在空间来实现机器无学习，从而更有效地处理上述问题。实验结果表明，DeepCUT在真实数据集上的表现显著优于基线方法，验证了其有效性和高效性。

链接: https://arxiv.org/abs/2503.14900
作者: Estrid He,Tabinda Sarwar,Ibrahim Khalil,Xun Yi,Ke Wang
机构: School of Computing Technologies, RMIT University (RMIT大学计算技术学院); Department of Electrical and Electronic Engineering, School of Engineering, RMIT University (RMIT大学工程学院电气与电子工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The past a few years have witnessed the great success of large language models, demonstrating powerful capabilities in comprehending textual data and generating human-like languages. Large language models achieve success by being trained on vast amounts of textual data, including online sources with copyrighted content and user-generated knowledge. However, this comes at a cost: the potential risk of exposing users’ privacy and violating copyright protections. Thus, to safeguard individuals’ “right to be forgotten”, there has been increasing interests in machine unlearning – the process of removing information carried by particular training samples from a model while not deteriorating its predictive quality. This is a challenging task due to the black-box nature of language models. Most existing studies focus on mitigating the impact of those forgot samples upon a model’s outputs, and do not explicitly consider the geometric distributions of samples in the latent space of a model. To address this issue, we propose a machine unlearning framework, named Deep Contrastive Unlearning for fine-Tuning (DeepCUT) language models. Our proposed model achieves machine unlearning by directly optimizing the latent space of a model. Comprehensive experiments on real-world datasets demonstrate the effectiveness and efficiency of DeepCUT with consistent and significant improvement over baseline methods.
zh

[NLP-38] Mitigating Object Hallucinations in MLLM s via Multi-Frequency Perturbations

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在视觉-语言任务中因对象幻觉（object hallucinations）导致响应真实性受损的问题。论文指出，模型对图像特定频率特征（如低频和高频特征）的过度敏感是产生幻觉的主要原因。为解决此问题，论文提出了一种名为多频扰动（Multi-Frequency Perturbations, MFP）的方法，其关键是通过同时利用图像的低频和高频特征来扰动视觉特征表示，并在推理过程中显式抑制冗余的频率域特征，从而有效减轻对象幻觉现象。此外，作为训练阶段的方法，MFP还可与推理阶段方法结合，在CHAIR基准测试中实现最先进的性能。

链接: https://arxiv.org/abs/2503.14895
作者: Shuo Li,Jiajun Sun,Guodong Zheng,Xiaoran Fan,Yujiong Shen,Yi Lu,Zhiheng Xi,Yuming Yang,Wenming Tan,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model’s over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.
zh

[NLP-39] MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer

【速读】：该论文旨在解决现有基于大语言模型（Large Language Models, LLMs）的链式思维（Chain-of-Thought, CoT）方法在数学推理任务中与人类解题策略存在差异的问题。传统方法通常直接生成目标问题的CoT和答案，而忽视了人类解决问题时常通过类比已有案例及其解决方案进行推理的过程。为弥合这一差距，论文提出了一种名为\textbf{MetaLadder}的新框架，其关键是引入元问题（meta-problems），即结构或语义上与目标问题相似的问题及其对应的CoT解法，促使LLMs回忆并反思这些元问题以促进目标问题的推理。此外，通过问题重述机制重新表达原始问题，进一步提升模型对目标问题的理解能力，从而实现从类比问题到目标问题的推理迁移，模仿人类“从例子中学习”和泛化的能力。实验表明，该方法显著提升了LLMs在数学基准测试中的解题准确性，优于标准CoT方法及其他对比方法达10.3%的准确率提升。

链接: https://arxiv.org/abs/2503.14891
作者: Honglin Lin,Zhuoshi Pan,Yu Li,Qizhi Pei,Xin Gao,Mengzhang Cai,Conghui He,Lijun Wu
机构: Shanghai AI Laboratory (上海人工智能实验室); Tsinghua University (清华大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated promising capabilities in solving mathematical reasoning tasks, leveraging Chain-of-Thought (CoT) data as a vital component in guiding answer generation. Current paradigms typically generate CoT and answers directly for a given problem, diverging from human problem-solving strategies to some extent. Humans often solve problems by recalling analogous cases and leveraging their solutions to reason about the current task. Inspired by this cognitive process, we propose \textbfMetaLadder, a novel framework that explicitly prompts LLMs to recall and reflect on meta-problems, those structurally or semantically analogous problems, alongside their CoT solutions before addressing the target problem. Additionally, we introduce a problem-restating mechanism to enhance the model’s comprehension of the target problem by regenerating the original question, which further improves reasoning accuracy. Therefore, the model can achieve reasoning transfer from analogical problems, mimicking human-like “learning from examples” and generalization abilities. Extensive experiments on mathematical benchmarks demonstrate that our MetaLadder significantly boosts LLMs’ problem-solving accuracy, largely outperforming standard CoT-based methods (\textbf10.3% accuracy gain) and other methods. Our code and data has been released at this https URL.
zh

[NLP-40] LogLLaMA: Transformer-based log anomaly detection with LLaMA

【速读】：该论文旨在解决日志异常检测（Log Anomaly Detection）问题，即从正常的日志消息中区分出异常的日志消息。为应对这一挑战，论文提出了一种名为LogLLaMA的新框架，其关键在于利用大型语言模型（LLMs）的强大能力。具体而言，LogLLaMA首先基于三个大规模数据集中的正常日志消息进行微调（finetune），以学习其模式，并使其能够根据先前的日志消息生成后续的日志消息。随后，通过强化学习（Reinforcement Learning, RL）进一步训练生成式模型（Generative Model），使其具备识别异常日志消息的能力。实验结果表明，LogLLaMA在BGL、Thunderbird和HDFS数据集上的异常检测性能优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.14849
作者: Zhuoyi Yang,Ian G. Harris
机构: University of California, Irvine (加州大学欧文分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Log anomaly detection refers to the task that distinguishes the anomalous log messages from normal log messages. Transformer-based large language models (LLMs) are becoming popular for log anomaly detection because of their superb ability to understand complex and long language patterns. In this paper, we propose LogLLaMA, a novel framework that leverages LLaMA2. LogLLaMA is first finetuned on normal log messages from three large-scale datasets to learn their patterns. After finetuning, the model is capable of generating successive log messages given previous log messages. Our generative model is further trained to identify anomalous log messages using reinforcement learning (RL). The experimental results show that LogLLaMA outperforms the state-of-the-art approaches for anomaly detection on BGL, Thunderbird, and HDFS datasets.
zh

[NLP-41] he CLEF-2025 CheckThat! Lab: Subjectivity Fact-Checking Claim Normalization and Retrieval

【速读】：该论文旨在解决在线虚假信息与操纵行为的识别及应对问题，通过设计创新技术推进多语言与多平台的信息验证流程。论文提出的解决方案聚焦于信息核查管道中的核心任务及辅助任务，包括主观性识别（Task 1）、主张归一化（Task 2）、数值主张事实核查（Task 3）以及科学网络话语处理（Task 4）。这些任务的关键在于解决文档级与片段级的复杂分类与检索问题，尤其在多语言环境下实现精准的信息验证。

链接: https://arxiv.org/abs/2503.14828
作者: Firoj Alam,Julia Maria Struß,Tanmoy Chakraborty,Stefan Dietze,Salim Hafid,Katerina Korre,Arianna Muti,Preslav Nakov,Federico Ruggeri,Sebastian Schellhammer,Vinay Setty,Megha Sundriyal,Konstantin Todorov,Venktesh V
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: misinformation, factuality, fact-checking, fact-checkers, check-worthiness, Social Media Platforms

点击查看摘要

Abstract:The CheckThat! lab aims to advance the development of innovative technologies designed to identify and counteract online disinformation and manipulation efforts across various languages and platforms. The first five editions focused on key tasks in the information verification pipeline, including check-worthiness, evidence retrieval and pairing, and verification. Since the 2023 edition, the lab has expanded its scope to address auxiliary tasks that support research and decision-making in verification. In the 2025 edition, the lab revisits core verification tasks while also considering auxiliary challenges. Task 1 focuses on the identification of subjectivity (a follow-up from CheckThat! 2024), Task 2 addresses claim normalization, Task 3 targets fact-checking numerical claims, and Task 4 explores scientific web discourse processing. These tasks present challenging classification and retrieval problems at both the document and span levels, including multilingual settings.
zh

[NLP-42] MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models ICLR2025

【速读】：该论文试图解决多模态基础模型（MMFMs）在安全性与可信度评估方面的不足问题。现有针对多模态模型的基准测试主要集中于评估模型的实用性或仅关注公平性、隐私等有限视角，而忽视了模型在安全、幻觉生成、公平性/偏差、隐私保护、对抗鲁棒性以及分布外泛化等多维度的综合评估需求。论文的关键解决方案是提出首个统一平台MMDT（Multimodal Decoding Trust），通过设计多种评估场景和红队算法，从多个视角全面评估MMFMs的安全性和可信度，并构建高质量的基准数据集。这一平台能够揭示模型在各维度上的漏洞与改进空间，从而推动更安全、可靠的多模态模型及系统的开发。

链接: https://arxiv.org/abs/2503.14827
作者: Chejian Xu,Jiawei Zhang,Zhaorun Chen,Chulin Xie,Mintong Kang,Yujin Potter,Zhun Wang,Zhuowen Yuan,Alexander Xiong,Zidi Xiong,Chenhui Zhang,Lingzhi Yuan,Yi Zeng,Peiyang Xu,Chengquan Guo,Andy Zhou,Jeffrey Ziwei Tan,Xuandong Zhao,Francesco Pinto,Zhen Xiang,Yu Gai,Zinan Lin,Dan Hendrycks,Bo Li,Dawn Song
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Chicago (芝加哥大学); University of California, Berkeley (加州大学伯克利分校); Harvard University (哈佛大学); Massachusetts Institute of Technology (麻省理工学院); Virginia Tech (弗吉尼亚理工大学); University of Georgia (佐治亚大学); Microsoft Corporation (微软公司); Center for AI Safety (人工智能安全中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: ICLR 2025

点击查看摘要

Abstract:Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at this https URL.
zh

[NLP-43] FACTSEVIDENCE: An Interactive Tool for Transparent Fine-Grained Factual Verification of Machine-Generated Text

【速读】：该论文旨在解决现有事实验证工具在自动检测AI生成内容真实性过程中缺乏透明推理（transparent reasoning）以及单一证据来源（limited source diversity）的问题。论文指出，以往的研究多将事实验证视为二分类或线性回归任务，虽然这类方法能在系统中作为自动化保障机制发挥作用，但无法提供可信的用户体验。为此，论文提出FactsEvidence——一种交互式且透明的复杂文本用户驱动验证工具。其关键在于通过分解输入文本、可视化单个主张的可信度，并结合模型决策解释与多个多样化证据来源的归因，从而增强用户对机器生成文本的理解、验证、选择性信任及有效利用的能力。

链接: https://arxiv.org/abs/2503.14797
作者: Varich Boonsanong,Vidhisha Balachandran,Xiaochuang Han,Shangbin Feng,Lucy Lu Wang,Yulia Tsvetkov
机构: University of Washington (华盛顿大学); Microsoft Research (微软研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the widespread consumption of AI-generated content, there has been an increased focus on developing automated tools to verify the factual accuracy of such content. However, prior research and tools developed for fact verification treat it as a binary classification or a linear regression problem. Although this is a useful mechanism as part of automatic guardrails in systems, we argue that such tools lack transparency in the prediction reasoning and diversity in source evidence to provide a trustworthy user experience. We develop FactsEvidence - an interactive and transparent tool for user-driven verification of complex text. The tool facilitates the intricate decision-making involved in fact-verification, presenting its users a breakdown of complex input texts to visualize the credibility of individual claims along with an explanation of model decisions and attribution to multiple, diverse evidence sources. FactsEvidence aims to empower consumers of machine-generated text and give them agency to understand, verify, selectively trust and use such text.
zh

[NLP-44] Language Independent Named Entity Recognition via Orthogonal Transformation of Word Vectors

【速读】：该论文试图解决跨语言命名实体识别（Named Entity Recognition, NER）的问题，即如何利用已训练好的英语模型在目标语言（如阿拉伯语）中检测命名实体，而无需针对目标语言进行额外的训练或微调。解决方案的关键在于通过构建一个正交线性变换矩阵，将目标语言的词嵌入（word embeddings）转换为与源语言（英语）一致的词嵌入表示，从而实现基于双向LSTM/CRF模型的跨语言NER任务。

链接: https://arxiv.org/abs/2503.14755
作者: Omar E. Rakha,Hazem M. Abbas
机构: Dept. Computer and Systems Engineering (计算机与系统工程系), Faculty of Engineering (工程学院), Ain Shams University (艾因夏姆斯大学), Cairo 11571, Egypt
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Paper was initially released in 2017 but was never published

点击查看摘要

Abstract:Word embeddings have been a key building block for NLP in which models relied heavily on word embeddings in many different tasks. In this paper, a model is proposed based on using Bidirectional LSTM/CRF with word embeddings to perform named entity recognition for any language. This is done by training a model on a source language (English) and transforming word embeddings from the target language into word embeddings of the source language by using an orthogonal linear transformation matrix. Evaluation of the model shows that by training a model on an English dataset the model was capable of detecting named entities in an Arabic dataset without neither training or fine tuning the model on an Arabic language dataset.
zh

[NLP-45] Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在表达答案置信度时，其口头表述的不确定性与实际错误率不一致的问题，即缺乏有效的不确定性量化方法。为了解决这一问题，论文的关键在于提出了一种名为“不确定性蒸馏”（uncertainty distillation）的方法，通过使用保留数据集将初始不确定性估计映射到有意义的概率，并创建带有口头化概率标注的数据进行有监督的微调，使LLMs能够输出经过校准的语义置信度（calibrated semantic confidences），从而实现更准确的不确定性表达。

链接: https://arxiv.org/abs/2503.14749
作者: Sophia Hager,David Mueller,Kevin Duh,Nicholas Andrews
机构: Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly used for factual question-answering, it becomes more important for LLMs to have the capability to communicate the likelihood that their answer is correct. For these verbalized expressions of uncertainty to be meaningful, they should reflect the error rates at the expressed level of confidence. However, when prompted to express confidence, the error rates of current LLMs are inconsistent with their communicated confidences, highlighting the need for uncertainty quantification methods. Many prior methods calculate lexical uncertainty, estimating a model’s confidence in the specific string it generated. In some cases, however, it may be more useful to estimate semantic uncertainty, or the model’s confidence in the answer regardless of how it is verbalized. We propose a simple procedure, uncertainty distillation, to teach an LLM to verbalize calibrated semantic confidences. Using held-out data to map initial uncertainty estimates to meaningful probabilities, we create examples annotated with verbalized probabilities for supervised fine-tuning. We demonstrate our method yields verbalized confidences that correlate with observed error rates with a small fine-tuned language model as well as with larger instruction-tuned models, and find that our semantic uncertainty correlates well with lexical uncertainty on short answers.
zh

[NLP-46] Strategic resource allocation in memory encoding: An efficiency principle shaping language processing

【速读】：该论文试图解决的问题是如何高效利用工作记忆（Working Memory）有限的容量来支持人类的语言行为。论文提出了一种以资源效率为核心的原则，即在句子处理过程中通过战略性地分配记忆编码资源，优先增强新颖且出乎意料的信息表征，从而减少其因记忆衰退和干扰而丢失的可能性。这一解决方案的关键在于从资源理性的角度出发，结合工作记忆的有限容量和其表示的噪声特性，论证这种效率原则的自然产生，并通过自然主义语料库数据验证了依赖局部性（Dependency Locality）背景下生产与理解两侧的战略性资源分配证据，同时揭示了跨语言变异性，强调了进一步探讨该普遍效率原则与特定语言短语结构相互作用的重要性。

链接: https://arxiv.org/abs/2503.14728
作者: Weijie Xu,Richard Futrell
机构: University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL)
备注: manuscript under review

点击查看摘要

Abstract:How is the limited capacity of working memory efficiently used to support human linguistic behaviors? In this paper, we investigate strategic resource allocation as an efficiency principle for memory encoding in sentence processing. The idea is that working memory resources are dynamically and strategically allocated to prioritize novel and unexpected information, enhancing their representations to make them less susceptible to memory decay and interference. Theoretically, from a resource-rational perspective, we argue that this efficiency principle naturally arises from two functional assumptions about working memory, namely, its limited capacity and its noisy representation. Empirically, through naturalistic corpus data, we find converging evidence for strategic resource allocation in the context of dependency locality from both the production and the comprehension side, where non-local dependencies with less predictable antecedents are associated with reduced locality effect. However, our results also reveal considerable cross-linguistic variability, highlighting the need for a closer examination of how strategic resource allocation, as a universal efficiency principle, interacts with language-specific phrase structures.
zh

[NLP-47] Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement

【速读】：该论文试图解决的问题是如何提高基于第一语言（母语）通用语言模型在第二语言（L2）韩语形态句法分析中的性能。解决方案的关键在于构建一个经过修订的、更适合通用依赖（UD）框架的增强型 L2 韩语 UD 树库（包含 5,454 条人工标注句子），并通过该树库微调三个韩语语言模型，从而显著提升其在域内和域外 L2 韩语数据集上的表现。

链接: https://arxiv.org/abs/2503.14718
作者: Hakyung Sung,Gyu-Ho Shin
机构: University of Oregon (俄勒冈大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.
zh

[NLP-48] HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

【速读】：该论文试图解决多模态大模型（Large Multi-modal Models, LMMs）在单个Transformer架构下性能不足与资源消耗之间的矛盾。现有大多数LMMs分别建模视觉和文本模态，而基于单一Transformer的原生LMMs虽有潜力，但通常表现出较大的性能差距且资源需求高昂。论文的关键解决方案在于提出一种简单而高效的构建方法：首先，设计了一种新的早期融合（early-fusion）LMM，能够在早期阶段融合多模态输入，并以自回归方式响应视觉指令；其次，开发了一种高效的训练策略，利用预训练模型的先验知识，从而同时缓解性能限制和资源消耗问题。所提方法显著提升了单Transformer LMM的性能，并大幅缩小了与组合式LMM的性能差距。

链接: https://arxiv.org/abs/2503.14694
作者: Rui Yang,Lin Song,Yicheng Xiao,Runhui Huang,Yixiao Ge,Ying Shan,Hengshuang Zhao
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.
zh

[NLP-49] Generating Medically-Informed Explanations for Depression Detection using LLM s

【速读】：该论文旨在解决从社交媒体数据中早期检测抑郁症这一具有挑战性的任务，目标是开发既准确又可解释的模型，以支持及时干预。解决方案的关键在于提出了一种名为LLM-MTD（大型语言模型的多任务抑郁症检测）的新方法，它利用预训练的大规模语言模型同时完成社交媒体帖子的抑郁症分类以及基于医学诊断标准生成可解释性文本的任务。通过采用多任务学习框架和结合优化分类准确性和解释质量的复合损失函数进行模型训练，该方法在Reddit自我报告抑郁症数据集（RSDD）上的实验结果表明，在AUPRC等关键指标上取得了最先进的性能，并显著优于传统机器学习方法和微调后的BERT模型。此外，人工评估证明了生成解释的相关性、完整性和医学准确性，突显了此方法增强的可解释性。

链接: https://arxiv.org/abs/2503.14671
作者: Xiangyong Chen,Xiaochuan Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Early detection of depression from social media data offers a valuable opportunity for timely intervention. However, this task poses significant challenges, requiring both professional medical knowledge and the development of accurate and explainable models. In this paper, we propose LLM-MTD (Large Language Model for Multi-Task Depression Detection), a novel approach that leverages a pre-trained large language model to simultaneously classify social media posts for depression and generate textual explanations grounded in medical diagnostic criteria. We train our model using a multi-task learning framework with a combined loss function that optimizes both classification accuracy and explanation quality. We evaluate LLM-MTD on the benchmark Reddit Self-Reported Depression Dataset (RSDD) and compare its performance against several competitive baseline methods, including traditional machine learning and fine-tuned BERT. Our experimental results demonstrate that LLM-MTD achieves state-of-the-art performance in depression detection, showing significant improvements in AUPRC and other key metrics. Furthermore, human evaluation of the generated explanations reveals their relevance, completeness, and medical accuracy, highlighting the enhanced interpretability of our approach. This work contributes a novel methodology for depression detection that combines the power of large language models with the crucial aspect of explainability.
zh

[NLP-50] ConQuer: A Framework for Concept-Based Quiz Generation

【速读】：该论文旨在解决高质量Quiz（测验）生成的挑战，即如何在保证生成效率的同时提升AI-generated quizzes的质量及其对学生教育影响的有效性。论文的关键在于提出了一种基于概念的Quiz生成框架ConQuer，其通过利用外部知识源来增强生成能力，并采用全面的评估维度结合LLMs（大型语言模型）作为评价工具，对生成结果进行量化评估。此外，通过引入消融研究验证了框架各组成部分的有效性。

链接: https://arxiv.org/abs/2503.14662
作者: Yicheng Fu,Zikui Wang,Liuxin Yang,Meiqing Huo,Zhongdongming Dai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quizzes play a crucial role in education by reinforcing students’ understanding of key concepts and encouraging self-directed exploration. However, compiling high-quality quizzes can be challenging and require deep expertise and insight into specific subject matter. Although LLMs have greatly enhanced the efficiency of quiz generation, concerns remain regarding the quality of these AI-generated quizzes and their educational impact on students. To address these issues, we introduce ConQuer, a concept-based quiz generation framework that leverages external knowledge sources. We employ comprehensive evaluation dimensions to assess the quality of the generated quizzes, using LLMs as judges. Our experiment results demonstrate a 4.8% improvement in evaluation scores and a 77.52% win rate in pairwise comparisons against baseline quiz sets. Ablation studies further underscore the effectiveness of each component in our framework. Code available at this https URL.
zh

[NLP-51] RAG O: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

【速读】：该论文旨在解决高效检索增强生成（Retrieval-Augmented Generation, RAG）服务面临的挑战，特别是由于多种RAG变体的快速涌现以及它们之间显著的工作负载特性差异所导致的性能优化难题。论文的关键贡献在于提出了一种系统化的优化框架——RAGO（Retrieval-Augmented Generation Optimizer）。这一方案的核心在于首先通过RAGSchema构建了一个结构化的抽象模型，用于捕捉RAG算法的广泛变化，为性能优化奠定基础；接着通过对多个具有代表性的RAG工作负载进行分析，揭示了这些工作负载之间的显著性能差异；最后，基于上述研究，设计了RAGO框架以应对这种性能差异，并满足多样化的性能需求。评估结果显示，与基于大型语言模型（LLM）系统扩展的传统RAG系统相比，RAGO能够将每芯片查询处理速度（QPS）提升至原来的两倍，并将首次生成标记的时间延迟减少55%。

链接: https://arxiv.org/abs/2503.14649
作者: Wenqi Jiang,Suvinay Subramanian,Cat Graves,Gustavo Alonso,Amir Yazdanbakhsh,Vidushi Dadu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG), which combines large language models (LLMs) with retrievals from external knowledge databases, is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. In this paper, we make three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. Our evaluation shows that RAGO achieves up to a 2x increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.
zh

[NLP-52] An Explainable Framework for Misinformation Identification via Critical Question Answering

【速读】：该论文旨在解决自然语言事实核查与理由核查系统中缺乏可解释性的问题。目前，自然语言误信息检测方法主要依赖于序列分类技术，导致系统的决策过程不透明，难以理解为何某些内容被判定为误信息。尽管在自动化事实核查领域已有一些可解释的方法被提出，但自动化理由核查系统仍未实现类似的能力。
论文的关键解决方案是基于论证模式（Argumentation Schemes）和批判性问题（Critical Questions）理论，提出了一种新的可解释框架，用于同时检测事实性和合理性误信息。为此，研究者构建并发布了NLAS-CQ语料库，包含3,566个教科书式的自然语言论证模式实例及其对应的4,687个批判性问题答案。在此基础上，论文实现了结合分类与问答的新框架，通过分析论据来检测误信息，并以批判性问题的形式向用户提供解释。

链接: https://arxiv.org/abs/2503.14626
作者: Ramon Ruiz-Dolz,John Lawrence
机构: Centre for Argument Technology (ARG-tech) (论辩技术中心); University of Dundee (邓迪大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural language misinformation detection approaches have been, to date, largely dependent on sequence classification methods, producing opaque systems in which the reasons behind classification as misinformation are unclear. While an effort has been made in the area of automated fact-checking to propose explainable approaches to the problem, this is not the case for automated reason-checking systems. In this paper, we propose a new explainable framework for both factual and rational misinformation detection based on the theory of Argumentation Schemes and Critical Questions. For that purpose, we create and release NLAS-CQ, the first corpus combining 3,566 textbook-like natural language argumentation scheme instances and 4,687 corresponding answers to critical questions related to these arguments. On the basis of this corpus, we implement and validate our new framework which combines classification with question answering to analyse arguments in search of misinformation, and provides the explanations in form of critical questions to the human user.
zh

[NLP-53] Retrieval-Augmented Simulacra: Generative Agents for Up-to-date and Knowledge-Adaptive Simulations

【速读】：该论文试图解决社交网络服务（SNS）趋势预测系统的需求问题，具体目标是提升虚拟SNS环境中生成自然交互的能力。解决方案的关键在于提出了一种模仿人类搜索行为的搜索扩展生成机制（search extension generation mechanism），并通过在基于大型语言模型（LLMs）构建的虚拟SNS环境中进行模拟实验，验证了该机制能够显著提高生成帖子和回复的自然性。

链接: https://arxiv.org/abs/2503.14620
作者: Hikaru Shimadzu,Takehito Utsuro,Daisuke Kitayama
机构: School of Informatics, Kogakuin University (信息学部, 工学院大学); Deg. Prog. Sys.&Inf. Eng., Grad. Sch. Sci.&Tech., University of Tsukuba (系统与信息工程研究生院, 科学与技术研究生院, 筑波大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:In the 2023 edition of the White Paper on Information and Communications, it is estimated that the population of social networking services in Japan will exceed 100 million by 2022, and the influence of social networking services in Japan is growing significantly. In addition, marketing using SNS and research on the propagation of emotions and information on SNS are being actively conducted, creating the need for a system for predicting trends in SNS interactions. We have already created a system that simulates the behavior of various communities on SNS by building a virtual SNS environment in which agents post and reply to each other in a chat community created by agents using a LLMs. In this paper, we evaluate the impact of the search extension generation mechanism used to create posts and replies in a virtual SNS environment using a simulation system on the ability to generate posts and replies. As a result of the evaluation, we confirmed that the proposed search extension generation mechanism, which mimics human search behavior, generates the most natural exchange.
zh

[NLP-54] Unique Hard Attention: A Tale of Two Sides

【速读】：该论文旨在探讨变换器（Transformers）的表达能力，并分析不同注意力机制对其表达能力的影响。论文的核心问题是研究仅采用左端硬注意力（leftmost-hard attention）的变换器模型与线性时间逻辑（Linear Temporal Logic, LTL）之间的关系。解决方案的关键在于通过理论证明表明，仅具有左端硬注意力的变换器模型只能表达LTL的一个严格弱化片段，而同时具备左右端硬注意力的模型则等价于完整的LTL。此外，论文还指出，左端硬注意力模型与软注意力（soft attention）等价，这暗示它们可能更贴近实际应用中的变换器模型，优于仅使用右端注意力的模型。这些发现细化了对变换器表达能力的理解，并强调了注意力方向性的重要性。

链接: https://arxiv.org/abs/2503.14615
作者: Selim Jerad,Anej Svete,Jiaoda Li,Ryan Cotterell
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Understanding the expressive power of transformers has recently attracted attention, as it offers insights into their abilities and limitations. Many studies analyze unique hard attention transformers, where attention selects a single position that maximizes the attention scores. When multiple positions achieve the maximum score, either the rightmost or the leftmost of those is chosen. In this paper, we highlight the importance of this seeming triviality. Recently, finite-precision transformers with both leftmost- and rightmost-hard attention were shown to be equivalent to Linear Temporal Logic (LTL). We show that this no longer holds with only leftmost-hard attention – in that case, they correspond to a \emphstrictly weaker fragment of LTL. Furthermore, we show that models with leftmost-hard attention are equivalent to \emphsoft attention, suggesting they may better approximate real-world transformers than right-attention models. These findings refine the landscape of transformer expressivity and underscore the role of attention directionality.
zh

[NLP-55] Image Captioning Evaluation in the Age of Multimodal LLM s: Challenges and Future Perspectives

【速读】：该论文旨在解决机器生成图像描述（image captions）评估这一复杂且不断发展的挑战，特别是在Multimodal Large Language Models (MLLMs) 推动下，图像描述任务的重要性增加，现有评估指标在准确性、鲁棒性和适应性方面面临诸多不足。论文的关键在于全面分析现有评估指标的进展、优势与局限，并从多个维度（如与人类判断的相关性、排序准确性及对幻觉现象的敏感度）进行系统评估。同时，论文重点探讨了由MLLMs生成的更长、更详细描述所带来的新挑战，并考察当前指标对这些风格变化的适应能力，进而揭示标准评估方法的局限性，为未来图像描述评估研究指明潜在方向。

链接: https://arxiv.org/abs/2503.14604
作者: Sara Sarto,Marcella Cornia,Rita Cucchiara
机构: University of Modena and Reggio Emilia (摩德纳大学和雷焦艾米利亚大学); IIT-CNR (意大利国家研究委员会信息学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Repo GitHub: this https URL

点击查看摘要

Abstract:The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.
zh

[NLP-56] Command R7B Arabic: A Small Enterprise Focused Multilingual and Culturally Aware Arabic LLM

【速读】：该论文旨在解决企业级阿拉伯语大型语言模型（LLMs）开发中高质量数据稀缺的问题。解决方案的关键在于提出了一种结合数据合成与迭代后训练的方法：通过利用合成数据生成技术和人机协作标注扩展阿拉伯语训练语料库，并进一步设计迭代后的后训练策略，以更好地使模型符合人类偏好，这是企业应用场景中的重要方面。最终成果包括开源了一个70亿参数的小型模型，在与同规模模型的对比以及针对阿拉伯文化知识、指令跟随、RAG（检索增强生成）和上下文忠实性的基准测试中表现出色。

链接: https://arxiv.org/abs/2503.14603
作者: Yazeed Alnumay,Alexandre Barbet,Anna Bialas,William Darling,Shaan Desai,Joan Devassy,Kyle Duffy,Stephanie Howe,Olivia Lasche,Justin Lee,Anirudh Shrinivason,Jennifer Tracey
机构: Cohere
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.
zh

[NLP-57] Squeeze Out Tokens from Sample for Finer-Grained Data Governance

链接: https://arxiv.org/abs/2503.14559
作者: Weixiong Lin,Chen Ju,Haicheng Wang,Shengchao Hu,Shuai Xiao,Mengting Chen,Yuheng Jiao,Mingshuai Yao,Jinsong Lan,Qingwen Liu,Ying Chen
机构: Alibaba Group (阿里巴巴集团); Shanghai Jiao Tong University (上海交通大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[NLP-58] hreefold model for AI Readiness: A Case Study with Finnish Healthcare SMEs

【速读】：本文旨在探讨芬兰医疗保健中小型企业（SMEs）采用人工智能（AI）的情况，通过半结构化访谈六家健康科技公司来识别其AI参与类别，包括AI好奇型（探索AI）、AI接纳型（整合AI）和AI服务型（提供AI解决方案）。论文提出的关键问题是尽管中小企业认识到AI的潜力，但大多仍处于早期采用阶段。为解决这一问题，论文强调了解决方案的关键在于推动监管改革、人才培养以及企业间合作，以克服诸如监管复杂性、技术专长缺口和财务约束等主要障碍。这些措施为医疗机构、政策制定者和研究人员提供了有价值的见解。

链接: https://arxiv.org/abs/2503.14527
作者: Mohammed Alnajjar,Khalid Alnajjar,Mika Hämäläinen
机构: Sparkka Oy (Sparkka Oy); Rootroo Ltd (Rootroo Ltd); Metropolia University of Applied Sciences (Metropolia大学应用科学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study examines AI adoption among Finnish healthcare SMEs through semi-structured interviews with six health-tech companies. We identify three AI engagement categories: AI-curious (exploring AI), AI-embracing (integrating AI), and AI-catering (providing AI solutions). Our proposed threefold model highlights key adoption barriers, including regulatory complexities, technical expertise gaps, and financial constraints. While SMEs recognize AI’s potential, most remain in early adoption stages. We provide actionable recommendations to accelerate AI integration, focusing on regulatory reforms, talent development, and inter-company collaboration, offering valuable insights for healthcare organizations, policymakers, and researchers.
zh

[NLP-59] Policy Frameworks for Transparent Chain-of-Thought Reasoning in Large Language Models

链接: https://arxiv.org/abs/2503.14521
作者: Yihang Chen,Haikang Deng,Kaiqiao Han,Qingyue Zhao
机构: Department of Computer Science (计算机科学系), University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-60] Solla: Towards a Speech-Oriented LLM That Hears Acoustic Context

【速读】：该论文旨在解决现有大型语言模型（LLMs）在处理多模态输入时主要依赖文本指令分析，而忽视混合语音指令与音频作为输入场景的问题。论文提出的关键解决方案是Solla框架，它能够同时理解基于语音的问题并感知声学上下文。Solla通过集成音频事件识别模块和基于自动语音识别（ASR）辅助的预测方法，有效提升了对语音内容的理解能力。此外，为了全面评估Solla及其对比模型，论文构建了一个名为SA-Eval的新基准数据集，包含音频事件分类、音频描述生成及音频问答三项任务，并设计了涵盖不同难度级别的多样化语音指令以模拟真实世界的声学环境。实验结果表明，Solla在易用性和复杂性测试集上均表现优异或优于基线模型，验证了其联合处理语音和音频信息的有效性。

链接: https://arxiv.org/abs/2503.15338
作者: Junyi Ao,Dekun Chen,Xiaohai Tian,Wenjie Feng,Jun Zhang,Lu Lu,Yuxuan Wang,Haizhou Li,Zhizheng Wu
机构: School of Data Science, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳数据科学学院，深圳大数据研究所); Bytedance (字节跳动)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently shown remarkable ability to process not only text but also multimodal inputs such as speech and audio. However, most existing models primarily focus on analyzing input signals using text instructions, overlooking scenarios in which speech instructions and audio are mixed and serve as inputs to the model. To address these challenges, we introduce Solla, a novel framework designed to understand speech-based questions and hear the acoustic context concurrently. Solla incorporates an audio tagging module to effectively identify and represent audio events, as well as an ASR-assisted prediction method to improve comprehension of spoken content. To rigorously evaluate Solla and other publicly available models, we propose a new benchmark dataset called SA-Eval, which includes three tasks: audio event classification, audio captioning, and audio question answering. SA-Eval has diverse speech instruction with various speaking styles, encompassing two difficulty levels, easy and hard, to capture the range of real-world acoustic conditions. Experimental results show that Solla performs on par with or outperforms baseline models on both the easy and hard test sets, underscoring its effectiveness in jointly understanding speech and audio.
zh

计算机视觉

[CV-0] Cube: A Roblox View of 3D Intelligence

【速读】：本文旨在构建一个用于3D智能的基础模型，支持开发者在Roblox平台中生成包括3D物体与场景、角色绑定以及行为脚本等所有方面的体验。论文提出了此类3D基础模型的三个关键设计需求，并介绍了迈向实现这一目标的第一步。解决方案的关键在于开发一种针对3D几何形状的核心数据类型及其对应的3D形状标记化方案，该方案能够应用于文本到形状生成、形状到文本生成以及文本到场景生成等任务。此外，通过与现有的大型语言模型（LLMs）协作，这些应用可进一步执行场景分析和推理。最终，论文讨论了构建完全统一的3D智能基础模型的发展路径。

链接: https://arxiv.org/abs/2503.15475
作者: Foundation AI Team Roblox:Kiran Bhat,Nishchaie Khanna,Karun Channa,Tinghui Zhou,Yiheng Zhu,Xiaoxia Sun,Charles Shang,Anirudh Sudarshan,Maurice Chu,Daiqing Li,Kangle Deng,Jean-Philippe Fauconnier,Tijmen Verhulsdonck,Maneesh Agrawala,Kayvon Fatahalian,Alexander Weiss,Christian Reiser,Ravi Kiran Chirravuri,Ravali Kandur,Alejandro Pelaez,Akash Garg,Michael Palleschi,Jessica Wang,Skylar Litz,Leon Liu,Anying Li,David Harmon,Derek Liu,Liangjun Feng,Denis Goupil,Lukas Kuczynski,Jihyun Yoon,Naveen Marri,Peiye Zhuang,Yinan Zhang,Brian Yin,Haomiao Jiang,Marcel van Workum,Thomas Lane,Bryce Erickson,Salil Pathare,Kyle Price,Anupam Singh,David Baszucki
机构: Foundation AI (Foundation AI); Roblox (Roblox)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code and model weights can be found at: this https URL

点击查看摘要

Abstract:Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation to producing programmatic scripts describing object behaviors. We discuss three key design requirements for such a 3D foundation model and then present our first step towards building such a model. We expect that 3D geometric shapes will be a core data type and describe our solution for 3D shape tokenizer. We show how our tokenization scheme can be used in applications for text-to-shape generation, shape-to-text generation and text-to-scene generation. We demonstrate how these applications can collaborate with existing large language models (LLMs) to perform scene analysis and reasoning. We conclude with a discussion outlining our path to building a fully unified foundation model for 3D intelligence.
zh

[CV-1] oward task-driven satellite image super-resolution

【速读】：该论文试图解决的问题是如何评估现有的超分辨率重建算法在实际图像分析任务中的有效性，并通过任务驱动的方式优化这些算法以更好地服务于自动化图像分析。论文的关键解决方案在于提出了一种方法论框架，用于判断现有计算机视觉模型是否适用于评价超分辨率重建算法，并探索如何通过特定任务驱动的方式训练这些模型，从而为选择合适的计算机视觉任务建立坚实的基础，推动真实世界中超分辨率技术能力的提升。

链接: https://arxiv.org/abs/2503.15474
作者: Maciej Ziaja,Pawel Kowaleczko,Daniel Kostrzewa,Nicolas Longépé,Michal Kawulok
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE IGARSS 2024

点击查看摘要

Abstract:Super-resolution is aimed at reconstructing high-resolution images from low-resolution observations. State-of-the-art approaches underpinned with deep learning allow for obtaining outstanding results, generating images of high perceptual quality. However, it often remains unclear whether the reconstructed details are close to the actual ground-truth information and whether they constitute a more valuable source for image analysis algorithms. In the reported work, we address the latter problem, and we present our efforts toward learning super-resolution algorithms in a task-driven way to make them suitable for generating high-resolution images that can be exploited for automated image analysis. In the reported initial research, we propose a methodological approach for assessing the existing models that perform computer vision tasks in terms of whether they can be used for evaluating super-resolution reconstruction algorithms, as well as training them in a task-driven way. We support our analysis with experimental study and we expect it to establish a solid foundation for selecting appropriate computer vision tasks that will advance the capabilities of real-world super-resolution.
zh

[CV-2] EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining

【速读】：该论文旨在解决现有以第一人称视角（Egocentric）视频语言预训练方法中缺乏三维理解能力的问题。大多数先前的工作依赖于一维文本或二维视觉提示（如边界框），这些方式本质上无法充分捕捉三维空间信息。为填补这一空白，论文提出了一种名为EgoDTM（Egocentric Depth- and Text-aware Model）的方法，其关键在于通过大规模三维感知视频预训练与视频-文本对比学习相结合的方式进行联合训练。EgoDTM引入了一个轻量级的三维感知解码器，利用深度估计模型生成的伪深度图来高效学习三维感知能力，并通过结合基础模型增强原始简短描述中的手部与物体视觉线索，进一步促进三维感知视频预训练。实验结果验证了EgoDTM在多种下游任务上的优越性能，体现了其卓越的三维感知视觉理解能力。

链接: https://arxiv.org/abs/2503.15470
作者: Boshen Xu,Yuting Mei,Xinbi Liu,Sipeng Zheng,Qin Jin
机构: Renmin University of China (中国人民大学); Beijing Academy of Artificial Intelligence (北京智源人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code will be released at: this https URL

点击查看摘要

Abstract:Egocentric video-language pretraining has significantly advanced video representation learning. Humans perceive and interact with a fully 3D world, developing spatial awareness that extends beyond text-based understanding. However, most previous works learn from 1D text or 2D visual cues, such as bounding boxes, which inherently lack 3D understanding. To bridge this gap, we introduce EgoDTM, an Egocentric Depth- and Text-aware Model, jointly trained through large-scale 3D-aware video pretraining and video-text contrastive learning. EgoDTM incorporates a lightweight 3D-aware decoder to efficiently learn 3D-awareness from pseudo depth maps generated by depth estimation models. To further facilitate 3D-aware video pretraining, we enrich the original brief captions with hand-object visual cues by organically combining several foundation models. Extensive experiments demonstrate EgoDTM’s superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Our code will be released at this https URL.
zh

[CV-3] FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers

【速读】：该论文旨在解决扩散模型（Diffusion Models, DM）在实际部署中的两大关键挑战：其一，经典基于卷积U-Net的DM模型与采用Transformer骨干网络的新一代DiT模型（如PixArt系列、Hunyuan等）在后训练量化（Post-Training Quantization, PTQ）方法上的兼容性不足；其二，现有的整数（Integer, INT）量化方法在低比特设置下未能很好地匹配DiT模型的权重和激活分布，而浮点量化（Floating-Point Quantization, FPQ）虽潜力巨大但尚未得到充分研究。论文的关键解决方案是提出FP4DiT，这是一种基于浮点量化的PTQ方法，实现了W4A6精度，并通过扩展和泛化自适应舍入技术来优化权重量化，同时引入稳健的在线激活量化技术以应对DiT模型中输入补丁数据依赖性的激活特性。实验结果表明，FP4DiT在W4A6和W4A8精度下优于基于整数的PTQ方法，并在PixArt-\alpha、PixArt-\Sigma和Hunyuan等DiT模型上生成了具有竞争力的文本到图像（Text-to-Image, T2I）视觉内容。

链接: https://arxiv.org/abs/2503.15465
作者: Ruichen Chen,Keith G. Mills,Di Niu
机构: Department of Electrical and Computer Engineering (电气与计算机工程系), University of Alberta (阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code is available at this https URL

点击查看摘要

Abstract:Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but doesn’t align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In response, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT outperforms integer-based PTQ at W4A6 and W4A8 precision and generates convincing visual content on PixArt- \alpha , PixArt- \Sigma and Hunyuan in terms of several T2I metrics such as HPSv2 and CLIP.
zh

[CV-4] Dimathtt[M]O: Distilling Masked Diffusion Models into One-step Generator

【速读】：该论文旨在解决掩码扩散模型（Masked Diffusion Models, MDMs）在推理过程中步骤过多导致速度较慢的问题。论文提出了一种名为Di\mathtt[M]O的新方法，通过将多步的MDMs蒸馏为一步生成器来实现高效的生成式建模。解决方案的关键在于解决两个核心挑战：(1) 针对利用中间步骤信息进行一步生成的不可行性，提出了基于令牌级分布匹配的方法，即通过“按策略框架”优化模型输出logits，并借助辅助模型的帮助；(2) 对初始分布熵不足的问题，设计了一种令牌初始化策略，在注入随机性的同时保持与教师训练分布的相似性。实验结果表明，Di\mathtt[M]O在类别条件和文本条件图像生成任务上均表现出色，其性能可媲美多步教师模型的输出，同时显著减少了推理时间。据我们所知，这是首次成功实现掩码扩散模型一步蒸馏的工作，并且首次将离散蒸馏应用于文本到图像生成领域，为高效的生成式建模开辟了新的途径。

链接: https://arxiv.org/abs/2503.15457
作者: Yuanzhi Zhu,Xi Wang,Stéphane Lathuilière,Vicky Kalogeiton
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di \mathtt[M] O, a novel approach that distills masked diffusion models into a one-step generator. Di \mathtt[M] O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an ‘on-policy framework’ with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di \mathtt[M] O’s effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.
zh

[CV-5] MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

【速读】：该论文致力于解决文本条件下的流式运动生成问题，即根据可变长度的历史动作和输入文本预测下一步的人体姿态。现有方法如扩散模型受限于预定义的动作长度，而基于 GPT 的方法则因离散化的非因果标记化面临延迟响应和误差累积的问题。为了解决这些问题，论文提出了一种名为 MotionStreamer 的新框架，其关键在于将连续因果潜在空间引入概率自回归模型中。连续潜在变量减轻了离散化引起的信息丢失，并在长期自回归生成过程中有效减少了误差累积。此外，通过建立当前动作潜在变量与历史动作潜在变量之间的时序因果依赖关系，模型能够充分利用可用信息实现准确的在线动作解码。实验表明，该方法在性能上优于现有方法，并支持多轮生成、长期生成和动态动作组合等多种应用。

链接: https://arxiv.org/abs/2503.15451
作者: Lixing Xiao,Shunlin Lu,Huaijin Pi,Ke Fan,Liang Pan,Yueer Zhou,Ziyong Feng,Xiaowei Zhou,Sida Peng,Jingbo Wang
机构: Zhejiang University (浙江大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学，深圳); The University of Hong Kong (香港大学); Shanghai Jiao Tong University (上海交通大学); DeepGlint; Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: this https URL
zh

[CV-6] V2X-DG: Domain Generalization for Vehicle-to-Everything Cooperative Perception ICRA2025

【速读】：本文旨在解决基于激光雷达的车到一切（LiDAR-based Vehicle-to-Everything, V2X）协同感知在跨域泛化（Domain Generalization, DG）方面的挑战，具体针对3D目标检测任务，研究如何仅通过在源域数据上的训练，使模型在未见过的目标域上仍能保持高性能。现有协同感知算法大多局限于同一数据集的训练与测试，其跨域泛化能力尚未得到充分探索。为此，本文提出了两个关键技术：一是Cooperative Mixup Augmentation based Generalization (CMAG)，通过模拟未见的协同场景增强模型的泛化能力；二是Cooperation Feature Consistency (CFC) 约束，用于对齐由CMAG生成的广义协同特征与源域原始早期融合特征，以确保广义特征表示的一致性和鲁棒性。实验结果表明，所提方法在保持源域性能的同时，显著提升了模型在其他未见数据集上的泛化表现。

链接: https://arxiv.org/abs/2503.15435
作者: Baolu Li,Zongzhe Xu,Jinlong Li,Xinyu Liu,Jianwu Fang,Xiaopeng Li,Hongkai Yu
机构: Cleveland State University (克利夫兰州立大学); Carnegie Mellon University (卡内基梅隆大学); Texas A&M University (德克萨斯农工大学); Xi’an Jiaotong University (西安交通大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICRA 2025

点击查看摘要

Abstract:LiDAR-based Vehicle-to-Everything (V2X) cooperative perception has demonstrated its impact on the safety and effectiveness of autonomous driving. Since current cooperative perception algorithms are trained and tested on the same dataset, the generalization ability of cooperative perception systems remains underexplored. This paper is the first work to study the Domain Generalization problem of LiDAR-based V2X cooperative perception (V2X-DG) for 3D detection based on four widely-used open source datasets: OPV2V, V2XSet, V2V4Real and DAIR-V2X. Our research seeks to sustain high performance not only within the source domain but also across other unseen domains, achieved solely through training on source domain. To this end, we propose Cooperative Mixup Augmentation based Generalization (CMAG) to improve the model generalization capability by simulating the unseen cooperation, which is designed compactly for the domain gaps in cooperative perception. Furthermore, we propose a constraint for the regularization of the robust generalized feature representation learning: Cooperation Feature Consistency (CFC), which aligns the intermediately fused features of the generalized cooperation by CMAG and the early fused features of the original cooperation in source domain. Extensive experiments demonstrate that our approach achieves significant performance gains when generalizing to other unseen datasets while it also maintains strong performance on the source dataset.
zh

[CV-7] Visual Position Prompt for MLLM based Visual Grounding

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理需要精确定位的任务（如视觉定位）时，难以将坐标与图像内的空间信息精确对齐的问题。这一挑战源于两个关键因素：首先，MLLMs 缺乏显式的空间参考，使得文本描述与图像具体位置的关联变得困难；其次，其特征提取过程更侧重全局上下文而非细粒度的空间细节，导致定位能力较弱。为解决此问题，论文提出了一种名为 VPP-LLaVA 的 MLLM，它引入了视觉位置提示（Visual Position Prompt, VPP）以增强其定位能力。VPP 包含两种互补机制：全局 VPP 通过在输入图像上叠加可学习的轴向嵌入提供结构化的空间线索；局部 VPP 则通过引入位置感知查询，聚焦于细粒度的定位，从而建议可能的对象位置。此外，论文还构建了一个包含 0.6M 样本的 VPP-SFT 数据集，用于高效模型训练。尽管使用较少的训练样本（相比其他依赖大规模数据集的 MLLMs，如 MiniGPT-v2 的约 21M 样本），VPP-LLaVA 在标准定位基准测试中仍达到了最先进的性能。

链接: https://arxiv.org/abs/2503.15426
作者: Wei Tang,Yanpeng Sun,Qinying Gu,Zechao Li
机构: School of Computer Science and Engineering, Nanjing University of Science and Technology (南京理工大学计算机科学与工程学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address this issue, we introduce VPP-LLaVA, an MLLM equipped with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms. The global VPP overlays learnable, axis-like embeddings onto the input image to provide structured spatial cues. The local VPP focuses on fine-grained localization by incorporating position-aware queries, which suggests probable object locations. We also introduce a VPP-SFT dataset with 0.6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training. Training on this dataset with VPP enhances the model’s performance, achieving state-of-the-art results on standard grounding benchmarks despite using fewer training samples compared to other MLLMs like MiniGPT-v2, which rely on much larger datasets ( \sim 21M samples). The code and VPP-SFT dataset will be available at this https URL upon acceptance.
zh

[CV-8] LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding

【速读】：该论文旨在解决现有隐式神经表示（Implicit Neural Representations, INRs）框架在内存效率、分辨率适应性以及计算效率方面的局限性，特别是依赖全局潜在向量或在多尺度信息建模上的不足。论文的关键创新在于提出了一种名为LIFT的新型高效框架，通过元学习机制捕获多尺度信息。LIFT结合多个并行的局部隐函数与分层潜在生成器，生成统一的潜在表示，涵盖局部、中间及全局特征，从而实现平滑的局部区域过渡，提升表达能力的同时保持推理效率。进一步地，通过引入ReLIFT这一增强版本，利用残差连接与高表达性的频率编码，有效解决了同类方法中存在的收敛-容量差距问题，提供了一种高效且强大的解决方案以提升模型容量并加速收敛。实验结果表明，LIFT在生成建模与分类任务中达到当前最优性能（SOTA），并显著降低了计算成本。

链接: https://arxiv.org/abs/2503.15420
作者: Amirhossein Kazerouni,Soroush Mehraban,Michael Brudno,Babak Taati
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所); University Health Network (大学健康网络)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains, offering key advantages such as memory efficiency and resolution independence. Conventional deep learning models are typically modality-dependent, often requiring custom architectures and objectives for different types of signals. However, existing INR frameworks frequently rely on global latent vectors or exhibit computational inefficiencies that limit their broader applicability. We introduce LIFT, a novel, high-performance framework that addresses these challenges by capturing multiscale information through meta-learning. LIFT leverages multiple parallel localized implicit functions alongside a hierarchical latent generator to produce unified latent representations that span local, intermediate, and global features. This architecture facilitates smooth transitions across local regions, enhancing expressivity while maintaining inference efficiency. Additionally, we introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings. With this straightforward approach, ReLIFT effectively addresses the convergence-capacity gap found in comparable methods, providing an efficient yet powerful solution to improve capacity and speed up convergence. Empirical results show that LIFT achieves state-of-the-art (SOTA) performance in generative modeling and classification tasks, with notable reductions in computational costs. Moreover, in single-task settings, the streamlined ReLIFT architecture proves effective in signal representations and inverse problem tasks.
zh

[CV-9] mporal Regularization Makes Your Video Generator Stronger

【速读】：该论文试图解决视频生成中时间一致性（temporal coherence）与多样性难以兼顾的问题。解决方案的关键在于提出了一种名为FluxFlow的数据级时间增强策略，它通过施加受控的时间扰动来提升时间质量，而无需对模型架构进行修改，从而在保持空间保真度的同时显著提高时间一致性和多样性。

链接: https://arxiv.org/abs/2503.15417
作者: Harold Haodong Chen,Haojian Huang,Xianfeng Wu,Yexin Liu,Yajing Bai,Wen-Jie Shu,Harry Yang,Ser-Nam Lim
机构: Everlyn AI (亿维智联); HKUST (香港科技大学); UCF (中佛罗里达大学); HKU (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project: this https URL

点击查看摘要

Abstract:Temporal quality is a critical aspect of video generation, as it ensures consistent motion and realistic dynamics across frames. However, achieving high temporal coherence and diversity remains challenging. In this work, we explore temporal augmentation in video generation for the first time, and introduce FluxFlow for initial investigation, a strategy designed to enhance temporal quality. Operating at the data level, FluxFlow applies controlled temporal perturbations without requiring architectural modifications. Extensive experiments on UCF-101 and VBench benchmarks demonstrate that FluxFlow significantly improves temporal coherence and diversity across various video generation models, including U-Net, DiT, and AR-based architectures, while preserving spatial fidelity. These findings highlight the potential of temporal augmentation as a simple yet effective approach to advancing video generation quality.
zh

[CV-10] Automated Processing of eXplainable Artificial Intelligence Outputs in Deep Learning Models for Fault Diagnostics of Large Infrastructures

【速读】：该论文旨在解决深度学习（Deep Learning, DL）模型在处理图像以识别大型基础设施组件健康状态时可能存在的偏差问题，以及这些模型可能依赖于非因果的捷径（non-causal shortcuts）。为应对这些问题，论文提出了一种新的框架，该框架结合了后验解释（post-hoc explanations）与半监督学习（semi-supervised learning），通过自动识别偏离正确分类图像解释的异常解释（anomalous explanations），从而可能指示模型的异常行为。这一方法显著减少了维护决策者的工作负担，仅需手动重新分类被标记为具有异常解释的图像。解决方案的关键在于利用GradCAM生成的解释和深度半监督异常检测技术，结合特定领域的数据（如电力基础设施监测中的绝缘子壳图像），实现了对两类故障图像平均分类准确率提升8%，并且只需手动重新分类15%的图像。此外，与基于忠实性度量（faithfulness metric）的现有方法相比，该框架在F₁分数上表现出一致性的优势，并成功识别了因非因果捷径导致的正确分类情况。

链接: https://arxiv.org/abs/2503.15415
作者: Giovanni Floreale,Piero Baraldi,Enrico Zio,Olga Fink
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Learning (DL) models processing images to recognize the health state of large infrastructure components can exhibit biases and rely on non-causal shortcuts. eXplainable Artificial Intelligence (XAI) can address these issues but manually analyzing explanations generated by XAI techniques is time-consuming and prone to errors. This work proposes a novel framework that combines post-hoc explanations with semi-supervised learning to automatically identify anomalous explanations that deviate from those of correctly classified images and may therefore indicate model abnormal behaviors. This significantly reduces the workload for maintenance decision-makers, who only need to manually reclassify images flagged as having anomalous explanations. The proposed framework is applied to drone-collected images of insulator shells for power grid infrastructure monitoring, considering two different Convolutional Neural Networks (CNNs), GradCAM explanations and Deep Semi-Supervised Anomaly Detection. The average classification accuracy on two faulty classes is improved by 8% and maintenance operators are required to manually reclassify only 15% of the images. We compare the proposed framework with a state-of-the-art approach based on the faithfulness metric: the experimental results obtained demonstrate that the proposed framework consistently achieves F_1 scores larger than those of the faithfulness-based approach. Additionally, the proposed framework successfully identifies correct classifications that result from non-causal shortcuts, such as the presence of ID tags printed on insulator shells.
zh

[CV-11] Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis

【速读】：本文旨在解决生成式新颖视图合成方法（GNVS）在训练过程中因单目设置导致场景尺度模糊性所带来的影响。传统多视角数据集缺乏度量标定，使得相机位置的尺度具有歧义性，而以往方法通过各种经验性的归一化预处理步骤承认了这一问题，但未直接分析错误场景尺度对其应用的影响。论文的关键在于提出了一种端到端框架，用于与GNVS模型联合估计场景尺度，并定义新的指标来衡量生成视图的尺度不一致性。实验表明，所提方法能够有效减少生成视图的尺度不一致性，且避免了以往尺度归一化方法的复杂性和局限性，同时提升了生成图像的质量。

链接: https://arxiv.org/abs/2503.15412
作者: Fereshteh Forghani,Jason J. Yu,Tristan Aumentado-Armstrong,Konstantinos G. Derpanis,Marcus A. Brubaker
机构: York University (约克大学); Vector Institute for AI (向量人工智能研究所); Samsung AI Centre Toronto (三星多伦多人工智能中心); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional depth-free multi-view datasets are captured using a moving monocular camera without metric calibration. The scales of camera positions in this monocular setting are ambiguous. Previous methods have acknowledged scale ambiguity in multi-view data via various ad-hoc normalization pre-processing steps, but have not directly analyzed the effect of incorrect scene scales on their application. In this paper, we seek to understand and address the effect of scale ambiguity when used to train generative novel view synthesis methods (GNVS). In GNVS, new views of a scene or object can be minimally synthesized given a single image and are, thus, unconstrained, necessitating the use of generative methods. The generative nature of these models captures all aspects of uncertainty, including any uncertainty of scene scales, which act as nuisance variables for the task. We study the effect of scene scale ambiguity in GNVS when sampled from a single image by isolating its effect on the resulting models and, based on these intuitions, define new metrics that measure the scale inconsistency of generated views. We then propose a framework to estimate scene scales jointly with the GNVS model in an end-to-end fashion. Empirically, we show that our method reduces the scale inconsistency of generated views without the complexity or downsides of previous scale normalization methods. Further, we show that removing this ambiguity improves generated image quality of the resulting GNVS model.
zh

[CV-12] Visual Persona: Foundation Model for Full-Body Human Customization CVPR2025

【速读】：该论文试图解决的问题是如何通过单张野外（in-the-wild）人体图像及文本描述，生成多样化且高质量的个性化全身定制图像，同时确保生成图像在身体结构和场景变化上与文本描述一致。现有方法主要关注面部身份的保留，而忽视了全身外观细节的表达。为解决这一挑战，论文的关键在于提出了一种名为Visual Persona的基础模型，并构建了一个包含58万配对图像的大型数据集Visual Persona-500K，用于训练模型以捕捉一致的全身外观。此外，论文设计了一种基于Transformer的编码器-解码器架构，将输入图像分割为不同的身体区域，提取局部外观特征并独立投影为密集的身份嵌入，从而指导扩散模型生成定制化图像。这种创新性的方法显著提升了生成图像的质量和多样性。

链接: https://arxiv.org/abs/2503.15406
作者: Jisu Nam,Soowon Son,Zhan Xu,Jing Shi,Difan Liu,Feng Liu,Aashish Misraa,Seungryong Kim,Yang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, Project page is available at this https URL

点击查看摘要

Abstract:We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.
zh

[CV-13] Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement CVPR2025

【速读】：本文旨在提升视觉Transformer (Vision Transformer, ViT) 在对抗性迁移攻击中的表现，具体解决的问题是如何通过改进代理模型的前向传播（Forward Propagation Refinement, FPR）来增强对抗样本的可迁移性。现有方法主要集中在后向传播的代理优化，而本文另辟蹊径，聚焦于ViT的两个关键模块：注意力图（attention maps）和标记嵌入（token embeddings）。为实现这一目标，提出了注意力图多样化（Attention Map Diversification, AMD）以增加特定注意力图的多样性，并在后向传播中隐式缓解梯度消失问题；同时引入动量标记嵌入（Momentum Token Embedding, MTE），通过累积历史嵌入信息来稳定注意力与多层感知机（MLP）块的前向更新。实验表明，所提FPR方法在跨不同CNN和ViT模型的对抗性迁移任务中平均优于当前最佳的后向代理优化方法达7.0%。

链接: https://arxiv.org/abs/2503.15404
作者: Yuchen Ren,Zhengyu Zhao,Chenhao Lin,Bo Yang,Lu Zhou,Zhe Liu,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学); Information Engineering University (信息工程大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Zhejiang Lab (之江实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: CVPR2025

点击查看摘要

Abstract:Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement by up to 7.0% on average. We also validate its superiority against popular defenses and its compatibility with other transfer methods. Codes and appendix are available at this https URL.
zh

[CV-14] owards efficient keyword spotting using spike-based time difference encoders

【速读】：该论文旨在解决在边缘设备中关键词 spotting 的低功耗限制问题，特别是在语音助手广泛应用的背景下。传统方法受限于目标嵌入系统的极端低功耗约束，而该研究通过探索时间差编码器（Temporal Difference Encoder, TDE）在关键词 spotting 中的性能提出了解决方案。解决方案的关键在于利用 TDE 模型，该模型通过编码瞬时频率和脉冲计数的时间差异，在类脑处理器上实现高效的关键词识别。研究比较了三种基于尖峰神经网络（Spiking Neural Networks, SNNs）的架构，其中 TDE 网络在准确率（89%）和能耗（比循环 CuBa-LIF 网络少 92% 的突触操作）方面表现出色，并且其结果具有高度可解释性，与数据集中关键词的频率及时标特征密切相关。这表明 TDE 是一种有前景的神经元模型，适用于可扩展的事件驱动时空模式处理。

链接: https://arxiv.org/abs/2503.15402
作者: Alejandro Pequeño-Zurro,Lyes Khacef,Stefano Panzeri,Elisabetta Chicca
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: 26 pages, 9 figures

点击查看摘要

Abstract:Keyword spotting in edge devices is becoming increasingly important as voice-activated assistants are widely used. However, its deployment is often limited by the extreme low-power constraints of the target embedded systems. Here, we explore the Temporal Difference Encoder (TDE) performance in keyword spotting. This recent neuron model encodes the time difference in instantaneous frequency and spike count to perform efficient keyword spotting with neuromorphic processors. We use the TIdigits dataset of spoken digits with a formant decomposition and rate-based encoding into spikes. We compare three Spiking Neural Networks (SNNs) architectures to learn and classify spatio-temporal signals. The proposed SNN architectures are made of three layers with variation in its hidden layer composed of either (1) feedforward TDE, (2) feedforward Current-Based Leaky Integrate-and-Fire (CuBa-LIF), or (3) recurrent CuBa-LIF neurons. We first show that the spike trains of the frequency-converted spoken digits have a large amount of information in the temporal domain, reinforcing the importance of better exploiting temporal encoding for such a task. We then train the three SNNs with the same number of synaptic weights to quantify and compare their performance based on the accuracy and synaptic operations. The resulting accuracy of the feedforward TDE network (89%) is higher than the feedforward CuBa-LIF network (71%) and close to the recurrent CuBa-LIF network (91%). However, the feedforward TDE-based network performs 92% fewer synaptic operations than the recurrent CuBa-LIF network with the same amount of synapses. In addition, the results of the TDE network are highly interpretable and correlated with the frequency and timescale features of the spoken keywords in the dataset. Our findings suggest that the TDE is a promising neuron model for scalable event-driven processing of spatio-temporal patterns.
zh

[CV-15] EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models CVPR2025

【速读】：该论文旨在解决多模态大型语言模型在部署过程中因模型复杂性带来的显著挑战，特别是在资源受限设备上的应用难题。解决方案的关键在于提出了一种针对大型视觉-语言模型的自动剪枝方法，以提升多模态推理效率。与传统依赖原始模型训练数据来为不同网络组件选择合适剪枝比例的方法不同，本文方法仅利用少量样本搜索理想的剪枝策略，通过最大化其在未知训练数据上的泛化能力同时保持模型精度，从而实现大型视觉-语言模型在准确性和效率之间的最优权衡。具体而言，研究者基于结构风险最小化原则定义剪枝策略的泛化差距，并结合任务性能和泛化能力迭代搜索最佳剪枝策略，优化视觉投影器以扩展具有更高性能上限的搜索空间。实验结果表明，在ScienceQA等四个数据集上，采用仅64个样本进行剪枝策略搜索的EfficientLLaVA达到了83.05%的准确率，且相比密集型LLaVA-v1.5-7B模型实现了1.8倍的速度提升。

链接: https://arxiv.org/abs/2503.15369
作者: Yinan Liang,Ziwei Wang,Xiuwei Xu,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:While multimodal large language models demonstrate strong performance in complex reasoning tasks, they pose significant challenges related to model complexity during deployment, especially for resource-limited devices. In this paper, we propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Conventional methods rely on the training data of the original model to select the proper pruning ratio for different network components. However, these methods are impractical for large vision-language models due to the unaffordable search costs caused by web-scale training corpus. In contrast, our approach only leverages a small number of samples to search for the desired pruning policy by maximizing its generalization ability on unknown training data while maintaining the model accuracy, which enables the achievement of an optimal trade-off between accuracy and efficiency for large visual language models. Specifically, we formulate the generalization gap of the pruning strategy using the structural risk minimization principle. Based on both task performance and generalization capability, we iteratively search for the optimal pruning policy within a given search space and optimize the vision projector to evolve the search space with higher upper bound of performance. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering. Using only 64 samples for pruning policy search, EfficientLLaVA achieves an accuracy of 83.05% on ScienceQA, along with a \times 1.8 speedup compared to the dense LLaVA-v1.5-7B model.
zh

[CV-16] Boosting HDR Image Reconstruction via Semantic Knowledge Transfer

【速读】：该论文旨在解决从多个低动态范围（LDR）图像恢复高动态范围（HDR）图像时，当LDR图像存在明显退化和缺失内容的情况下所面临的挑战。论文的关键在于利用场景特定的语义先验来修复严重退化的区域。然而，这些先验通常是从sRGB标准动态范围（SDR）图像中提取的，域/格式差距成为将其应用于HDR成像的重大挑战。为了解决这一问题，作者提出了一种通用框架，通过自蒸馏将来自SDR域的语义知识转移到现有的HDR重建方法中以增强其性能。具体而言，提出的框架首先引入了语义先验引导的重建模型（SPGRM），利用SDR图像的语义知识解决初始HDR重建结果中的不适定问题；随后采用自蒸馏机制，通过语义知识约束颜色和内容信息，使基线与SPGRM之间的外部输出对齐；此外，为了转移内部特征的语义知识，还使用语义知识对齐模块（SKAM）利用互补掩码填充缺失的语义内容。大量实验表明，该方法能够显著提升现有HDR成像的质量。

链接: https://arxiv.org/abs/2503.15361
作者: Qingsen Yan,Tao Hu,Genggeng Chen,Wei Dong,Yanning Zhang
机构: The School of Computer Science, Northwestern Polytechnical University (西北工业大学计算机学院); Xi’an University of Architecture and Technology (西安建筑科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recovering High Dynamic Range (HDR) images from multiple Low Dynamic Range (LDR) images becomes challenging when the LDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB Standard Dynamic Range (SDR) images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a semantic knowledge alignment module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our method can significantly improve the HDR imaging quality of existing methods.
zh

[CV-17] Leverag ing Perfect Multimodal Alignment and Gaussian Assumptions for Cross-modal Transfer

【速读】：该论文旨在解决多模态对齐（Multimodal Alignment）的问题，具体目标是在一个联合潜在向量空间中构建两个表示相同概念的不同模态映射到同一向量的能力。论文将此问题形式化为一个逆问题，并证明在特定条件下可以实现完美对齐（Perfect Alignment）。解决方案的关键在于假设语义类别在潜在空间中由高斯混合模型表示，通过将数据点从表示空间投影到分别代表每个模态的子空间中，从而实现跨模态迁移（Cross-modal Transfer），且无需针对新模态进行任何带标签的微调（Unsupervised Cross-modal Transfer）。实验结果验证了所提出方法的有效性。

链接: https://arxiv.org/abs/2503.15352
作者: Abhi Kamboj,Minh N. Do
机构: University of Illinois (伊利诺伊大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Multimodal alignment aims to construct a joint latent vector space where two modalities representing the same concept map to the same vector. We formulate this as an inverse problem and show that under certain conditions perfect alignment can be achieved. We then address a specific application of alignment referred to as cross-modal transfer. Unsupervised cross-modal transfer aims to leverage a model trained with one modality to perform inference on another modality, without any labeled fine-tuning on the new modality. Assuming that semantic classes are represented as a mixture of Gaussians in the latent space, we show how cross-modal transfer can be performed by projecting the data points from the representation space onto different subspaces representing each modality. Our experiments on synthetic multimodal Gaussian data verify the effectiveness of our perfect alignment and cross-modal transfer method. We hope these findings inspire further exploration of the applications of perfect alignment and the use of Gaussian models for cross-modal learning.
zh

[CV-18] ruthLens:A Training-Free Paradigm for DeepFake Detection

【速读】：该论文旨在解决由先进AI模型生成的合成图像激增所引发的识别和理解操纵视觉内容的重大挑战。当前的假图像检测方法主要依赖于关注准确性的二元分类模型，但往往忽视了解释性，导致用户无法清楚了解为何某张图像是真实还是伪造的。为弥合这一差距，论文引入了TruthLens，这是一种新颖的无需训练的框架，将深度伪造检测重新构想为视觉问答（Visual Question-Answering, VQA）任务。其关键是利用最先进的大型视觉语言模型（Large Vision-Language Models, LVLMs）来观察和描述视觉伪影，并结合大型语言模型（Large Language Models, LLMs）如GPT-4的推理能力，将证据分析和聚合为明智决策。通过采用多模态方法，TruthLens不仅无缝整合视觉和语义推理以分类图像的真实性，还为其决策提供可解释的说明，从而增强信任并提供有价值的见解，揭示信号合成内容的伪影。广泛的评估表明，TruthLens在具有挑战性的数据集上超越传统方法，同时保持高度的准确性与强大的可解释性。通过将深度伪造检测重塑为推理驱动的过程，TruthLens确立了对抗合成媒体的新范式，结合尖端性能与可解释性以应对日益增长的视觉虚假信息威胁。

链接: https://arxiv.org/abs/2503.15342
作者: Ritabrata Chakraborty,Rajatsubhra Chakraborty,Ali Khaleghi Rahimian,Thomas MacDougall
机构: Manipal University Jaipur; University of North Carolina Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of synthetic images generated by advanced AI models poses significant challenges in identifying and understanding manipulated visual content. Current fake image detection methods predominantly rely on binary classification models that focus on accuracy while often neglecting interpretability, leaving users without clear insights into why an image is deemed real or fake. To bridge this gap, we introduce TruthLens, a novel training-free framework that reimagines deepfake detection as a visual question-answering (VQA) task. TruthLens utilizes state-of-the-art large vision-language models (LVLMs) to observe and describe visual artifacts and combines this with the reasoning capabilities of large language models (LLMs) like GPT-4 to analyze and aggregate evidence into informed decisions. By adopting a multimodal approach, TruthLens seamlessly integrates visual and semantic reasoning to not only classify images as real or fake but also provide interpretable explanations for its decisions. This transparency enhances trust and provides valuable insights into the artifacts that signal synthetic content. Extensive evaluations demonstrate that TruthLens outperforms conventional methods, achieving high accuracy on challenging datasets while maintaining a strong emphasis on explainability. By reframing deepfake detection as a reasoning-driven process, TruthLens establishes a new paradigm in combating synthetic media, combining cutting-edge performance with interpretability to address the growing threats of visual disinformation.
zh

[CV-19] Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport CVPR2025

【速读】：本文旨在解决开放词汇多标签图像识别中的两个关键挑战：(1) CLIP 的全局预训练目标破坏了局部语义，导致区域预测不可靠；(2) 忽视了图像区域与候选标签之间的匹配特性，仅依赖平均池化等简单特征聚合方法，从而产生由无关区域引起的错误预测。为解决这些问题，论文提出了 RAM（Recover And Match）框架，其关键在于通过引入 Ladder Local Adapter (LLA) 恢复局部语义以聚焦局部区域，并利用 Knowledge-Constrained Optimal Transport (KCOT) 将任务建模为最优传输问题以抑制对非真实标签的无意义匹配，从而有效提升模型性能。

链接: https://arxiv.org/abs/2503.15337
作者: Hao Tan,Zichang Tan,Jun Li,Ajian Liu,Jun Wan,Zhen Lei
机构: SAIS, UCAS (上海科技大学信息科学与技术学院，中国科学院大学); MAIS, Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); SIAT, Chinese Academy of Sciences (深圳先进技术研究院，中国科学院); Sangfor Technologies Inc. (深信服科技股份有限公司); SAI, UCAS (人工智能学院，中国科学院大学); CAIR, HKISI, Chinese Academy of Sciences (香港中文大学深圳研究院，中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision. Recent studies explore the transfer of powerful vision-language models such as CLIP. However, these approaches face two critical challenges: (1) The local semantics of CLIP are disrupted due to its global pre-training objectives, resulting in unreliable regional predictions. (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. To tackle the first problem, we propose Ladder Local Adapter (LLA) to enforce refocusing on local regions, recovering local semantics in a memory-friendly way. For the second issue, we propose Knowledge-Constrained Optimal Transport (KCOT) to suppress meaningless matching to non-GT labels by formulating the task as an optimal transport problem. As a result, RAM achieves state-of-the-art performance on various datasets from three distinct domains, and shows great potential to boost the existing methods. Code: this https URL.
zh

[CV-20] SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes

【速读】：该论文试图解决城市语义分割中纹理网格（textured meshes）未被充分探索的问题。传统研究主要集中在图像或点云的语义分割，而忽略了纹理网格这一富含空间信息的表示形式。论文的关键在于引入SUM Parts，这是一个针对城市纹理网格的大规模数据集，包含21个类别且覆盖约2.5平方公里，并带有部件级语义标签。为实现这一目标，论文开发了一种自有的标注工具，支持基于面和纹理的高效交互式选择标注，从而解决了纹理网格语义分割的标注难题。此外，论文还对该数据集上的3D语义分割及交互式标注方法进行了全面评估。

链接: https://arxiv.org/abs/2503.15300
作者: Weixiao Gao,Liangliang Nan,Hugo Ledoux
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 24 figures

点击查看摘要

Abstract:Semantic segmentation in urban scene analysis has mainly focused on images or point clouds, while textured meshes - offering richer spatial representation - remain underexplored. This paper introduces SUM Parts, the first large-scale dataset for urban textured meshes with part-level semantic labels, covering about 2.5 km2 with 21 classes. The dataset was created using our own annotation tool, which supports both face- and texture-based annotations with efficient interactive selection. We also provide a comprehensive evaluation of 3D semantic segmentation and interactive annotation methods on this dataset. Our project page is available at this https URL.
zh

[CV-21] DCA: Dividing and Conquering Amnesia in Incremental Object Detection AAAI2025

【速读】：本文旨在解决增量目标检测（Incremental Object Detection, IOD）中的灾难性遗忘问题，特别是在基于Transformer的检测框架中，现有方法通过改进知识蒸馏和示例重放取得了一定进展，但对内在遗忘机制的研究尚不充分。研究发现，基于Transformer的IOD存在定位与识别之间的遗忘不平衡现象：定位能力较少受到遗忘影响且能够泛化到新类别，而识别则容易发生灾难性遗忘。为此，论文提出了一种分而治之的遗忘（Divide-and-Conquer Amnesia, DCA）策略，将基于Transformer的IOD重新设计为先定位后识别的过程，从而有效保持并迁移定位能力，并专注于解决脆弱的识别问题。为减少识别过程中的特征漂移，利用预训练语言模型编码的语义知识，在增量任务中锚定类别表示于统一特征空间内，并通过双路分类器融合以及以查询形式嵌入类别语义特征至识别解码过程来实现这一目标。实验表明，该方法在长周期增量场景下达到最先进的性能，例如在MS-COCO的四步设置中显著提升了最终平均精度（AP）达6.9%。关键在于通过DCA策略分离定位与识别任务，并结合语义知识增强识别模块的稳定性。

链接: https://arxiv.org/abs/2503.15295
作者: Aoting Zhang,Dongbao Yang,Chang Liu,Xiaopeng Hong,Miao Shang,Yu Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Incremental object detection (IOD) aims to cultivate an object detector that can continuously localize and recognize novel classes while preserving its performance on previous classes. Existing methods achieve certain success by improving knowledge distillation and exemplar replay for transformer-based detection frameworks, but the intrinsic forgetting mechanisms remain underexplored. In this paper, we dive into the cause of forgetting and discover forgetting imbalance between localization and recognition in transformer-based IOD, which means that localization is less-forgetting and can generalize to future classes, whereas catastrophic forgetting occurs primarily on recognition. Based on these insights, we propose a Divide-and-Conquer Amnesia (DCA) strategy, which redesigns the transformer-based IOD into a localization-then-recognition process. DCA can well maintain and transfer the localization ability, leaving decoupled fragile recognition to be specially conquered. To reduce feature drift in recognition, we leverage semantic knowledge encoded in pre-trained language models to anchor class representations within a unified feature space across incremental tasks. This involves designing a duplex classifier fusion and embedding class semantic features into the recognition decoding process in the form of queries. Extensive experiments validate that our approach achieves state-of-the-art performance, especially for long-term incremental scenarios. For example, under the four-step setting on MS-COCO, our DCA strategy significantly improves the final AP by 6.9%.
zh

[CV-22] st-Time Backdoor Detection for Object Detection Models CVPR2025

【速读】：该论文旨在解决对象检测模型在测试阶段检测被投毒样本（即包含预定义触发器的样本）的问题。与图像分类任务不同，对象检测的独特特性（尤其是其输出众多目标的特点）以及复杂的攻击效果（如“幽灵”物体的出现或“消失”物体的现象），使得现有防御方法难以有效应对。为了解决这一问题，论文提出了一种名为TRAnsformation Consistency Evaluation (TRACE) 的新方法。该方法的关键在于基于两个有趣的观察：(1) 投毒样本在不同背景下表现出显著更高的检测结果一致性；(2) 清洁样本在引入不同的焦点信息时表现出更高的检测一致性。TRACE通过对每个测试样本进行前景和背景变换，并通过计算目标置信度的方差来评估变换一致性，从而实现黑盒、通用的后门检测。实验表明，TRACE相比最先进的防御方法在AUROC指标上提升了30%，并对自适应攻击具有鲁棒性。

链接: https://arxiv.org/abs/2503.15293
作者: Hangtao Zhang,Yichen Wang,Shihui Yan,Chenyu Zhu,Ziqi Zhou,Linshan Hou,Shengshan Hu,Minghui Li,Yanjun Zhang,Leo Yu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Object detection models are vulnerable to backdoor attacks, where attackers poison a small subset of training samples by embedding a predefined trigger to manipulate prediction. Detecting poisoned samples (i.e., those containing triggers) at test time can prevent backdoor activation. However, unlike image classification tasks, the unique characteristics of object detection – particularly its output of numerous objects – pose fresh challenges for backdoor detection. The complex attack effects (e.g., “ghost” object emergence or “vanishing” object) further render current defenses fundamentally inadequate. To this end, we design TRAnsformation Consistency Evaluation (TRACE), a brand-new method for detecting poisoned samples at test time in object detection. Our journey begins with two intriguing observations: (1) poisoned samples exhibit significantly more consistent detection results than clean ones across varied backgrounds. (2) clean samples show higher detection consistency when introduced to different focal information. Based on these phenomena, TRACE applies foreground and background transformations to each test sample, then assesses transformation consistency by calculating the variance in objects confidences. TRACE achieves black-box, universal backdoor detection, with extensive experiments showing a 30% improvement in AUROC over state-of-the-art defenses and resistance to adaptive attacks.
zh

[CV-23] PAPI-Reg: Patch-to-Pixel Solution for Efficient Cross-Modal Registration between LiDAR Point Cloud and Camera Image

【速读】：该论文旨在解决跨模态数据融合中激光雷达点云与相机图像精确配准的问题，传统方法依赖耗时的外部标定板或特定环境特征，而现有基于直接配准的方法因点云与图像之间的领域差距难以同时保证高精度与实时性。论文的关键解决方案在于提出一种框架，将点云投影到多个二维表示以匹配相机图像，从而更有效地利用点云的几何特性并缩小点云与图像间的领域差距。此外，通过引入多尺度特征提取网络和基于块到像素的匹配网络，进一步应对跨模态差异及点云与图像重叠区域有限带来的挑战，实现更有效的监督与更高精度的配准。实验验证表明，该方法在KITTI和nuScenes数据集上实现了实时性能及超过99%的注册精度。

链接: https://arxiv.org/abs/2503.15285
作者: Yuanchao Yue,Zhengxin Li,Wei Zhang,Hui Yuan
机构: Key Laboratory of Machine Intelligence and System Control, Ministry of Education (智能与系统控制重点实验室, 教育部); School of Control Science and Engineering, Shandong University (控制科学与工程学院, 山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The primary requirement for cross-modal data fusion is the precise alignment of data from different sensors. However, the calibration between LiDAR point clouds and camera images is typically time-consuming and needs external calibration board or specific environmental features. Cross-modal registration effectively solves this problem by aligning the data directly without requiring external calibration. However, due to the domain gap between the point cloud and the image, existing methods rarely achieve satisfactory registration accuracy while maintaining real-time performance. To address this issue, we propose a framework that projects point clouds into several 2D representations for matching with camera images, which not only leverages the geometric characteristic of LiDAR point clouds more effectively but also bridge the domain gap between the point cloud and image. Moreover, to tackle the challenges of cross modal differences and the limited overlap between LiDAR point clouds and images in the image matching task, we introduce a multi-scale feature extraction network to effectively extract features from both camera images and the projection maps of LiDAR point cloud. Additionally, we propose a patch-to-pixel matching network to provide more effective supervision and achieve higher accuracy. We validate the performance of our model through experiments on the KITTI and nuScenes datasets. Our network achieves real-time performance and extremely high registration accuracy. On the KITTI dataset, our model achieves a registration accuracy rate of over 99%.
zh

[CV-24] EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

【速读】：该论文旨在解决跨模态数据配准中因计算约束导致精度损失以及跨模态特征差异引起的匹配困难问题。现有方法通常通过下采样原始点云和图像数据来缓解计算压力，但不可避免地牺牲了精度；同时，不同模态采用的特征提取器产生的高维特征需要特定技术来弥合跨模态差异以实现有效匹配。论文的关键解决方案在于利用原始点云和图像的边缘信息进行跨模态配准，通过提取边缘点和边缘像素保留原始数据的重要信息，在保证计算效率的同时提升配准准确性。此外，引入基于注意力机制的特征交换模块消除跨模态差异，并结合最优匹配层优化对应关系识别，从而实现更精确且高效的跨模态配准。

链接: https://arxiv.org/abs/2503.15284
作者: Yuanchao Yue,Hui Yuan,Qinglong Miao,Xiaolong Mao,Raouf Hamzaoui,Peter Eisert
机构: School of Control Science and Engineering, Shandong University, Jinan, Shandong, China (山东大学控制科学与工程学院，济南，中国); School of software, Shandong University, Jinan, Shandong, China (山东大学软件学院，济南，中国); School of Engineering and Sustainable Development, De Montfort University, LE1 9BH Leicester, U.K. (德蒙福特大学工程与可持续发展学院，英国莱斯特LE1 9BH); Institut für Informatik, Humboldt-Universität zu Berlin, Germany (柏林洪堡大学计算机研究所，德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-modal data registration has long been a critical task in computer vision, with extensive applications in autonomous driving and robotics. Accurate and robust registration methods are essential for aligning data from different modalities, forming the foundation for multimodal sensor data fusion and enhancing perception systems’ accuracy and reliability. The registration task between 2D images captured by cameras and 3D point clouds captured by Light Detection and Ranging (LiDAR) sensors is usually treated as a visual pose estimation problem. High-dimensional feature similarities from different modalities are leveraged to identify pixel-point correspondences, followed by pose estimation techniques using least squares methods. However, existing approaches often resort to downsampling the original point cloud and image data due to computational constraints, inevitably leading to a loss in precision. Additionally, high-dimensional features extracted using different feature extractors from various modalities require specific techniques to mitigate cross-modal differences for effective matching. To address these challenges, we propose a method that uses edge information from the original point clouds and images for cross-modal registration. We retain crucial information from the original data by extracting edge points and pixels, enhancing registration accuracy while maintaining computational efficiency. The use of edge points and edge pixels allows us to introduce an attention-based feature exchange block to eliminate cross-modal disparities. Furthermore, we incorporate an optimal matching layer to improve correspondence identification. We validate the accuracy of our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.
zh

[CV-25] F-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

【速读】：该论文试图解决Text-and-Image-To-Image (TI2I) 领域中现有方法在利用多图像输入时效果不佳的问题，具体表现为部分方法仅关注特定元素（如对象或风格），而另一些方法在处理复杂多图像指令时生成质量下降。为克服这些挑战，论文提出了一种无需额外训练的Training-Free Text-and-Image-to-Image (TF-TI2I) 方法。其关键在于利用MM-DiT架构，并通过从参考图像中提取精简的视觉表示以及引入Reference Contextual Masking技术，实现对与文本指令相关的视觉信息的选择性共享。此外，Winner-Takes-All模块进一步缓解了分布偏移问题，确保每个视觉标记都能优先参考最相关的上下文信息。为评估TI2I方法，论文还提出了FG-TI2I Bench基准，验证了所提方法在多种基准测试中的鲁棒性能。

链接: https://arxiv.org/abs/2503.15283
作者: Teng-Fang Hsiao,Bo-Kai Ruan,Yi-Lun Wu,Tzu-Ling Lin,Hong-Han Shuai
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking – this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.
zh

[CV-26] Challenges and Trends in Egocentric Vision: A Survey

【速读】：该论文旨在系统性地梳理和总结以第一视角视觉理解（Egocentric Vision Understanding）为核心的研究进展，解决的关键问题是全面分析该领域的主要任务、挑战及趋势，并为未来研究提供资源与方向。论文通过将第一视角场景分解为四个主要任务类别：主体理解（Subject Understanding）、物体理解（Object Understanding）、环境理解（Environment Understanding）以及混合理解（Hybrid Understanding），并深入探讨各类别下的子任务，明确了研究框架与重点。此外，论文总结了高质量数据集，为后续研究提供了重要资源。解决方案的关键在于系统化分类任务、深入剖析技术难点以及展望未来应用方向，从而推动第一视角视觉理解在增强现实（AR）、虚拟现实（VR）及具身智能（Embodied Intelligence）等领域的广泛应用。

链接: https://arxiv.org/abs/2503.15275
作者: Xiang Li,Heqian Qiu,Lanxiao Wang,Hanwen Zhang,Chenghao Qi,Linfeng Han,Huiyu Xiong,Hongliang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of artificial intelligence technologies and wearable devices, egocentric vision understanding has emerged as a new and challenging research direction, gradually attracting widespread attention from both academia and industry. Egocentric vision captures visual and multimodal data through cameras or sensors worn on the human body, offering a unique perspective that simulates human visual experiences. This paper provides a comprehensive survey of the research on egocentric vision understanding, systematically analyzing the components of egocentric scenes and categorizing the tasks into four main areas: subject understanding, object understanding, environment understanding, and hybrid understanding. We explore in detail the sub-tasks within each category. We also summarize the main challenges and trends currently existing in the field. Furthermore, this paper presents an overview of high-quality egocentric vision datasets, offering valuable resources for future research. By summarizing the latest advancements, we anticipate the broad applications of egocentric vision technologies in fields such as augmented reality, virtual reality, and embodied intelligence, and propose future research directions based on the latest developments in the field.
zh

[CV-27] DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning

【速读】：该论文旨在解决现有自动回归方法在生成结构化网格时受限于有限面数和网格不完整性的问题。为应对这些挑战，论文提出DeepMesh框架，其关键创新点包括：(1) 一种高效的预训练策略，结合新颖的令牌化算法，并改进数据整理与处理；(2) 将强化学习（Reinforcement Learning, RL）引入三维网格生成，通过直接偏好优化（Direct Preference Optimization, DPO）实现对人类偏好的对齐。论文设计了一种综合人类评估与三维度量标准的评分准则，以收集DPO所需的偏好对，确保生成网格在视觉吸引力和几何准确性方面的表现。DeepMesh在点云和图像条件下生成细节丰富且拓扑精确的网格，在精度和质量上超越了现有最先进方法。

链接: https://arxiv.org/abs/2503.15265
作者: Ruowen Zhao,Junliang Ye,Zhengyi Wang,Guangce Liu,Yiwen Chen,Yikai Wang,Jun Zhu
机构: Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学); ShengShu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Triangle meshes play a crucial role in 3D applications for efficient manipulation and rendering. While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). We design a scoring standard that combines human evaluation with 3D metrics to collect preference pairs for DPO, ensuring both visual appeal and geometric accuracy. Conditioned on point clouds and images, DeepMesh generates meshes with intricate details and precise topology, outperforming state-of-the-art methods in both precision and quality. Project page: this https URL
zh

[CV-28] LEGION: Learning to Ground and Explain for Synthetic Image Detection

【速读】：该论文旨在解决现有合成图像检测方法在文本可解释性不足以及过度关注图像篡改检测的问题，同时指出当前数据集存在老旧生成器和缺乏细粒度标注的局限。为应对这些挑战，论文提出了SynthScars数据集和LEGION框架作为解决方案。SynthScars包含12,236张完全合成的图像，并提供高质量的人类专家标注，具备多样化的图像内容类型、多类伪造痕迹及细粒度标注。LEGION则是一个基于多模态大型语言模型（MLLM）的图像伪造分析框架，通过整合伪造痕迹检测、分割与解释能力，不仅提升了检测性能，还进一步探索其作为控制器集成到图像优化流程中，以引导生成更高质量和更逼真的图像。实验结果表明，LEGION在多个基准测试中超越现有方法，在SynthScars数据集上的mIoU和F1分数分别比第二好的传统专家方法高出3.31%和7.75%，并且其指导生成的优化图像更符合人类偏好。

链接: https://arxiv.org/abs/2503.15264
作者: Hengrui Kang,Siwei Wen,Zichen Wen,Junyan Ye,Weijia Li,Peilin Feng,Baichuan Zhou,Bin Wang,Dahua Lin,Linfeng Zhang,Conghui He
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Beihang University (北京航空航天大学); Sun Yat-Sen University (中山大学); SenseTime Research (商汤研究); opendatalab (开放数据实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation. Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31% in mIoU and 7.75% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. The code, model, and dataset will be released.
zh

[CV-29] DEPT: Deep Extreme Point Tracing for Ultrasound Image Segmentation

【速读】：该论文旨在解决医学影像分割中完全监督学习方法因需大量人工标注而导致的高成本问题。为此，论文提出了一种结合Deep Extreme Point Tracing (DEPT) 和 Feature-Guided Extreme Point Masking (FGEPM) 的弱监督学习方案。其关键在于通过特征图生成的成本矩阵，识别连接所有极值点的最低成本路径来生成伪标签，并进一步采用迭代训练策略逐步优化这些伪标签，从而实现网络性能的持续提升。实验结果表明，该方法在两个公开数据集上的表现接近完全监督方法，同时优于其他现有弱监督方法。

链接: https://arxiv.org/abs/2503.15260
作者: Lei Shi,Xi Fang,Naiyu Wang,Junxing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic medical image segmentation plays a crucial role in computer aided diagnosis. However, fully supervised learning approaches often require extensive and labor-intensive annotation efforts. To address this challenge, weakly supervised learning methods, particularly those using extreme points as supervisory signals, have the potential to offer an effective solution. In this paper, we introduce Deep Extreme Point Tracing (DEPT) integrated with Feature-Guided Extreme Point Masking (FGEPM) algorithm for ultrasound image segmentation. Notably, our method generates pseudo labels by identifying the lowest-cost path that connects all extreme points on the feature map-based cost matrix. Additionally, an iterative training strategy is proposed to refine pseudo labels progressively, enabling continuous network improvement. Experimental results on two public datasets demonstrate the effectiveness of our proposed method. The performance of our method approaches that of the fully supervised method and outperforms several existing weakly supervised methods.
zh

[CV-30] CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification CVPR2025

【速读】：本文旨在解决基于概念的后验解释方法在构建准确且充分的语言化全局概念与局部电路说明时缺乏灵活性的问题，尤其强调语义视觉概念（Visual Concepts, VCs）固有的多义性对模型可解释性的负面影响。论文的关键解决方案是提出了一种链式解释（Chain-of-Explanation, CoE）方法，其核心包括：通过自动化解码和描述视觉概念来构建全局概念解释数据集；设计一种概念多义性解耦与过滤机制以识别最相关的概念原子；引入概念多义性熵（Concept Polysemanticity Entropy, CPE）作为衡量模型可解释性的指标，并将确定性概念建模升级为不确定概念原子分布；最终通过追踪概念回路实现深度视觉模型决策过程的语言化局部解释。实验结果验证了CPE的有效性和CoE的优越性，其可解释性评分平均绝对提升达36%。

链接: https://arxiv.org/abs/2503.15234
作者: Wenlong Yu,Qilong Wang,Chuang Liu,Dong Li,Qinghua Hu
机构: Tianjin Key Lab of Machine Learning, College of Intelligence and Computing, Tianjin University (天津大学智能与计算学部机器学习重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Explainability is a critical factor influencing the wide deployment of deep vision models (DVMs). Concept-based post-hoc explanation methods can provide both global and local insights into model decisions. However, current methods in this field face challenges in that they are inflexible to automatically construct accurate and sufficient linguistic explanations for global concepts and local circuits. Particularly, the intrinsic polysemanticity in semantic Visual Concepts (VCs) impedes the interpretability of concepts and DVMs, which is underestimated severely. In this paper, we propose a Chain-of-Explanation (CoE) approach to address these issues. Specifically, CoE automates the decoding and description of VCs to construct global concept explanation datasets. Further, to alleviate the effect of polysemanticity on model explainability, we design a concept polysemanticity disentanglement and filtering mechanism to distinguish the most contextually relevant concept atoms. Besides, a Concept Polysemanticity Entropy (CPE), as a measure of model interpretability, is formulated to quantify the degree of concept uncertainty. The modeling of deterministic concepts is upgraded to uncertain concept atom distributions. Finally, CoE automatically enables linguistic local explanations of the decision-making process of DVMs by tracing the concept circuit. GPT-4o and human-based experiments demonstrate the effectiveness of CPE and the superiority of CoE, achieving an average absolute improvement of 36% in terms of explainability scores.
zh

[CV-31] GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector CVPR2025

【速读】：该论文旨在解决基于神经辐射场（Neural Radiance Fields, NeRF）的多视图三维目标检测中精确体素表示的挑战。由于遮挡和缺乏三维信息，从多视图二维图像构建三维特征具有难度。为了解决这一问题，论文的关键创新在于引入了一种嵌入三维位置信息的体素优化机制，用于融合多视图特征，并设计了一种双重重要性采样方案以优先重建目标区域的神经场。此外，通过引入透明度优化模块以及将射线距离作为权重因子来约束多视角一致性，进一步提升了体素密度的一致性。这些独特模块协同工作，构成了一个端到端的神经网络模型，在ScanNet和ARKitScenes数据集上的大量实验验证了其在NeRF基多视图三维检测任务中的最新性能。

链接: https://arxiv.org/abs/2503.15211
作者: Zechuan Li,Hongshan Yu,Yihao Ding,Jinhao Qiao,Basim Azam,Naveed Akhtar
机构: Hunan University (湖南大学); The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields. The key to accurate 3D object detection is in effective voxel representation. However, due to occlusion and lack of 3D information, constructing 3D features from multi-view 2D images is challenging. Addressing that, we introduce a unique 3D positional information embedded voxel optimization mechanism to fuse multi-view features. To prioritize neural field reconstruction in object regions, we also devise a double importance sampling scheme for the NeRF branch of our detector. We additionally propose an opacity optimization module for precise voxel opacity prediction by enforcing multi-view consistency constraints. Moreover, to further improve voxel density consistency across multiple perspectives, we incorporate ray distance as a weighting factor to minimize cumulative ray errors. Our unique modules synergetically form an end-to-end neural model that establishes new state-of-the-art in NeRF-based multi-view 3D detection, verified with extensive experiments on ScanNet and ARKITScenes. Code will be available at this https URL.
zh

[CV-32] DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

【速读】：该论文旨在解决动态4D驾驶场景生成中，现有生成模型难以同时支持时间外推（temporal extrapolation）和空间新视角合成（spatial novel view synthesis, NVS），且无需针对每个场景进行优化的问题。论文的关键挑战在于找到一种高效且具有泛化能力的几何表示方法，以无缝连接时间和空间的合成任务。为了解决这一问题，论文提出了DiST-4D，这是一种首个用于4D驾驶场景生成的解耦时空扩散框架。其关键创新点在于利用度量深度（metric depth）作为核心几何表示，并将问题分解为两个扩散过程：DiST-T负责从过去观测直接预测未来度量深度和多视图RGB序列；DiST-S通过仅在现有视点上训练并施加循环一致性约束来实现空间NVS。这种循环一致性机制引入了前向-后向渲染约束，减少了已观测视点与未观测视点之间的泛化差距。由于度量深度能够提供一致的几何表示，它对于准确可靠的预测以及精确的空间NVS至关重要，从而实现了在时间预测和NVS任务上的最新性能表现，同时在规划相关评估中也表现出竞争力。

链接: https://arxiv.org/abs/2503.15208
作者: Jiazhe Guo,Yikang Ding,Xiwu Chen,Shuo Chen,Bohan Li,Yingshuang Zou,Xiaoyang Lyu,Feiyang Tan,Xiaojuan Qi,Zhiheng Li,Hao Zhao
机构: THU (清华大学); MEGVII (商汤科技); Mach Drive (马赫驱动); SJTU (上海交通大学); HKU (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.
zh

[CV-33] Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization CVPR25

【速读】：该论文试图解决扩散模型在文本到图像生成任务中可能被滥用以创建有害内容的问题。现有方法如概念遗忘和安全引导通过微调模型权重或调整隐状态来干预模型，但这种方式缺乏可解释性，并且在处理复杂多概念提示时会严重影响采样轨迹，限制其实际应用。论文提出的解决方案关键在于引入Detect-and-Guide (DAG) 框架，它利用扩散模型的内部知识，在采样过程中实现自我诊断和细粒度自调节。DAG首先使用优化标记的精化交叉注意力图从噪声潜空间检测有害概念，然后根据自适应强度和编辑区域应用安全引导以消除不安全生成。此方法仅需少量标注数据即可提供具有泛化能力和概念特异性的精确检测图，同时无需对扩散模型进行微调，从而避免损害其生成多样性。实验表明，DAG在消除色情内容方面实现了最先进的安全生成性能，平衡了多概念真实世界提示下的有害性缓解和文本遵循性能。

链接: https://arxiv.org/abs/2503.15197
作者: Feifei Li,Mi Zhang,Yiming Sun,Min Yang
机构: School of Computer Science, Fudan University (计算机科学学院, 复旦大学), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR25

点击查看摘要

Abstract:Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed. However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process. DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation. The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not require fine-tuning of diffusion models, and therefore introduces no loss to their generation diversity. Experiments on erasing sexual content show that DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on multi-concept real-world prompts.
zh

[CV-34] Benchmarking Large Language Models for Handwritten Text Recognition

【速读】：该论文旨在解决传统手写文本识别（Handwritten Text Recognition, HTR）模型依赖于大量手工标注数据且易因布局与文本处理分离而产生错误的问题。论文提出利用多模态大型语言模型（Multimodal Large Language Models, MLLMs）作为通用方法，无需特定模型训练即可识别多样化的手写风格。关键在于评估不同专有和开源LLMs在现代及历史数据集上的表现，并特别关注其自主修正先前生成输出的能力。研究发现，在零样本设置下，专有模型如Claude 3.5 Sonnet优于开源替代方案；MLLMs在识别现代手写体方面表现出色，但对英语语言的偏好源于其预训练数据集构成。与Transkribus相比，两种方法均无明显优势，同时LLMs在零样本转录中的自主纠错能力有限。

链接: https://arxiv.org/abs/2503.15195
作者: Giorgia Crosilla,Lukas Klic,Giovanni Colavizza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional machine learning models for Handwritten Text Recognition (HTR) rely on supervised training, requiring extensive manual annotations, and often produce errors due to the separation between layout and text processing. In contrast, Multimodal Large Language Models (MLLMs) offer a general approach to recognizing diverse handwriting styles without the need for model-specific training. The study benchmarks various proprietary and open-source LLMs against Transkribus models, evaluating their performance on both modern and historical datasets written in English, French, German, and Italian. In addition, emphasis is placed on testing the models’ ability to autonomously correct previously generated outputs. Findings indicate that proprietary models, especially Claude 3.5 Sonnet, outperform open-source alternatives in zero-shot settings. MLLMs achieve excellent results in recognizing modern handwriting and exhibit a preference for the English language due to their pre-training dataset composition. Comparisons with Transkribus show no consistent advantage for either approach. Moreover, LLMs demonstrate limited ability to autonomously correct errors in zero-shot transcriptions.
zh

[CV-35] 3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation CVPR2025

【速读】：该论文旨在解决基于相机的三维占用预测中体素查询分辨率对视图变换质量的影响问题。随着计算资源的限制和实时部署的需求，通常需要降低查询分辨率，但这会导致信息丢失。为了解决这一问题，论文提出了一种名为ProtoOcc的新方法，其关键是利用聚类图像段的原型来增强低分辨率上下文。具体而言，将二维原型映射到三维体素查询中，可以编码高级视觉几何并补充因分辨率降低而丢失的空间信息。此外，设计了多视角解码策略以高效地将密集压缩的视觉线索解耦为高维三维占用场景。实验结果表明，该方法在Occ3D和SemanticKITTI基准数据集上优于基线模型，并且即使在体素分辨率降低75%的情况下，ProtoOcc仍能保持竞争力的性能表现。

链接: https://arxiv.org/abs/2503.15185
作者: Gyeongrok Oh,Sungjune Kim,Heeju Ko,Hyung-gun Chi,Jinkyu Kim,Dongwook Lee,Daehyun Ji,Sungjoon Choi,Sujin Jang,Sangpil Kim
机构: Korea University (韩国大学); Purdue University (普渡大学); AI Center, DS Division, Samsung Electronics (三星电子DS部门AI中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:The resolution of voxel queries significantly influences the quality of view transformation in camera-based 3D occupancy prediction. However, computational constraints and the practical necessity for real-time deployment require smaller query resolutions, which inevitably leads to an information loss. Therefore, it is essential to encode and preserve rich visual details within limited query sizes while ensuring a comprehensive representation of 3D occupancy. To this end, we introduce ProtoOcc, a novel occupancy network that leverages prototypes of clustered image segments in view transformation to enhance low-resolution context. In particular, the mapping of 2D prototypes onto 3D voxel queries encodes high-level visual geometries and complements the loss of spatial information from reduced query resolutions. Additionally, we design a multi-perspective decoding strategy to efficiently disentangle the densely compressed visual cues into a high-dimensional 3D occupancy scene. Experimental results on both Occ3D and SemanticKITTI benchmarks demonstrate the effectiveness of the proposed method, showing clear improvements over the baselines. More importantly, ProtoOcc achieves competitive performance against the baselines even with 75% reduced voxel resolution.
zh

[CV-36] World Models in Artificial Intelligence: Sensing Learning and Reasoning Like a Child

【速读】：该论文试图解决的问题是：当前世界模型（World Models）在人工智能中的应用虽广泛，但缺乏结构化和自适应的表示能力，难以实现像人类儿童那样直观发展的认知能力。论文指出，要突破单纯的模式识别（pattern recognition），需要基于皮亚杰认知发展理论构建动态且可解释的框架。

解决方案的关键在于聚焦六大研究领域：物理信息学习（physics-informed learning）、神经符号学习（neurosymbolic learning）、持续学习（continual learning）、因果推理（causal inference）、人机协作人工智能（human-in-the-loop AI）以及负责任的人工智能（responsible AI）。通过将统计学习与这些领域的进展相结合，使人工智能从模式识别进化至具备真正的理解、适应和推理能力。

链接: https://arxiv.org/abs/2503.15168
作者: Javier Del Ser,Jesus L. Lobo,Heimo Müller,Andreas Holzinger
机构: TECNALIA, Basque Research & Technology Alliance (BRTA)(TECNALIA, 巴斯克研究与技术联盟 (BRTA)); Department of Mathematics, University of the Basque Country (UPV/EHU)(巴斯克大学数学系 (UPV/EHU)); Medical University Graz(格拉茨医科大学); University of Natural Resources and Life Sciences Vienna(维也纳自然资源与生命科学大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:World Models help Artificial Intelligence (AI) predict outcomes, reason about its environment, and guide decision-making. While widely used in reinforcement learning, they lack the structured, adaptive representations that even young children intuitively develop. Advancing beyond pattern recognition requires dynamic, interpretable frameworks inspired by Piaget’s cognitive development theory. We highlight six key research areas – physics-informed learning, neurosymbolic learning, continual learning, causal inference, human-in-the-loop AI, and responsible AI – as essential for enabling true reasoning in AI. By integrating statistical learning with advances in these areas, AI can evolve from pattern recognition to genuine understanding, adaptation and reasoning capabilities.
zh

[CV-37] UltraFlwr – An Efficient Federated Medical and Surgical Object Detection Framework MICCAI

【速读】：该论文旨在解决医疗和手术领域中目标检测模型在实际边缘部署时面临的挑战，包括有限的高质量标注数据、数据共享限制以及计算资源约束等问题。论文的关键创新在于提出UltraFlwr框架，利用联邦学习（Federated Learning, FL）实现多站点去中心化模型训练，同时避免直接传输原始数据。为提升效率，论文进一步引入YOLO-PA，这是一种专为FL设计的新颖的部分聚合（Partial Aggregation, PA）策略，用于YOLO模型。YOLO-PA通过减少高达83%的每轮通信开销，在保持性能与全聚合（Full Aggregation, FA）策略相当的同时，显著提高了训练效率。实验结果表明，YOLO-PA不仅优于客户端集中式训练和FA策略，还促进了资源受限边缘设备上的高效训练与部署，并建立了联邦医疗和手术目标检测领域的首个基准之一。这一工作推进了边缘部署检测模型的可行性，使其更适用于时间敏感且资源受限的医疗和手术应用。

链接: https://arxiv.org/abs/2503.15161
作者: Yang Li,Soumya Snigdha Kundu,Maxence Boels,Toktam Mahmoodi,Sebastien Ourselin,Tom Vercauteren,Prokar Dasgupta,Jonathan Shapey,Alejandro Granados
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, under review @ MICCAI

点击查看摘要

Abstract:Object detection shows promise for medical and surgical applications such as cell counting and tool tracking. However, its faces multiple real-world edge deployment challenges including limited high-quality annotated data, data sharing restrictions, and computational constraints. In this work, we introduce UltraFlwr, a framework for federated medical and surgical object detection. By leveraging Federated Learning (FL), UltraFlwr enables decentralized model training across multiple sites without sharing raw data. To further enhance UltraFlwr’s efficiency, we propose YOLO-PA, a set of novel Partial Aggregation (PA) strategies specifically designed for YOLO models in FL. YOLO-PA significantly reduces communication overhead by up to 83% per round while maintaining performance comparable to Full Aggregation (FA) strategies. Our extensive experiments on BCCD and m2cai16-tool-locations datasets demonstrate that YOLO-PA not only provides better client models compared to client-wise centralized training and FA strategies, but also facilitates efficient training and deployment across resource-constrained edge devices. Further, we also establish one of the first benchmarks in federated medical and surgical object detection. This paper advances the feasibility of training and deploying detection models on the edge, making federated object detection more practical for time-critical and resource-constrained medical and surgical applications. UltraFlwr is publicly available at this https URL.
zh

[CV-38] ARC: Anchored Representation Clouds for High-Resolution INR Classification ICLR2025

【速读】：该论文旨在解决当前基于隐式神经表示（INRs）的图像分类方法在低分辨率数据上的表现不佳以及对图像空间变换敏感的问题。作者认为这些问题源于现有INRs中全局全连接多层感知机（MLP）架构缺乏局部表征机制：MLPs对图像绝对位置敏感且难以处理高频细节。为了解决这些问题，论文提出了一种名为ARC（Anchored Representation Clouds）的新颖INR架构，其关键在于显式地将潜在向量锚定在图像空间的局部区域，并通过引入空间结构到潜在向量中来捕捉局部图像数据。这种设计不仅实现了低分辨率和高分辨率图像的最新隐式图像分类性能，还增强了对图像空间平移的鲁棒性。

链接: https://arxiv.org/abs/2503.15156
作者: Joost Luijmes,Alexander Gielisse,Roman Knyazhitskiy,Jan van Gemert
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the ICLR 2025 Workshop on Neural Network Weights as a New Data Modality

点击查看摘要

Abstract:Implicit neural representations (INRs) encode signals in neural network weights as a memory-efficient representation, decoupling sampling resolution from the associated resource costs. Current INR image classification methods are demonstrated on low-resolution data and are sensitive to image-space transformations. We attribute these issues to the global, fully-connected MLP neural network architecture encoding of current INRs, which lack mechanisms for local representation: MLPs are sensitive to absolute image location and struggle with high-frequency details. We propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors locally in image-space. By introducing spatial structure to the latent vectors, ARC captures local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation. Code can be found at this https URL.
zh

[CV-39] PointSFDA: Source-free Domain Adaptation for Point Cloud Completion

【速读】：该论文旨在解决点云补全方法在应用于分布外的真实扫描数据时面临的显著挑战。传统方法通常在合成数据集上训练，在真实场景中表现不佳。为解决此问题，论文提出了一种名为PointSFDA的无源领域自适应框架。其关键解决方案包括：首先，引入粗到细的知识蒸馏方法，显式转移从源数据集中学到的全局几何知识；其次，针对领域差距可能引入的噪声，设计了一种自监督的部分掩码一致性训练策略，以学习目标域中的局部几何信息。这些贡献使PointSFDA能够在不依赖源域标注数据的情况下，显著提升跨域形状补全任务中现有先进网络的性能。

链接: https://arxiv.org/abs/2503.15144
作者: Xing He,Zhe Zhu,Liangliang Nan,Honghua Chen,Jing Qin,Mingqiang Wei
机构: School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (南京航空航天大学计算机科学与技术学院), China; Urban Data Science Section, Delft University of Technology (代尔夫特理工大学城市数据科学系), Netherlands; School of Nursing, The Hong Kong Polytechnic University (香港理工大学护理学院), Hong Kong, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional methods for point cloud completion, typically trained on synthetic datasets, face significant challenges when applied to out-of-distribution real-world scans. In this paper, we propose an effective yet simple source-free domain adaptation framework for point cloud completion, termed \textbfPointSFDA. Unlike unsupervised domain adaptation that reduces the domain gap by directly leveraging labeled source data, PointSFDA uses only a pretrained source model and unlabeled target data for adaptation, avoiding the need for inaccessible source data in practical scenarios. Being the first source-free domain adaptation architecture for point cloud completion, our method offers two core contributions. First, we introduce a coarse-to-fine distillation solution to explicitly transfer the global geometry knowledge learned from the source dataset. Second, as noise may be introduced due to domain gaps, we propose a self-supervised partial-mask consistency training strategy to learn local geometry information in the target domain. Extensive experiments have validated that our method significantly improves the performance of state-of-the-art networks in cross-domain shape completion. Our code is available at \emph\textcolormagentathis https URL.
zh

[CV-40] Object-Centric Pretraining via Target Encoder Bootstrapping ICLR2025

【速读】：该论文旨在解决现有基于目标为中心（Object-Centric）表征学习方法的性能瓶颈问题。具体而言，当前方法依赖于预训练的非目标为中心的基础模型，其特征作为槽注意力（Slot Attention）的重建目标。然而，由于这些目标在整个训练过程中保持冻结状态，限制了目标为中心模型能达到的最佳性能。此外，尝试通过自举更新目标编码器会导致显著的性能下降，这归因于缺乏目标为中心的归纳偏置（Inductive Biases），从而导致目标为中心模型的编码器偏离作为重建目标有用的表示。

为了解决这些问题，论文提出了一种名为“目标为中心预训练通过目标编码器自举”（Object-CEntric Pretraining by Target Encoder BOotstrapping, OCEBO）的新方法。该方法是一种自我蒸馏框架，首次实现了从真实世界数据中从头开始训练目标为中心的模型。在OCEBO中，目标编码器被更新为目标为中心模型的指数移动平均值，从而显式地引入由槽注意力带来的目标为中心归纳偏置，同时消除了其他模型中存在的性能上限。为了缓解因目标编码器随机初始化引起的槽坍塌（Slot Collapse），论文还引入了一种新颖的跨视图补丁过滤方法，限制监督仅限于足够信息丰富的补丁。当在COCO数据集的241k张图像上进行预训练时，OCEBO实现了与使用冻结的非目标为中心预训练目标编码器（通常需要数亿张图像）的目标为中心模型相当的无监督目标发现性能。代码和预训练模型已公开发布。

链接: https://arxiv.org/abs/2503.15141
作者: Nikola Đukić,Tim Lebailly,Tinne Tuytelaars
机构: KU Leuven (鲁汶天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:Object-centric representation learning has recently been successfully applied to real-world datasets. This success can be attributed to pretrained non-object-centric foundation models, whose features serve as reconstruction targets for slot attention. However, targets must remain frozen throughout the training, which sets an upper bound on the performance object-centric models can attain. Attempts to update the target encoder by bootstrapping result in large performance drops, which can be attributed to its lack of object-centric inductive biases, causing the object-centric model’s encoder to drift away from representations useful as reconstruction targets. To address these limitations, we propose Object-CEntric Pretraining by Target Encoder BOotstrapping, a self-distillation setup for training object-centric models from scratch, on real-world data, for the first time ever. In OCEBO, the target encoder is updated as an exponential moving average of the object-centric model, thus explicitly being enriched with object-centric inductive biases introduced by slot attention while removing the upper bound on performance present in other models. We mitigate the slot collapse caused by random initialization of the target encoder by introducing a novel cross-view patch filtering approach that limits the supervision to sufficiently informative patches. When pretrained on 241k images from COCO, OCEBO achieves unsupervised object discovery performance comparable to that of object-centric models with frozen non-object-centric target encoders pretrained on hundreds of millions of images. The code and pretrained models are publicly available at this https URL.
zh

[CV-41] VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

【速读】：该论文旨在解决现有视频生成模型在多镜头叙事中的三大挑战：（1）叙述片段化，即缺乏结构化的叙事能力；（2）视觉不一致性，即难以在不同镜头间保持视觉连贯性；（3）转场 artifacts，即镜头切换导致的沉浸感中断。论文提出了一种名为 VideoGen-of-Thought (VGoT) 的分步框架，其关键是通过系统化方法解决上述问题：首先引入动态故事线建模以实现逻辑连贯的叙述（Narrative Fragmentation），其次设计基于身份感知的跨镜头传播机制以确保角色特征的一致性与变化的合理性（Visual Inconsistency），最后提出相邻潜在空间转场机制以实现平滑的视觉过渡并保留叙事连续性（Transition Artifacts）。通过这些创新，VGoT 在多镜头视频生成任务中显著提升了性能，特别是在跨镜头一致性方面表现出色。

链接: https://arxiv.org/abs/2503.15138
作者: Mingzhe Zheng,Yongqi Xu,Haojian Huang,Xuran Ma,Yexin Liu,Wenjie Shu,Yatian Pang,Feilong Tang,Qifeng Chen,Harry Yang,Ser-Nam Lim
机构: Hong Kong University of Science and Technology (香港科技大学); Peking University (北京大学); University of Hong Kong (香港大学); National University of Singapore (新加坡国立大学); University of Central Florida (中佛罗里达大学); Everlyn AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL Webpage: this https URL

点击查看摘要

Abstract:Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative Fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which first converts the user prompt into concise shot descriptions, then elaborates them into detailed, cinematic specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, HDR lighting), ensuring logical narrative progression with self-validation. (2) Visual Inconsistency: Existing approaches struggle with maintaining visual consistency across shots. Our identity-aware cross-shot propagation generates identity-preserving portrait (IPP) tokens that maintain character fidelity while allowing trait variations (expressions, aging) dictated by the storyline. (3) Transition Artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots’ features at transition points, enabling seamless visual flow while preserving narrative continuity. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency, while achieving over 100% better cross-shot consistency and 10x fewer manual adjustments than alternatives.
zh

[CV-42] xt-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation

【速读】：该论文旨在解决基于骨架的动作分割任务（Skeleton-based Temporal Action Segmentation, STAS）中，现有方法忽视骨骼特征中关节间及动作间内在关联性的问题，导致对人类运动理解受限。为应对这一挑战，论文提出了一种基于文本衍生关系图增强网络（Text-Derived Relational Graph-Enhanced Network, TRG-Net）。其关键在于通过大语言模型（Large Language Models, LLM）生成先验图来提升建模与监督能力：在建模方面，动态时空融合建模（Dynamic Spatio-Temporal Fusion Modeling, DSFM）结合文本衍生关节图（Text-Derived Joint Graphs, TJG）及通道级与帧级动态自适应，有效捕捉空间关系，并在时序建模中整合时空核心特征；在监督方面，绝对-相对类间监督（Absolute-Relative Inter-Class Supervision, ARIS）利用动作特征与文本嵌入之间的对比学习，规范绝对类别分布，并借助文本衍生动作图（Text-Derived Action Graphs, TAG）捕获动作特征间的相对类间关系。此外，引入空间感知增强处理（Spatial-Aware Enhancement Processing, SAEP），通过随机关节遮挡与轴向旋转增强空间泛化能力。实验表明，TRG-Net在四个公开数据集上取得了最先进的性能。

链接: https://arxiv.org/abs/2503.15126
作者: Haoyu Ji,Bowen Chen,Weihong Ren,Wenze Huang,Zhihao Yang,Zhiyong Wang,Honghai Liu
机构: State Key Laboratory of Robotics and Systems, Harbin Institute of Technology Shenzhen (哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Skeleton-based Temporal Action Segmentation (STAS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current STAS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. To address this, we propose a Text-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-Temporal Fusion Modeling (DSFM) method incorporates Text-Derived Joint Graphs (TJG) with channel- and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes Text-Derived Action Graphs (TAG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-Aware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results.
zh

[CV-43] GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation CVPR2025

【速读】：该论文旨在解决基于RGBD的类别级物体位姿估计方法受限于精确深度信息的问题，限制了其广泛应用。为应对这一挑战，论文聚焦于基于RGB的方法，并指出几何引导的姿态回归方法在实例级任务中表现优异，但传统的NOCs（Normalized Object Coordinate Space）图作为中间表示存在不足，因其与类别级位姿的多对一对应关系引入了冗余的实例特定信息，导致结果次优。论文识别出仅基于NOCs图的位姿回归中存在的类内变化问题，并提出了一种新的坐标表示——类级一致模型生成的无类内变化共识（IVFC）图。关键在于通过结合NOCs图和IVFC图的优势，论文引入GIVEPose框架，实现渐进式的类内变化消除，显著提升了类别级物体位姿估计的性能。实验证明，GIVEPose在合成数据集和真实数据集上的表现均优于现有最先进的基于RGB的方法。

链接: https://arxiv.org/abs/2503.15110
作者: Zinqin Huang,Gu Wang,Chenyangguang Zhang,Ruida Zhang,Xiu Li,Xiangyang Ji
机构: Tsinghua University (清华大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:Recent advances in RGBD-based category-level object pose estimation have been limited by their reliance on precise depth information, restricting their broader applicability. In response, RGB-based methods have been developed. Among these methods, geometry-guided pose regression that originated from instance-level tasks has demonstrated strong performance. However, we argue that the NOCS map is an inadequate intermediate representation for geometry-guided pose regression method, as its many-to-one correspondence with category-level pose introduces redundant instance-specific information, resulting in suboptimal results. This paper identifies the intra-class variation problem inherent in pose regression based solely on the NOCS map and proposes the Intra-class Variation-Free Consensus (IVFC) map, a novel coordinate representation generated from the category-level consensus model. By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. Extensive evaluations on both synthetic and real-world datasets demonstrate that GIVEPose significantly outperforms existing state-of-the-art RGB-based approaches, achieving substantial improvements in category-level object pose estimation. Our code is available at this https URL.
zh

[CV-44] Distilling 3D distinctive local descriptors for 6D pose estimation

【速读】：该论文旨在解决GeDi（几何描述符）在零样本6D姿态估计任务中表现优异但因昂贵的推理过程而难以应用于实际场景的问题。论文的关键在于提出了一种知识蒸馏框架，通过训练一个高效的“学生模型”从“GeDi教师模型”回归局部描述符，同时确保模型在计算和存储受限的情况下对遮挡和部分观测具有鲁棒性。此外，引入了一种新颖的损失函数来处理来自非区分性教师描述符的弱监督信号。这种方法在五个BOP基准数据集上的验证表明，与现有方法相比，推理时间显著减少，同时保持了竞争性的性能，使零样本6D姿态估计更接近实时可行性。

链接: https://arxiv.org/abs/2503.15106
作者: Amir Hamza,Andrea Caraffa,Davide Boscaini,Fabio Poiesi
机构: Fondazione Bruno Kessler (布鲁诺克莱斯勒基金会); University of Trento (特伦托大学); ISCRA; EuroHPC Joint Undertaking; CINECA (意大利)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL

点击查看摘要

Abstract:Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. \textitCan we retain GeDi’s effectiveness while significantly improving its efficiency? In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility. Project Website: this https URL
zh

[CV-45] When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning CVPR2025

【速读】：该论文致力于解决自监督视频任务中的两个关键挑战：(1) 随机时间采样引入的不确定性增加了无标注数据下模型训练的难度；(2) 传统掩码视频建模方法在像素空间恢复被遮掩补丁时，信息压缩不足，导致下游任务性能受限。为同时应对这两个挑战，论文提出了一种名为T-CoRe（Temporal Correspondence for video Representation learning）的自监督框架。其解决方案的关键在于：针对挑战(1)，提出了夹心采样策略，通过选择两个辅助帧以两侧挤压的方式减少重建不确定性；针对挑战(2)，引入了一个辅助分支到自蒸馏架构中，在潜在空间恢复表征，并生成富含时间信息的高层语义表征。实验结果表明，T-CoRe在多个下游任务中表现出色，验证了其有效性。

链接: https://arxiv.org/abs/2503.15096
作者: Yang Liu,Qianqian Xu,Peisong Wen,Siran Dai,Qingming Huang
机构: School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:The past decade has witnessed notable achievements in self-supervised learning for video tasks. Recent efforts typically adopt the Masked Video Modeling (MVM) paradigm, leading to significant progress on multiple video tasks. However, two critical challenges remain: 1) Without human annotations, the random temporal sampling introduces uncertainty, increasing the difficulty of model training. 2) Previous MVM methods primarily recover the masked patches in the pixel space, leading to insufficient information compression for downstream tasks. To address these challenges jointly, we propose a self-supervised framework that leverages Temporal Correspondence for video Representation learning (T-CoRe). For challenge 1), we propose a sandwich sampling strategy that selects two auxiliary frames to reduce reconstruction uncertainty in a two-side-squeezing manner. Addressing challenge 2), we introduce an auxiliary branch into a self-distillation architecture to restore representations in the latent space, generating high-level semantic representations enriched with temporal information. Experiments of T-CoRe consistently present superior performance across several downstream tasks, demonstrating its effectiveness for video representation learning. The code is available at this https URL.
zh

[CV-46] Intelligent Spatial Perception by Building Hierarchical 3D Scene Graphs for Indoor Scenarios with the Help of LLM s

【速读】：该论文致力于解决先进智能机器人导航中对空间环境更全面理解的需求，提出了一种利用大型语言模型（Large Language Models, LLMs）构建室内场景分层3D场景图（3D Scene Graphs, 3DSGs）的创新系统。解决方案的关键在于通过LLMs不仅为对象节点，还为更高层次的节点（如房间节点）提供智能化且精确的标注，并提出了基于LLMs的房间分类查询机制以增强房间节点标注的准确性和可靠性。实验结果验证了该系统能够将语义描述与几何数据整合，生成准确且全面的环境表示，这对上下文感知导航和任务规划至关重要。

链接: https://arxiv.org/abs/2503.15091
作者: Yao Cheng,Zhe Han,Fengyang Jiang,Huaizhen Wang,Fengyu Zhou,Qingshan Yin,Lei Wei
机构: Shandong New Generation Information Industrial Technology Research Institute (山东新一代信息技术产业技术研究院); Inspur Intelligent Terminal Co., Ltd. (浪潮智能终端有限公司); School of Control Science and Engineering, Shandong University (山东大学控制科学与工程学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by WRC SARA 2024

点击查看摘要

Abstract:This paper addresses the high demand in advanced intelligent robot navigation for a more holistic understanding of spatial environments, by introducing a novel system that harnesses the capabilities of Large Language Models (LLMs) to construct hierarchical 3D Scene Graphs (3DSGs) for indoor scenarios. The proposed framework constructs 3DSGs consisting of a fundamental layer with rich metric-semantic information, an object layer featuring precise point-cloud representation of object nodes as well as visual descriptors, and higher layers of room, floor, and building nodes. Thanks to the innovative application of LLMs, not only object nodes but also nodes of higher layers, e.g., room nodes, are annotated in an intelligent and accurate manner. A polling mechanism for room classification using LLMs is proposed to enhance the accuracy and reliability of the room node annotation. Thorough numerical experiments demonstrate the system’s ability to integrate semantic descriptions with geometric data, creating an accurate and comprehensive representation of the environment instrumental for context-aware navigation and task planning.
zh

[CV-47] An Investigation of Beam Density on LiDAR Object Detection Performance

【速读】：本文旨在解决由于激光雷达（LiDAR）传感器束密度差异导致的域间隙（domain gap）问题，特别是在跨域场景下使用高密度128线激光雷达时的知识空白。研究的关键在于通过综合评估不同的三维目标检测架构来理解束密度对域间隙的影响，并发现将体素（voxel-based）与点云（point-based）方法结合的方法能够在跨域应用中实现更优性能。基于此，论文强调需同时考虑其他域偏移因素来全面评估域间隙，并指出检测器从密集数据训练中获益且在推理阶段对束密度变化表现出鲁棒性。

链接: https://arxiv.org/abs/2503.15087
作者: Christoph Griesbacher,Christian Fruhwirth-Reisinger
机构: Institute of Visual Computing, TU Graz (格拉茨技术大学视觉计算研究所); Christian Doppler Laboratory for Embedded Machine Learning (嵌入式机器学习克里斯蒂安·多普勒实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVWW 2025

点击查看摘要

Abstract:Accurate 3D object detection is a critical component of autonomous driving, enabling vehicles to perceive their surroundings with precision and make informed decisions. LiDAR sensors, widely used for their ability to provide detailed 3D measurements, are key to achieving this capability. However, variations between training and inference data can cause significant performance drops when object detection models are employed in different sensor settings. One critical factor is beam density, as inference on sparse, cost-effective LiDAR sensors is often preferred in real-world applications. Despite previous work addressing the beam-density-induced domain gap, substantial knowledge gaps remain, particularly concerning dense 128-beam sensors in cross-domain scenarios. To gain better understanding of the impact of beam density on domain gaps, we conduct a comprehensive investigation that includes an evaluation of different object detection architectures. Our architecture evaluation reveals that combining voxel- and point-based approaches yields superior cross-domain performance by leveraging the strengths of both representations. Building on these findings, we analyze beam-density-induced domain gaps and argue that these domain gaps must be evaluated in conjunction with other domain shifts. Contrary to conventional beliefs, our experiments reveal that detectors benefit from training on denser data and exhibit robustness to beam density variations during inference.
zh

[CV-48] MultiBARF: Integrating Imagery of Different Wavelength Regions by Using Neural Radiance Fields

【速读】：该论文试图解决通过数字转型普及的光学传感器应用中，将观测数据与真实世界位置关联以及整合不同图像传感器时面临的高门槛数据准备问题。解决方案的关键在于提出MultiBARF方法，它通过在指定视点合成两幅不同传感器图像与深度图像的配对数据，替代传统的共配准和几何校准过程，从而降低对传感和图像处理专业知识的要求。这种方法基于Bundle Adjusting Neural Radiance Fields (BARF)，扩展其用于两种成像设备，并通过实验证明能够将这些传感器图像的两个色通道叠加到Neural Radiance Fields (NeRF) 上。

链接: https://arxiv.org/abs/2503.15070
作者: Kana Kurata,Hitoshi Niigaki,Xiaojun Wu,Ryuichi Tanida
机构: NTT Corporation (NTT株式会社)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical sensor applications have become popular through digital transformation. Linking observed data to real-world locations and combining different image sensors is essential to make the applications practical and efficient. However, data preparation to try different sensor combinations requires high sensing and image processing expertise. To make data preparation easier for users unfamiliar with sensing and image processing, we have developed MultiBARF. This method replaces the co-registration and geometric calibration by synthesizing pairs of two different sensor images and depth images at assigned viewpoints. Our method extends Bundle Adjusting Neural Radiance Fields(BARF), a deep neural network-based novel view synthesis method, for the two imagers. Through experiments on visible light and thermographic images, we demonstrate that our method superimposes two color channels of those sensor images on NeRF.
zh

[CV-49] Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis

【速读】：该论文旨在解决统一表征学习（representation learning）与生成式建模（generative modeling）两个领域未被充分探索的问题，特别是现有统一自监督学习（Unified Self-Supervised Learning, SSL）方法依赖外部标记器进行语义令牌重建，导致显著计算开销的问题。论文的关键创新在于提出了Sorcen框架，通过引入协同对比-重建目标（Contrastive-Reconstruction objective），特别是其“回声对比”（Echo Contrast）机制，利用生成能力直接在语义令牌空间中生成对比正样本对，无需额外图像裁剪或增强。同时，Sorcen仅基于预计算的令牌操作，避免在线令牌转换，大幅降低训练中的计算复杂度。这一方案有效提升了统一SSL模型在多种任务上的性能，并显著提高了效率。

链接: https://arxiv.org/abs/2503.15060
作者: Imanol G. Estepa,Jesús M. Rodríguez-de-Vera,Ignacio Sarasúa,Bhalaji Nagarajan,Petia Radeva
机构: Universitat de Barcelona (巴塞罗那大学), Spain; NVIDIA Computing Spain (NVIDIA计算西班牙); Barcelona Supercomputing Center (BSC) (巴塞罗那超级计算中心); Universitat de Barcelona (巴塞罗那大学), Spain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The source code is available in this https URL

点击查看摘要

Abstract:While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training – introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective, “Echo Contrast”, leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen “generates” an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.
zh

[CV-50] Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation

【速读】：该论文旨在解决非配对图像到图像翻译领域中基于扩散模型或Schrödinger桥方法因迭代采样性质而难以在实际应用中广泛采用的问题。为应对这一挑战，论文提出了一种名为隐式桥一致性蒸馏（Implicit Bridge Consistency Distillation, IBCD）的新框架，实现了无需对抗损失的单步双向非配对翻译。IBCD 的关键是通过使用扩散隐式桥模型连接分布间的PF-ODE轨迹来扩展一致性蒸馏，并引入了两项关键改进：1）一致性蒸馏的分布匹配；2）基于蒸馏难度的自适应加权方法。实验结果表明，IBCD 在基准数据集上达到了最先进的性能，并能够在单次生成步骤内实现卓越效果。

链接: https://arxiv.org/abs/2503.15056
作者: Suhyeon Lee,Kwanyoung Kim,Jong Chul Ye
机构: Kim Jae Chul Graduate School of AI, KAIST (KAIST 蔡嘉洙研究生院); Samsung Research (三星研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 16 figures

点击查看摘要

Abstract:Unpaired image-to-image translation has seen significant progress since the introduction of CycleGAN. However, methods based on diffusion models or Schrödinger bridges have yet to be widely adopted in real-world applications due to their iterative sampling nature. To address this challenge, we propose a novel framework, Implicit Bridge Consistency Distillation (IBCD), which enables single-step bidirectional unpaired translation without using adversarial loss. IBCD extends consistency distillation by using a diffusion implicit bridge model that connects PF-ODE trajectories between distributions. Additionally, we introduce two key improvements: 1) distribution matching for consistency distillation and 2) adaptive weighting method based on distillation difficulty. Experimental results demonstrate that IBCD achieves state-of-the-art performance on benchmark datasets in a single generation step. Project page available at this https URL
zh

[CV-51] DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling

【速读】：该论文旨在解决现有轨迹生成方法在精度、计算时间和内存效率之间形成的不可能三角问题。传统方法（场景中心、主体中心和查询中心框架）各有优势，但难以同时兼顾这三个关键性能指标。为突破这一限制，论文提出了一种名为Directional Rotary Position Embedding (DRoPE) 的创新解决方案，它是Rotary Position Embedding (RoPE) 在自动驾驶领域的新适应性改进。与传统的相对位置嵌入（Relative Position Embedding, RPE）相比，RoPE虽然以较低的空间复杂度高效编码相对位置，但因旋转操作的周期性限制，在处理角度信息时存在不足。DRoPE的关键在于引入一个均匀的身份标量到RoPE的二维旋转变换中，使旋转角度与真实主体朝向对齐，从而自然地编码相对角信息。理论分析和实证评估表明，DRoPE能够同时优化轨迹生成的准确性、时间复杂度和空间复杂度，并显著降低空间复杂度，验证了其理论严谨性和实际有效性。

链接: https://arxiv.org/abs/2503.15029
作者: Jianbo Zhao,Taiyu Ban,Zhihao Liu,Hangning Zhou,Xiyang Wang,Qibin Zhou,Hailong Qin,Mu Yang,Lei Liu,Bin Li
机构: University of Science and Technology of China (中国科学技术大学); Mach Drive (马赫驱动); Temasek Laboratories, National University of Singapore (淡马锡实验室, 新加坡国立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and efficient modeling of agent interactions is essential for trajectory generation, the core of autonomous driving systems. Existing methods, scene-centric, agent-centric, and query-centric frameworks, each present distinct advantages and drawbacks, creating an impossible triangle among accuracy, computational time, and memory efficiency. To break this limitation, we propose Directional Rotary Position Embedding (DRoPE), a novel adaptation of Rotary Position Embedding (RoPE), originally developed in natural language processing. Unlike traditional relative position embedding (RPE), which introduces significant space complexity, RoPE efficiently encodes relative positions without explicitly increasing complexity but faces inherent limitations in handling angular information due to periodicity. DRoPE overcomes this limitation by introducing a uniform identity scalar into RoPE’s 2D rotary transformation, aligning rotation angles with realistic agent headings to naturally encode relative angular information. We theoretically analyze DRoPE’s correctness and efficiency, demonstrating its capability to simultaneously optimize trajectory generation accuracy, time complexity, and space complexity. Empirical evaluations compared with various state-of-the-art trajectory generation models, confirm DRoPE’s good performance and significantly reduced space complexity, indicating both theoretical soundness and practical effectiveness. The video documentation is available at this https URL.
zh

[CV-52] Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

【速读】：该论文试图解决伪造媒体（forgery media）检测在AIGC时代面临的多样化挑战问题，特别是缺乏一个全面的基准来评估大型视觉语言模型（Large Vision Language Models, LVLMs）在伪造检测任务中的辨别能力。论文的关键解决方案在于提出Forensics-Bench，这是一个包含63,292个多选视觉问题的新伪造检测评估基准套件，覆盖了从五个维度（伪造语义、模态、任务、类型和模型）的112种独特伪造检测类型。通过设计这样一个全面的基准，Forensics-Bench旨在揭示现有LVLMs在复杂伪造检测任务中的局限性，并推动研究者开发更强大的全方面伪造检测器。

链接: https://arxiv.org/abs/2503.15024
作者: Jin Wang,Chenghui Lv,Xian Li,Shichao Dong,Huadong Li,kelu Yao,Chao Li,Wenqi Shao,Ping Luo
机构: The University of Hong Kong (香港大学); HKU Shanghai Intelligent Computing Research Center (香港大学上海智能计算研究中心); Shanghai AI Laboratory (上海人工智能实验室); Zhejiang Laboratory (浙江实验室); Hangzhou Institute for Advanced Study (杭州高等研究院); Zhejiang University (浙江大学); Alibaba, Beijing, China (阿里巴巴，中国北京); MEGVII Technology (旷视科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 19 figures

点击查看摘要

Abstract:Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc. To detect the ever-increasingly diverse malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design robust forgery detectors due to their impressive performance on a wide range of multimodal tasks. However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs’ discerning capabilities on forgery media. To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC. The deliverables will be updated at this https URL.
zh

[CV-53] Bridging the Gap: Fusing CNNs and Transformers to Decode the Elegance of Handwritten Arabic Script

【速读】：该论文致力于解决手写阿拉伯文字符识别中的挑战，主要源于阿拉伯文字母动态形态及上下文变化的复杂性。为应对这些难题，论文提出了一种结合卷积神经网络（Convolutional Neural Networks, CNNs）和基于Transformer架构的混合方法。解决方案的关键在于设计了一个集成模型，通过基于置信度的融合策略整合EfficientNet-B7与Vision Transformer (ViT-B16)等定制及微调模型的优势，从而实现卓越的性能表现，在IFN/ENIT数据集上达到了96.38%的字母分类准确率和97.22%的位置分类准确率，验证了CNNs与Transformers的互补特性及其在鲁棒阿拉伯文手写识别中的联合潜力。

链接: https://arxiv.org/abs/2503.15023
作者: Chaouki Boufenar,Mehdi Ayoub Rabiai,Boualem Nadjib Zahaf,Khelil Rafik Ouaras
机构: Department of Computer Science (计算机科学系), University of Algiers 1 (阿尔及尔大学1)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Handwritten Arabic script recognition is a challenging task due to the script’s dynamic letter forms and contextual variations. This paper proposes a hybrid approach combining convolutional neural networks (CNNs) and Transformer-based architectures to address these complexities. We evaluated custom and fine-tuned models, including EfficientNet-B7 and Vision Transformer (ViT-B16), and introduced an ensemble model that leverages confidence-based fusion to integrate their strengths. Our ensemble achieves remarkable performance on the IFN/ENIT dataset, with 96.38% accuracy for letter classification and 97.22% for positional classification. The results highlight the complementary nature of CNNs and Transformers, demonstrating their combined potential for robust Arabic handwriting recognition. This work advances OCR systems, offering a scalable solution for real-world applications.
zh

[CV-54] xMOD: Cross-Modal Distillation for 2D/3D Multi-Object Discovery from 2D motion CVPR2025

【速读】：该论文旨在解决三维数据中多目标无监督发现的问题，这是当前三维数据分析领域的一个重要挑战。尽管二维图像中的物体发现任务已取得显著进展，但三维场景下的类似任务由于缺乏有效的运动线索而仍未被充分探索。论文的关键在于提出了一种创新框架，通过利用二维运动信息的优势（灵活性和泛化能力）来弥合二维与三维模态之间的差距。解决方案的核心包括引入DIOD-3D，这是一个基于二维运动的三维数据多目标发现基准，并结合场景完成作为辅助任务以实现稀疏输入条件下的密集物体定位；同时开发了xMOD跨模态训练框架，在训练过程中始终使用二维运动线索，并采用教师-学生范式减轻模态间的确认偏差。此外，还设计了一种晚期融合技术，进一步提升了双模态输入下的性能表现。

链接: https://arxiv.org/abs/2503.15022
作者: Saad Lahlali,Sandra Kara,Hejer Ammar,Florian Chabot,Nicolas Granger,Hervé Le Borgne,Quoc-Cuong Pham
机构: Université Paris-Saclay (巴黎萨克雷大学), CEA (法国原子能委员会), List (List研究所), F-91120, Palaiseau, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Object discovery, which refers to the task of localizing objects without human annotations, has gained significant attention in 2D image analysis. However, despite this growing interest, it remains under-explored in 3D data, where approaches rely exclusively on 3D motion, despite its several challenges. In this paper, we present a novel framework that leverages advances in 2D object discovery which are based on 2D motion to exploit the advantages of such motion cues being more flexible and generalizable and to bridge the gap between 2D and 3D modalities. Our primary contributions are twofold: (i) we introduce DIOD-3D, the first baseline for multi-object discovery in 3D data using 2D motion, incorporating scene completion as an auxiliary task to enable dense object localization from sparse input data; (ii) we develop xMOD, a cross-modal training framework that integrates 2D and 3D data while always using 2D motion cues. xMOD employs a teacher-student training paradigm across the two modalities to mitigate confirmation bias by leveraging the domain gap. During inference, the model supports both RGB-only and point cloud-only inputs. Additionally, we propose a late-fusion technique tailored to our pipeline that further enhances performance when both modalities are available at inference. We evaluate our approach extensively on synthetic (TRIP-PD) and challenging real-world datasets (KITTI and Waymo). Notably, our approach yields a substantial performance improvement compared with the 2D object discovery state-of-the-art on all datasets with gains ranging from +8.7 to +15.1 in F1@50 score. The code is available at this https URL
zh

[CV-55] Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene CVPR2025

【速读】：该论文旨在解决4D Panoptic Scene Graph (4D-PSG) 数据稀缺导致的词汇不足问题以及基准生成方法流水线特性引发的次优性能问题。为应对这些挑战，论文提出了一种新颖的框架，利用丰富的2D视觉场景标注来增强4D场景学习。关键解决方案在于引入了一个集成3D掩码解码器的4D大语言模型（4D-LLM），用于端到端生成4D-PSG，并设计了一种链式场景图推理机制以迭代推断精确且全面的对象和关系标签。此外，提出了一种2D到4D视觉场景的迁移学习框架，通过时空场景超越策略有效将维度不变特征从大量的2D场景图标注转移到4D场景中，从而有效弥补4D-PSG的数据稀缺问题。

链接: https://arxiv.org/abs/2503.15019
作者: Shengqiong Wu,Hao Fei,Jingkang Yang,Xiangtai Li,Juncheng Li,Hanwang Zhang,Tat-seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can primarily suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs’ open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.
zh

[CV-56] Exploiting Diffusion Prior for Real-World Image Dehazing with Unpaired Training AAAI2025

【速读】：该论文旨在解决现有去雾方法在真实场景中的泛化能力有限的问题，主要由于特征表示能力受限以及对真实世界先验知识利用不足所导致。为了解决这一问题，论文提出了一种名为Diff-Dehazer的新框架，其关键是利用扩散模型（diffusion model）的强大生成能力，将其作为双向映射学习器集成到经典的CycleGAN框架中，以实现无配对的真实图像去雾任务。此外，通过引入物理先验（physical prior）进一步挖掘真实世界的先验知识，并通过去除图像和文本模态中的退化信息来充分挖掘扩散模型的表征能力，从而提升去雾效果。实验结果验证了所提方法在多个真实数据集上的优越性能。

链接: https://arxiv.org/abs/2503.15017
作者: Yunwei Lan,Zhigao Cui,Chang Liu,Jialun Peng,Nian Wang,Xin Luo,Dong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Unpaired training has been verified as one of the most effective paradigms for real scene dehazing by learning from unpaired real-world hazy and clear images. Although numerous studies have been proposed, current methods demonstrate limited generalization for various real scenes due to limited feature representation and insufficient use of real-world prior. Inspired by the strong generative capabilities of diffusion models in producing both hazy and clear images, we exploit diffusion prior for real-world image dehazing, and propose an unpaired framework named Diff-Dehazer. Specifically, we leverage diffusion prior as bijective mapping learners within the CycleGAN, a classic unpaired learning framework. Considering that physical priors contain pivotal statistics information of real-world data, we further excavate real-world knowledge by integrating physical priors into our framework. Furthermore, we introduce a new perspective for adequately leveraging the representation ability of diffusion models by removing degradation in image and text modalities, so as to improve the dehazing effect. Extensive experiments on multiple real-world datasets demonstrate the superior performance of our method. Our code this https URL.
zh

[CV-57] Manifold Learning for Hyperspectral Images

【速读】：该论文旨在解决传统特征提取与投影技术（如主成分分析 Principal Component Analysis, PCA）在处理X射线透射（X-Ray Transmission, XRT）多能谱（Multi-Energy, ME）图像时难以充分表示数据的问题，这限制了神经网络在决策过程中的性能。为了解决这一挑战，论文提出了一种通过构建邻接图来近似数据集拓扑结构的方法，利用Uniform Manifold Approximation and Projection (UMAP) 实现。该方法的关键在于能够捕获数据中的非线性相关性，不仅保留了数据的整体结构，还增强了特征的可分离性，从而显著提高了机器学习算法（尤其是处理高光谱图像 Hyperspectral Images, HSI）的性能，实现了更准确且鲁棒的分类结果。

链接: https://arxiv.org/abs/2503.15016
作者: Fethi Harkat(EDP, DT),Tiphaine Deuberet(DT),Guillaume Gey(DT),Valérie Perrier(EDP),Kévin Polisano(SVH)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traditional feature extraction and projection techniques, such as Principal Component Analysis, struggle to adequately represent X-Ray Transmission (XRT) Multi-Energy (ME) images, limiting the performance of neural networks in decision-making processes. To address this issue, we propose a method that approximates the dataset topology by constructing adjacency graphs using the Uniform Manifold Approximation and Projection. This approach captures nonlinear correlations within the data, significantly improving the performance of machine learning algorithms, particularly in processing Hyperspectral Images (HSI) from X-ray transmission spectroscopy. This technique not only preserves the global structure of the data but also enhances feature separability, leading to more accurate and robust classification results.
zh

[CV-58] Universal Scene Graph Generation CVPR2025

【速读】：该论文旨在解决当前场景图（Scene Graph, SG）研究主要局限于单一模态场景建模的问题，未能充分利用多模态场景图表示的互补优势以全面描述整体场景语义。为应对这一挑战，论文提出了通用场景图（Universal Scene Graph, USG）的新表示方法，能够从任意模态输入组合中充分表征综合语义场景，并涵盖模态不变与模态特定场景。解决方案的关键在于设计了一个针对特定需求的USG解析器USG-Par，有效解决了跨模态物体对齐和领域外样本两大瓶颈问题。USG-Par采用模块化架构实现端到端的USG生成，通过引入物体关联器缓解模态差异以实现跨模态物体对齐，并提出基于文本的场景对比学习机制，通过对齐多模态物体与关系与文本场景图来减轻领域不平衡问题。

链接: https://arxiv.org/abs/2503.15005
作者: Shengqiong Wu,Hao Fei,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics. To this end, we introduce Universal SG (USG), a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs. Through extensive experiments, we demonstrate that USG offers a stronger capability for expressing scene semantics than standalone SGs, and also that our USG-Par achieves higher efficacy and performance.
zh

[CV-59] Semantic Segmentation of Transparent and Opaque Drinking Glasses with the Help of Zero-shot Learning

【速读】：该论文试图解决在图像中分割透明结构的问题，这类结构因与背景难以区分而具有挑战性。论文聚焦于常见但形状和大小各异的玻璃杯等透明物体的分割任务。为应对这一难题，论文提出了一种名为TransCaGNet的改进模型，其关键在于将原始CaGNet模型的分割主干替换为Trans4Trans架构，使其能够处理透明对象的分割任务。此外，针对某些罕见的玻璃类别未在训练集中出现的情况，论文采用了零样本学习（zero-shot learning）方法，以实现对这些类别的语义分割。同时，论文构建了一个包含多样化环境条件的新型合成数据集，并补充了一个真实场景评估数据集，以模拟实际应用中的复杂情况。结合SAM 2的分割结果进一步优化了分割性能，通过为相似类别赋予多个可能的类别标签，显著提升了均交并比（mean IoU）和平均准确率（mean accuracy），分别在合成数据集上提高了13.68%和17.88%，并在真实场景中进一步验证了模型的泛化能力，使均交并比和平均准确率分别提升了5.55%和5.72%。

链接: https://arxiv.org/abs/2503.15004
作者: Annalena Blänsdorf,Tristan Wirth,Arne Rak,Thomas Pöllabauer,Volker Knauthe,Arjan Kuijper
机构: TU Darmstadt (达姆施塔特工业大学), Germany; Fraunhofer IGD (弗劳恩霍夫图形数据研究所), Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmenting transparent structures in images is challenging since they are difficult to distinguish from the background. Common examples are drinking glasses, which are a ubiquitous part of our lives and appear in many different shapes and sizes. In this work we propose TransCaGNet, a modified version of the zero-shot model CaGNet. We exchange the segmentation backbone with the architecture of Trans4Trans to be capable of segmenting transparent objects. Since some glasses are rarely captured, we use zeroshot learning to be able to create semantic segmentations of glass categories not given during training. We propose a novel synthetic dataset covering a diverse set of different environmental conditions. Additionally we capture a real-world evaluation dataset since most applications take place in the real world. Comparing our model with Zeg-Clip we are able to show that TransCaGNet produces better mean IoU and accuracy values while ZegClip outperforms it mostly for unseen classes. To improve the segmentation results, we combine the semantic segmentation of the models with the segmentation results of SAM 2. Our evaluation emphasizes that distinguishing between different classes is challenging for the models due to similarity, points of view, or coverings. Taking this behavior into account, we assign glasses multiple possible categories. The modification leads to an improvement up to 13.68% for the mean IoU and up to 17.88% for the mean accuracy values on the synthetic dataset. Using our difficult synthetic dataset for training, the models produce even better results on the real-world dataset. The mean IoU is improved up to 5.55% and the mean accuracy up to 5.72% on the real-world dataset.
zh

[CV-60] Low-Complexity Patch-based No-Reference Point Cloud Quality Metric exploiting Weighted Structure and Texture Features

【速读】：本文旨在解决无参考点云质量评估（No-Reference Point Cloud Quality Assessment, PCQA）的问题，即在没有参考点云的情况下，准确评估点云在压缩、传输或渲染过程中因各种失真引入的感知质量下降。传统方法难以全面捕捉点云数据的独特特性以及不同失真的影响，而本文提出的PST-PCQA通过学习型框架实现了高效且精确的质量评估。

解决方案的关键在于采用了一种基于低复杂度、学习驱动的方法，将点云划分为多个局部patch，并分别提取其特征，同时结合局部与全局特征来预测平均意见分数（Mean Opinion Score, MOS）。这种方法的核心创新点包括：1) 引入patch-wise分析策略以细化质量评估；2) 利用相关性权重整合局部特征，从而实现整体质量预测；3) 构建轻量级模型，使其适用于实时应用及资源受限的设备环境。实验结果表明，PST-PCQA在多种数据集上的表现优异，并具备良好的泛化能力与参数效率。

链接: https://arxiv.org/abs/2503.15001
作者: Michael Neri,Federica Battisti
机构: Tampere University (坦佩雷大学); University of Padova (帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Accepted for publication in IEEE Transactions on Broadcasting. Code at this https URL

点击查看摘要

Abstract:During the compression, transmission, and rendering of point clouds, various artifacts are introduced, affecting the quality perceived by the end user. However, evaluating the impact of these distortions on the overall quality is a challenging task. This study introduces PST-PCQA, a no-reference point cloud quality metric based on a low-complexity, learning-based framework. It evaluates point cloud quality by analyzing individual patches, integrating local and global features to predict the Mean Opinion Score. In summary, the process involves extracting features from patches, combining them, and using correlation weights to predict the overall quality. This approach allows us to assess point cloud quality without relying on a reference point cloud, making it particularly useful in scenarios where reference data is unavailable. Experimental tests on three state-of-the-art datasets show good prediction capabilities of PST-PCQA, through the analysis of different feature pooling strategies and its ability to generalize across different datasets. The ablation study confirms the benefits of evaluating quality on a patch-by-patch basis. Additionally, PST-PCQA’s light-weight structure, with a small number of parameters to learn, makes it well-suited for real-time applications and devices with limited computational capacity. For reproducibility purposes, we made code, model, and pretrained weights available at this https URL.
zh

[CV-61] GV: Tabular Data-Guided Learning of Visual Cardiac Representations

【速读】：该论文试图解决传统对比学习方法在医学影像领域面临的挑战，即如何有效利用多模态数据（图像与表格数据）来增强视觉表征的学习能力。传统方法通常依赖于同一图像的不同增强视图形成对比对，但在医学场景中，研究更关注不同表型患者之间的比较而非单一扫描的多次增强。论文提出通过引入临床相关的表格数据来识别不同的患者表型，并在对比学习框架中构建更具意义的对比对。解决方案的关键在于利用表格属性指导视觉表征的训练，而无需联合嵌入空间。通过这种方式，结合UK Biobank中的短轴心脏磁共振图像和临床属性，论文展示了表格数据能够更有效地区分患者亚组，并在下游任务（如心血管动脉疾病和心脏表型的微调及零样本预测）中证明，相比仅依赖图像增强或图像-表格联合嵌入的传统方法，该方法能生成更强的视觉表征。此外，实验表明，带有表格引导训练的图像编码器能够将人口统计信息嵌入其表征中，使其能够在缺乏全面临床注释的情况下进行单模态预测，从而更好地适应实际医疗环境的需求。

链接: https://arxiv.org/abs/2503.14998
作者: Marta Hasny,Maxime Di Folco,Keno Bressem,Julia Schnabel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive learning methods in computer vision typically rely on different views of the same image to form pairs. However, in medical imaging, we often seek to compare entire patients with different phenotypes rather than just multiple augmentations of one scan. We propose harnessing clinically relevant tabular data to identify distinct patient phenotypes and form more meaningful pairs in a contrastive learning framework. Our method uses tabular attributes to guide the training of visual representations, without requiring a joint embedding space. We demonstrate its strength using short-axis cardiac MR images and clinical attributes from the UK Biobank, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data yields stronger visual representations than conventional methods that rely solely on image augmentations or combined image-tabular embeddings. Furthermore, we demonstrate that image encoders trained with tabular guidance are capable of embedding demographic information in their representations, allowing them to use insights from tabular data for unimodal predictions, making them well-suited to real-world medical settings where extensive clinical annotations may not be routinely available at inference time. The code will be available on GitHub.
zh

[CV-62] Disentangling Modes and Interference in the Spectrogram of Multicomponent Signals

【速读】：该论文旨在解决多分量信号的 spectrogram 分解问题，将其分解为模态部分（mode part）和干扰部分（interference part）。关键在于首先通过两种方法识别干扰成分：一是受图像处理中纹理-几何分解启发的变分方法；二是基于 U-Net 架构的监督学习方法，该方法在包含多样干扰模式和噪声条件的数据集上进行训练。一旦干扰成分被有效分离，论文进一步提出利用这一结果定义局部自适应窗口长度的标准，以优化在接近模态（close modes）情况下 spectrogram 的 ridge 检测性能。实验表明这两种方法的优势与局限性，并突出了它们在强干扰条件下提升时间-频率分析的潜力。

链接: https://arxiv.org/abs/2503.14990
作者: Kévin Polisano(SVH),Sylvain Meignen(DAO),Nils Laurent(Phys-ENS),Hubert Leterme(ENSICAEN)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:In this paper, we investigate how the spectrogram of multicomponent signals can be decomposed into a mode part and an interference part. We explore two approaches: (i) a variational method inspired by texture-geometry decomposition in image processing, and (ii) a supervised learning approach using a U-Net architecture, trained on a dataset encompassing diverse interference patterns and noise conditions. Once the interference component is identified, we explain how it enables us to define a criterion to locally adapt the window length used in the definition of the spectrogram, for the sake of improving ridge detection in the presence of close modes. Numerical experiments illustrate the advantages and limitations of both approaches for spectrogram decomposition, highlighting their potential for enhancing time-frequency analysis in the presence of strong interference.
zh

[CV-63] Semi-KAN: KAN Provides an Effective Representation for Semi-Supervised Learning in Medical Image Segmentation

【速读】：该论文致力于解决半监督医学图像分割（Semi-supervised Medical Image Segmentation, SSMIS）中由于标注数据有限导致的表征学习难题。传统方法通常依赖固定的激活函数和线性建模模式，限制了网络有效学习鲁棒表征的能力。为应对这一挑战，论文提出了一种基于Kolmogorov-Arnold网络（KANs）的方法——Semi-KAN，通过挖掘KANs的潜力增强SSMIS中的骨干架构以实现更高效的表征学习。解决方案的关键在于将可学习的KAN模块集成到U-Net管道的编码器瓶颈层和解码器顶层的分块中间表征中，同时结合多分支结构与不确定性估计，以在减少计算开销的同时提取高语义特征并捕获多样化模式表示。

链接: https://arxiv.org/abs/2503.14983
作者: Zanting Ye,Xiaolong Niu,Xuanbin Wu,Wenxiang Yi,Yuan Chang,Lijun Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Deep learning-based medical image segmentation has shown remarkable success; however, it typically requires extensive pixel-level annotations, which are both expensive and time-intensive. Semi-supervised medical image segmentation (SSMIS) offers a viable alternative, driven by advancements in CNNs and ViTs. However, these networks often rely on single fixed activation functions and linear modeling patterns, limiting their ability to effectively learn robust representations. Given the limited availability of labeled date, achieving robust representation learning becomes crucial. Inspired by Kolmogorov-Arnold Networks (KANs), we propose Semi-KAN, which leverages the untapped potential of KANs to enhance backbone architectures for representation learning in SSMIS. Our findings indicate that: (1) compared to networks with fixed activation functions, KANs exhibit superior representation learning capabilities with fewer parameters, and (2) KANs excel in high-semantic feature spaces. Building on these insights, we integrate KANs into tokenized intermediate representations, applying them selectively at the encoder’s bottleneck and the decoder’s top layers within a U-Net pipeline to extract high-level semantic features. Although learnable activation functions improve feature expansion, they introduce significant computational overhead with only marginal performance gains. To mitigate this, we reduce the feature dimensions and employ horizontal scaling to capture multiple pattern representations. Furthermore, we design a multi-branch U-Net architecture with uncertainty estimation to effectively learn diverse pattern representations. Extensive experiments on four public datasets demonstrate that Semi-KAN surpasses baseline networks, utilizing fewer KAN layers and lower computational cost, thereby underscoring the potential of KANs as a promising approach for SSMIS.
zh

[CV-64] One-Shot Medical Video Object Segmentation via Temporal Contrastive Memory Networks MICCAI2024

【速读】：该论文旨在解决医学视频对象分割中因数据可用性和标注不足带来的挑战，提出了一种新颖的一次性医学视频对象分割任务，即仅基于首帧掩码标注分离视频中的前景和背景像素。为了解决这一问题，论文的关键方案是设计了一个包含图像和掩码编码器的时序对比记忆网络，用于学习特征表示；一个时序对比记忆库，通过拉近相邻帧的嵌入向量同时推开远距离帧的嵌入向量来显式建模帧间关系，并存储这些特征；以及一个融合编码图像特征和记忆读取结果的解码器用于分割。此外，论文构建了一个多样化的多源医学视频数据集以评估该任务，实验结果表明其在分割已见及未见结构时达到当前最优性能，并展示了从稀缺标签中泛化的能力，从而凸显减轻医学视频分析标注负担的潜力。代码资源可访问提供的链接获取。

链接: https://arxiv.org/abs/2503.14979
作者: Yaxiong Chen,Junjian Hu,Chunlei Li,Zixuan Zheng,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou
机构: MedAI Technology (Wuxi) Co. Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2024 Workshop

点击查看摘要

Abstract:Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one-shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter-frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi-source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state-of-the-art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available at this https URL.
zh

[CV-65] aming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening

【速读】：本文旨在解决高分辨率全色图像与多光谱图像融合（即Pansharpening）过程中，基于随机微分方程（SDEs）的扩散模型虽能达到最先进的性能，但其固有的多步采样过程带来显著计算开销的问题。现有方法通过采用高效采样器、知识蒸馏或再训练来减少采样步骤（如从1,000步降至更少），但通常会牺牲融合质量。为克服这些问题，论文提出了最优传输流匹配（OTFM）框架，该框架结合不平衡最优传输（UOT）的对偶公式，实现了单步高质量的Pansharpening。与传统最优传输（OT）公式相比，UOT放松了边缘约束以增强建模灵活性，适应遥感数据中固有的光谱和空间差异。此外，通过将任务特定正则化引入UOT目标函数，增强了流模型的鲁棒性。OTFM框架支持无模拟训练和单步推理，同时严格遵循Pansharpening约束。实验评估表明，OTFM在多个数据集上的表现与先前基于回归的模型和领先的基于扩散的方法相当甚至超越，且仅需一步采样。

链接: https://arxiv.org/abs/2503.14975
作者: Zihan Cao,Yu Zhong,Liang-Jian Deng
机构: UESTC(电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pansharpening, a pivotal task in remote sensing for fusing high-resolution panchromatic and multispectral imagery, has garnered significant research interest. Recent advancements employing diffusion models based on stochastic differential equations (SDEs) have demonstrated state-of-the-art performance. However, the inherent multi-step sampling process of SDEs imposes substantial computational overhead, hindering practical deployment. While existing methods adopt efficient samplers, knowledge distillation, or retraining to reduce sampling steps (e.g., from 1,000 to fewer steps), such approaches often compromise fusion quality. In this work, we propose the Optimal Transport Flow Matching (OTFM) framework, which integrates the dual formulation of unbalanced optimal transport (UOT) to achieve one-step, high-quality pansharpening. Unlike conventional OT formulations that enforce rigid distribution alignment, UOT relaxes marginal constraints to enhance modeling flexibility, accommodating the intrinsic spectral and spatial disparities in remote sensing data. Furthermore, we incorporate task-specific regularization into the UOT objective, enhancing the robustness of the flow model. The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints. Experimental evaluations across multiple datasets demonstrate that OTFM matches or exceeds the performance of previous regression-based models and leading diffusion-based methods while only needing one sampling step. Codes are available at this https URL.
zh

[CV-66] Language-based Image Colorization: A Benchmark and Beyond

【速读】：该论文旨在解决自动图像着色方法在生成高质量彩色图像时因颜色歧义导致的挑战，以及现有方法用户可控性有限的问题。为应对这一挑战，论文聚焦于基于语言的图像着色方法，并指出其核心挑战在于跨模态对齐。论文将这些方法分为两类：一类是从零开始训练跨模态网络；另一类利用预训练的跨模态模型建立文本-视觉对应关系。基于对现有方法局限性的分析，论文提出了一种基于蒸馏扩散模型的简单而有效的方法。该方法的关键在于通过蒸馏技术优化模型效率，实现在速度提升14倍的情况下获得优于先前复杂方法的结果。据作者所知，这是首个对该领域进行全面综述与基准测试的工作，为社区提供了有意义的洞见。代码已公开发布。

链接: https://arxiv.org/abs/2503.14974
作者: Yifan Li,Shuai Yang,Jiaying Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image colorization aims to bring colors back to grayscale images. Automatic image colorization methods, which requires no additional guidance, struggle to generate high-quality images due to color ambiguity, and provides limited user controllability. Thanks to the emergency of cross-modality datasets and models, language-based colorization methods are proposed to fully utilize the efficiency and flexibly of text descriptions to guide colorization. In view of the lack of a comprehensive review of language-based colorization literature, we conduct a thorough analysis and benchmarking. We first briefly summarize existing automatic colorization methods. Then, we focus on language-based methods and point out their core challenge on cross-modal alignment. We further divide these methods into two categories: one attempts to train a cross-modality network from scratch, while the other utilizes the pre-trained cross-modality model to establish the textual-visual correspondence. Based on the analyzed limitations of existing language-based methods, we propose a simple yet effective method based on distilled diffusion model. Extensive experiments demonstrate that our simple baseline can produces better results than previous complex methods with 14 times speed up. To the best of our knowledge, this is the first comprehensive review and benchmark on language-based image colorization field, providing meaningful insights for the community. The code is available at this https URL.
zh

[CV-67] Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models MICCAI2024

【速读】：该论文试图解决公开可用的超声视频数据集稀缺的问题，这限制了高效视频分类模型的发展。论文的关键解决方案是引入了一种潜在动态扩散模型（Latent Dynamic Diffusion Model, LDDM），通过将现成且丰富的超声图像转化为具有真实视频特性的动态序列，从而合成逼真的超声视频。这种技术不仅在BUSV基准测试中展示了出色的定量结果和视觉效果，还证明了结合真实数据与LDDM合成数据训练视频分类模型能够显著提升性能，表明该方法成功模拟了区分诊断至关重要的动态特性。这一图像到视频的转换方法为超声视频分析提供了有效的数据增强解决方案。

链接: https://arxiv.org/abs/2503.14966
作者: Tingxiu Chen,Yilei Shi,Zixuan Zheng,Bingcong Yan,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: MICCAI 2024

点击查看摘要

Abstract:Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we introduce a latent dynamic diffusion model (LDDM) to efficiently translate static images to dynamic sequences with realistic video characteristics. We demonstrate strong quantitative results and visually appealing synthesized videos on the BUSV benchmark. Notably, training video classification models on combinations of real and LDDM-synthesized videos substantially improves performance over using real data alone, indicating our method successfully emulates dynamics critical for discrimination. Our image-to-video approach provides an effective data augmentation solution to advance ultrasound video analysis. Code is available at this https URL.
zh

[CV-68] Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition

【速读】：该论文旨在解决基于骨架的动作识别（Skeleton-based HAR）中手部细微动作被忽视的问题，现有方法多聚焦于全身运动而未能充分捕捉关键的手部细节。为了解决这一挑战，论文提出了一种名为BHaRNet的新框架，其关键是将人体专家模型与手部专家模型相结合，在统一的图表示基础上通过协作专业化机制共同训练两个分支，并利用专家化分支方法和池化注意力模块实现特征层面的交互与互补信息的选择性融合。此外，受MMNet启发，还展示了该方法在多模态任务中的应用潜力。实验结果表明，BHaRNet在手部密集型动作上的准确率从86.4%提升至93.0%，同时保持了较低的计算开销和参数量。

链接: https://arxiv.org/abs/2503.14960
作者: Seungyeon Cho,Tae-Kyun Kim
机构: School of Computing, KAIST (KAIST 计算学院), South Korea; Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 figures, 8 pages

点击查看摘要

Abstract:Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human-robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body-Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies – improving from 86.4% to 93.0% in hand-intensive actions – while maintaining fewer GFLOPs and parameters than the relevant unified methods.
zh

[CV-69] Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning MICCAI2024

【速读】：该论文旨在解决医学领域中视频对象分割任务中标注数据稀缺的问题，传统方法需要大量密集帧标注进行训练，而这些标注在医学场景中极为匮乏。为减少昂贵的视频标注成本，论文提出了一种利用少量视频帧标注并结合现有标记图像的方法。关键在于设计了一个两阶段框架：首先基于标记图像学习少样本分割模型；其次，在医学视频上引入时空一致性重学习方法，通过约束连续帧之间的连贯性以及图像模型与重学习模型在特征层和预测层的一致性来提升性能。实验表明，该方法优于现有的少样本分割方法，并有效弥合了丰富标注的医学图像与稀疏标注的医学视频之间的差距。

链接: https://arxiv.org/abs/2503.14958
作者: Zixuan Zheng,Yilei Shi,Chunlei Li,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2024

点击查看摘要

Abstract:Few-shot video object segmentation aims to reduce annotation costs; however, existing methods still require abundant dense frame annotations for training, which are scarce in the medical domain. We investigate an extremely low-data regime that utilizes annotations from only a few video frames and leverages existing labeled images to minimize costly video annotations. Specifically, we propose a two-phase framework. First, we learn a few-shot segmentation model using labeled images. Subsequently, to improve performance without full supervision, we introduce a spatiotemporal consistency relearning approach on medical videos that enforces consistency between consecutive frames. Constraints are also enforced between the image model and relearning model at both feature and prediction levels. Experiments demonstrate the superiority of our approach over state-of-the-art few-shot segmentation methods. Our model bridges the gap between abundant annotated medical images and scarce, sparsely labeled medical videos to achieve strong video segmentation performance in this low data regime. Code is available at this https URL.
zh

[CV-70] Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering

【速读】：该论文旨在解决视频问答（Video Question-Answering, VQA）任务中复杂推理的需求，挑战模型利用程序性知识（procedural knowledge）进行视觉实体识别、假设生成以及上下文、因果和反事实推理。论文的关键解决方案是提出了一种神经符号推理模块（neuro symbolic reasoning module），该模块结合了神经网络与大型语言模型（LLMs）驱动的变量约束推理，以实现可解释的答案生成。实验结果表明，将LLMs与基于逻辑的结构化知识推理相结合，能够显著提升在STAR基准及新构建数据集上的程序性推理能力。

链接: https://arxiv.org/abs/2503.14957
作者: Thanh-Son Nguyen,Hong Yang,Tzeh Yuan Neoh,Hao Zhang,Ee Yeo Keat,Basura Fernando
机构: Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore (高性能计算研究所, 新加坡科学技术研究局); Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore (前沿人工智能研究中心, 新加坡科学技术研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces a new video question-answering (VQA) dataset that challenges models to leverage procedural knowledge for complex reasoning. It requires recognizing visual entities, generating hypotheses, and performing contextual, causal, and counterfactual reasoning. To address this, we propose neuro symbolic reasoning module that integrates neural networks and LLM-driven constrained reasoning over variables for interpretable answer generation. Results show that combining LLMs with structured knowledge reasoning with logic enhances procedural reasoning on the STAR benchmark and our dataset. Code and dataset at this https URL soon.
zh

[CV-71] Depth-Aware Range Image-Based Model for Point Cloud Segmentation

【速读】：该论文旨在解决点云分割（Point Cloud Segmentation, PCS）任务中基于范围图像（range image）的模型因缺乏显式深度信息而导致的物体分割困难问题。此外，现有方法通常从基于颜色图像的模型衍生而来，无法充分利用范围图像中隐含但有序的深度信息，从而影响性能。论文的关键解决方案在于提出Depth-Aware Module (DAM)，通过显式建模通道间的相互依赖性来感知范围图像中的有序深度信息。同时，Fast FMVNet V3通过在每个架构阶段的最后一块集成DAM，进一步提升了模型性能，实验结果表明DAM显著提高了Fast FMVNet V3的性能且计算成本极低。

链接: https://arxiv.org/abs/2503.14955
作者: Bike Chen,Antti Tikanmäki,Juha Röning
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: No Comments

点击查看摘要

Abstract:Point cloud segmentation (PCS) aims to separate points into different and meaningful groups. The task plays an important role in robotics because PCS enables robots to understand their physical environments directly. To process sparse and large-scale outdoor point clouds in real time, range image-based models are commonly adopted. However, in a range image, the lack of explicit depth information inevitably causes some separate objects in 3D space to touch each other, bringing difficulty for the range image-based models in correctly segmenting the objects. Moreover, previous PCS models are usually derived from the existing color image-based models and unable to make full use of the implicit but ordered depth information inherent in the range image, thereby achieving inferior performance. In this paper, we propose Depth-Aware Module (DAM) and Fast FMVNet V3. DAM perceives the ordered depth information in the range image by explicitly modelling the interdependence among channels. Fast FMVNet V3 incorporates DAM by integrating it into the last block in each architecture stage. Extensive experiments conducted on SemanticKITTI, nuScenes, and SemanticPOSS demonstrate that DAM brings a significant improvement for Fast FMVNet V3 with negligible computational cost.
zh

[CV-72] Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching

【速读】：该论文旨在解决多视图描述匹配中视觉语义模型信息容量有限且易受局部相似负样本干扰的问题。解决方案的关键在于提出了一种名为Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE) 的两阶段框架，通过利用密集文本蒸馏增强稀疏文本的表示能力，从而提升嵌入向量的信息容量。在预训练阶段，该方法通过图像与密集文本对齐增强视觉语义嵌入的信息容量；在微调阶段，同时优化图像与稀疏文本对齐以及密集文本嵌入向稀疏文本嵌入的蒸馏任务，进一步提升稀疏文本嵌入的信息容量。实验结果表明，D2S-VSE 在大规模的 MS-COCO 和 Flickr30K 数据集上优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.14953
作者: Yang Liu,Wentao Feng,Zhuoyao Liu,Shudong Huang,Jiancheng Lv
机构: College of Computer Science, Sichuan University (四川大学计算机学院); Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education (教育部机器学习与产业智能工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Enabling Visual Semantic Models to effectively handle multi-view description matching has been a longstanding challenge. Existing methods typically learn a set of embeddings to find the optimal match for each view’s text and compute similarity. However, the visual and text embeddings learned through these approaches have limited information capacity and are prone to interference from locally similar negative samples. To address this issue, we argue that the information capacity of embeddings is crucial and propose Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE), which enhances the information capacity of sparse text by leveraging dense text distillation. Specifically, D2S-VSE is a two-stage framework. In the pre-training stage, we align images with dense text to enhance the information capacity of visual semantic embeddings. In the fine-tuning stage, we optimize two tasks simultaneously, distilling dense text embeddings to sparse text embeddings while aligning images and sparse texts, enhancing the information capacity of sparse text embeddings. Our proposed D2S-VSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.
zh

[CV-73] USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network

【速读】：该论文致力于解决高精度深度估计在自动驾驶和增强现实等领域的迫切需求，提出了一种新的网络架构来有效利用多模态数据。解决方案的关键在于引入了统一分割注意力机制网络（USAM-Net），它通过双路径架构结合预训练的语义分割模型（SAM）和深度估计模型，将立体图像与语义分割图及注意力机制集成，以增强深度估计性能。具体而言，分割路径对立体图像进行预处理生成语义掩码，并将其与原始立体图像拼接后作为深度估计路径的输入，从而聚焦于物体边界和表面纹理等关键特征，提升深度感知的准确性。实验结果表明，USAM-Net在DrivingStereo数据集上的表现优于传统方法，验证了该方法的有效性。

链接: https://arxiv.org/abs/2503.14950
作者: Joseph Emmanuel DL Dayo,Prospero C. Naval Jr
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing demand for high-accuracy depth estimation in autonomous driving and augmented reality applications necessitates advanced neural architectures capable of effectively leveraging multiple data modalities. In this context, we introduce the Unified Segmentation Attention Mechanism Network (USAM-Net), a novel convolutional neural network that integrates stereo image inputs with semantic segmentation maps and attention to enhance depth estimation performance. USAM-Net employs a dual-pathway architecture, which combines a pre-trained segmentation model (SAM) and a depth estimation model. The segmentation pathway preprocesses the stereo images to generate semantic masks, which are then concatenated with the stereo images as inputs to the depth estimation pathway. This integration allows the model to focus on important features such as object boundaries and surface textures which are crucial for accurate depth perception. Empirical evaluation on the DrivingStereo dataset demonstrates that USAM-Net achieves superior performance metrics, including a Global Difference (GD) of 3.61% and an End-Point Error (EPE) of 0.88, outperforming traditional models such as CFNet, SegStereo, and iResNet. These results underscore the effectiveness of integrating segmentation information into stereo depth estimation tasks, highlighting the potential of USAM-Net in applications demanding high-precision depth data.
zh

[CV-74] ChatStitch: Visualizing Through Structures via Surround-View Unsupervised Deep Image Stitching with Collaborative LLM -Agents

【速读】：该论文旨在解决现有协同感知系统在用户交互效率低下以及多摄像头照片级真实感可视化方面的挑战。为应对这些挑战，论文提出了ChatStitch，这是一种创新的协同感知系统，能够通过整合自然语言命令与外部数字资产揭示被遮挡的盲区信息。其关键解决方案包括采用基于大语言模型的多智能体协作框架以高效处理复杂或抽象的命令，并提出SV-UDIS（首个非全局重叠条件下的无监督深度图像拼接方法），以实现最直观的人类感知体验。实验结果表明，SV-UDIS在UDIS-D数据集上的3、4、5图像拼接任务中达到了最先进的性能，分别提升了PSNR 9%、17%、21%，SSIM 8%、18%、26%。

链接: https://arxiv.org/abs/2503.14948
作者: Hao Liang,Zhipeng Dong,Yi Yang,Mengyin Fu
机构: School of Automation, Beijing Institute of Technology (北京理工大学自动化学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Collaborative perception has garnered significant attention for its ability to enhance the perception capabilities of individual vehicles through the exchange of information with surrounding vehicle-agents. However, existing collaborative perception systems are limited by inefficiencies in user interaction and the challenge of multi-camera photorealistic visualization. To address these challenges, this paper introduces ChatStitch, the first collaborative perception system capable of unveiling obscured blind spot information through natural language commands integrated with external digital assets. To adeptly handle complex or abstract commands, ChatStitch employs a multi-agent collaborative framework based on Large Language Models. For achieving the most intuitive perception for humans, ChatStitch proposes SV-UDIS, the first surround-view unsupervised deep image stitching method under the non-global-overlapping condition. We conducted extensive experiments on the UDIS-D, MCOV-SLAM open datasets, and our real-world dataset. Specifically, our SV-UDIS method achieves state-of-the-art performance on the UDIS-D dataset for 3, 4, and 5 image stitching tasks, with PSNR improvements of 9%, 17%, and 21%, and SSIM improvements of 8%, 18%, and 26%, respectively.
zh

[CV-75] Generating Multimodal Driving Scenes via Next-Scene Prediction

【速读】：本文旨在解决现有自动驾驶（Autonomous Driving, AD）生成模型仅能捕获有限模态范围的问题，限制了生成可控场景以全面评估AD系统的能力。为解决此问题，论文提出了一种包含四种主要数据模态（包括创新性引入的地图模态）的多模态生成框架。关键解决方案在于结合两种自回归组件：Temporal AutoRegressive (TAR) 组件用于捕捉每种模态的帧间动态，Ordered AutoRegressive (OAR) 组件则通过固定顺序预测令牌来对齐场景中的模态；同时引入Action-aware Map Alignment (AMA) 模块，基于主体动作对地图模态进行变换，确保地图与主体行为模态之间的一致性。该框架能够有效生成复杂且真实的长序列驾驶场景，保证多模态一致性并实现对场景元素的精细控制。

链接: https://arxiv.org/abs/2503.14945
作者: Yanhao Wu,Haoyang Zhang,Tianwei Lin,Lichao Huang,Shujie Luo,Rui Wu,Congpei Qiu,Wei Ke,Tong Zhang
机构: School of Software Engineering, XJTU (西安交通大学软件学院); Horizon Robotics (地平线机器人); School of Computer and Communication Sciences, EPFL (洛桑联邦理工学院计算机与通信科学学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.
zh

[CV-76] MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

【速读】：该论文旨在解决图像融合领域中存在的四个主要挑战：1) 需要针对特定任务或数据集设计模型；2) 忽视真实世界中的图像退化（如噪声），导致处理退化输入时失效；3) 在像素空间中操作，注意力机制计算昂贵；4) 缺乏用户交互能力。为了解决这些问题，论文提出了一种统一的多任务、多退化和语言引导的图像融合框架。该框架的关键在于包含两个核心组件：1) 一个实用的退化管道，用于模拟真实世界的图像退化并生成交互提示以指导模型；2) 一种在潜在空间中运行的全合一扩散Transformer (Diffusion Transformer, DiT)，它基于降级输入和生成提示条件融合清晰图像。此外，还对原始DiT架构进行了原则性修改，以更好地适应融合任务。基于此框架，开发了两种模型变体：基于回归和基于流匹配的版本。大量定性和定量实验表明，该方法有效解决了上述限制，并优于先前的恢复+融合和全合一管道方法。

链接: https://arxiv.org/abs/2503.14944
作者: Zihan Cao,Yu Zhong,Ziqi Wang,Liang-Jian Deng
机构: UESTC (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textite.g., noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at this https URL.
zh

[CV-77] 3D Engine-ready Photorealistic Avatars via Dynamic Textures

【速读】：该论文旨在解决数字avatar在高保真度与实际应用之间的矛盾问题。当前3D生产流程中的数字化方法依赖于昂贵的捕捉设备，难以普及至普通消费者；而基于隐式表示（如NeRFs使用的体素）的方法虽能在有限数据下重建人类并生成高质量视频，但与传统渲染管线不兼容，限制了其在游戏等应用场景中的使用。
论文的关键解决方案在于提出了一种端到端的流水线，利用标准3D资产构建显式表示的超逼真数字avatar，并通过动态生成纹理增强真实感，同时掩盖底层网格几何结构的不足。这种方法实现了与最先进的3D avatar生成方法相当的视觉质量，同时能够无缝集成到现有的图形渲染流水线中。

链接: https://arxiv.org/abs/2503.14943
作者: Yifan Wang,Ivan Molodetskikh,Ondrej Texler,Dimitar Dinev
机构: Samsung Research America; Lomonosov Moscow State University (莫斯科国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As the digital and physical worlds become more intertwined, there has been a lot of interest in digital avatars that closely resemble their real-world counterparts. Current digitization methods used in 3D production pipelines require costly capture setups, making them impractical for mass usage among common consumers. Recent academic literature has found success in reconstructing humans from limited data using implicit representations (e.g., voxels used in NeRFs), which are able to produce impressive videos. However, these methods are incompatible with traditional rendering pipelines, making it difficult to use them in applications such as games. In this work, we propose an end-to-end pipeline that builds explicitly-represented photorealistic 3D avatars using standard 3D assets. Our key idea is the use of dynamically-generated textures to enhance the realism and visually mask deficiencies in the underlying mesh geometry. This allows for seamless integration with current graphics pipelines while achieving comparable visual quality to state-of-the-art 3D avatar generation methods.
zh

[CV-78] UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation CVPR2025

【速读】：本文旨在解决现有多模态大语言模型（MLLMs）评估方法中因人工设计视觉图像问答对（QA pairs）导致的人力负担过重以及由此限制的评估规模和范围问题，同时克服自动化MLLM-as-judge方法引入的偏见问题。为应对这些挑战，论文提出了一种无监督同行评审的MLLM评估框架（Unsupervised Peer review MLLM Evaluation, UPME）。其关键是利用仅有的图像数据，使模型能够自动产生问题并对其他模型的答案进行同行评审评估，从而有效减轻对人力的依赖。此外，通过引入视觉-语言评分系统，从响应正确性、视觉理解和推理能力以及图像-文本相关性三个方面缓解偏见问题。实验结果表明，UPME在MMstar数据集上的皮尔逊相关系数达到0.944，在ScienceQA数据集上达到0.814，证明该框架与人类设计的基准高度一致，并符合内在的人类偏好。

链接: https://arxiv.org/abs/2503.14941
作者: Qihui Zhang,Munan Ning,Zheyuan Liu,Yanbo Wang,Jiayi Ye,Yue Huang,Shuo Yang,Xiao Chen,Yibing Song,Li Yuan
机构: School of Electrical and Computer Engineering, Peking University (北京大学电子工程与计算机学院); Hupan Lab (湖畔实验室); DAMO Academy, Alibaba Group (阿里集团达摩院); University of Notre Dame (圣母大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation methods face limitations due to the significant human workload required to design QA pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through automatic evaluations, they often introduce biases. To address these problems, we propose an Unsupervised Peer review MLLM Evaluation framework. It utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) visual understanding and reasoning; and (iii) image-text correlation. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our framework closely aligns with human-designed benchmarks and inherent human preferences.
zh

[CV-79] VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

【速读】：该论文试图解决的问题是：多模态大型语言模型（Multimodal Large Language Models, MLLMs）是否能够发展出类似于人类的直观数感。为了解决这一问题，论文提出了一个名为Visual Number Benchmark (VisNumBench) 的评估基准，用于评估MLLMs在广泛视觉数值任务中的数感能力。VisNumBench包含约1,900个多选题答案对，涵盖七种视觉数值属性和四种视觉数值估算任务类型。通过在VisNumBench上的实验，研究发现尽管更强的MLLMs（如具有更大参数规模和更广泛通用能力的模型）在数感任务上表现出适度提升，但总体而言，包括开源模型和专有模型在内的17种测试模型在数感相关任务上的表现显著低于人类水平。因此，该解决方案的关键在于设计了一个全面的视觉数感基准（VisNumBench），以系统性地评估和揭示当前MLLMs在数感方面的局限性。

链接: https://arxiv.org/abs/2503.14939
作者: Tengjin Weng,Jingyi Wang,Wenhao Jiang,Zhong Ming
机构: Shenzhen University (深圳大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济实验室（深圳）); Shenzhen Technology University (深圳技术大学); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested, including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash, perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs’ number sense abilities. All benchmark resources, including code and datasets, will be publicly available at this https URL.
zh

[CV-80] Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification

【速读】：该论文旨在解决Few-Shot Remote Sensing Scene Classification (FS-RSSC) 中因有限标注样本导致的分类挑战，现有方法多聚焦于单一模态特征学习而忽视了多模态表示优化的潜在优势。为应对这一局限性，论文提出了一种名为Optimal Transport Adapter Tuning (OTAT) 的新框架，通过最优传输（Optimal Transport, OT）理论构建理想的表征空间。该框架的关键在于Optimal Transport Adapter (OTA)，它利用跨模态注意力机制丰富文本表示，并促进模态间的信息交互与互补。通过将网络优化转化为OT优化问题，OTA建立了模态间平衡信息交换的有效路径。此外，引入了样本级Entropy-Aware Weighted (EAW) 损失函数，结合难度加权相似度分数与基于熵的正则化，以更精细地控制OT优化过程，提升其可解性和稳定性。实验结果表明，OTAT在FS-RSSC任务中实现了最先进的性能，显著提升了模型性能和泛化能力。

链接: https://arxiv.org/abs/2503.14938
作者: Zhong Ji,Ci Liu,Jingren Liu,Chen Tang,Yanwei Pang,Xuelong Li
机构: School of Electrical and Information Engineering, Tianjin University (天津大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd (中国电信集团有限公司人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-Shot Remote Sensing Scene Classification (FS-RSSC) presents the challenge of classifying remote sensing images with limited labeled samples. Existing methods typically emphasize single-modal feature learning, neglecting the potential benefits of optimizing multi-modal representations. To address this limitation, we propose a novel Optimal Transport Adapter Tuning (OTAT) framework aimed at constructing an ideal Platonic representational space through optimal transport (OT) theory. This framework seeks to harmonize rich visual information with less dense textual cues, enabling effective cross-modal information transfer and complementarity. Central to this approach is the Optimal Transport Adapter (OTA), which employs a cross-modal attention mechanism to enrich textual representations and facilitate subsequent better information interaction. By transforming the network optimization into an OT optimization problem, OTA establishes efficient pathways for balanced information exchange between modalities. Moreover, we introduce a sample-level Entropy-Aware Weighted (EAW) loss, which combines difficulty-weighted similarity scores with entropy-based regularization. This loss function provides finer control over the OT optimization process, enhancing its solvability and stability. Our framework offers a scalable and efficient solution for advancing multimodal learning in remote sensing applications. Extensive experiments on benchmark datasets demonstrate that OTAT achieves state-of-the-art performance in FS-RSSC, significantly improving the model performance and generalization.
zh

[CV-81] FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

【速读】：该论文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在细粒度运动理解（fine-grained motion comprehension）方面的局限性。论文的关键在于引入了一个综合评估基准 FAVOR-Bench 和一个对应的训练数据集 FAVOR-Train。FAVOR-Bench 包含 1,776 段带有结构化人工标注的视频，并设计了涵盖闭合型任务（8,184 多选题答案对）和开放型任务的评估方法，包括一种无需大语言模型（LLM-free）且成本高效的评价方案以及一种基于 GPT 的辅助评估方法。此外，FAVOR-Train 数据集包含 17,152 段带有细粒度运动标注的视频。通过在 FAVOR-Train 上微调 Qwen2.5-VL，实验结果表明其在与运动相关的 TVBench、MotionBench 和 FAVOR-Bench 任务上的性能得到了一致提升。因此，FAVOR-Bench 和 FAVOR-Train 提供了有价值的工具，用于推动视频理解能力更强大的模型的发展。

链接: https://arxiv.org/abs/2503.14935
作者: Chongjun Tu,Lin Zhang,Pengtao Chen,Peng Ye,Xianfang Zeng,Wei Cheng,Gang Yu,Tao Chen
机构: Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学); StepFun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: FAVOR-Bench project page: this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \hrefthis https URLthis https URL.
zh

[CV-82] Shushing! Lets Imagine an Authentic Speech from the Silent Video

【速读】：该论文旨在解决现有视觉引导语音生成方法在跨模态对齐（语义、音色和情感韵律）方面的不足，提出Consistent Video-to-Speech (CV2S)任务以增强跨模态一致性。论文的关键解决方案是引入ImaginTalk，这是一种基于离散空间的跨模态扩散框架，仅利用视觉输入生成忠实语音。其核心创新包括：(1) 提出离散唇对齐器，通过预测唇部视频的离散语音标记来捕获语义信息；(2) 设计误差检测器识别错位标记，并通过BERT的掩码语言建模进行优化；(3) 开发带有面部风格适配器的风格扩散Transformer，在通道和时间维度自适应调整身份与韵律动态，同时保持与唇部感知语义特征的同步。这些技术共同提升了生成语音的保真度、语义细节准确性以及音色和情感表达能力。

链接: https://arxiv.org/abs/2503.14928
作者: Jiaxin Ye,Hongming Shan
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Project Page: this https URL

点击查看摘要

Abstract:Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals, offering significant potential for applications such as dubbing in filmmaking and assisting individuals with aphonia. Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody from visual cues, prompting us to propose Consistent Video-to-Speech (CV2S) as an extended task to enhance cross-modal consistency. To tackle emerging challenges, we introduce ImaginTalk, a novel cross-modal diffusion framework that generates faithful speech using only visual input, operating within a discrete space. Specifically, we propose a discrete lip aligner that predicts discrete speech tokens from lip videos to capture semantic information, while an error detector identifies misaligned tokens, which are subsequently refined through masked language modeling with BERT. To further enhance the expressiveness of the generated speech, we develop a style diffusion transformer equipped with a face-style adapter that adaptively customizes identity and prosody dynamics across both the channel and temporal dimensions while ensuring synchronization with lip-aware semantic features. Extensive experiments demonstrate that ImaginTalk can generate high-fidelity speech with more accurate semantic details and greater expressiveness in timbre and emotion compared to state-of-the-art baselines. Demos are shown at our project page: this https URL.
zh

[CV-83] GenM3: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

【速读】：该论文旨在解决在大规模多源运动数据集上训练时因运动内容变化引起的数据异质性挑战。为应对这一问题，论文提出了一种名为Generative Pretrained Multi-path Motion Model (GenM³) 的综合框架，其关键是通过以下两个组件实现统一运动表示的学习：

Multi-Expert VQ-VAE (MEVQ-VAE)，适应不同的数据集分布以学习统一的离散化运动表示；
Multi-path Motion Transformer (MMT)，通过分离的模态特定路径提升单模态内的表示能力（每条路径包含密集激活的专家以适配模态内变化），并通过文本-运动共享路径改善跨模态对齐。此外，论文通过整合与统一11个高质量运动数据集（约220小时数据）并结合文本标注（由大语言模型标注近10,000个运动序列及专家人工标注300+序列）实现了大规模训练。

链接: https://arxiv.org/abs/2503.14919
作者: Junyu Shi,Lijiang Liu,Yong Sun,Zhiyuan Zhang,Jinni Zhou,Qiang Nie
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM ^3 ), a comprehensive framework designed to learn unified motion representations. GenM ^3 comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM ^3 achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.
zh

[CV-84] Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes

【速读】：本文旨在解决利用紧凑表示方法描述三维室内场景的问题，尤其针对由家具等人工物体组成的室内场景，其几何形状通常呈现矩形特性。论文的关键在于提出了一种基于深度学习的拟合框架，通过多面体立方体（polycuboids）组合来实现这种紧凑表示。具体而言，该方案首先使用变换网络（Transformer Network）检测点云中的六类立方体面，接着借助图神经网络（Graph Neural Network）验证这些面的空间关系以形成潜在的多面体立方体，最终依据聚合的面标签重构每个多面体立方体实例。为了训练相关网络，研究者构建了一个包含多样化立方体和多面体立方体形状的合成数据集，以模拟真实室内场景的特点。该框架在Replica、ScanNet以及iPhone捕捉的真实场景数据集中表现出良好的泛化能力，并通过虚拟房间导览和场景编辑等实际应用展示了其多功能性。

链接: https://arxiv.org/abs/2503.14912
作者: Gahye Lee,Hyejeong Yoon,Jungeon Kim,Seungyong Lee
机构: POSTECH (POSTECH)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 3DV 2025

点击查看摘要

Abstract:This paper presents a novel framework for compactly representing a 3D indoor scene using a set of polycuboids through a deep learning-based fitting method. Indoor scenes mainly consist of man-made objects, such as furniture, which often exhibit rectilinear geometry. This property allows indoor scenes to be represented using combinations of polycuboids, providing a compact representation that benefits downstream applications like furniture rearrangement. Our framework takes a noisy point cloud as input and first detects six types of cuboid faces using a transformer network. Then, a graph neural network is used to validate the spatial relationships of the detected faces to form potential polycuboids. Finally, each polycuboid instance is reconstructed by forming a set of boxes based on the aggregated face labels. To train our networks, we introduce a synthetic dataset encompassing a diverse range of cuboid and polycuboid shapes that reflect the characteristics of indoor scenes. Our framework generalizes well to real-world indoor scene datasets, including Replica, ScanNet, and scenes captured with an iPhone. The versatility of our method is demonstrated through practical applications, such as virtual room tours and scene editing.
zh

[CV-85] Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology

【速读】：该论文试图解决医学领域中皮肤科学（Dermatology）在视觉-语言模型应用方面进展滞后的问题，主要原因是现有皮肤病数据集规模有限且缺乏深度，仅提供狭窄范围内的单标签标注，缺少丰富的文本描述及临床语境。为应对这些局限性，论文提出了Derm1M，这是一个包含1,029,761个图像-文本对的首个大规模皮肤科学视觉-语言数据集。其关键在于通过构建一个由专家合作开发的标准本体论（ontology），结合多样化的教育资源，实现对超过390种皮肤状况的全面覆盖，并包含丰富的上下文信息如病史、症状和肤色等130个临床概念，从而弥补现有数据集的不足。此外，论文基于Derm1M预训练了一系列类似于CLIP的模型（DermLIP），验证了该数据集在推动AI研究与临床应用方面的潜力。

链接: https://arxiv.org/abs/2503.14911
作者: Siyuan Yan,Ming Hu,Yiwen Jiang,Xieji Li,Hao Fei,Philipp Tschandl,Harald Kittler,Zongyuan Ge
机构: Monash University (蒙纳士大学); National University of Singapore (新加坡国立大学); Medical University of Vienna (维也纳医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages

点击查看摘要

Abstract:The emergence of vision-language models has transformed medical AI, enabling unprecedented advances in diagnostic capability and clinical applications. However, progress in dermatology has lagged behind other medical domains due to the lack of standard image-text pairs. Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. Built from diverse educational resources and structured around a standard ontology collaboratively developed by experts, Derm1M provides comprehensive coverage for over 390 skin conditions across four hierarchical levels and 130 clinical concepts with rich contextual information such as medical history, symptoms, and skin tone. To demonstrate Derm1M potential in advancing both AI research and clinical application, we pretrained a series of CLIP-like models, collectively called DermLIP, on this dataset. The DermLIP family significantly outperforms state-of-the-art foundation models on eight diverse datasets across multiple tasks, including zero-shot skin disease classification, clinical and artifacts concept identification, few-shot/full-shot learning, and cross-modal retrieval. Our dataset and code will be public.
zh

[CV-86] Robust Distribution Alignment for Industrial Anomaly Detection under Distribution Shift

【速读】：该论文致力于解决工业应用中质量控制领域的异常检测问题，特别是面对未见过的领域偏移（如光照变化或传感器漂移）时鲁棒性不足的挑战。现有方法通常通过训练可泛化的模型来应对领域偏移，但往往依赖于目标分布的先验知识，并且难以推广到为其他数据模态设计的基础模型。论文的关键解决方案在于基于记忆库的异常检测方法，通过在有限的目标训练数据上优化鲁棒的 Sinkhorn 距离，提升模型对未见过的目标领域的泛化能力。实验结果表明，所提出的方法在 2D 和 3D 异常检测基准测试中，相较于最先进的异常检测和领域适应方法表现更优。

链接: https://arxiv.org/abs/2503.14910
作者: Jingyi Liao,Xun Xu,Yongyi Su,Rong-Cheng Tu,Yifan Liu,Dacheng Tao,Xulei Yang
机构: Institute for Infocomm Research, A*STAR (信息通信研究院, 新加坡科技研究局); Nanyang Technological University (南洋理工大学); South China University of Technology (华南理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly detection plays a crucial role in quality control for industrial applications. However, ensuring robustness under unseen domain shifts such as lighting variations or sensor drift remains a significant challenge. Existing methods attempt to address domain shifts by training generalizable models but often rely on prior knowledge of target distributions and can hardly generalise to backbones designed for other data modalities. To overcome these limitations, we build upon memory-bank-based anomaly detection methods, optimizing a robust Sinkhorn distance on limited target training data to enhance generalization to unseen target domains. We evaluate the effectiveness on both 2D and 3D anomaly detection benchmarks with simulated distribution shifts. Our proposed method demonstrates superior results compared with state-of-the-art anomaly detection and domain adaptation methods.
zh

[CV-87] POSTA: A Go-to Framework for Customized Artistic Poster Generation CVPR2025

【速读】：该论文旨在解决现有自动海报设计方法在文本准确性、用户定制化以及美学吸引力方面的不足，这些局限性限制了其在电影和展览等需要清晰内容传递与视觉冲击的艺术领域的应用。论文提出POSTA框架作为解决方案，其关键是结合扩散模型（Diffusion Models）和多模态大型语言模型（Multimodal Large Language Models, MLLMs），通过三个模块实现定制化的艺术海报生成：Background Diffusion创建主题背景，Design MLLM生成布局和排版元素以匹配背景风格，ArtText Diffusion对关键文本元素进行额外美化处理，最终生成视觉统一且吸引人的海报，并支持完全定制化。

链接: https://arxiv.org/abs/2503.14908
作者: Haoyu Chen,Xiaojie Xu,Wenbo Li,Jingjing Ren,Tian Ye,Songhua Liu,Ying-Cong Chen,Lei Zhu,Xinchao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); The Chinese University of Hong Kong(香港中文大学); National University of Singapore(新加坡国立大学); The Hong Kong University of Science and Technology(香港科技大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Poster design is a critical medium for visual communication. Prior work has explored automatic poster design using deep learning techniques, but these approaches lack text accuracy, user customization, and aesthetic appeal, limiting their applicability in artistic domains such as movies and exhibitions, where both clear content delivery and visual impact are essential. To address these limitations, we present POSTA: a modular framework powered by diffusion models and multimodal large language models (MLLMs) for customized artistic poster generation. The framework consists of three modules. Background Diffusion creates a themed background based on user input. Design MLLM then generates layout and typography elements that align with and complement the background style. Finally, to enhance the poster’s aesthetic appeal, ArtText Diffusion applies additional stylization to key text elements. The final result is a visually cohesive and appealing poster, with a fully modular process that allows for complete customization. To train our models, we develop the PosterArt dataset, comprising high-quality artistic posters annotated with layout, typography, and pixel-level stylized text segmentation. Our comprehensive experimental analysis demonstrates POSTA’s exceptional controllability and design diversity, outperforming existing models in both text accuracy and aesthetic quality.
zh

[CV-88] Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

【速读】：本文针对快速发展的生成式人工智能内容（AIGC）技术带来的合成图像在日常生活中日益普及所引发的真实性评估与检测挑战展开研究。尽管现有方法在评价图像真实性和定位伪造方面表现出一定效果，但它们通常缺乏人类可解释性，且未能充分应对合成数据复杂性的持续增长。为了解决这些问题，论文引入了FakeVLM，这是一种专门设计用于通用合成图像和DeepFake检测任务的大规模多模态模型。FakeVLM不仅擅长区分真实图像与伪造图像，还通过自然语言清晰解释图像伪影，从而提升了解释能力。此外，文中提出了FakeClue，这是一个包含超过10万张图像、涵盖七个类别的综合数据集，并以自然语言标注了细粒度的伪影线索。FakeVLM在无需额外分类器的情况下实现了与专家模型相当的性能，成为一种鲁棒的合成数据检测解决方案。广泛的跨数据集评估证实了FakeVLM在真实性分类和伪影解释任务上的优越性，为合成图像检测设定了新的基准。

链接: https://arxiv.org/abs/2503.14905
作者: Siwei Wen,Junyan Ye,Peilin Feng,Hengrui Kang,Zichen Wen,Yize Chen,Jiang Wu,Wenjun Wu,Conghui He,Weijia Li
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Sun Yat-Sen University (中山大学); Beihang University (北京航空航天大学); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学，深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, synthetic images have become increasingly prevalent in everyday life, posing new challenges for authenticity assessment and detection. Despite the effectiveness of existing methods in evaluating image authenticity and locating forgeries, these approaches often lack human interpretability and do not fully address the growing complexity of synthetic data. To tackle these challenges, we introduce FakeVLM, a specialized large multimodal model designed for both general synthetic image and DeepFake detection tasks. FakeVLM not only excels in distinguishing real from fake images but also provides clear, natural language explanations for image artifacts, enhancing interpretability. Additionally, we present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories, annotated with fine-grained artifact clues in natural language. FakeVLM demonstrates performance comparable to expert models while eliminating the need for additional classifiers, making it a robust solution for synthetic data detection. Extensive evaluations across multiple datasets confirm the superiority of FakeVLM in both authenticity classification and artifact explanation tasks, setting a new benchmark for synthetic image detection. The dataset and code will be released in: this https URL.
zh

[CV-89] When Domain Generalization meets Generalized Category Discovery: An Adaptive Task-Arithmetic Driven Approach

【速读】：该论文致力于解决传统广义类别发现（Generalized Class Discovery, GCD）方法在面对分布偏移时的局限性，尤其是这些方法通常需要在训练阶段访问目标域数据的问题，而这一条件在某些情况下可能不切实际。为了解决上述挑战，论文引入了领域泛化在GCD中的新范式（Domain Generalization in GCD, DG-GCD），其核心是在仅使用源域数据进行训练的同时，使模型能够适应未见的目标域。

解决方案的关键在于提出了一种名为DG2CD-Net的网络架构，其核心创新点是一种基于情景训练策略（episodic training strategy）。该策略通过在一个由源域任务与由基础模型生成的合成域任务构成的混合任务集上微调基线模型，增强了跨领域的泛化能力。此外，结合开放集领域自适应、新的边界损失函数以及表示学习机制，逐步优化特征空间。同时，为了捕捉对基线模型进行微调的影响，论文扩展了任务算术方法，根据验证分布上的GCD性能动态调整局部任务向量权重。这种情景更新机制显著提升了基线模型对未见过的目标域的适应能力。实验结果表明，DG2CD-Net在三个数据集上的表现优于现有的专门针对DG-GCD定制的方法。

链接: https://arxiv.org/abs/2503.14897
作者: Vaibhav Rathore,Shubhranil B,Saikat Dutta,Sarthak Mehrotra,Zsolt Kira,Biplab Banerjee
机构: IIT Bombay (印度理工学院孟买分校); IITB-Monash Research Academy (IITB-蒙纳士研究学院); Georgia Institute of Technology (乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalized Class Discovery (GCD) clusters base and novel classes in a target domain using supervision from a source domain with only base classes. Current methods often falter with distribution shifts and typically require access to target data during training, which can sometimes be impractical. To address this issue, we introduce the novel paradigm of Domain Generalization in GCD (DG-GCD), where only source data is available for training, while the target domain, with a distinct data distribution, remains unseen until inference. To this end, our solution, DG2CD-Net, aims to construct a domain-independent, discriminative embedding space for GCD. The core innovation is an episodic training strategy that enhances cross-domain generalization by adapting a base model on tasks derived from source and synthetic domains generated by a foundation model. Each episode focuses on a cross-domain GCD task, diversifying task setups over episodes and combining open-set domain adaptation with a novel margin loss and representation learning for optimizing the feature space progressively. To capture the effects of fine-tuning on the base model, we extend task arithmetic by adaptively weighting the local task vectors concerning the fine-tuned models based on their GCD performance on a validation distribution. This episodic update mechanism boosts the adaptability of the base model to unseen targets. Experiments across three datasets confirm that DG2CD-Net outperforms existing GCD methods customized for DG-GCD.
zh

[CV-90] Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers

【速读】：本文旨在解决视觉自回归模型在推理过程中因存储先前生成表示而导致的显著内存开销问题。尽管已有研究尝试通过压缩技术缓解此问题，但此前的工作并未明确将KV缓存压缩问题形式化定义于这一背景下。论文首先正式定义了视觉自回归变换器的KV缓存压缩问题，并进一步证明了基于注意力机制的顺序视觉标记生成机制至少需要 (\Omega(n^2 d)) 的内存，当 (d = \Omega(\log n))，其中 (n) 是生成的标记数，(d) 是嵌入维度。此结果表明，除非引入额外的结构约束，否则实现真正次二次方级别的内存使用是不可能的。证明通过从计算下界问题的约简构建，并利用受降维原理启发的随机嵌入技术。最后，讨论了视觉表示的稀疏先验如何影响内存效率，提出了不可能性结果以及减轻内存开销的潜在方向。关键在于通过理论证明揭示了现有方法的内存下限，并探索了通过引入结构约束或利用稀疏性来优化内存使用的可能性。

链接: https://arxiv.org/abs/2503.14881
作者: Bo Chen,Xiaoyu Li,Yekun Ke,Yingyu Liang,Zhenmei Shi,Zhao Song
机构: Middle Tennessee State University; Stevens Institute of Technology; Independent Researcher; The University of Hong Kong; University of Wisconsin-Madison; University of Wisconsin-Madison; The Simons Institute for the Theory of Computing at the UC, Berkeley (伯克利加州大学西蒙斯计算理论研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A fundamental challenge in Visual Autoregressive models is the substantial memory overhead required during inference to store previously generated representations. Despite various attempts to mitigate this issue through compression techniques, prior works have not explicitly formalized the problem of KV-cache compression in this context. In this work, we take the first step in formally defining the KV-cache compression problem for Visual Autoregressive transformers. We then establish a fundamental negative result, proving that any mechanism for sequential visual token generation under attention-based architectures must use at least \Omega(n^2 d) memory, when d = \Omega(\log n) , where n is the number of tokens generated and d is the embedding dimensionality. This result demonstrates that achieving truly sub-quadratic memory usage is impossible without additional structural constraints. Our proof is constructed via a reduction from a computational lower bound problem, leveraging randomized embedding techniques inspired by dimensionality reduction principles. Finally, we discuss how sparsity priors on visual representations can influence memory efficiency, presenting both impossibility results and potential directions for mitigating memory overhead.
zh

[CV-91] DPFlow: Adaptive Optical Flow Estimation with a Dual-Pyramid Framework CVPR2025

【速读】：该论文旨在解决高分辨率视频光学流估计中的两个关键问题：一是现有光学流方法通常设计用于低分辨率且无法有效推广到大尺寸输入，因受固定架构限制，常通过下采样或输入分块来减小输入规模，导致细节与全局信息的丢失；二是缺乏适用于高分辨率样本的光学流基准，使得现有方法的实际性能难以被准确评估。论文提出的关键解决方案包括DPFlow，这是一种自适应光学流架构，能够在仅使用低分辨率样本训练的情况下实现高达8K分辨率输入的泛化能力；同时引入Kubric-NK基准，用于评估从1K到8K分辨率范围内的光学流方法性能。这些贡献不仅扩展了现有方法的能力边界，还揭示了其在更高分辨率场景下的泛化潜力。

链接: https://arxiv.org/abs/2503.14880
作者: Henrique Morimitsu,Xiaobin Zhu,Roberto M. Cesar Jr.,Xiangyang Ji,Xu-Cheng Yin
机构: University of Science and Technology Beijing (北京科技大学); University of São Paulo (圣保罗大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025. The code and dataset are available at this https URL . 24 pages, 17 figures

点击查看摘要

Abstract:Optical flow estimation is essential for video processing tasks, such as restoration and action recognition. The quality of videos is constantly increasing, with current standards reaching 8K resolution. However, optical flow methods are usually designed for low resolution and do not generalize to large inputs due to their rigid architectures. They adopt downscaling or input tiling to reduce the input size, causing a loss of details and global information. There is also a lack of optical flow benchmarks to judge the actual performance of existing methods on high-resolution samples. Previous works only conducted qualitative high-resolution evaluations on hand-picked samples. This paper fills this gap in optical flow estimation in two ways. We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. We also introduce Kubric-NK, a new benchmark for evaluating optical flow methods with input resolutions ranging from 1K to 8K. Our high-resolution evaluation pushes the boundaries of existing methods and reveals new insights about their generalization capabilities. Extensive experimental results show that DPFlow achieves state-of-the-art results on the MPI-Sintel, KITTI 2015, Spring, and other high-resolution benchmarks.
zh

[CV-92] Efficient Personalization of Quantized Diffusion Model without Backpropagation

【速读】：该论文致力于解决扩散模型在训练、微调和推理过程中对计算和内存资源的高需求问题，特别是在边缘设备（如移动电话）上利用个人化数据进行微调的应用场景。论文的关键创新在于通过Textual Inversion实现扩散模型的量化，并结合零阶优化技术对个性化标记进行微调，避免了反向传播所需的梯度和激活存储，从而显著减少了内存消耗。此外，为了克服零阶优化估计梯度噪声大的问题，论文提出了一种子空间梯度方法，通过将估计梯度投影到由标记历史构造的子空间中来去噪。同时，研究还提出了部分均匀时间步采样方法，以提高扩散时间步采样的效率。最终，该方法在保持图像与文本对齐性能的同时，将训练内存需求降低了高达8.2倍。

链接: https://arxiv.org/abs/2503.14868
作者: Hoigi Seo,Wongi Jeong,Kyungryeol Lee,Se Young Chun
机构: Dept. of Electrical and Computer Engineering, Seoul National University (首尔国立大学); INMC & IPAI, Seoul National University (首尔国立大学), Republic of Korea
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have shown remarkable performance in image synthesis, but they demand extensive computational and memory resources for training, fine-tuning and inference. Although advanced quantization techniques have successfully minimized memory usage for inference, training and fine-tuning these quantized models still require large memory possibly due to dequantization for accurate computation of gradients and/or backpropagation for gradient-based algorithms. However, memory-efficient fine-tuning is particularly desirable for applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consumes considerable memory. Since a gradient estimation using zeroth-order optimization is quite noisy for a single or a few images in personalization, we propose to denoise the estimated gradient by projecting it onto a subspace that is constructed with the past history of the tokens, dubbed Subspace Gradient. In addition, we investigated the influence of text embedding in image generation, leading to our proposed time steps sampling, dubbed Partial Uniform Timestep Sampling for sampling with effective diffusion timesteps. Our method achieves comparable performance to prior methods in image and text alignment scores for personalizing Stable Diffusion with only forward passes while reducing training memory demand up to 8.2\times .
zh

[CV-93] DVHGNN: Multi-Scale Dilated Vision HGNN for Efficient Vision Recognition

【速读】：该论文旨在解决Vision Graph Neural Network (ViG) 面临的两个关键问题：一是由K-Nearest Neighbor (KNN) 图构建引起的二次计算复杂度，二是普通图中仅能捕获成对关系的局限性。为应对这些挑战，论文提出了一种名为Dilated Vision HyperGraph Neural Network (DVHGNN) 的新架构，其核心在于利用多尺度超图高效捕获对象间的高阶相关性。解决方案的关键在于设计了Clustering and Dilated HyperGraph Construction (DHGC) 方法以自适应捕捉数据样本的多尺度依赖关系，并提出了动态超图卷积机制，以促进超图级别上的特征交换与融合。实验结果表明，DVHGNN 在多个基准图像数据集上显著优于现有视觉主干网络，例如在ImageNet-1K上，DVHGNN-S 的top-1准确率达到83.1%，分别比ViG-S和ViHGNN-S高出1.0%和0.6%。

链接: https://arxiv.org/abs/2503.14867
作者: Caoshuo Li,Tanzhe Li,Xiaobin Hu,Donghao Luo,Taisong Jin
机构: Key Laboratory of Multimedia Trusted Perception and Efficient Computing (多媒体可信感知与高效计算重点实验室), Ministry of Education of China (中华人民共和国教育部), Xiamen University (厦门大学), China; School of Informatics (信息学院), Xiamen University (厦门大学), China; Tencent Youtu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Vision Graph Neural Network (ViG) has gained considerable attention in computer vision. Despite its groundbreaking innovation, Vision Graph Neural Network encounters key issues including the quadratic computational complexity caused by its K-Nearest Neighbor (KNN) graph construction and the limitation of pairwise relations of normal graphs. To address the aforementioned challenges, we propose a novel vision architecture, termed Dilated Vision HyperGraph Neural Network (DVHGNN), which is designed to leverage multi-scale hypergraph to efficiently capture high-order correlations among objects. Specifically, the proposed method tailors Clustering and Dilated HyperGraph Construction (DHGC) to adaptively capture multi-scale dependencies among the data samples. Furthermore, a dynamic hypergraph convolution mechanism is proposed to facilitate adaptive feature exchange and fusion at the hypergraph level. Extensive qualitative and quantitative evaluations of the benchmark image datasets demonstrate that the proposed DVHGNN significantly outperforms the state-of-the-art vision backbones. For instance, our DVHGNN-S achieves an impressive top-1 accuracy of 83.1% on ImageNet-1K, surpassing ViG-S by +1.0% and ViHGNN-S by +0.6%.
zh

[CV-94] mporal-Consistent Video Restoration with Pre-trained Diffusion Models

【速读】：该论文致力于解决视频恢复（Video Restoration, VR）任务中从退化视频生成高质量视频的问题。现有基于预训练扩散模型（Diffusion Models, DMs）的零样本方法虽然展现出良好潜力，但存在反向扩散过程中的近似误差以及时间一致性不足的问题，并且处理三维视频数据时计算复杂度较高。论文的关键在于将扩散模型中的反向过程视为一个函数，并提出了一种新的最大后验概率（Maximum a Posterior, MAP）框架，通过直接在扩散模型的种子空间（seed space）中参数化视频帧，消除了近似误差。此外，论文引入了促进双层时间一致性的策略：利用种子空间中的聚类结构实现语义一致性，通过渐进形变结合光流细化实现像素级一致性。实验结果表明，该方法在多个虚拟现实任务中实现了优于当前最先进的视觉质量和时间一致性。

链接: https://arxiv.org/abs/2503.14863
作者: Hengkang Wang,Yang Liu,Huidong Liu,Chien-Chih Wang,Yanhui Guo,Hongdong Li,Bryan Wang,Ju Sun
机构: University of Minnesota (明尼苏达大学); Amazon.com, Inc. (亚马逊); ANU (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video restoration (VR) aims to recover high-quality videos from degraded ones. Although recent zero-shot VR methods using pre-trained diffusion models (DMs) show good promise, they suffer from approximation errors during reverse diffusion and insufficient temporal consistency. Moreover, dealing with 3D video data, VR is inherently computationally intensive. In this paper, we advocate viewing the reverse process in DMs as a function and present a novel Maximum a Posterior (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors. We also introduce strategies to promote bilevel temporal consistency: semantic consistency by leveraging clustering structures in the seed space, and pixel-level consistency by progressive warping with optical flow refinements. Extensive experiments on multiple virtual reality tasks demonstrate superior visual quality and temporal consistency achieved by our method compared to the state-of-the-art.
zh

[CV-95] Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task Dataset and Benchmark

【速读】：该论文旨在解决开放词汇检测器在新型类别中定位和识别物体时因视觉相关语言词汇数据变化导致的不公平和不可靠评估问题。现有评估方法虽尝试通过引入对象属性或添加位置及特征到描述中来缓解此问题，但这些属性和位置依赖于图像的具体细节而非类别本身，因此检测器仍需依赖精确的人类标注描述以进行准确预测。为应对这一挑战，论文提出了3F-OVD任务，即将细粒度监督目标检测扩展至开放词汇设置中。此任务直观且具挑战性，需要对细粒度描述有深刻理解，并对图像中的细粒度细节给予细致关注才能实现细粒度物体的准确检测。此外，由于合格的细粒度物体检测数据集稀缺，作者创建了一个名为NEU-171K的新数据集，适用于监督和开放词汇两种设定，并在此数据集上评估了最先进的目标检测器的表现。同时，还提出了一种简单而有效的后处理技术作为解决方案的关键部分。

链接: https://arxiv.org/abs/2503.14862
作者: Ying Liu,Yijing Hua,Haojiang Chai,Yanbo Wang,TengQi Ye
机构: department of software engineering, Northeastern University, China (东北大学软件工程系, 中国); Articul8 AI (Articul8 AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Open-vocabulary detectors are proposed to locate and recognize objects in novel classes. However, variations in vision-aware language vocabulary data used for open-vocabulary learning can lead to unfair and unreliable evaluations. Recent evaluation methods have attempted to address this issue by incorporating object properties or adding locations and characteristics to the captions. Nevertheless, since these properties and locations depend on the specific details of the images instead of classes, detectors can not make accurate predictions without precise descriptions provided through human annotation. This paper introduces 3F-OVD, a novel task that extends supervised fine-grained object detection to the open-vocabulary setting. Our task is intuitive and challenging, requiring a deep understanding of Fine-grained captions and careful attention to Fine-grained details in images in order to accurately detect Fine-grained objects. Additionally, due to the scarcity of qualified fine-grained object detection datasets, we have created a new dataset, NEU-171K, tailored for both supervised and open-vocabulary settings. We benchmark state-of-the-art object detectors on our dataset for both settings. Furthermore, we propose a simple yet effective post-processing technique.
zh

[CV-96] Global Renewables Watch: A Temporal Dataset of Solar and Wind Energy Derived from Satellite Imagery

【速读】：本文旨在构建一个全面的全球时间序列数据集，用于商业太阳能光伏（Solar Photovoltaic, PV）电站和陆上风力涡轮机的时空分布与特性分析。为了解决这一问题，研究的关键在于开发基于深度学习的分割模型，通过分析从2017年第4季度到2024年第2季度每季度的高分辨率卫星图像，自动识别可再生能源设施，并估算其建设日期及之前的土地利用类型。最终，该数据集涵盖了超过13万亿像素的全球范围，包含375,197个单个风力涡轮机和86,410个太阳能光伏安装点。通过对这些数据进行国家层面的汇总并与国际可再生能源机构（IRENA）的数据对比验证，证明了模型预测的准确性（r²值分别为0.96和0.93）。因此，该研究的核心贡献在于提供了一个高质量的开源数据资源，以支持可持续发展目标的评估与推进，同时为政策制定者、研究人员及利益相关方提供了重要的决策依据。

链接: https://arxiv.org/abs/2503.14860
作者: Caleb Robinson,Anthony Ortiz,Allen Kim,Rahul Dodhia,Andrew Zolli,Shivaprakash K Nagaraju,James Oakleaf,Joe Kiesecker,Juan M. Lavista Ferres
机构: Microsoft AI for Good Research Lab (微软人工智能公益研究中心), Redmond, WA, USA; Planet Labs PBC (行星实验室公共有限公司), San Francisco, CA, USA; The Nature Conservancy (TNC)(大自然保护协会), Washington D.C., USA
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a comprehensive global temporal dataset of commercial solar photovoltaic (PV) farms and onshore wind turbines, derived from high-resolution satellite imagery analyzed quarterly from the fourth quarter of 2017 to the second quarter of 2024. We create this dataset by training deep learning-based segmentation models to identify these renewable energy installations from satellite imagery, then deploy them on over 13 trillion pixels covering the world. For each detected feature, we estimate the construction date and the preceding land use type. This dataset offers crucial insights into progress toward sustainable development goals and serves as a valuable resource for policymakers, researchers, and stakeholders aiming to assess and promote effective strategies for renewable energy deployment. Our final spatial dataset includes 375,197 individual wind turbines and 86,410 solar PV installations. We aggregate our predictions to the country level – estimating total power capacity based on construction date, solar PV area, and number of windmills – and find an r^2 value of 0.96 and 0.93 for solar PV and onshore wind respectively compared to IRENA’s most recent 2023 country-level capacity estimates.
zh

[CV-97] Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在深度伪造（deepfake）检测中的潜力未被充分挖掘的问题，主要由于其知识与取证模式之间的不匹配。论文的关键解决方案包含三个组成部分：(1) 知识引导的伪造适应模块，通过对比学习外部操作知识来对齐VLM语义空间与法证特征；(2) 多模态提示调优框架，联合优化视觉-文本嵌入以实现定位与可解释性；(3) 迭代精化策略，支持基于证据的多轮对话推理。核心在于利用VLM的知识引导伪造检测器（KFD）、VLM图像编码器以及大型语言模型（LLM），通过提取视觉提示嵌入、推理生成文本检测响应等步骤，最终实现超越现有方法的泛化性能及多轮对话能力。

链接: https://arxiv.org/abs/2503.14853
作者: Peipeng Yu,Jianwei Fei,Hui Gao,Xuan Feng,Zhihua Xia,Chip Hong Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current vision-language models (VLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misaligned of their knowledge and forensics patterns. To this end, we present a novel paradigm that unlocks VLMs’ potential capabilities through three components: (1) A knowledge-guided forgery adaptation module that aligns VLM’s semantic space with forensic features through contrastive learning with external manipulation knowledge; (2) A multi-modal prompt tuning framework that jointly optimizes visual-textual embeddings for both localization and explainability; (3) An iterative refinement strategy enabling multi-turn dialog for evidence-based reasoning. Our framework includes a VLM-based Knowledge-guided Forgery Detector (KFD), a VLM image encoder, and a Large Language Model (LLM). The VLM image encoder extracts visual prompt embeddings from images, while the LLM receives visual and question prompt embeddings for inference. The KFD is used to calculate correlations between image features and pristine/deepfake class embeddings, enabling forgery classification and localization. The outputs from these components are used to construct forgery prompt embeddings. Finally, we feed these prompt embeddings into the LLM to generate textual detection responses to assist judgment. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, and DFDC, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.
zh

[CV-98] ClimateGS: Real-Time Climate Simulation with 3D Gaussian Style Transfer

【速读】：该论文旨在解决物理基础NeRF渲染方法在模拟恶劣气候条件下用于自动驾驶系统感知与决策时，因渲染速度慢和预处理时间长而无法满足实时测试和用户交互需求的问题。论文的关键创新在于提出ClimateGS框架，通过结合3D高斯表示与物理仿真实现气候效应的实时渲染。其解决方案的关键包括：1）开发线性变换以实现3D高斯真实感风格迁移，允许直接修改球谐函数以高效且一致地适应风格；2）设计联合训练策略，融合有监督与自监督学习加速收敛同时保留场景细节；3）开发基于物理的实时渲染方法，将物理效果与3D高斯表示集成，实现高效且真实的渲染。这些创新使得ClimateGS能够在实时应用中提供与现有2D/3D方法相当或更优的视觉质量。

链接: https://arxiv.org/abs/2503.14845
作者: Yuezhen Xie,Meiying Zhang,Qi Hao
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adverse climate conditions pose significant challenges for autonomous systems, demanding reliable perception and decision-making across diverse environments. To better simulate these conditions, physically-based NeRF rendering methods have been explored for their ability to generate realistic scene representations. However, these methods suffer from slow rendering speeds and long preprocessing times, making them impractical for real-time testing and user interaction. This paper presents ClimateGS, a novel framework integrating 3D Gaussian representations with physical simulation to enable real-time climate effects rendering. The novelty of this work is threefold: 1) developing a linear transformation for 3D Gaussian photorealistic style transfer, enabling direct modification of spherical harmonics across bands for efficient and consistent style adaptation; 2) developing a joint training strategy for 3D style transfer, combining supervised and self-supervised learning to accelerate convergence while preserving original scene details; 3) developing a real-time rendering method for climate simulation, integrating physics-based effects with 3D Gaussian to achieve efficient and realistic rendering. We evaluate ClimateGS on MipNeRF360 and Tanks and Temples, demonstrating real-time rendering with comparable or superior visual quality to SOTA 2D/3D methods, making it suitable for interactive applications.
zh

[CV-99] SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments

【速读】：本文旨在解决动态交通场景中高精度感知的问题，特别是针对高级自动驾驶系统中运动物体估计与实例分割任务通常被单独处理而导致性能次优、时空不一致以及复杂场景下效率低下的挑战。论文的关键创新在于提出了一种名为SemanticFlow的多任务框架，能够同时预测全分辨率点云的场景流与实例分割。其核心解决方案包括：1) 设计了一种从粗到细的多任务策略，通过初始粗略分割静态背景和动态目标来提供上下文信息，并利用共享特征处理模块优化运动与语义信息；2) 构建了一系列损失函数以提升场景流估计与实例分割的表现，同时确保动静态物体在空间和时间上的连续性；3) 开发了一种自监督学习方案，利用粗分割检测刚体对象并计算其在连续帧间的变换矩阵，从而生成自监督标签。这些方法显著提升了实例分割准确性、场景流估计效果及计算效率，在Argoverse和Waymo数据集上的验证结果表明SemanticFlow建立了新的自监督技术基准。

链接: https://arxiv.org/abs/2503.14837
作者: Yinqi Chen,Meiying Zhang,Qi Hao,Guang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate perception of dynamic traffic scenes is crucial for high-level autonomous driving systems, requiring robust object motion estimation and instance segmentation. However, traditional methods often treat them as separate tasks, leading to suboptimal performance, spatio-temporal inconsistencies, and inefficiency in complex scenarios due to the absence of information sharing. This paper proposes a multi-task SemanticFlow framework to simultaneously predict scene flow and instance segmentation of full-resolution point clouds. The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multi-task scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse segmentation to detect rigid objects and compute their transformation matrices between sequential frames, enabling the generation of self-supervised labels. The proposed framework is validated on the Argoverse and Waymo datasets, demonstrating superior performance in instance segmentation accuracy, scene flow estimation, and computational efficiency, establishing a new benchmark for self-supervised methods in dynamic scene understanding.
zh

[CV-100] On the Robustness Tradeoff in Fine-Tuning

【速读】：该论文试图解决在微调（fine-tuning）预训练模型以适应下游任务时，模型鲁棒性与准确性之间的权衡问题。论文的关键在于系统性地评估多种微调策略在不同基准数据集上的鲁棒性和准确性，并揭示微调过程中存在的鲁棒性-准确性权衡现象。研究发现，针对简单任务，外围更新方法如BitFit表现更优；而对于复杂任务，则通过Compacter等方法微调信息密集层（如注意力层）可获得更好的帕累托前沿（Pareto frontier）。此外，论文指出微调对分布外数据的鲁棒性与模型准确性密切相关。这些见解强调了在实际应用中采用鲁棒性感知的微调策略的重要性。

链接: https://arxiv.org/abs/2503.14836
作者: Kunyang Li,Jean-Charles Noirot Ferrand,Ryan Sheatsley,Blaine Hoak,Yohan Beugin,Eric Pauley,Patrick McDaniel
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-tuning has become the standard practice for adapting pre-trained (upstream) models to downstream tasks. However, the impact on model robustness is not well understood. In this work, we characterize the robustness-accuracy trade-off in fine-tuning. We evaluate the robustness and accuracy of fine-tuned models over 6 benchmark datasets and 7 different fine-tuning strategies. We observe a consistent trade-off between adversarial robustness and accuracy. Peripheral updates such as BitFit are more effective for simple tasks–over 75% above the average measured with area under the Pareto frontiers on CIFAR-10 and CIFAR-100. In contrast, fine-tuning information-heavy layers, such as attention layers via Compacter, achieves a better Pareto frontier on more complex tasks–57.5% and 34.6% above the average on Caltech-256 and CUB-200, respectively. Lastly, we observe that robustness of fine-tuning against out-of-distribution data closely tracks accuracy. These insights emphasize the need for robustness-aware fine-tuning to ensure reliable real-world deployments.
zh

[CV-101] H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection

【速读】：该论文致力于解决任务增量学习（Task Incremental Learning, TIL）在开放世界场景下持续检测来自未知任务分布（Out-of-Distribution, OOD）样本的挑战。现有方法依赖模型输出导致对模型性能过度依赖，难以选择合适的阈值，限制了实际应用，并且二元ID/OOD分类无法提供任务级别的识别能力。为了解决这些问题，论文提出了一种名为分层双样本检验（Hierarchical Two-sample Tests, H2ST）的新颖连续OOD检测方法。H2ST通过假设检验消除了对阈值选择的需求，并利用特征图更好地挖掘模型能力，减少了对模型性能的依赖。其分层架构能够在任务级别实现更高效的检测，相比非分层分类器双样本检验具有更好的性能和更低的开销。实验结果验证了H2ST在开放世界TIL场景中的有效性及其相对于现有方法的优势。

链接: https://arxiv.org/abs/2503.14832
作者: Yuhang Liu,Wenjie Zhao,Yunhui Guo
机构: University of Electronic Science and Technology of China (电子科技大学); University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:Task Incremental Learning (TIL) is a specialized form of Continual Learning (CL) in which a model incrementally learns from non-stationary data streams. Existing TIL methodologies operate under the closed-world assumption, presuming that incoming data remains in-distribution (ID). However, in an open-world setting, incoming samples may originate from out-of-distribution (OOD) sources, with their task identities inherently unknown. Continually detecting OOD samples presents several challenges for current OOD detection methods: reliance on model outputs leads to excessive dependence on model performance, selecting suitable thresholds is difficult, hindering real-world deployment, and binary ID/OOD classification fails to provide task-level identification. To address these issues, we propose a novel continual OOD detection method called the Hierarchical Two-sample Tests (H2ST). H2ST eliminates the need for threshold selection through hypothesis testing and utilizes feature maps to better exploit model capabilities without excessive dependence on model performance. The proposed hierarchical architecture enables task-level detection with superior performance and lower overhead compared to non-hierarchical classifier two-sample tests. Extensive experiments and analysis validate the effectiveness of H2ST in open-world TIL scenarios and its superiority to the existing methods. Code is available at \hrefthis https URLthis https URL.
zh

[CV-102] Decompositional Neural Scene Reconstruction with Generative Diffusion Prior CVPR’25

【速读】：该论文旨在解决从稀疏视角输入中进行完整形状与详细纹理的3D场景分解重建问题，这是下游应用极具吸引力但极具挑战性的任务。现有方法通过语义或几何正则化来应对这一挑战，但在欠约束区域存在显著退化，并且无法恢复被遮挡的区域。论文认为解决问题的关键在于为这些区域补充缺失的信息。为此，作者提出了DP-Recon方法，采用扩散先验（Score Distillation Sampling, SDS）优化每个单独物体在新视图下的神经表示，从而为欠约束区域提供额外信息。然而，直接引入扩散先验可能导致重建与生成引导之间的潜在冲突，因此进一步引入了一种可见性引导的方法来动态调整每像素的SDS损失权重。这些组件共同提升了几何和外观恢复的效果，同时忠实于输入图像。实验结果表明，该方法在Replica和ScanNet++数据集上的表现显著优于当前最先进的方法，并能够在仅10个视角下实现比基线方法在100个视角下更好的物体重建效果。

链接: https://arxiv.org/abs/2503.14830
作者: Junfeng Ni,Yu Liu,Ruijie Lu,Zirui Zhou,Song-Chun Zhu,Yixin Chen,Siyuan Huang
机构: Tsinghua University (清华大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室, BIGAI); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR’25. Project page: this https URL

点击查看摘要

Abstract:Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms SOTA methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization and produces decomposed object meshes with detailed UV maps that support photorealistic Visual effects (VFX) editing. The project page is available at this https URL.
zh

[CV-103] Prototype Perturbation for Relaxing Alignment Constraints in Backward-Compatible Learning

【速读】：该论文旨在解决传统检索模型更新范式中因后填充（backfilling）导致的计算开销大且耗时的问题。为避免后填充，研究聚焦于通过向旧特征原型引入扰动来放松强对齐约束，从而在保持新模型判别能力的同时实现与旧模型的兼容性学习（Backward-Compatible Learning, BCL）。关键在于提出两种基于邻居驱动（Neighbor-Driven Prototype Perturbation, NDPP）和优化驱动（Optimization-Driven Prototype Perturbation, ODPP）的方法，以合理调整扰动，确保新模型既能适应旧特征空间又能有效区分不同类别。实验表明，所提方法在地标和商品数据集上的性能优于现有BCL算法。

链接: https://arxiv.org/abs/2503.14824
作者: Zikun Zhou,Yushuai Sun,Wenjie Pei,Xin Li,Yaowei Wang
机构: Pengcheng Laboratory (鹏城实验室), Shenzhen, China; School of Computer Science and Technology, Harbin Institute of Technology (哈尔滨工业大学), Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The traditional paradigm to update retrieval models requires re-computing the embeddings of the gallery data, a time-consuming and computationally intensive process known as backfilling. To circumvent backfilling, Backward-Compatible Learning (BCL) has been widely explored, which aims to train a new model compatible with the old one. Many previous works focus on effectively aligning the embeddings of the new model with those of the old one to enhance the backward-compatibility. Nevertheless, such strong alignment constraints would compromise the discriminative ability of the new model, particularly when different classes are closely clustered and hard to distinguish in the old feature space. To address this issue, we propose to relax the constraints by introducing perturbations to the old feature prototypes. This allows us to align the new feature space with a pseudo-old feature space defined by these perturbed prototypes, thereby preserving the discriminative ability of the new model in backward-compatible learning. We have developed two approaches for calculating the perturbations: Neighbor-Driven Prototype Perturbation (NDPP) and Optimization-Driven Prototype Perturbation (ODPP). Particularly, they take into account the feature distributions of not only the old but also the new models to obtain proper perturbations along with new model updating. Extensive experiments on the landmark and commodity datasets demonstrate that our approaches perform favorably against state-of-the-art BCL algorithms.
zh

[CV-104] SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting

【速读】：该论文致力于解决从校准的多视图图像中精确重建参数化 3D 边缘的问题。传统方法通常先从多视图 2D 边缘图像重建 3D 边缘点集，再拟合 3D 边缘到该点集，但这种方法可能导致点集中的噪声引起拟合边缘之间的间隙，并且恢复的边缘可能与输入的多视图图像不一致，因为边缘拟合仅依赖于重建的 3D 点集。论文的关键解决方案是提出了一种名为 SketchSplat 的方法，通过可微分的多视图草图投射（sketch splatting）来实现准确、完整且紧凑的 3D 边缘重建。该方法将 3D 边缘表示为由控制点、尺度和透明度等属性定义的参数化线条和曲线的草图，并在重建过程中迭代采样高斯点以投射到 2D 边缘图像上，从而实现误差梯度的反向传播以优化草图属性。这种不同iable的方式确保了 3D 边缘与 2D 图像的良好对齐，从而获得更准确和完整的重建结果。此外，论文还引入了一系列自适应拓扑操作，并将其与草图优化结合使用，这些操作减少了所需草图的数量，同时保持了高精度，实现了更紧凑的重建。

链接: https://arxiv.org/abs/2503.14786
作者: Haiyang Ying,Matthias Zwicker
机构: University of Maryland, College Park (马里兰大学帕克分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Edges are one of the most basic parametric primitives to describe structural information in 3D. In this paper, we study parametric 3D edge reconstruction from calibrated multi-view images. Previous methods usually reconstruct a 3D edge point set from multi-view 2D edge images, and then fit 3D edges to the point set. However, noise in the point set may cause gaps among fitted edges, and the recovered edges may not align with input multi-view images since the edge fitting depends only on the reconstructed 3D point set. To mitigate these problems, we propose SketchSplat, a method to reconstruct accurate, complete, and compact 3D edges via differentiable multi-view sketch splatting. We represent 3D edges as sketches, which are parametric lines and curves defined by attributes including control points, scales, and opacity. During edge reconstruction, we iteratively sample Gaussian points from a set of sketches and rasterize the Gaussians onto 2D edge images. Then the gradient of the image error with respect to the input 2D edge images can be back-propagated to optimize the sketch attributes. Our method bridges 2D edge images and 3D edges in a differentiable manner, which ensures that 3D edges align well with 2D images and leads to accurate and complete results. We also propose a series of adaptive topological operations and apply them along with the sketch optimization. The topological operations help reduce the number of sketches required while ensuring high accuracy, yielding a more compact reconstruction. Finally, we contribute an accurate 2D edge detector that improves the performance of both ours and existing methods. Experiments show that our method achieves state-of-the-art accuracy, completeness, and compactness on a benchmark CAD dataset.
zh

[CV-105] RAT: Boosting Misclassification Detection Ability without Extra Data

【速读】：该论文致力于解决深度神经网络（Deep Neural Networks, DNN）在高风险领域（如自动驾驶和医疗）中错误预测检测及干预的问题。为实现这一目标，论文从对抗扰动的角度出发，提出将鲁棒半径（robust radius，亦称输入空间边界）作为置信度指标，并设计了两种高效的误分类检测算法：RR-BS 和 RR-Fast。此外，论文还提出了名为鲁棒半径感知训练（Radius Aware Training, RAT）的方法，以增强模型识别错误的能力。实验结果表明，与现有方法相比，所提方法在AURC上可减少多达29.3%，在FPR@95TPR上减少21.62%。因此，该研究的关键在于通过引入鲁棒半径作为置信度度量，并结合高效的检测算法与改进的训练方法，提升模型的安全性和可靠性。

链接: https://arxiv.org/abs/2503.14783
作者: Ge Yan,Tsui-Wei Weng
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As deep neural networks(DNN) become increasingly prevalent, particularly in high-stakes areas such as autonomous driving and healthcare, the ability to detect incorrect predictions of models and intervene accordingly becomes crucial for safety. In this work, we investigate the detection of misclassified inputs for image classification models from the lens of adversarial perturbation: we propose to use robust radius (a.k.a. input-space margin) as a confidence metric and design two efficient estimation algorithms, RR-BS and RR-Fast, for misclassification detection. Furthermore, we design a training method called Radius Aware Training (RAT) to boost models’ ability to identify mistakes. Extensive experiments show our method could achieve up to 29.3% reduction on AURC and 21.62% reduction in FPR@95TPR, compared with previous methods.
zh

[CV-106] Revisiting Image Fusion for Multi-Illuminant White-Balance Correction

【速读】：该论文致力于解决多光源场景下白平衡（White Balance, WB）校正这一计算机视觉领域的持续性挑战。现有基于融合的方法通过神经网络线性混合输入图像的多个sRGB版本，每个版本使用预定义的WB设置处理，但这些方法在常见多光源场景中表现欠佳。此外，现有的基于融合的方法依赖于缺乏专用多光源图像的sRGB WB数据集，这限制了训练和评估的效果。为了解决这些问题，论文提出了两个关键贡献：首先，提出了一种高效的基于Transformer的模型，能够有效捕捉sRGB WB设置之间的空间依赖关系，显著改进线性融合技术；其次，构建了一个大规模的多光源数据集，包含超过16,000张以五种不同WB设置渲染的sRGB图像及其对应的WB校正图像。实验结果表明，所提方法在新的多光源图像融合数据集上的性能相比现有技术提升了高达100%。

链接: https://arxiv.org/abs/2503.14774
作者: David Serrano-Lozano,Aditya Arora,Luis Herranz,Konstantinos G. Derpanis,Michael S. Brown,Javier Vazquez-Corral
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:White balance (WB) correction in scenes with multiple illuminants remains a persistent challenge in computer vision. Recent methods explored fusion-based approaches, where a neural network linearly blends multiple sRGB versions of an input image, each processed with predefined WB presets. However, we demonstrate that these methods are suboptimal for common multi-illuminant scenarios. Additionally, existing fusion-based methods rely on sRGB WB datasets lacking dedicated multi-illuminant images, limiting both training and evaluation. To address these challenges, we introduce two key contributions. First, we propose an efficient transformer-based model that effectively captures spatial dependencies across sRGB WB presets, substantially improving upon linear fusion techniques. Second, we introduce a large-scale multi-illuminant dataset comprising over 16,000 sRGB images rendered with five different WB settings, along with WB-corrected images. Our method achieves up to 100% improvement over existing techniques on our new multi-illuminant image fusion dataset.
zh

[CV-107] Validation of Human Pose Estimation and Human Mesh Recovery for Extracting Clinically Relevant Motion Data from Videos

【速读】：该论文旨在探讨运动生物力学领域的当前运动分析工具现状，从惯性测量单元（IMUs）和基于反光标记的光学动作捕捉（MoCap）等先进技术，到计算领域新兴的人体姿态估计和人体网格恢复方法。论文的核心目标是通过对比分析验证无标记动作捕捉技术在临床环境中的适用性，证明这些无标记技术在运动学分析结果上与现有更为复杂且便携性较差的IMUs和MoCap工具处于合理范围内。论文的关键解决方案在于利用基于人体姿态估计的无标记动作捕捉，不仅能够获得与IMU和传统MoCap相一致的结果，还显著减少了设置时间和对专业知识的需求。尽管所产生数据的质量仍有提升空间，但作者认为这一折衷方案在用于小规模临床测试的低速动作范围内是可以接受的误差范围之内。

链接: https://arxiv.org/abs/2503.14760
作者: Kai Armstrong,Alexander Rodrigues,Alexander P. Willmott,Lei Zhang,Xujiong Ye
机构: School of Computer Science (计算机学院), University of Lincoln (林肯大学); School of Psychology, Sport Science, and Wellbeing (心理学、运动科学与健康学院), University of Lincoln (林肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work aims to discuss the current landscape of kinematic analysis tools, ranging from the state-of-the-art in sports biomechanics such as inertial measurement units (IMUs) and retroreflective marker-based optical motion capture (MoCap) to more novel approaches from the field of computing such as human pose estimation and human mesh recovery. Primarily, this comparative analysis aims to validate the use of marker-less MoCap techniques in a clinical setting by showing that these marker-less techniques are within a reasonable range for kinematics analysis compared to the more cumbersome and less portable state-of-the-art tools. Not only does marker-less motion capture using human pose estimation produce results in-line with the results of both the IMU and MoCap kinematics but also benefits from a reduced set-up time and reduced practical knowledge and expertise to set up. Overall, while there is still room for improvement when it comes to the quality of the data produced, we believe that this compromise is within the room of error that these low-speed actions that are used in small clinical tests.
zh

[CV-108] RETHINED: A New Benchmark and Baseline for Real-Time High-Resolution Image Inpainting On Edge Devices

【速读】：该论文旨在解决现有图像修复方法在高分辨率场景下表现不佳且需要强大硬件支持的问题，限制了其在边缘设备上的部署。论文提出了一种名为RETHINED的实时超高清图像修复基线方法，能够在多种移动设备上以≤30ms的速度进行超高清图像修复。该方案的关键在于结合了一个轻量级卷积神经网络（CNN）用于恢复结构，以及一种与分辨率无关的补丁替换机制来提供详细的纹理细节。通过利用CNN的结构化能力与基于补丁方法的高级细节，该方法实现了高分辨率图像修复的核心性能提升，并展示了比现有最先进方法快100倍的效率。此外，还发布了首个免费形式掩码超高清修复数据集DF8K-Inpainting。

链接: https://arxiv.org/abs/2503.14757
作者: Marcelo Sanchez,Gil Triginer,Ignacio Sarasua,Lara Raad,Coloma Ballester
机构: Crisalix; NVIDIA (英伟达); IIE, FIng, UdelaR (乌拉圭共和国大学工程学院工业研究所); UPF (庞培法布拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing image inpainting methods have shown impressive completion results for low-resolution images. However, most of these algorithms fail at high resolutions and require powerful hardware, limiting their deployment on edge devices. Motivated by this, we propose the first baseline for REal-Time High-resolution image INpainting on Edge Devices (RETHINED) that is able to inpaint at ultra-high-resolution and can run in real-time ( \leq 30ms) in a wide variety of mobile devices. A simple, yet effective novel method formed by a lightweight Convolutional Neural Network (CNN) to recover structure, followed by a resolution-agnostic patch replacement mechanism to provide detailed texture. Specially our pipeline leverages the structural capacity of CNN and the high-level detail of patch-based methods, which is a key component for high-resolution image inpainting. To demonstrate the real application of our method, we conduct an extensive analysis on various mobile-friendly devices and demonstrate similar inpainting performance while being \mathrm100 \times faster than existing state-of-the-art methods. Furthemore, we realease DF8K-Inpainting, the first free-form mask UHD inpainting dataset.
zh

[CV-109] SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

【速读】：该论文旨在解决现有文本驱动的三维室内场景生成方法评估中存在的局限性问题，即当前评估指标主要关注生成场景的真实感与基准场景的对比，而往往忽视了生成场景与输入文本描述的一致性，而这恰恰是衡量方法是否满足用户需求的关键因素。为了解决这一问题，论文提出了一种名为SceneEval的评估框架，其关键是不仅包含针对输入文本中明确描述的对象及其属性等显式用户需求的评估指标，还涵盖了如对象间无碰撞等隐式预期的评估，从而实现对场景质量的全面评估。此外，为了支持评估，论文还构建了一个包含100个场景描述及其标注基准场景属性的数据集SceneEval-100。通过使用SceneEval对近期的场景生成方法进行评估，研究结果表明当前方法在满足用户需求方面存在不足，进一步强调了在此方向上开展更多研究的必要性。

链接: https://arxiv.org/abs/2503.14756
作者: Hou In Ivan Tam,Hou In Derek Pun,Austin T. Wang,Angel X. Chang,Manolis Savva
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics primarily assess the realism of generated scenes by comparing them to a set of ground-truth scenes, often overlooking alignment with the input text - a critical factor in determining how effectively a method meets user requirements. We present SceneEval, an evaluation framework designed to address this limitation. SceneEval includes metrics for both explicit user requirements, such as the presence of specific objects and their attributes described in the input text, and implicit expectations, like the absence of object collisions, providing a comprehensive assessment of scene quality. To facilitate evaluation, we introduce SceneEval-100, a dataset of scene descriptions with annotated ground-truth scene properties. We evaluate recent scene generation methods using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results show that current methods struggle at generating scenes that meet user requirements, underscoring the need for further research in this direction.
zh

[CV-110] Bayesian Modeling of Zero-Shot Classifications for Urban Flood Detection

【速读】：该论文旨在解决街景数据集中城市物体和事件（如街道洪水）检测中可靠标签缺乏的问题。主要挑战在于城市事件种类繁多且稀有事件的标注不足。为应对这一难题，论文提出了一种名为BayFlood的两阶段方法。关键在于：首先利用预训练的视觉语言模型（Vision-Language Model, VLM）进行零样本分类以识别事件发生的位置；其次在此基础上构建空间贝叶斯模型。这种方法避免了大规模人工标注的需求，并通过贝叶斯模型实现了不确定性量化、位置间平滑处理以及外部数据（如积水区域）的有效整合。

链接: https://arxiv.org/abs/2503.14754
作者: Matt Franchi,Nikhil Garg,Wendy Ju,Emma Pierson
机构: Cornell University (康奈尔大学); Jacobs Technion-Cornell Institute (雅各布技术学院-康奈尔大学), Cornell Tech (康奈尔科技学院) (纽约，美国); University of California - Berkeley (加州大学伯克利分校) (伯克利，加利福尼亚州，美国)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: In review

点击查看摘要

Abstract:Street scene datasets, collected from Street View or dashboard cameras, offer a promising means of detecting urban objects and incidents like street flooding. However, a major challenge in using these datasets is their lack of reliable labels: there are myriad types of incidents, many types occur rarely, and ground-truth measures of where incidents occur are lacking. Here, we propose BayFlood, a two-stage approach which circumvents this difficulty. First, we perform zero-shot classification of where incidents occur using a pretrained vision-language model (VLM). Second, we fit a spatial Bayesian model on the VLM classifications. The zero-shot approach avoids the need to annotate large training sets, and the Bayesian model provides frequent desiderata in urban settings - principled measures of uncertainty, smoothing across locations, and incorporation of external data like stormwater accumulation zones. We comprehensively validate this two-stage approach, showing that VLMs provide strong zero-shot signal for floods across multiple cities and time periods, the Bayesian model improves out-of-sample prediction relative to baseline methods, and our inferred flood risk correlates with known external predictors of risk. Having validated our approach, we show it can be used to improve urban flood detection: our analysis reveals 113,738 people who are at high risk of flooding overlooked by current methods, identifies demographic biases in existing methods, and suggests locations for new flood sensors. More broadly, our results showcase how Bayesian modeling of zero-shot LM annotations represents a promising paradigm because it avoids the need to collect large labeled datasets and leverages the power of foundation models while providing the expressiveness and uncertainty quantification of Bayesian models.
zh

[CV-111] LipShiFT: A Certifiably Robust Shift-based Vision Transformer ICLR2025

【速读】：本文旨在解决基于Transformer架构的模型在Lipschitz常数紧界推导方面的挑战，特别是在大规模输入和高维注意力模块导致训练过程受限且结果次优的情境下。研究重点在于视觉任务中Lipschitz约束方法的实际限制，并发现基于Lipschitz的边界训练可作为有效的正则化手段，通过限制模型连续层的权重实现。关键解决方案是针对ShiftViT模型的Lipschitz连续变体，在归一化约束输入设置下解决Transformer架构的显著训练难题，并利用l₂范数估计常见图像分类数据集上的模型Lipschitz常数上界，最终证明所提方法适用于更大规模模型，并提升了Transformer架构认证鲁棒性的当前技术水平。

链接: https://arxiv.org/abs/2503.14751
作者: Rohan Menon,Nicola Franco,Stephan Günnemann
机构: Technical Univ. of Munich (慕尼黑工业大学); Fraunhofer Institute for Cognitive Systems IKS (弗劳恩霍夫认知系统研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025 Workshop: VerifAI: AI Verification in the Wild

点击查看摘要

Abstract:Deriving tight Lipschitz bounds for transformer-based architectures presents a significant challenge. The large input sizes and high-dimensional attention modules typically prove to be crucial bottlenecks during the training process and leads to sub-optimal results. Our research highlights practical constraints of these methods in vision tasks. We find that Lipschitz-based margin training acts as a strong regularizer while restricting weights in successive layers of the model. Focusing on a Lipschitz continuous variant of the ShiftViT model, we address significant training challenges for transformer-based architectures under norm-constrained input setting. We provide an upper bound estimate for the Lipschitz constants of this model using the l_2 norm on common image classification datasets. Ultimately, we demonstrate that our method scales to larger models and advances the state-of-the-art in certified robustness for transformer-based architectures.
zh

[CV-112] HandSplat: Embedding-Driven Gaussian Splatting for High-Fidelity Hand Rendering

【速读】：该论文旨在解决现有基于3D高斯点 splatting (3D Gaussian Splatting, 3DGS) 的手部渲染方法中存在的几何细节丢失、时间不稳定性以及点分布效率低下等问题。这些问题源于其依赖刚性骨骼运动模型而对非刚性运动建模过于简化，仅基于逐点梯度进行稠密化处理且未能考虑空间和时间的相关性。为了解决上述挑战，论文提出了一种名为HandSplat的新框架，其关键在于通过引入隐式几何和外观嵌入来增强3DGS的标准属性，从而实现更精细的非刚性运动建模，同时保留原始3DGS所捕捉的手部静态特性；此外，还设计了一种局部梯度感知的稠密化策略，动态优化高变化区域内的高斯密度，并通过姿态条件下的属性正则化促进相似姿态间属性的一致性以提升时间稳定性。这些创新显著提高了手部渲染的保真度与稳定性，并实现了实时性能。

链接: https://arxiv.org/abs/2503.14736
作者: Yilan Dong,Haohe Liu,Qing Wang,Jiahao Yang,Wenqing Wang,Gregory Slabaugh,Shanxin Yuan
机构: Queen Mary University of London (伦敦玛丽女王大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing 3D Gaussian Splatting (3DGS) methods for hand rendering rely on rigid skeletal motion with an oversimplified non-rigid motion model, which fails to capture fine geometric and appearance details. Additionally, they perform densification based solely on per-point gradients and process poses independently, ignoring spatial and temporal correlations. These limitations lead to geometric detail loss, temporal instability, and inefficient point distribution. To address these issues, we propose HandSplat, a novel Gaussian Splatting-based framework that enhances both fidelity and stability for hand rendering. To improve fidelity, we extend standard 3DGS attributes with implicit geometry and appearance embeddings for finer non-rigid motion modeling while preserving the static hand characteristic modeled by original 3DGS attributes. Additionally, we introduce a local gradient-aware densification strategy that dynamically refines Gaussian density in high-variation regions. To improve stability, we incorporate pose-conditioned attribute regularization to encourage attribute consistency across similar poses, mitigating temporal artifacts. Extensive experiments on InterHand2.6M demonstrate that HandSplat surpasses existing methods in fidelity and stability while achieving real-time performance. We will release the code and pre-trained models upon acceptance.
zh

[CV-113] ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints

【速读】：该论文试图解决的问题是如何在仅使用固定集合的刚性形状（rigid shapes）的情况下，通过文本引导生成符合语义描述的图像，这一挑战类似于解决七巧板拼图或根据语义描述排列真实物体。论文将此问题形式化为基于形状的图像生成任务，这是一种新的文本引导的图像到图像转换任务，要求重新排列输入的刚性形状集合以形成非重叠配置，并视觉化传达目标概念。

解决方案的关键在于提出的方法ShapeShift。该方法通过在可微向量图形管道中显式参数化每个形状，迭代优化其位置和方向，利用预训练扩散模型的分数蒸馏采样进行指导。为保持布局清晰，引入了一种内容感知的碰撞解决机制，在发生重叠时施加最小且语义一致的调整，确保平滑收敛至物理上有效的配置。通过结合基于扩散的语义引导与显式的几何约束，该方法能够生成空间关系明确体现文本提示的可解释组合。

链接: https://arxiv.org/abs/2503.14720
作者: Vihaan Misra,Peter Schaldenbrand,Jean Oh
机构: Robotics Institute, Carnegie Mellon University (卡内基梅隆大学机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While diffusion-based models excel at generating photorealistic images from text, a more nuanced challenge emerges when constrained to using only a fixed set of rigid shapes, akin to solving tangram puzzles or arranging real-world objects to match semantic descriptions. We formalize this problem as shape-based image generation, a new text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations and visually communicating the target concept. Unlike pixel-manipulation approaches, our method, ShapeShift, explicitly parameterizes each shape within a differentiable vector graphics pipeline, iteratively optimizing placement and orientation through score distillation sampling from pretrained diffusion models. To preserve arrangement clarity, we introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur, ensuring smooth convergence toward physically valid configurations. By bridging diffusion-based semantic guidance with explicit geometric constraints, our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt. Extensive experiments demonstrate compelling results across diverse scenarios, with quantitative and qualitative advantages over alternative techniques.
zh

[CV-114] ViVa-SAFELAND: a New Freeware for Safe Validation of Vision-based Navigation in Aerial Vehicles

【速读】：该论文旨在解决无人机（UAV）视觉导航策略在复杂城市环境中测试与评估的问题，同时确保符合法律法规并保障人员安全。论文的关键在于提出ViVa-SAFELAND这一开源软件库，它通过包含高分辨率航拍视频的数据集以及虚拟化无人机（Emulated Aerial Vehicle, EAV）技术，模拟真实场景下的动态障碍物（如车辆和行人），实现对视觉导航算法的标准化评估。此外，该框架支持随机化变量以进行多次试验，并有助于生成用于训练任务的图像数据集，同时为人类或自主飞行员的深度学习训练提供支持。论文通过移动物体检测和风险评估分割两个案例研究，验证了该框架在验证视觉算法方面的有效性。其创新点在于首次提供了针对复杂城市环境部署的安全验证基准平台。

链接: https://arxiv.org/abs/2503.14719
作者: Miguel S. Soriano-García,Diego A. Mercado-Ravell
机构: Center for Research in Mathematics CIMAT (中心研究数学 CIMAT), campus Zacatecas; Centro de Investigación y de Estudios Avanzados del IPN, CINVESTAV-IPN unidad Guadalajara (国家理工学院高级研究中心, CINVESTAV-IPN瓜达拉哈拉分部)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: paper under review for publication

点击查看摘要

Abstract:ViVa-SAFELAND is an open source software library, aimed to test and evaluate vision-based navigation strategies for aerial vehicles, with special interest in autonomous landing, while complying with legal regulations and people’s safety. It consists of a collection of high definition aerial videos, focusing on real unstructured urban scenarios, recording moving obstacles of interest, such as cars and people. Then, an Emulated Aerial Vehicle (EAV) with a virtual moving camera is implemented in order to ``navigate" inside the video, according to high-order commands. ViVa-SAFELAND provides a new, safe, simple and fair comparison baseline to evaluate and compare different visual navigation solutions under the same conditions, and to randomize variables along several trials. It also facilitates the development of autonomous landing and navigation strategies, as well as the generation of image datasets for different training tasks. Moreover, it is useful for training either human of autonomous pilots using deep learning. The effectiveness of the framework for validating vision algorithms is demonstrated through two case studies, detection of moving objects and risk assessment segmentation. To our knowledge, this is the first safe validation framework of its kind, to test and compare visual navigation solution for aerial vehicles, which is a crucial aspect for urban deployment in complex real scenarios.
zh

[CV-115] Construction Site Scaffolding Completeness Detection Based on Mask R-CNN and Hough Transform

【速读】：该论文旨在解决建筑工地脚手架完整性检查中存在的效率低下和人工成本高的问题，特别是在确保交叉支撑等关键部件完整性和正确安装方面。传统方法依赖人工检查，容易因工人图方便而忽视安全规范，导致违规现象频发。论文提出的关键解决方案是基于深度学习的计算机视觉方法，通过训练卷积神经网络（CNN）模型，利用标注好的脚手架图像数据集实现交叉支撑部件完整性的自动检测。这种方法无需人工介入，能够显著节省时间和劳动力成本，同时提供了一种非侵入式的高效检测手段，从而提升施工现场的安全性。

链接: https://arxiv.org/abs/2503.14716
作者: Pei-Hsin Lin,Jacob J. Lin,Shang-Hsien Hsieh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The 30th EG-ICE: International Conference on Intelligent Computing in Engineering

点击查看摘要

Abstract:Construction site scaffolding is essential for many building projects, and ensuring its safety is crucial to prevent accidents. The safety inspector must check the scaffolding’s completeness and integrity, where most violations occur. The inspection process includes ensuring all the components are in the right place since workers often compromise safety for convenience and disassemble parts such as cross braces. This paper proposes a deep learning-based approach to detect the scaffolding and its cross braces using computer vision. A scaffold image dataset with annotated labels is used to train a convolutional neural network (CNN) model. With the proposed approach, we can automatically detect the completeness of cross braces from images taken at construction sites, without the need for manual inspection, saving a significant amount of time and labor costs. This non-invasive and efficient solution for detecting scaffolding completeness can help improve safety in construction sites.
zh

[CV-116] ARC-Calib: Autonomous Markerless Camera-to-Robot Calibration via Exploratory Robot Motions

【速读】：该论文旨在解决传统相机到机器人标定方法中存在的两大问题：一是基于标记的传统方法通常需要人工干预进行系统设置；二是现有的无标记自主标定方法依赖于预训练的机器人跟踪模型，这限制了其在边缘设备上的应用，并且需要针对新型机器人形态进行微调。为了解决这些问题，论文提出了一种名为ARC-Calib的基于模型的无标记相机到机器人标定框架。该框架的关键在于通过引入探索性机器人运动，在摄像机图像帧中生成易于追踪的基于轨迹的视觉模式，并利用观测运动中的共面性和共线性约束，通过几何优化框架迭代精化标定结果。这种方法无需额外的环境标记布置或数据收集与模型训练，从而使其能够广泛适应多种实际自主系统。

链接: https://arxiv.org/abs/2503.14701
作者: Podshara Chanrungmaneekul,Yiting Chen,Joshua T. Grace,Aaron M. Dollar,Kaiyu Hang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Camera-to-robot (also known as eye-to-hand) calibration is a critical component of vision-based robot manipulation. Traditional marker-based methods often require human intervention for system setup. Furthermore, existing autonomous markerless calibration methods typically rely on pre-trained robot tracking models that impede their application on edge devices and require fine-tuning for novel robot embodiments. To address these limitations, this paper proposes a model-based markerless camera-to-robot calibration framework, ARC-Calib, that is fully autonomous and generalizable across diverse robots and scenarios without requiring extensive data collection or learning. First, exploratory robot motions are introduced to generate easily trackable trajectory-based visual patterns in the camera’s image frames. Then, a geometric optimization framework is proposed to exploit the coplanarity and collinearity constraints from the observed motions to iteratively refine the estimated calibration result. Our approach eliminates the need for extra effort in either environmental marker setup or data collection and model training, rendering it highly adaptable across a wide range of real-world autonomous systems. Extensive experiments are conducted in both simulation and the real world to validate its robustness and generalizability.
zh

[CV-117] SplatVoxel: History-Aware Novel View Streaming without Temporal Training

【速读】：本文研究了从稀疏视角视频生成连续高质量且时间一致的新视角的问题，现有方法在时间一致性与视觉保真度方面存在不足，导致画面闪烁和不连贯。为解决这些问题，论文引入了历史感知机制，利用先前帧重建场景以提升质量和稳定性。关键在于提出了一种混合的点云体素前馈场景重建方法，结合高斯点云传播时间信息与分层体素网格进行时间融合，同时通过扩展二维跟踪模型至三维运动的运动图高效扭曲高斯基元，并以误差感知的方式整合新观测，且无需依赖多视角视频数据集训练即可实现推理阶段的历史感知应用。该方法在静态及流式场景重建中达到当前最优性能，显著减少了时间伪影和视觉伪影，同时实现了交互速率（15fps，延迟350ms）运行。

链接: https://arxiv.org/abs/2503.14698
作者: Yiming Wang,Lucy Chai,Xuan Luo,Michael Niemeyer,Manuel Lagunas,Stephen Lombardi,Siyu Tang,Tiancheng Sun
机构: ETH Zurich (苏黎世联邦理工学院); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study the problem of novel view streaming from sparse-view videos, which aims to generate a continuous sequence of high-quality, temporally consistent novel views as new input frames arrive. However, existing novel view synthesis methods struggle with temporal coherence and visual fidelity, leading to flickering and inconsistency. To address these challenges, we introduce history-awareness, leveraging previous frames to reconstruct the scene and improve quality and stability. We propose a hybrid splat-voxel feed-forward scene reconstruction approach that combines Gaussian Splatting to propagate information over time, with a hierarchical voxel grid for temporal fusion. Gaussian primitives are efficiently warped over time using a motion graph that extends 2D tracking models to 3D motion, while a sparse voxel transformer integrates new temporal observations in an error-aware manner. Crucially, our method does not require training on multi-view video datasets, which are currently limited in size and diversity, and can be directly applied to sparse-view video streams in a history-aware manner at inference time. Our approach achieves state-of-the-art performance in both static and streaming scene reconstruction, effectively reducing temporal artifacts and visual artifacts while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU. Project Page: this https URL
zh

[CV-118] Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs

【速读】：该论文试图解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在复杂视觉推理任务中因多步推理需求而导致性能受限的问题。为了解决这一局限性，论文提出了一种名为MF-SQ-LLaVA的新方法，其关键是通过端到端训练实现隐式的自问自答机制。具体而言，该方案通过对视觉问答数据集进行增强，引入包含子问题及其答案对的推理链，并利用多任务损失函数来同时鼓励生成和回答这些中间步骤以及最终答案的预测。实验结果表明，该方法在ScienceQA和VQAv2数据集上显著优于现有的最先进模型，包括基础版LLaVA和原始SQ-LLaVA，且消融研究验证了各组成部分的有效性，人类评估进一步确认了所提出方法提升的推理过程准确性与连贯性。

链接: https://arxiv.org/abs/2503.14674
作者: Liu Jing,Amirul Rahman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown remarkable progress in various multimodal tasks, yet they often struggle with complex visual reasoning that requires multi-step inference. To address this limitation, we propose MF-SQ-LLaVA, a novel approach that enhances LVLMs by enabling implicit self-questioning through end-to-end training. Our method involves augmenting visual question answering datasets with reasoning chains consisting of sub-question and answer pairs, and training the LVLM with a multi-task loss that encourages the generation and answering of these intermediate steps, as well as the prediction of the final answer. We conduct extensive experiments on the ScienceQA and VQAv2 datasets, demonstrating that MF-SQ-LLaVA significantly outperforms existing state-of-the-art models, including the base LLaVA and the original SQ-LLaVA. Ablation studies further validate the contribution of each component of our approach, and human evaluation confirms the improved accuracy and coherence of the reasoning process enabled by our method.
zh

[CV-119] hese Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models

【速读】：该论文致力于解决辐射场（radiance fields）不确定性量化的问题，特别是在高维和复杂场景下，传统方法难以高效实现不确定性估计的挑战。这种不确定性量化对于下游任务如视点规划（view planning）和场景理解（scene understanding）至关重要，尤其是在需要安全性和鲁棒性的应用场景中。论文的关键创新在于利用渲染方程的高阶矩（higher-order moments），通过概率渲染过程实现了辐射场输出（包括颜色、深度和语义预测）的高效、可微分的不确定性计算。这种方法不仅优于现有辐射场不确定性估计技术，还提供了更直接、计算效率更高的不同iable公式化方案，无需依赖蒙特卡洛采样等数值近似方法。此外，论文展示了该方法在下游任务中的实用性，例如最佳下一步视点选择（next-best-view selection）和神经辐射场训练中的主动光线采样（active ray sampling）。实验结果表明，该方法在保持简洁性的同时达到了最先进的性能。

链接: https://arxiv.org/abs/2503.14665
作者: Parker Ewen,Hao Chen,Seth Isaacson,Joey Wilson,Katherine A. Skinner,Ram Vasudevan
机构: University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper introduces a novel approach to uncertainty quantification for radiance fields by leveraging higher-order moments of the rendering equation. Uncertainty quantification is crucial for downstream tasks including view planning and scene understanding, where safety and robustness are paramount. However, the high dimensionality and complexity of radiance fields pose significant challenges for uncertainty quantification, limiting the use of these uncertainty quantification methods in high-speed decision-making. We demonstrate that the probabilistic nature of the rendering process enables efficient and differentiable computation of higher-order moments for radiance field outputs, including color, depth, and semantic predictions. Our method outperforms existing radiance field uncertainty estimation techniques while offering a more direct, computationally efficient, and differentiable formulation without the need for this http URL uncertainty quantification, we also illustrate the utility of our approach in downstream applications such as next-best-view (NBV) selection and active ray sampling for neural radiance field training. Extensive experiments on synthetic and real-world scenes confirm the efficacy of our approach, which achieves state-of-the-art performance while maintaining simplicity.
zh

[CV-120] A Simple Combination of Diffusion Models for Better Quality Trade-Offs in Image Denoising

【速读】：该论文致力于解决扩散模型在图像去噪任务中难以同时实现高视觉质量和低失真的平衡问题。针对加性高斯噪声去除这一基础任务，论文首先提出一种直观的方法来利用预训练的扩散模型，并进一步引入了一种名为线性组合去噪扩散器（Linear Combination Diffusion Denoiser, LCDD）的解决方案。LCDD的关键在于统一两种互补的推理过程：一种发挥模型的生成潜力，另一种确保信号的忠实恢复。通过利用去噪样本的内在结构，LCDD实现了最先进的性能，并通过简单的标量超参数调整提供了可控且良好的权衡。

链接: https://arxiv.org/abs/2503.14654
作者: Jonas Dornbusch,Emanuel Pfarr,Florin-Alexandru Vasluianu,Frank Werner,Radu Timofte
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Diffusion models have garnered considerable interest in computer vision, owing both to their capacity to synthesize photorealistic images and to their proven effectiveness in image reconstruction tasks. However, existing approaches fail to efficiently balance the high visual quality of diffusion models with the low distortion achieved by previous image reconstruction methods. Specifically, for the fundamental task of additive Gaussian noise removal, we first illustrate an intuitive method for leveraging pretrained diffusion models. Further, we introduce our proposed Linear Combination Diffusion Denoiser (LCDD), which unifies two complementary inference procedures - one that leverages the model’s generative potential and another that ensures faithful signal recovery. By exploiting the inherent structure of the denoising samples, LCDD achieves state-of-the-art performance and offers controlled, well-behaved trade-offs through a simple scalar hyperparameter adjustment.
zh

[CV-121] Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer

【速读】：本文旨在解决现有视觉解释方法无法揭示 Vision Transformer (ViT) 模型内部结构中隐藏的注意力流动的问题，即无法展示 ViT 模型如何形成最终的注意力区域以支持其决策过程。为了解决这一问题，论文提出了一种名为动态累积注意力图 (Dynamic Accumulated Attention Map, DAAM) 的新视觉解释方法。该方法的关键在于设计了一个新颖的分解模块，用于通过解锁每个 ViT 块自注意力模块生成的 [class] token 来构建和存储空间特征信息，并通过分解分类分数获得通道重要性系数（对于有监督的 ViT 模型）。对于无监督的 ViT 模型，由于缺乏分类分数，提出了维度感知的重要性权重来计算通道重要性系数。这些空间特征与对应的通道重要性系数线性组合生成每个块的注意力图，通过逐块累积形成动态注意力流。论文的核心贡献在于通过引入新的分解模块和维度感知的重要性权重，可视化 ViT 模型中任意中间块的决策注意力演化动态。定量和定性分析验证了 DAAM 在解释具有全连接层分类器的 ViT 模型以及自监督 ViT 模型方面的有效性与优越能力。代码已开源。

链接: https://arxiv.org/abs/2503.14640
作者: Yi Liao,Yongsheng Gao,Weichuan Zhang
机构: Griffith University (格里菲斯大学); Shaanxi University of Science and Technology (陕西科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Various Vision Transformer (ViT) models have been widely used for image recognition tasks. However, existing visual explanation methods can not display the attention flow hidden inside the inner structure of ViT models, which explains how the final attention regions are formed inside a ViT for its decision-making. In this paper, a novel visual explanation approach, Dynamic Accumulated Attention Map (DAAM), is proposed to provide a tool that can visualize, for the first time, the attention flow from the top to the bottom through ViT networks. To this end, a novel decomposition module is proposed to construct and store the spatial feature information by unlocking the [class] token generated by the self-attention module of each ViT block. The module can also obtain the channel importance coefficients by decomposing the classification score for supervised ViT models. Because of the lack of classification score in self-supervised ViT models, we propose dimension-wise importance weights to compute the channel importance coefficients. Such spatial features are linearly combined with the corresponding channel importance coefficients, forming the attention map for each block. The dynamic attention flow is revealed by block-wisely accumulating each attention map. The contribution of this work focuses on visualizing the evolution dynamic of the decision-making attention for any intermediate block inside a ViT model by proposing a novel decomposition module and dimension-wise importance weights. The quantitative and qualitative analysis consistently validate the effectiveness and superior capacity of the proposed DAAM for not only interpreting ViT models with the fully-connected layers as the classifier but also self-supervised ViT models. The code is available at this https URL.
zh

[CV-122] Reinforcement learning-based motion imitation for physiologically plausible musculoskeletal motor control

【速读】：该论文旨在解决如何通过模型驱动的方式理解并控制生理上精确的人体运动，特别是针对肌肉驱动的运动控制问题。传统基于强化学习的方法在简化人形模型上的表现虽已取得显著成果，但对真实人体生理性模型的控制仍面临挑战。论文的关键在于提出了一种无模型的运动模仿框架（KINESIS），其核心解决方案是结合具有80个肌肉驱动器和20个自由度的下肢骨骼肌肉模型，实现了对1.9小时运动捕捉数据的高质量模仿，并通过预训练的文本到运动生成模型实现了自然语言控制，同时可进一步微调以完成高级任务如目标导向运动。此外，KINESIS生成的肌肉活动模式与人类表面肌电图（EMG）活动高度相关，体现了其生理合理性，这使其成为研究人类运动控制理论难题（如Bernstein的冗余性问题）的一个有前景的工具。

链接: https://arxiv.org/abs/2503.14637
作者: Merkourios Simos,Alberto Silvio Chiappa,Alexander Mathis
机构: EPFL (瑞士联邦理工学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:How do humans move? The quest to understand human motion has broad applications in numerous fields, ranging from computer animation and motion synthesis to neuroscience, human prosthetics and rehabilitation. Although advances in reinforcement learning (RL) have produced impressive results in capturing human motion using simplified humanoids, controlling physiologically accurate models of the body remains an open challenge. In this work, we present a model-free motion imitation framework (KINESIS) to advance the understanding of muscle-based motor control. Using a musculoskeletal model of the lower body with 80 muscle actuators and 20 DoF, we demonstrate that KINESIS achieves strong imitation performance on 1.9 hours of motion capture data, is controllable by natural language through pre-trained text-to-motion generative models, and can be fine-tuned to carry out high-level tasks such as target goal reaching. Importantly, KINESIS generates muscle activity patterns that correlate well with human EMG activity. The physiological plausibility makes KINESIS a promising model for tackling challenging problems in human motor control theory, which we highlight by investigating Bernstein’s redundancy problem in the context of locomotion. Code, videos and benchmarks will be available at this https URL.
zh

[CV-123] Can Large Vision Language Models Read Maps Like a Human?

【速读】：该论文旨在解决基于像素地图的户外导航任务中，大型语言模型（Large Language Models, LLMS）在生成可读性高且准确的语言导航指令方面的能力不足问题。论文的关键创新在于提出了MapBench数据集，该数据集包含来自100张多样化地图的超过1600个基于像素的地图路径查找问题，并引入了Map空间场景图（Map Space Scene Graph, MSSG）作为索引结构，用于自然语言与LLMs生成结果之间的转换和评估。通过设计复杂的路径规划场景，MapBench不仅挑战了现有最先进的LLMs在零样本提示和链式思维（Chain-of-Thought, CoT）增强推理框架下的性能，还揭示了这些模型在空间推理和结构化决策能力上的显著局限性。

链接: https://arxiv.org/abs/2503.14607
作者: Shuo Xing,Zezhou Sun,Shuangyu Xie,Kaiyuan Chen,Yanjia Huang,Yuping Wang,Jiachen Li,Dezhen Song,Zhengzhong Tu
机构: Texas A&M University (德州农工大学); UC Berkeley (加州大学伯克利分校); MBZUAI; University of Michigan (密歇根大学); UC Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages

点击查看摘要

Abstract:In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in this https URL.
zh

[CV-124] Effortless Active Labeling for Long-Term Test-Time Adaptation CVPR

【速读】：本文旨在解决长时测试时适应（Long-term Test-Time Adaptation, TTA）中的误差积累问题，同时降低主动标注的负担。传统方法通过在每一批次中标注少量样本来缓解此问题，但随着批次数量增加，标注成本迅速上升。为实现轻松的主动标注，本文提出仅在每一批次中选择一个样本进行标注的方法。关键在于首先从单步优化的角度出发，在TTA背景下标注每个批次中最有价值的样本，这些样本位于源域和目标域数据分布的边界处，模型在一次迭代中学习它们的可能性最大。其次，引入基于特征扰动的高效策略来识别这些样本，并发现标注与未标注样本产生的梯度幅值存在显著差异，因此提出了利用两个动态权重平衡其对模型优化的影响。实验结果表明，该方法在ImageNet-C、-R、-K、-A及PACS数据库上的性能优于现有最先进方法，且标注成本显著降低。

链接: https://arxiv.org/abs/2503.14564
作者: Guowei Wang,Changxing Ding
机构: South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR this http URL : this https URL

点击查看摘要

Abstract:Long-term test-time adaptation (TTA) is a challenging task due to error accumulation. Recent approaches tackle this issue by actively labeling a small proportion of samples in each batch, yet the annotation burden quickly grows as the batch number increases. In this paper, we investigate how to achieve effortless active labeling so that a maximum of one sample is selected for annotation in each batch. First, we annotate the most valuable sample in each batch based on the single-step optimization perspective in the TTA context. In this scenario, the samples that border between the source- and target-domain data distributions are considered the most feasible for the model to learn in one iteration. Then, we introduce an efficient strategy to identify these samples using feature perturbation. Second, we discover that the gradient magnitudes produced by the annotated and unannotated samples have significant variations. Therefore, we propose balancing their impact on model optimization using two dynamic weights. Extensive experiments on the popular ImageNet-C, -R, -K, -A and PACS databases demonstrate that our approach consistently outperforms state-of-the-art methods with significantly lower annotation costs.
zh

[CV-125] SuperPC: A Single Diffusion Model for Point Cloud Completion Upsampling Denoising and Colorization

【速读】：该论文旨在解决点云（Point Cloud, PC）处理任务中诸如补全、上采样、去噪和着色等子任务通常被独立处理的问题。现有方法多采用针对单一缺陷设计的独立模型，未能有效应对点云中常见多种缺陷（如不完整、低分辨率、噪声和缺乏颜色）共存且相互影响的情况，导致误差累积及计算成本增加。论文的关键解决方案是提出SuperPC，这是一种统一的扩散模型（Unified Diffusion Model），能够同时处理上述四种任务。其核心在于采用三重条件扩散框架（Three-Level-Conditioned Diffusion Framework），并通过新颖的空间混合融合策略（Spatial-Mix-Fusion Strategy），充分利用这些缺陷之间的相关性，实现高效且协同的任务处理。实验表明，SuperPC在所有四个单独任务上均优于现有的专用模型及其组合。

链接: https://arxiv.org/abs/2503.14558
作者: Yi Du,Zhipeng Zhao,Shaoshu Su,Sharath Golluri,Haoze Zheng,Runmao Yao,Chen Wang
机构: Spatial AI & Robotics (SAIR) Lab, University at Buffalo
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Point cloud (PC) processing tasks-such as completion, upsampling, denoising, and colorization-are crucial in applications like autonomous driving and 3D reconstruction. Despite substantial advancements, prior approaches often address each of these tasks independently, with separate models focused on individual issues. However, this isolated approach fails to account for the fact that defects like incompleteness, low resolution, noise, and lack of color frequently coexist, with each defect influencing and correlating with the others. Simply applying these models sequentially can lead to error accumulation from each model, along with increased computational costs. To address these challenges, we introduce SuperPC, the first unified diffusion model capable of concurrently handling all four tasks. Our approach employs a three-level-conditioned diffusion framework, enhanced by a novel spatial-mix-fusion strategy, to leverage the correlations among these four defects for simultaneous, efficient processing. We show that SuperPC outperforms the state-of-the-art specialized models as well as their combination on all four individual tasks.
zh

[CV-126] Synchronous vs Asynchronous Reinforcement Learning in a Real World Robot

【速读】：该论文旨在解决物理机器人在强化学习（Reinforcement Learning, RL）过程中因响应时间增加而导致性能下降的问题。传统RL算法在物理环境中需要等待智能体完成决策与梯度更新，而这些计算密集型任务通常按顺序执行，显著延长了响应时间，尤其在快速变化的环境中可能严重影响学习效果。论文的关键解决方案是引入异步RL方法，通过分离决策与梯度更新的计算，从而减少响应时间。实验结果表明，异步RL方法不仅使智能体学习速度更快，累积回报更高，还证明了具有更短响应时间的学习智能体在性能上优于响应时间较长但梯度更新次数更多的智能体。

链接: https://arxiv.org/abs/2503.14554
作者: Ali Parsaee,Fahim Shahriar,Chuxin He,Ruiqing Tan
机构: University of Alberta (艾伯塔大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Presented at Alberta Robotics Intelligent Systems Expo (RISE) Conference

点击查看摘要

Abstract:In recent times, reinforcement learning (RL) with physical robots has attracted the attention of a wide range of researchers. However, state-of-the-art RL algorithms do not consider that physical environments do not wait for the RL agent to make decisions or updates. RL agents learn by periodically conducting computationally expensive gradient updates. When decision-making and gradient update tasks are carried out sequentially by the RL agent in a physical robot, it significantly increases the agent’s response time. In a rapidly changing environment, this increased response time may be detrimental to the performance of the learning agent. Asynchronous RL methods, which separate the computation of decision-making and gradient updates, are a potential solution to this problem. However, only a few comparisons between asynchronous and synchronous RL have been made with physical robots. For this reason, the exact performance benefits of using asynchronous RL methods over synchronous RL methods are still unclear. In this study, we provide a performance comparison between asynchronous and synchronous RL using a physical robotic arm called Franka Emika Panda. Our experiments show that the agents learn faster and attain significantly more returns using asynchronous RL. Our experiments also demonstrate that the learning agent with a faster response time performs better than the agent with a slower response time, even if the agent with a slower response time performs a higher number of gradient updates.
zh

[CV-127] Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）在计算机视觉任务中因数据异质性（data heterogeneity）建模不充分而导致的性能评估偏差问题。现有研究主要通过标签分布偏斜（label distribution skew）来模拟客户端之间的数据异质性，但这种方法未能全面捕捉计算机视觉任务（如分类以外的任务）中的真实数据分布差异。论文的关键解决方案在于引入了一种基于嵌入（embedding-based）的新颖数据异质性度量方法，通过利用预训练深度神经网络提取任务特定的数据嵌入，并使用Dirichlet分布对聚类后的数据点进行分配。这种方法重新定义了数据异质性的概念，并通过广泛的实验验证了不同FL方法在新定义下的性能表现，同时提出了新的基准性能指标，揭示了一系列可进一步探索的研究方向。

链接: https://arxiv.org/abs/2503.14553
作者: Kasra Borazjani,Payam Abdisarabshali,Naji Khosravan,Seyyedali Hosseinalipour
机构: Department of Electrical Engineering, University at Buffalo–SUNY, NY, USA (纽约州立大学布法罗分校电气工程系); Zillow Group, Seattle, WA, USA (西雅图，美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 9 figures, 1 table, (implementations are included at our GitHub repository: this https URL )

点击查看摘要

Abstract:Federated Learning (FL) represents a paradigm shift in distributed machine learning (ML), enabling clients to train models collaboratively while keeping their raw data private. This paradigm shift from traditional centralized ML introduces challenges due to the non-iid (non-independent and identically distributed) nature of data across clients, significantly impacting FL’s performance. Existing literature, predominantly model data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the real-world data heterogeneity among clients in computer vision tasks beyond classification. Subsequently, we demonstrate that current approaches overestimate FL’s performance by relying on label/class distribution skew, exposing an overlooked gap in the literature. By utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. We further unveil a series of open research directions that can be pursued.
zh

[CV-128] Fire and Smoke Datasets in 20 Years: An In-depth Review

【速读】：该论文旨在解决火灾和烟雾管理领域中数据集缺乏系统性评估的问题，并探索现有数据集在训练和测试先进深度学习模型方面的适用性和局限性。论文的关键在于通过全面分析过去20年收集的火灾和烟雾数据集的特性（如类型、规模、格式、采集方法及地理多样性），以及其成像模态（RGB、热成像、红外）和在分类、分割、检测等任务中的适用性，总结各数据集的优势与不足。此外，论文通过在多个主流算法（如ResNet-50、DeepLab-V3和YoloV8）上进行广泛的实验分析，进一步验证这些数据集在推动火灾管理研究和技术发展方面的潜力。

链接: https://arxiv.org/abs/2503.14552
作者: Sayed Pedram Haeri Boroujeni,Niloufar Mehrabi,Fatemeh Afghah,Connor Peter McGrath,Danish Bhatkar,Mithilesh Anil Biradar,Abolfazl Razi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fire and smoke phenomena pose a significant threat to the natural environment, ecosystems, and global economy, as well as human lives and wildlife. In this particular circumstance, there is a demand for more sophisticated and advanced technologies to implement an effective strategy for early detection, real-time monitoring, and minimizing the overall impacts of fires on ecological balance and public safety. Recently, the rapid advancement of Artificial Intelligence (AI) and Computer Vision (CV) frameworks has substantially revolutionized the momentum for developing efficient fire management systems. However, these systems extensively rely on the availability of adequate and high-quality fire and smoke data to create proficient Machine Learning (ML) methods for various tasks, such as detection and monitoring. Although fire and smoke datasets play a critical role in training, evaluating, and testing advanced Deep Learning (DL) models, a comprehensive review of the existing datasets is still unexplored. For this purpose, we provide an in-depth review to systematically analyze and evaluate fire and smoke datasets collected over the past 20 years. We investigate the characteristics of each dataset, including type, size, format, collection methods, and geographical diversities. We also review and highlight the unique features of each dataset, such as imaging modalities (RGB, thermal, infrared) and their applicability for different fire management tasks (classification, segmentation, detection). Furthermore, we summarize the strengths and weaknesses of each dataset and discuss their potential for advancing research and technology in fire management. Ultimately, we conduct extensive experimental analyses across different datasets using several state-of-the-art algorithms, such as ResNet-50, DeepLab-V3, and YoloV8.
zh

[CV-129] Matching Skeleton-based Activity Representations with Heterogeneous Signals for HAR

链接: https://arxiv.org/abs/2503.14547
作者: Shuheng Li,Jiayun Zhang,Xiaohan Fu,Xiyuan Zhang,Jingbo Shang,Rajesh K. Gupta
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper is accepted by SenSys 2025

点击查看摘要

[CV-130] Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey

链接: https://arxiv.org/abs/2503.14537
作者: Liewen Liao,Weihao Yan,Ming Yang,Songan Zhang
机构: Global Institute of Future Technology, Shanghai Jiao Tong University (上海交通大学未来技术学院); Department of Automation, Shanghai Jiao Tong University (上海交通大学自动化系)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

[CV-131] Interpretable Unsupervised Joint Denoising and Enhancement for Real-World low-light Scenarios

【速读】：该论文致力于解决真实世界低光图像中存在的复杂退化问题，如局部过曝、亮度不足、噪声以及不均匀照明。传统有监督方法容易过拟合特定场景，而无监督方法虽具有更好的泛化能力，但由于缺乏参考图像难以建模这些退化。为了解决这些问题，论文提出了一种可解释的、零参考联合去噪与低光增强框架，专门针对真实场景设计。关键在于通过基于物理成像原理和瑞利理论的配对子图像训练策略，结合在sRGB空间中利用离散余弦变换（DCT）进行频域分解，并引入隐式引导的混合表示策略以有效分离复杂的复合退化。此外，在主干网络设计中，开发了由隐式退化表示机制引导的视网膜分解网络。实验结果证明了该方法的有效性。

链接: https://arxiv.org/abs/2503.14535
作者: Huaqiu Li,Xiaowan Hu,Haoqian Wang
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Real-world low-light images often suffer from complex degradations such as local overexposure, low brightness, noise, and uneven illumination. Supervised methods tend to overfit to specific scenarios, while unsupervised methods, though better at generalization, struggle to model these degradations due to the lack of reference images. To address this issue, we propose an interpretable, zero-reference joint denoising and low-light enhancement framework tailored for real-world scenarios. Our method derives a training strategy based on paired sub-images with varying illumination and noise levels, grounded in physical imaging principles and retinex theory. Additionally, we leverage the Discrete Cosine Transform (DCT) to perform frequency domain decomposition in the sRGB space, and introduce an implicit-guided hybrid representation strategy that effectively separates intricate compounded degradations. In the backbone network design, we develop retinal decomposition network guided by implicit degradation representation mechanisms. Extensive experiments demonstrate the superiority of our method. Code will be available at this https URL.
zh

[CV-132] SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

链接: https://arxiv.org/abs/2503.14530
作者: Qing Li,Jiahui Geng,Derui Zhu,Fengyu Cai,Chenyang Lyu,Fakhri Karray
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Technical University of Munich (慕尼黑工业大学); Technical University of Darmstadt (达姆施塔特工业大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-133] ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis

【速读】：该论文旨在解决视觉-语言-动作（Vision-Language-Action, VLA）模型在机器人操作领域中由于真实世界数据收集成本高昂导致的数据扩展受限及其泛化能力不足的问题。论文的关键在于提出了一种名为ReBot的新方法，这是一种从真实到仿真再到真实的(real-to-sim-to-real)框架，用于扩展真实机器人数据集并适应目标域。ReBot通过在仿真环境中重放真实世界机器人的轨迹来多样化操作对象（真实到仿真），并将这些模拟动作与真实背景合成的图像结合，生成物理上逼真且时间一致的机器人视频（仿真到真实）。这种方法的优势在于充分利用了真实数据以缩小仿真与现实之间的差距，利用仿真环境的可扩展性，并实现了预训练VLA模型向目标域的全自动迁移。实验结果表明，ReBot显著提升了VLA模型的性能和鲁棒性。

链接: https://arxiv.org/abs/2503.14526
作者: Yu Fang,Yue Yang,Xinghao Zhu,Kaiyuan Zheng,Gedas Bertasius,Daniel Szafir,Mingyu Ding
机构: Department of Computer Science, University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校计算机科学系); Robotics and AI Institute (机器人与人工智能研究所); Department of Electrical and Computer Engineering, University of Washington (华盛顿大学电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Website: this https URL

点击查看摘要

Abstract:Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs. In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real). Our approach has several advantages: 1) it enjoys the benefit of real data to minimize the sim-to-real gap; 2) it leverages the scalability of simulation; and 3) it can generalize a pretrained VLA to a target domain with fully automated data pipelines. Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%. More information can be found at: this https URL
zh

[CV-134] Salient Temporal Encoding for Dynamic Scene Graph Generation

【速读】：该论文旨在解决使用结构化时空场景图表示动态场景这一新颖且具有挑战性的任务。论文的关键在于不仅要学习物体之间的空间关系，还要学习它们在时间上的交互作用。由于当前基准数据集中缺乏显式标注的时间关系，大多数现有的时空场景图生成方法会在跨帧的所有物体之间构建密集且抽象的时间连接。然而，并非所有的时间连接都能编码有意义的时间动态。为了解决这一问题，论文提出了一种新的时空场景图生成方法，该方法仅选择性地在与时间相关的目标对之间建立时间连接，并将时间关系作为场景图中的显式边进行表示。这种稀疏且显式的时间表示使得在场景图检测（Scene Graph Detection）任务上相比强大的基线模型提升了高达4.4%的性能。此外，研究还表明，该方法可以提升下游视觉任务的表现，在动作识别任务中相对于最先进的方法提高了0.6%的mAP。

链接: https://arxiv.org/abs/2503.14524
作者: Zhihao Zhu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Representing a dynamic scene using a structured spatial-temporal scene graph is a novel and particularly challenging task. To tackle this task, it is crucial to learn the temporal interactions between objects in addition to their spatial relations. Due to the lack of explicitly annotated temporal relations in current benchmark datasets, most of the existing spatial-temporal scene graph generation methods build dense and abstract temporal connections among all objects across frames. However, not all temporal connections are encoding meaningful temporal dynamics. We propose a novel spatial-temporal scene graph generation method that selectively builds temporal connections only between temporal-relevant objects pairs and represents the temporal relations as explicit edges in the scene graph. The resulting sparse and explicit temporal representation allows us to improve upon strong scene graph generation baselines by up to 4.4% in Scene Graph Detection. In addition, we show that our approach can be leveraged to improve downstream vision tasks. Particularly, applying our approach to action recognition, shows 0.6% gain in mAP in comparison to the state-of-the-art
zh

[CV-135] Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control ICLR’25

【速读】：该论文旨在解决基于语音驱动的3D说话人脸生成中，现有方法仅使用离散情感标签进行全局表情控制，难以实现时空域内灵活的细粒度面部控制的问题。论文的关键创新在于提出了一种基于扩散-Transformer的3D说话人脸生成模型Cafe-Talk，它同时集成了粗粒度和细粒度的多模态控制条件。为了解决多条件纠缠带来的挑战，论文设计了一个两阶段训练管道：首先仅使用语音音频和粗粒度条件进行初始训练；然后通过引入细粒度控制适配器逐步加入由动作单元（Action Units, AUs）表示的细粒度指令，避免不利的语音-唇同步效果。此外，论文设计了一种交换标签训练机制以增强细粒度条件的主导性，并提出了基于掩码的CFG技术来调节细粒度控制的发生与强度。同时，结合文本-AU对齐的文本检测器实现了自然语言用户输入，进一步支持多模态控制。实验结果表明，Cafe-Talk在唇同步和表达能力方面达到当前最佳性能，并在用户研究中获得广泛认可。

链接: https://arxiv.org/abs/2503.14517
作者: Hejia Chen,Haoxian Zhang,Shoulong Zhang,Xiaoqiang Liu,Sisi Zhuang,Yuan Zhang,Pengfei Wan,Di Zhang,Shuai Li
机构: State Key Laboratory of Virtual Reality Systems and Technology, Beihang University (北航); Kuaishou Technology (快手科技); Zhongguancun Laboratory (中关村实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR’25

点击查看摘要

Abstract:Speech-driven 3D talking face method should offer both accurate lip synchronization and controllable expressions. Previous methods solely adopt discrete emotion labels to globally control expressions throughout sequences while limiting flexible fine-grained facial control within the spatiotemporal domain. We propose a diffusion-transformer-based 3D talking face generation model, Cafe-Talk, which simultaneously incorporates coarse- and fine-grained multimodal control conditions. Nevertheless, the entanglement of multiple conditions challenges achieving satisfying performance. To disentangle speech audio and fine-grained conditions, we employ a two-stage training pipeline. Specifically, Cafe-Talk is initially trained using only speech audio and coarse-grained conditions. Then, a proposed fine-grained control adapter gradually adds fine-grained instructions represented by action units (AUs), preventing unfavorable speech-lip synchronization. To disentangle coarse- and fine-grained conditions, we design a swap-label training mechanism, which enables the dominance of the fine-grained conditions. We also devise a mask-based CFG technique to regulate the occurrence and intensity of fine-grained control. In addition, a text-based detector is introduced with text-AU alignment to enable natural language user input and further support multimodal control. Extensive experimental results prove that Cafe-Talk achieves state-of-the-art lip synchronization and expressiveness performance and receives wide acceptance in fine-grained control in user studies. Project page: this https URL
zh

[CV-136] Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition

【速读】：该论文旨在解决情感识别领域中因缺乏多样化且具有泛化能力的数据集而导致的挑战。在基于身体运动的情感识别任务中，由于年龄、性别、种族、性格和疾病等多种因素的影响，身体运动数据表现出较大的个体差异性，使得构建适用于情感识别的通用数据集变得困难。论文的关键解决方案在于提出了一种新颖的神经气体网络（Neural Gas Network, NGN）算法的应用，用于合成身体运动数据，并优化数据的多样性和生成速度。NGN 算法通过学习骨骼结构拓扑，将神经元或气体粒子适配到身体关节上，进而生成新的身体姿势，并通过帧间连接形成完整的合成身体运动序列。与传统的生成对抗网络（GANs）、变分自编码器（VAEs）及基准算法相比，该方法不仅提高了生成数据的真实感和情感表达的多样性，还在生成效率上展现出优势。此外，通过引入 Fréchet Inception Distance (FID) 和分类指标（如准确率、精确率、召回率等），验证了 NGN 方法在提升模型性能方面的能力，特别是在面对未见数据时的表现优于现有方法。

链接: https://arxiv.org/abs/2503.14513
作者: Seyed Muhammad Hossein Mousavi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 18 pages

点击查看摘要

Abstract:In the domain of emotion recognition using body motion, the primary challenge lies in the scarcity of diverse and generalizable datasets. Automatic emotion recognition uses machine learning and artificial intelligence techniques to recognize a person’s emotional state from various data types, such as text, images, sound, and body motion. Body motion poses unique challenges as many factors, such as age, gender, ethnicity, personality, and illness, affect its appearance, leading to a lack of diverse and robust datasets specifically for emotion recognition. To address this, employing Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks (GANs) and Variational Auto Encoders (VAEs), offers potential solutions, though these methods are often complex. This research introduces a novel application of the Neural Gas Network (NGN) algorithm for synthesizing body motion data and optimizing diversity and generation speed. By learning skeletal structure topology, the NGN fits the neurons or gas particles on body joints. Generated gas particles, which form the skeletal structure later on, will be used to synthesize the new body posture. By attaching body postures over frames, the final synthetic body motion appears. We compared our generated dataset against others generated by GANs, VAEs, and another benchmark algorithm, using benchmark metrics such as Fréchet Inception Distance (FID), Diversity, and a few more. Furthermore, we continued evaluation using classification metrics such as accuracy, precision, recall, and a few others. Joint-related features or kinematic parameters were extracted, and the system assessed model performance against unseen data. Our findings demonstrate that the NGN algorithm produces more realistic and emotionally distinct body motion data and does so with more synthesizing speed than existing methods.
zh

[CV-137] Federated Continual 3D Segmentation With Single-round Communication

【速读】：该论文旨在解决动态联邦学习环境中传统联邦通信策略的局限性问题。在实际应用中，随着新客户端加入或任务需求变化导致标签集扩展，传统的每轮通信进行模型聚合的方法会导致计算和通信开销线性增加，并且需要严格的同步通信，这在分布式环境下难以实现。为了解决这些问题，论文提出了一种基于多模型蒸馏的一次性服务器端模型聚合的联邦持续学习策略。该方法的关键在于通过构建和更新全局模型，避免频繁的服务器通信，并在集成新数据流或接纳新客户端时，高效复用先前客户端的模型，无需在整个联邦范围内重新训练全局模型。这种方法显著降低了通信负载，放松了客户端间的同步要求，提供了一个高效且可扩展的联邦分析框架，适用于真实世界的应用场景。

链接: https://arxiv.org/abs/2503.15414
作者: Can Peng,Qianhui Men,Pramit Saha,Qianye Yang,Cheng Ouyang,J. Alison Noble
机构: University of Oxford (牛津大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Federated learning seeks to foster collaboration among distributed clients while preserving the privacy of their local data. Traditionally, federated learning methods assume a fixed setting in which client data and learning objectives remain constant. However, in real-world scenarios, new clients may join, and existing clients may expand the segmentation label set as task requirements evolve. In such a dynamic federated analysis setup, the conventional federated communication strategy of model aggregation per communication round is suboptimal. As new clients join, this strategy requires retraining, linearly increasing communication and computation overhead. It also imposes requirements for synchronized communication, which is difficult to achieve among distributed clients. In this paper, we propose a federated continual learning strategy that employs a one-time model aggregation at the server through multi-model distillation. This approach builds and updates the global model while eliminating the need for frequent server communication. When integrating new data streams or onboarding new clients, this approach efficiently reuses previous client models, avoiding the need to retrain the global model across the entire federation. By minimizing communication load and bypassing the need to put unchanged clients online, our approach relaxes synchronization requirements among clients, providing an efficient and scalable federated analysis framework suited for real-world applications. Using multi-class 3D abdominal CT segmentation as an application task, we demonstrate the effectiveness of the proposed approach.
zh

[CV-138] FedSCA: Federated Tuning with Similarity-guided Collaborative Aggregation for Heterogeneous Medical Image Segmentation

【速读】：该论文旨在解决 Transformer 基础模型（Foundation Models, FMs）在医学图像分割领域因数据集规模限制及隐私保护导致的扩展性挑战。现有方法受限于孤立医院内有限的医学影像数据量以及数据集中化的隐私约束，而联邦学习（Federated Learning, FL）与基础模型微调（FLFM）的结合虽提供了一种协作训练的潜在方案，但非独立同分布（non-IID）数据分布及联邦环境中的计算与通信约束仍限制了性能提升。论文的关键在于提出了一种名为 FedSCA 的新型 FLFM 微调框架，涵盖联邦学习的所有阶段：(1) 设计高效的参数微调策略以提升本地客户端的计算效率；(2) 使用部分低层适配器传输优化通信效率；(3) 在服务器端引入基于相似度引导的协同聚合（SGCA）以应对非 IID 数据问题。通过这些创新，FedSCA 实现了显著的性能提升，并在三个医学图像分割的联邦学习基准测试中建立了新的 SOTA 水平。

链接: https://arxiv.org/abs/2503.15390
作者: Yumin Zhang,Yan Gao,Haoran Duan,Hanqing Guo,Tejal Shah,Rajiv Ranjan,Bo Wei
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based foundation models (FMs) have recently demonstrated remarkable performance in medical image segmentation. However, scaling these models is challenging due to the limited size of medical image datasets within isolated hospitals, where data centralization is restricted due to privacy concerns. These constraints, combined with the data-intensive nature of FMs, hinder their broader application. Integrating federated learning (FL) with foundation models (FLFM) fine-tuning offers a potential solution to these challenges by enabling collaborative model training without data sharing, thus allowing FMs to take advantage of a diverse pool of sensitive medical image data across hospitals/clients. However, non-independent and identically distributed (non-IID) data among clients, paired with computational and communication constraints in federated environments, presents an additional challenge that limits further performance improvements and remains inadequately addressed in existing studies. In this work, we propose a novel FLFM fine-tuning framework, \underline\textbfFederated tuning with \underline\textbfSimilarity-guided \underline\textbfCollaborative \underline\textbfAggregation (FedSCA), encompassing all phases of the FL process. This includes (1) specially designed parameter-efficient fine-tuning (PEFT) for local client training to enhance computational efficiency; (2) partial low-level adapter transmission for communication efficiency; and (3) similarity-guided collaborative aggregation (SGCA) on the server side to address non-IID issues. Extensive experiments on three FL benchmarks for medical image segmentation demonstrate the effectiveness of our proposed FedSCA, establishing new SOTA performance.
zh

[CV-139] Euclid Quick Data Release (Q1). Active galactic nuclei identification using diffusion-based inpainting of Euclid VIS images

链接: https://arxiv.org/abs/2503.15321
作者: Euclid Collaboration:G. Stevens(1),S. Fotopoulou(1),M. N. Bremer(1),T. Matamoro Zatarain(1),K. Jahnke(2),B. Margalef-Bentabol(3),M. Huertas-Company(4 and 5 and 6 and 7),M. J. Smith(8 and 9),M. Walmsley(10 and 11),M. Salvato(12),M. Mezcua(13 and 14),A. Paulino-Afonso(15 and 16),M. Siudek(5 and 13),M. Talia(17 and 18),F. Ricci(19 and 20),W. Roster(12),N. Aghanim(21),B. Altieri(22),S. Andreon(23),H. Aussel(24),C. Baccigalupi(25 and 26 and 27 and 28),M. Baldi(29 and 18 and 30),S. Bardelli(18),P. Battaglia(18),A. Biviano(26 and 25),A. Bonchi(31),E. Branchini(32 and 33 and 23),M. Brescia(34 and 35),J. Brinchmann(16 and 36),S. Camera(37 and 38 and 39),G. Cañas-Herrera(40 and 41 and 42),V. Capobianco(39),C. Carbone(43),J. Carretero(44 and 45),M. Castellano(20),G. Castignani(18),S. Cavuoti(35 and 46),K. C. Chambers(47),A. Cimatti(48),C. Colodro-Conde(4),G. Congedo(49),C. J. Conselice(11),L. Conversi(50 and 22),Y. Copin(51),A. Costille(52),F. Courbin(53 and 54),H. M. Courtois(55),M. Cropper(56),A. Da Silva(57 and 58),H. Degaudenzi(59),G. De Lucia(26),C. Dolding(56),H. Dole(21),M. Douspis(21),F. Dubath(59),X. Dupac(22),S. Dusini(60),S. Escoffier(61),M. Farina(62),S. Ferriol(51),K. George(63),C. Giocoli(18 and 30),B. R. Granett(23),A. Grazian(64),F. Grupp(12 and 63),S. V. H. Haugan(65),I. M. Hook(66),F. Hormuth(67),A. Hornstrup(68 and 69),P. Hudelot(70),M. Jhabvala(71),E. Keihänen(72),S. Kermiche(61),A. Kiessling(73),M. Kilbinger(24),B. Kubik(51),M. Kümmel(63),H. Kurki-Suonio(74 and 75),Q. Le Boulc’h(76),A. M. C. Le Brun(77),D. Le Mignant(52),P. B. Lilje(65),V. Lindholm(74 and 75),I. Lloro(78),G. Mainetti(76),D. Maino(79 and 43 and 80),E. Maiorano(18),O. Marggraf(81),M. Martinelli(20 and 82),N. Martinet(52),F. Marulli(17 and 18 and 30),R. Massey(83),S. Maurogordato(84),H. J. McCracken(70),E. Medinaceli(18),S. Mei(85 and 86),M. Melchior(87),M. Meneghetti(18 and 30),E. Merlin
机构: University of Bristol (布里斯托尔大学); European Space Agency (欧洲航天局); Institut de Física d’Altes Energies (高能物理研究所); Max Planck Institute for Extraterrestrial Physics (马克斯·普朗克地外物理研究所); INAF - Istituto Nazionale di Astrofisica (意大利国家天体物理学研究所); Laboratoire AIM Paris-Saclay (巴黎萨克雷AIM实验室); Centre de Recherche Astrophysique de Lyon (里昂天体物理学研究中心); Instituto de Astrofísica de Canarias (加那利天体物理研究所); Università degli Studi di Bologna (博洛尼亚大学); Università degli Studi di Padova (帕多瓦大学); Università degli Studi di Trieste (的里雅斯特大学); Universidad Complutense de Madrid (马德里康普顿斯大学); Universitat de Barcelona (巴塞罗那大学); University of Oxford (牛津大学); University of Nottingham (诺丁汉大学); University of Edinburgh (爱丁堡大学); University of Copenhagen (哥本哈根大学); Sorbonne Université (索邦大学); CNRS (法国国家科学研究中心); Observatoire de Paris (巴黎天文台); European Southern Observatory (欧洲南方天文台); University of California, Berkeley (加州大学伯克利分校); University of California, Santa Cruz (加州大学圣克鲁兹分校); University of Hawaii (夏威夷大学); Australian National University (澳大利亚国立大学); ETH Zurich (瑞士联邦理工学院); University of Geneva (日内瓦大学); University of Leicester (莱斯特大学); University of Durham (杜伦大学); University of St Andrews (圣安德鲁斯大学); University of Portsmouth (朴茨茅斯大学); University of Helsinki (赫尔辛基大学); Aarhus University (奥胡斯大学); University of Copenhagen (哥本哈根大学); University of Turku (图尔库大学); University of Western Australia (西澳大利亚大学); Universidad de La Laguna (拉古纳大学); Universidad Autónoma de Madrid (马德里自治大学); Universidad Politécnica de Madrid (马德里理工大学); Universidad de Zaragoza (萨拉戈萨大学); Universidad de Valencia (瓦伦西亚大学); Universidad de Oviedo (奥维耶多大学); Universidad de Granada (格拉纳达大学); Universidad de Sevilla (塞维利亚大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de Murcia (穆尔西亚大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Burgos (布尔戈斯大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de León (莱昂大学); Universidad de Ourense (奥伦塞大学); Universidad de Vigo (比戈大学); Universidad de Santiago de Compostela (圣地亚哥德孔波斯特拉大学); Universidad de A Coruña (科鲁尼亚大学); Universidad de La Coruña (拉科鲁尼亚大学); Universidad de Zaragoza (萨拉戈萨大学); Universidad de Huelva (韦尔瓦大学); Universidad de Cádiz (加的斯大学); Universidad de Cáceres (卡塞雷斯大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura (埃斯特雷马杜拉大学); Universidad de León (莱昂大学); Universidad de Salamanca (萨拉曼卡大学); Universidad de Valladolid (巴利亚多利德大学); Universidad de Burgos (布尔戈斯大学); Universidad de La Rioja (拉里奥哈大学); Universidad de Murcia (穆尔西亚大学); Universidad de Almería (阿尔梅里亚大学); Universidad de Granada (格拉纳达大学); Universidad de Córdoba (科尔多瓦大学); Universidad de Sevilla (塞维利亚大学); Universidad de Cádiz (加的斯大学); Universidad de Huelva (韦尔瓦大学); Universidad de Extremadura
类目: Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper submitted as part of the AA Special Issue `Euclid Quick Data Release (Q1)', 32 pages, 26 figures

点击查看摘要

[CV-140] Beacon2Science: Enhancing STEREO/HI beacon data1 with machine learning for efficient CME tracking

【速读】：该论文旨在解决基于STEREO/HI信标数据预测日冕物质抛射（CME）到达时间准确性较低的问题。由于信标数据存在数据间隙和质量较低的限制，其预测精度无法达到高分辨率科学数据的水平。为解决这一问题，论文提出了一种名为“Beacon2Science”的创新管道，其关键是通过增强信标数据的质量（提高信噪比和空间分辨率）、学习插值提升时间分辨率至与科学数据相当的40分钟，并通过优化模型架构和损失函数确保连续帧间的信息一致性。改进后的信标图像在CME可见性方面优于原始信标数据，且提取的轨迹更接近科学数据，平均误差减少了约一半（从1°降至约0.5°），从而显著提升了基于信标数据的CME跟踪性能。

链接: https://arxiv.org/abs/2503.15288
作者: Justin Le Louëdec,Maike Bauer,Tanja Amerstorfer,Jackie A. Davies
机构: 未知
类目: pace Physics (physics.space-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 pages, 11 figures, 1 tables, submitted to AGU Space Weather on 14th Marc 2025

点击查看摘要

Abstract:Observing and forecasting coronal mass ejections (CME) in real-time is crucial due to the strong geomagnetic storms they can generate that can have a potentially damaging effect, for example, on satellites and electrical devices. With its near-real-time availability, STEREO/HI beacon data is the perfect candidate for early forecasting of CMEs. However, previous work concluded that CME arrival prediction based on beacon data could not achieve the same accuracy as with high-resolution science data due to data gaps and lower quality. We present our novel pipeline entitled ‘‘Beacon2Science’’, bridging the gap between beacon and science data to improve CME tracking. Through this pipeline, we first enhance the quality (signal-to-noise ratio and spatial resolution) of beacon data. We then increase the time resolution of enhanced beacon images through learned interpolation to match science data’s 40-minute resolution. We maximize information coherence between consecutive frames with adapted model architecture and loss functions through the different steps. The improved beacon images are comparable to science data, showing better CME visibility than the original beacon data. Furthermore, we compare CMEs tracked in beacon, enhanced beacon, and science images. The tracks extracted from enhanced beacon data are closer to those from science images, with a mean average error of \sim 0.5 ^\circ of elongation compared to 1^\circ with original beacon data. The work presented in this paper paves the way for its application to forthcoming missions such as Vigil and PUNCH.
zh

[CV-141] xture-Aware StarGAN for CT data harmonisation

链接: https://arxiv.org/abs/2503.15058
作者: Francesco Di Feola,Ludovica Pompilio,Cecilia Assolito,Valerio Guarrasi,Paolo Soda
机构: Campus Bio-Medico University of Rome (罗马生物医学大学 Campus Bio-Medico University of Rome); Umeå University (乌梅å 大学 Umeå University); La Sapienza University of Rome (罗马大学 La Sapienza University of Rome)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-142] A Novel Channel Boosted Residual CNN-Transformer with Regional-Boundary Learning for Breast Cancer Detection

链接: https://arxiv.org/abs/2503.15008
作者: Aamir Mehmood,Yue Hu,Saddam Hussain Khan(Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat, Pakistan)
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 10 Figures, 2 Tables. arXiv admin note: substantial text overlap with arXiv:2405.12986

点击查看摘要

[CV-143] A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology

链接: https://arxiv.org/abs/2503.14933
作者: Yi Luo,Hamed Hooshangnejad,Xue Feng,Gaofeng Huang,Xiaojian Chen,Rui Zhang,Quan Chen,Wil Ngwa,Kai Ding
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 19 pages, 4 figures

点击查看摘要

[CV-144] FetalFlex: Anatomy-Guided Diffusion Model for Flexible Control on Fetal Ultrasound Image Synthesis

【速读】：该论文旨在解决胎儿超声（US）图像多平面标注数据获取困难的问题，特别是针对罕见或复杂的先天性异常，由于其低发病率和众多亚型，导致现有数据难以满足训练新手放射科医生和开发鲁棒AI模型的需求，尤其是在异常胎儿检测方面。论文提出了一种名为“Flexible Fetal US Image Generation Framework (FetalFlex)”的解决方案，其关键在于通过解剖结构和多模态信息实现跨多样平面的可控合成胎儿US图像。具体而言，FetalFlex 包含一个预对齐模块以增强可控性，并引入重绘策略确保一致的纹理和外观；同时，采用两阶段自适应采样策略逐步从粗到细优化图像质量。这种方法无需异常数据即可生成分布内正常和分布外异常的胎儿US图像，实验表明其在多项图像质量指标上达到最先进的性能，并且生成结果与专家视觉评估高度一致，同时显著提升了下游分类和异常检测任务中典型深度模型的性能。此外，FetalFlex 在解剖学层面的可控生成特性为异常模拟及像素级配对或反事实数据的创建提供了独特优势。

链接: https://arxiv.org/abs/2503.14906
作者: Yaofei Duan,Tao Tan,Zhiyuan Zhu,Yuhao Huang,Yuanji Zhang,Rui Gao,Patrick Cheong-Iao Pang,Xinru Gao,Guowei Tao,Xiang Cong,Zhou Li,Lianying Liang,Guangzhi He,Linliang Yin,Xuedong Deng,Xin Yang,Dong Ni
机构: mpu.edu.mo(澳门城市大学); szu.edu.cn(深圳大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 10 figures

点击查看摘要

Abstract:Fetal ultrasound (US) examinations require the acquisition of multiple planes, each providing unique diagnostic information to evaluate fetal development and screening for congenital anomalies. However, obtaining a comprehensive, multi-plane annotated fetal US dataset remains challenging, particularly for rare or complex anomalies owing to their low incidence and numerous subtypes. This poses difficulties in training novice radiologists and developing robust AI models, especially for detecting abnormal fetuses. In this study, we introduce a Flexible Fetal US image generation framework (FetalFlex) to address these challenges, which leverages anatomical structures and multimodal information to enable controllable synthesis of fetal US images across diverse planes. Specifically, FetalFlex incorporates a pre-alignment module to enhance controllability and introduces a repaint strategy to ensure consistent texture and appearance. Moreover, a two-stage adaptive sampling strategy is developed to progressively refine image quality from coarse to fine levels. We believe that FetalFlex is the first method capable of generating both in-distribution normal and out-of-distribution abnormal fetal US images, without requiring any abnormal data. Experiments on multi-center datasets demonstrate that FetalFlex achieved state-of-the-art performance across multiple image quality metrics. A reader study further confirms the close alignment of the generated results with expert visual assessments. Furthermore, synthetic images by FetalFlex significantly improve the performance of six typical deep models in downstream classification and anomaly detection tasks. Lastly, FetalFlex’s anatomy-level controllable generation offers a unique advantage for anomaly simulation and creating paired or counterfactual data at the pixel level. The demo is available at: this https URL.
zh

[CV-145] Degradation Alchemy: Self-Supervised Unknown-to-Known Transformation for Blind Hyperspectral Image Fusion

链接: https://arxiv.org/abs/2503.14892
作者: He Huang,Yong Chen,Yujun Guo,Wei He
机构: LIESMARS, Wuhan University (武汉大学), Wuhan, China; School of Computer and Information Engineering, Jiangxi Normal University (江西师范大学), Nanchang, China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-146] Involution and BSConv Multi-Depth Distillation Network for Lightweight Image Super-Resolution

链接: https://arxiv.org/abs/2503.14779
作者: Akram Khatami-Rizi,Ahmad Mahmoudi-Aznaveh
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-147] Core-Periphery Principle Guided State Space Model for Functional Connectome Classification

链接: https://arxiv.org/abs/2503.14655
作者: Minheng Chen,Xiaowei Yu,Jing Zhang,Tong Chen,Chao Cao,Yan Zhuang,Yanjun Lyu,Lu Zhang,Tianming Liu,Dajiang Zhu
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[CV-148] hree-dimensional Reconstruction of the Lumbar Spine with Submillimeter Accuracy Using Biplanar X-ray Images

【速读】：该论文旨在解决在承重条件下通过双平面X射线图像进行脊柱三维重建精度不足的问题。当前全自动重建方法的准确性较低，未能满足临床应用标准。论文提出了一种全自动的高精度腰椎三维重建方法，其关键是结合腰椎分解与标志点检测，并采用可变形模型以及基于标志点加权的二维到三维配准（Landmark-weighted 2D-3D registration）技术。通过将CT分割的椎体模型与双平面X射线图像配准获得的金标准验证，该方法实现了0.80毫米的三维重建精度，显著优于主流方法。这一成果将有助于承重状态下腰椎疾病的临床诊断。

链接: https://arxiv.org/abs/2503.14573
作者: Wanxin Yu,Zhemin Zhu,Cong Wang,Yihang Bao,Chunjie Xia,Rongshan Cheng,Yan Yu,Tsung-Yuan Tsai
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 21 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Three-dimensional reconstruction of the spine under weight-bearing conditions from biplanar X-ray images is of great importance for the clinical assessment of spinal diseases. However, the current fully automated reconstruction methods have low accuracy and fail to meet the clinical application standards. This study developed and validated a fully automated method for high-accuracy 3D reconstruction of the lumbar spine from biplanar X-ray images. The method involves lumbar decomposition and landmark detection from the raw X-ray images, followed by a deformable model and landmark-weighted 2D-3D registration approach. The reconstruction accuracy was validated by the gold standard obtained through the registration of CT-segmented vertebral models with the biplanar X-ray images. The proposed method achieved a 3D reconstruction accuracy of 0.80 mm, representing a significant improvement over the mainstream approaches. This study will contribute to the clinical diagnosis of lumbar in weight-bearing positions.
zh

[CV-149] Analysis of human visual field information using machine learning methods and assessment of their accuracy

链接: https://arxiv.org/abs/2503.14562
作者: A.I. Medvedeva,V.V. Bakutkin
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: in Russian language

点击查看摘要

[CV-150] Novel AI-Based Quantification of Breast Arterial Calcification to Predict Cardiovascular Risk

链接: https://arxiv.org/abs/2503.14550
作者: Theodorus Dapamede,Aisha Urooj,Vedant Joshi,Gabrielle Gershon,Frank Li,Mohammadreza Chavoshi,Beatrice Brown-Mulry,Rohan Satya Isaac,Aawez Mansuri,Chad Robichaux,Chadi Ayoub,Reza Arsanjani,Laurence Sperling,Judy Gichoya,Marly van Assen,Charles W. ONeill,Imon Banerjee,Hari Trivedi
机构: Emory University (埃默里大学); Mayo Clinic (梅奥诊所); Emory University (埃默里大学); Emory University (埃默里大学); Emory University (埃默里大学); Emory University (埃默里大学); Emory University (埃默里大学); Emory University (埃默里大学); Emory University (埃默里大学); Emory University (埃默里大学); Mayo Clinic (梅奥诊所); Mayo Clinic (梅奥诊所); Emory University (埃默里大学); Emory University (埃默里大学); Emory University (埃默里大学); Mayo Clinic (梅奥诊所); Emory University (埃默里大学); Emory University (埃默里大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-151] he Impact of Artificial Intelligence on Emergency Medicine: A Review of Recent Advances

链接: https://arxiv.org/abs/2503.14546
作者: Gustavo Correia,Victor Alves,Paulo Novais
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 2 tables, 2 figures

点击查看摘要

[CV-152] AI-Driven Rapid Identification of Bacterial and Fungal Pathogens in Blood Smears of Septic Patients

【速读】：该论文旨在解决败血症患者血液样本中细菌和酵母样真菌快速、准确分类的问题，以辅助微生物诊断并提高败血症治疗效率。传统微生物学方法耗时且成本高昂，因此研究开发了一种基于深度学习的算法，利用Gram染色显微图像实现14种细菌和3种酵母样真菌的自动识别。解决方案的关键在于结合Cellpose 3模型用于分割（segmentation）以及基于注意力机制的深度多实例学习（Attention-based Deep Multiple Instance Learning）用于分类，从而在有限的数据集上实现了较高的分类准确率（细菌77.15%，真菌71.39%），并在某些物种上达到了高达96.2%的准确度。然而，对于形态相似的物种（如Staphylococcus hominis与S. haemolyticus，以及Candida albicans的不同形态），仍存在显著的分类困难。因此，进一步优化模型和扩充训练数据集是未来研究的重点方向。

链接: https://arxiv.org/abs/2503.14542
作者: Agnieszka Sroka-Oleksiak,Adam Pardyl,Dawid Rymarczyk,Aldona Olechowska-Jarząb,Katarzyna Biegun-Drożdż,Dorota Ochońska,Michał Wronka,Adriana Borowa,Tomasz Gosiewski,Miłosz Adamczyk,Henryk Telega,Bartosz Zieliński,Monika Brzychczy-Włoch
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sepsis is a life-threatening condition which requires rapid diagnosis and treatment. Traditional microbiological methods are time-consuming and expensive. In response to these challenges, deep learning algorithms were developed to identify 14 bacteria species and 3 yeast-like fungi from microscopic images of Gram-stained smears of positive blood samples from sepsis patients. A total of 16,637 Gram-stained microscopic images were used in the study. The analysis used the Cellpose 3 model for segmentation and Attention-based Deep Multiple Instance Learning for classification. Our model achieved an accuracy of 77.15% for bacteria and 71.39% for fungi, with ROC AUC of 0.97 and 0.88, respectively. The highest values, reaching up to 96.2%, were obtained for Cutibacterium acnes, Enterococcus faecium, Stenotrophomonas maltophilia and Nakaseomyces glabratus. Classification difficulties were observed in closely related species, such as Staphylococcus hominis and Staphylococcus haemolyticus, due to morphological similarity, and within Candida albicans due to high morphotic diversity. The study confirms the potential of our model for microbial classification, but it also indicates the need for further optimisation and expansion of the training data set. In the future, this technology could support microbial diagnosis, reducing diagnostic time and improving the effectiveness of sepsis treatment due to its simplicity and accessibility. Part of the results presented in this publication was covered by a patent application at the European Patent Office EP24461637.1 “A computer implemented method for identifying a microorganism in a blood and a data processing system therefor”. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2503.14542 [eess.IV] (or arXiv:2503.14542v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2503.14542 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-153] Vision-Language Models for Acute Tuberculosis Diagnosis: A Multimodal Approach Combining Imaging and Clinical Data

【速读】：该论文旨在解决急性结核病（TB）在资源有限环境下的自动化筛查问题。传统方法依赖放射科医生的主观判断，存在效率低下和诊断精度不足的问题，特别是在缺乏医疗资源的地区。为应对这些挑战，论文提出了一种基于视觉-语言模型（Vision-Language Model, VLM）的解决方案，结合SIGLIP用于视觉编码（visual encoding），以及Gemma-3b用于解码（decoding）。关键在于通过集成胸部X光图像与临床笔记，实现对急性结核病特异性病理特征的高效表征与精准识别。实验结果表明，该模型能够以高精度（97%）和高召回率（96%）检测出如实变、空洞和结节等主要急性结核病病理特征，并具备良好的空间定位能力和区分阳性病例的鲁棒性。这一多模态能力显著降低了对放射科医生的依赖，为急性结核病的规模化筛查提供了可靠工具。未来工作将聚焦于提升对细微病变的检测能力及缓解数据集偏差，以进一步增强模型的泛化性能与全球适用性。

链接: https://arxiv.org/abs/2503.14538
作者: Ananya Ganapthy,Praveen Shastry,Naveen Kumarasami,Anandakumar D,Keerthana R,Mounigasri M,Varshinipriya M,Kishore Prasath Venkatesh,Bargava Subramanian,Kalyan Sivasailam
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Background: This study introduces a Vision-Language Model (VLM) leveraging SIGLIP and Gemma-3b architectures for automated acute tuberculosis (TB) screening. By integrating chest X-ray images and clinical notes, the model aims to enhance diagnostic accuracy and efficiency, particularly in resource-limited settings. Methods: The VLM combines visual data from chest X-rays with clinical context to generate detailed, context-aware diagnostic reports. The architecture employs SIGLIP for visual encoding and Gemma-3b for decoding, ensuring effective representation of acute TB-specific pathologies and clinical insights. Results: Key acute TB pathologies, including consolidation, cavities, and nodules, were detected with high precision (97percent) and recall (96percent). The model demonstrated strong spatial localization capabilities and robustness in distinguishing TB-positive cases, making it a reliable tool for acute TB diagnosis. Conclusion: The multimodal capability of the VLM reduces reliance on radiologists, providing a scalable solution for acute TB screening. Future work will focus on improving the detection of subtle pathologies and addressing dataset biases to enhance its generalizability and application in diverse global healthcare settings. Comments: 11 pages, 3 figures Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) MSC classes: 68T07, 68T45, 92C55, 92C50, 68U10 Cite as: arXiv:2503.14538 [eess.IV] (or arXiv:2503.14538v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2503.14538 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Anandakumar D [view email] [v1] Mon, 17 Mar 2025 14:08:35 UTC (508 KB)
zh

[CV-154] Advancing Chronic Tuberculosis Diagnostics Using Vision-Language Models: A Multi modal Framework for Precision Analysis

链接: https://arxiv.org/abs/2503.14536
作者: Praveen Shastry,Sowmya Chowdary Muthulur,Naveen Kumarasami,Anandakumar D,Mounigasri M,Keerthana R,Kishore Prasath Venkatesh,Bargava Subramanian,Kalyan Sivasailam,Revathi Ezhumalai,Abitha Marimuthu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages , 3 figures

点击查看摘要

[CV-155] Ship Detection in Remote Sensing Imagery for Arbitrarily Oriented Object Detection

【速读】：该论文旨在解决传统船检测方法在复杂场景下存在的挑战，包括船只任意方向、复杂背景以及视角遮挡等问题。为应对这些挑战，论文提出了一种创新的船检测系统，其关键在于结合使用YOLOv8与改进的U-Net两种先进的深度学习模型。YOLOv8专注于实时处理和高精度检测，而U-Net则用于实现船实例分割以改善边界定位并处理遮挡情况。通过利用“Airbus Ship Detection”数据集评估，结果表明该方法显著提升了船检测性能，其中YOLOv8达到了88%的mAP，U-Net则实现了89%的mAP。这不仅验证了所提方案的有效性，还展示了深度学习模型在提升船检测任务中的潜力。

链接: https://arxiv.org/abs/2503.14534
作者: Bibi Erum Ayesha,T. Satyanarayana Murthy,Palamakula Ramesh Babu,Ramu Kuchipudi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This research paper presents an innovative ship detection system tailored for applications like maritime surveillance and ecological monitoring. The study employs YOLOv8 and repurposed U-Net, two advanced deep learning models, to significantly enhance ship detection accuracy. Evaluation metrics include Mean Average Precision (mAP), processing speed, and overall accuracy. The research utilizes the “Airbus Ship Detection” dataset, featuring diverse remote sensing images, to assess the models’ versatility in detecting ships with varying orientations and environmental contexts. Conventional ship detection faces challenges with arbitrary orientations, complex backgrounds, and obscured perspectives. Our approach incorporates YOLOv8 for real-time processing and U-Net for ship instance segmentation. Evaluation focuses on mAP, processing speed, and overall accuracy. The dataset is chosen for its diverse images, making it an ideal benchmark. Results demonstrate significant progress in ship detection. YOLOv8 achieves an 88% mAP, excelling in accurate and rapid ship detection. U Net, adapted for ship instance segmentation, attains an 89% mAP, improving boundary delineation and handling occlusions. This research enhances maritime surveillance, disaster response, and ecological monitoring, exemplifying the potential of deep learning models in ship detection.
zh

[CV-156] Spline refinement with differentiable rendering

链接: https://arxiv.org/abs/2503.14525
作者: Frans Zdyb,Albert Alonso,Julius B. Kirkegaard
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

[CV-157] SDF-TopoNet: A Two-Stage Framework for Tubular Structure Segmentation via SDF Pre-training and Topology-Aware Fine-Tuning

链接: https://arxiv.org/abs/2503.14523
作者: Siyi Wu,Leyi Zhao,Haitian Ma,Xinyuan Song
机构: University of Texas at Arlington (德克萨斯大学阿灵顿分校); Indiana University (印第安纳大学); Emory University (埃默里大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

人工智能

[AI-0] Learning to Play Piano in the Real World

链接: https://arxiv.org/abs/2503.15481
作者: Yves-Simon Zeulner,Sandeep Selvaraj,Roberto Calandra
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Towards the grand challenge of achieving human-level manipulation in robots, playing piano is a compelling testbed that requires strategic, precise, and flowing movements. Over the years, several works demonstrated hand-designed controllers on real world piano playing, while other works evaluated robot learning approaches on simulated piano scenarios. In this paper, we develop the first piano playing robotic system that makes use of learning approaches while also being deployed on a real world dexterous robot. Specifically, we make use of Sim2Real to train a policy in simulation using reinforcement learning before deploying the learned policy on a real world dexterous robot. In our experiments, we thoroughly evaluate the interplay between domain randomization and the accuracy of the dynamics model used in simulation. Moreover, we evaluate the robot’s performance across multiple songs with varying complexity to study the generalization of our learned policy. By providing a proof-of-concept of learning to play piano in the real world, we want to encourage the community to adopt piano playing as a compelling benchmark towards human-level manipulation. We open-source our code and show additional videos at this https URL .

[AI-1] CCDP: Composition of Conditional Diffusion Policies with Guided Sampling

链接: https://arxiv.org/abs/2503.15386
作者: Amirreza Razmjoo,Sylvain Calinon,Michael Gienger,Fan Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Imitation Learning offers a promising approach to learn directly from data without requiring explicit models, simulations, or detailed task definitions. During inference, actions are sampled from the learned distribution and executed on the robot. However, sampled actions may fail for various reasons, and simply repeating the sampling step until a successful action is obtained can be inefficient. In this work, we propose an enhanced sampling strategy that refines the sampling distribution to avoid previously unsuccessful actions. We demonstrate that by solely utilizing data from successful demonstrations, our method can infer recovery actions without the need for additional exploratory behavior or a high-level controller. Furthermore, we leverage the concept of diffusion model decomposition to break down the primary problem (which may require long-horizon history to manage failures) into multiple smaller, more manageable sub-problems in learning, data collection, and inference, thereby enabling the system to adapt to variable failure counts. Our approach yields a low-level controller that dynamically adjusts its sampling space to improve efficiency when prior samples fall short. We validate our method across several tasks, including door opening with unknown directions, object manipulation, and button-searching scenarios, demonstrating that our approach outperforms traditional baselines.

[AI-2] Do Chains-of-Thoughts of Large Language Models Suffer from Hallucinations Cognitive Biases or Phobias in Bayesian Reasoning ?

链接: https://arxiv.org/abs/2503.15268
作者: Roberto Araya
类目: Artificial Intelligence (cs.AI)
*备注: 24 pages, 3 figures

点击查看摘要

Abstract:Learning to reason and carefully explain arguments is central to students’ cognitive, mathematical, and computational thinking development. This is particularly challenging in problems under uncertainty and in Bayesian reasoning. With the new generation of large language models (LLMs) capable of reasoning using Chain-of-Thought (CoT), there is an excellent opportunity to learn with them as they explain their reasoning through a dialogue with their artificial internal voice. It is an engaging and excellent opportunity to learn Bayesian reasoning. Furthermore, given that different LLMs sometimes arrive at opposite solutions, CoT generates opportunities for deep learning by detailed comparisons of reasonings. However, unlike humans, we found that they do not autonomously explain using ecologically valid strategies like natural frequencies, whole objects, and embodied heuristics. This is unfortunate, as these strategies help humans avoid critical mistakes and have proven pedagogical value in Bayesian reasoning. In order to overcome these biases and aid understanding and learning, we included prompts that induce LLMs to use these strategies. We found that LLMs with CoT incorporate them but not consistently. They show persistent biases towards symbolic reasoning and avoidance or phobia of ecologically valid strategies.

[AI-3] Automated Non-Functional Requirements Generation in Software Engineering with Large Language Models : A Comparative Study

链接: https://arxiv.org/abs/2503.15248
作者: Jomar Thomas Almonte,Santhosh Anitha Boominathan,Nathalia Nascimento
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 11 pages

点击查看摘要

Abstract:Neglecting non-functional requirements (NFRs) early in software development can lead to critical challenges. Despite their importance, NFRs are often overlooked or difficult to identify, impacting software quality. To support requirements engineers in eliciting NFRs, we developed a framework that leverages Large Language Models (LLMs) to derive quality-driven NFRs from functional requirements (FRs). Using a custom prompting technique within a Deno-based pipeline, the system identifies relevant quality attributes for each functional requirement and generates corresponding NFRs, aiding systematic integration. A crucial aspect is evaluating the quality and suitability of these generated requirements. Can LLMs produce high-quality NFR suggestions? Using 34 functional requirements - selected as a representative subset of 3,964 FRs-the LLMs inferred applicable attributes based on the ISO/IEC 25010:2023 standard, generating 1,593 NFRs. A horizontal evaluation covered three dimensions: NFR validity, applicability of quality attributes, and classification precision. Ten industry software quality evaluators, averaging 13 years of experience, assessed a subset for relevance and quality. The evaluation showed strong alignment between LLM-generated NFRs and expert assessments, with median validity and applicability scores of 5.0 (means: 4.63 and 4.59, respectively) on a 1-5 scale. In the classification task, 80.4% of LLM-assigned attributes matched expert choices, with 8.3% near misses and 11.3% mismatches. A comparative analysis of eight LLMs highlighted variations in performance, with gemini-1.5-pro exhibiting the highest attribute accuracy, while llama-3.3-70B achieved higher validity and applicability scores. These findings provide insights into the feasibility of using LLMs for automated NFR generation and lay the foundation for further exploration of AI-assisted requirements engineering.

[AI-4] A Personalized Data-Driven Generative Model of Human Motion

链接: https://arxiv.org/abs/2503.15225
作者: Angelo Di Porzio,Marco Coraggio
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 6 pages, 9 figures

点击查看摘要

Abstract:The deployment of autonomous virtual avatars (in extended reality) and robots in human group activities - such as rehabilitation therapy, sports, and manufacturing - is expected to increase as these technologies become more pervasive. Designing cognitive architectures and control strategies to drive these agents requires realistic models of human motion. However, existing models only provide simplified descriptions of human motor behavior. In this work, we propose a fully data-driven approach, based on Long Short-Term Memory neural networks, to generate original motion that captures the unique characteristics of specific individuals. We validate the architecture using real data of scalar oscillatory motion. Extensive analyses show that our model effectively replicates the velocity distribution and amplitude envelopes of the individual it was trained on, remaining different from other individuals, and outperforming state-of-the-art models in terms of similarity to human data.

[AI-5] A Unified Framework for Real-Time Failure Handling in Robotics Using Vision-Language Models Reactive Planner and Behavior Trees

链接: https://arxiv.org/abs/2503.15202
作者: Faseeh Ahmad,Hashim Ismail,Jonathan Styrud,Maj Stenmark,Volker Krueger
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Robotic systems often face execution failures due to unexpected obstacles, sensor errors, or environmental changes. Traditional failure recovery methods rely on predefined strategies or human intervention, making them less adaptable. This paper presents a unified failure recovery framework that combines Vision-Language Models (VLMs), a reactive planner, and Behavior Trees (BTs) to enable real-time failure handling. Our approach includes pre-execution verification, which checks for potential failures before execution, and reactive failure handling, which detects and corrects failures during execution by verifying existing BT conditions, adding missing preconditions and, when necessary, generating new skills. The framework uses a scene graph for structured environmental perception and an execution history for continuous monitoring, enabling context-aware and adaptive failure handling. We evaluate our framework through real-world experiments with an ABB YuMi robot on tasks like peg insertion, object sorting, and drawer placement, as well as in AI2-THOR simulator. Compared to using pre-execution and reactive methods separately, our approach achieves higher task success rates and greater adaptability. Ablation studies highlight the importance of VLM-based reasoning, structured scene representation, and execution history tracking for effective failure recovery in robotics.

[AI-6] Foundation models may exhibit staged progression in novel CBRN threat disclosure

链接: https://arxiv.org/abs/2503.15182
作者: Kevin M Esvelt
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
*备注: 26 pages, 2 figures

点击查看摘要

Abstract:The extent to which foundation models can disclose novel chemical, biological, radiation, and nuclear (CBRN) threats to expert users is unclear due to a lack of test cases. I leveraged the unique opportunity presented by an upcoming publication describing a novel catastrophic biothreat - “Technical Report on Mirror Bacteria: Feasibility and Risks” - to conduct a small controlled study before it became public. Graduate-trained biologists tasked with predicting the consequences of releasing mirror E. coli showed no significant differences in rubric-graded accuracy using Claude Sonnet 3.5 new (n=10) or web search only (n=2); both groups scored comparably to a web baseline (28 and 43 versus 36). However, Sonnet reasoned correctly when prompted by a report author, but a smaller model, Haiku 3.5, failed even with author guidance (80 versus 5). These results suggest distinct stages of model capability: Haiku is unable to reason about mirror life even with threat-aware expert guidance (Stage 1), while Sonnet correctly reasons only with threat-aware prompting (Stage 2). Continued advances may allow future models to disclose novel CBRN threats to naive experts (Stage 3) or unskilled users (Stage 4). While mirror life represents only one case study, monitoring new models’ ability to reason about privately known threats may allow protective measures to be implemented before widespread disclosure.

[AI-7] Multi-Agent Actor-Critic with Harmonic Annealing Pruning for Dynamic Spectrum Access Systems

链接: https://arxiv.org/abs/2503.15172
作者: George Stamatelis,Angelos-Nikolaos Kanatas,George C. Alexandropoulos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 5 pages, 3 figures, 1 table, submited to an IEEE conference

点击查看摘要

Abstract:Multi-Agent Deep Reinforcement Learning (MADRL) has emerged as a powerful tool for optimizing decentralized decision-making systems in complex settings, such as Dynamic Spectrum Access (DSA). However, deploying deep learning models on resource-constrained edge devices remains challenging due to their high computational cost. To address this challenge, in this paper, we present a novel sparse recurrent MARL framework integrating gradual neural network pruning into the independent actor global critic paradigm. Additionally, we introduce a harmonic annealing sparsity scheduler, which achieves comparable, and in certain cases superior, performance to standard linear and polynomial pruning schedulers at large sparsities. Our experimental investigation demonstrates that the proposed DSA framework can discover superior policies, under diverse training conditions, outperforming conventional DSA, MADRL baselines, and state-of-the-art pruning techniques.

[AI-8] Volumetric Reconstruction From Partial Views for Task-Oriented Grasping

链接: https://arxiv.org/abs/2503.15167
作者: Fujian Yan,Hui Li,Hongsheng He
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Object affordance and volumetric information are essential in devising effective grasping strategies under task-specific constraints. This paper presents an approach for inferring suitable grasping strategies from limited partial views of an object. To achieve this, a recurrent generative adversarial network (R-GAN) was proposed by incorporating a recurrent generator with long short-term memory (LSTM) units for it to process a variable number of depth scans. To determine object affordances, the AffordPose knowledge dataset is utilized as prior knowledge. Affordance retrieving is defined by the volume similarity measured via Chamfer Distance and action similarities. A Proximal Policy Optimization (PPO) reinforcement learning model is further implemented to refine the retrieved grasp strategies for task-oriented grasping. The retrieved grasp strategies were evaluated on a dual-arm mobile manipulation robot with an overall grasping accuracy of 89% for four tasks: lift, handle grasp, wrap grasp, and press.

[AI-9] Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models

链接: https://arxiv.org/abs/2503.15129
作者: Man Fai Wong,Chee Wei Tan
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper studies how AI-assisted programming and large language models (LLM) improve software developers’ ability via AI tools (LLM agents) like Github Copilot and Amazon CodeWhisperer, while integrating human feedback to enhance reinforcement learning (RLHF) with crowd-sourced computation to enhance text-to-code generation. Additionally, we demonstrate that our Bayesian optimization framework supports AI alignment in code generation by distributing the feedback collection burden, highlighting the value of collecting human feedback of good quality. Our empirical evaluations demonstrate the efficacy of this approach, showcasing how LLM agents can be effectively trained for improved text-to-code generation. Our Bayesian optimization framework can be designed for general domain-specific languages, promoting the alignment of large language model capabilities with human feedback in AI-assisted programming for code generation.

[AI-10] Reasoning Effort and Problem Complexity: A Scaling Analysis in LLM s ICLR2025

链接: https://arxiv.org/abs/2503.15113
作者: Benjamin Estermann,Roger Wattenhofer
类目: Artificial Intelligence (cs.AI)
*备注: Published at ICLR 2025 Workshop on Reasoning and Planning for LLMs

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable text generation capabilities, and recent advances in training paradigms have led to breakthroughs in their reasoning performance. In this work, we investigate how the reasoning effort of such models scales with problem complexity. We use the infinitely scalable Tents puzzle, which has a known linear-time solution, to analyze this scaling behavior. Our results show that reasoning effort scales with problem size, but only up to a critical problem complexity. Beyond this threshold, the reasoning effort does not continue to increase, and may even decrease. This observation highlights a critical limitation in the logical coherence of current LLMs as problem complexity increases, and underscores the need for strategies to improve reasoning scalability. Furthermore, our results reveal significant performance differences between current state-of-the-art reasoning models when faced with increasingly complex logical puzzles.

[AI-11] VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

链接: https://arxiv.org/abs/2503.15108
作者: Mohamed Salim Aissi,Clemence Grislain,Mohamed Chetouani,Olivier Sigaud,Laure Soulier,Nicolas Thome
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this paper, we introduce VIPER, a novel framework for multimodal instruction-based planning that integrates VLM-based perception with LLM-based reasoning. Our approach uses a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then processed by an LLM policy to predict actions based on the task goal. We fine-tune the reasoning module using behavioral cloning and reinforcement learning, improving our agent’s decision-making capabilities. Experiments on the ALFWorld benchmark show that VIPER significantly outperforms state-of-the-art visual instruction-based planners while narrowing the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER also enhances explainability, paving the way for a fine-grained analysis of perception and reasoning components.

[AI-12] Diffusion-Based Forecasting for Uncertainty-Aware Model Predictive Control

链接: https://arxiv.org/abs/2503.15095
作者: Stelios Zarifis,Ioannis Kordonis,Petros Maragos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 5 pages, 3 figures, 3 tables. This version is submitted to the 33rd European Signal Processing Conference (EUSIPCO 2025), to be held in Isola delle Femmine - Palermo - Italy, on September 8-12, 2025

点击查看摘要

Abstract:We propose Diffusion-Informed Model Predictive Control (D-I MPC), a generic framework for uncertainty-aware prediction and decision-making in partially observable stochastic systems by integrating diffusion-based time series forecasting models in Model Predictive Control algorithms. In our approach, a diffusion-based time series forecasting model is used to probabilistically estimate the evolution of the system’s stochastic components. These forecasts are then incorporated into MPC algorithms to estimate future trajectories and optimize action selection under the uncertainty of the future. We evaluate the framework on the task of energy arbitrage, where a Battery Energy Storage System participates in the day-ahead electricity market of the New York state. Experimental results indicate that our model-based approach with a diffusion-based forecaster significantly outperforms both implementations with classical forecasting methods and model-free reinforcement learning baselines.

[AI-13] StyleLoco: Generative Adversarial Distillation for Natural Humanoid Robot Locomotion

链接: https://arxiv.org/abs/2503.15082
作者: Le Ma,Ziyu Meng,Tengyu Liu,Yuhan Li,Ran Song,Wei Zhang,Siyuan Huang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Humanoid robots are anticipated to acquire a wide range of locomotion capabilities while ensuring natural movement across varying speeds and terrains. Existing methods encounter a fundamental dilemma in learning humanoid locomotion: reinforcement learning with handcrafted rewards can achieve agile locomotion but produces unnatural gaits, while Generative Adversarial Imitation Learning (GAIL) with motion capture data yields natural movements but suffers from unstable training processes and restricted agility. Integrating these approaches proves challenging due to the inherent heterogeneity between expert policies and human motion datasets. To address this, we introduce StyleLoco, a novel two-stage framework that bridges this gap through a Generative Adversarial Distillation (GAD) process. Our framework begins by training a teacher policy using reinforcement learning to achieve agile and dynamic locomotion. It then employs a multi-discriminator architecture, where distinct discriminators concurrently extract skills from both the teacher policy and motion capture data. This approach effectively combines the agility of reinforcement learning with the natural fluidity of human-like movements while mitigating the instability issues commonly associated with adversarial training. Through extensive simulation and real-world experiments, we demonstrate that StyleLoco enables humanoid robots to perform diverse locomotion tasks with the precision of expertly trained policies and the natural aesthetics of human motion, successfully transferring styles across different movement types while maintaining stable locomotion across a broad spectrum of command inputs.

[AI-14] HAD-Gen: Human-like and Diverse Driving Behavior Modeling for Controllable Scenario Generation

链接: https://arxiv.org/abs/2503.15049
作者: Cheng Wang,Lingxin Kong,Massimiliano Tamborski,Stefano V. Albrecht
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Simulation-based testing has emerged as an essential tool for verifying and validating autonomous vehicles (AVs). However, contemporary methodologies, such as deterministic and imitation learning-based driver models, struggle to capture the variability of human-like driving behavior. Given these challenges, we propose HAD-Gen, a general framework for realistic traffic scenario generation that simulates diverse human-like driving behaviors. The framework first clusters the vehicle trajectory data into different driving styles according to safety features. It then employs maximum entropy inverse reinforcement learning on each of the clusters to learn the reward function corresponding to each driving style. Using these reward functions, the method integrates offline reinforcement learning pre-training and multi-agent reinforcement learning algorithms to obtain general and robust driving policies. Multi-perspective simulation results show that our proposed scenario generation framework can simulate diverse, human-like driving behaviors with strong generalization capability. The proposed framework achieves a 90.96% goal-reaching rate, an off-road rate of 2.08%, and a collision rate of 6.91% in the generalization test, outperforming prior approaches by over 20% in goal-reaching performance. The source code is released at this https URL.

[AI-15] GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback

链接: https://arxiv.org/abs/2503.15035
作者: Sungjae Lee,Yeonjoo Hong,Kwang In Kim
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Despite significant advancements in robotic manipulation, achieving consistent and stable grasping remains a fundamental challenge, often limiting the successful execution of complex tasks. Our analysis reveals that even state-of-the-art policy models frequently exhibit unstable grasping behaviors, leading to failure cases that create bottlenecks in real-world robotic applications. To address these challenges, we introduce GraspCorrect, a plug-and-play module designed to enhance grasp performance through vision-language model-guided feedback. GraspCorrect employs an iterative visual question-answering framework with two key components: grasp-guided prompting, which incorporates task-specific constraints, and object-aware sampling, which ensures the selection of physically feasible grasp candidates. By iteratively generating intermediate visual goals and translating them into joint-level actions, GraspCorrect significantly improves grasp stability and consistently enhances task success rates across existing policy models in the RLBench and CALVIN datasets.

[AI-16] Application of linear regression method to the deep reinforcement learning in continuous action cases

链接: https://arxiv.org/abs/2503.14976
作者: Hisato Komatsu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:The linear regression (LR) method offers the advantage that optimal parameters can be calculated relatively easily, although its representation capability is limited than that of the deep learning technique. To improve deep reinforcement learning, the Least Squares Deep Q Network (LS-DQN) method was proposed by Levine et al., which combines Deep Q Network (DQN) with LR method. However, the LS-DQN method assumes that the actions are discrete. In this study, we propose the Double Least Squares Deep Deterministic Policy Gradient (DLS-DDPG) method to address this limitation. This method combines the LR method with the Deep Deterministic Policy Gradient (DDPG) technique, one of the representative deep reinforcement learning algorithms for continuous action cases. Numerical experiments conducted in MuJoCo environments showed that the LR update improved performance at least in some tasks, although there are difficulties such as the inability to make the regularization terms small.

[AI-17] Behaviour Discovery and Attribution for Explainable Reinforcement Learning

链接: https://arxiv.org/abs/2503.14973
作者: Rishav Rishav,Somjit Nath,Vincent Michalski,Samira Ebrahimi Kahou
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Explaining the decisions made by reinforcement learning (RL) agents is critical for building trust and ensuring reliability in real-world applications. Traditional approaches to explainability often rely on saliency analysis, which can be limited in providing actionable insights. Recently, there has been growing interest in attributing RL decisions to specific trajectories within a dataset. However, these methods often generalize explanations to long trajectories, potentially involving multiple distinct behaviors. Often, providing multiple more fine grained explanations would improve clarity. In this work, we propose a framework for behavior discovery and action attribution to behaviors in offline RL trajectories. Our method identifies meaningful behavioral segments, enabling more precise and granular explanations associated with high level agent behaviors. This approach is adaptable across diverse environments with minimal modifications, offering a scalable and versatile solution for behavior discovery and attribution for explainable RL.

[AI-18] A Semantic and Clean-label Backdoor Attack against Graph Convolutional Networks

链接: https://arxiv.org/abs/2503.14922
作者: Jiazhu Dai,Haoyu Sun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Graph Convolutional Networks (GCNs) have shown excellent performance in graph-structured tasks such as node classification and graph classification. However, recent research has shown that GCNs are vulnerable to a new type of threat called the backdoor attack, where the adversary can inject a hidden backdoor into the GCNs so that the backdoored model performs well on benign samples, whereas its prediction will be maliciously changed to the attacker-specified target label if the hidden backdoor is activated by the attacker-defined trigger. Clean-label backdoor attack and semantic backdoor attack are two new backdoor attacks to Deep Neural Networks (DNNs), they are more imperceptible and have posed new and serious threats. The semantic and clean-label backdoor attack is not fully explored in GCNs. In this paper, we propose a semantic and clean-label backdoor attack against GCNs under the context of graph classification to reveal the existence of this security vulnerability in GCNs. Specifically, SCLBA conducts an importance analysis on graph samples to select one type of node as semantic trigger, which is then inserted into the graph samples to create poisoning samples without changing the labels of the poisoning samples to the attacker-specified target label. We evaluate SCLBA on multiple datasets and the results show that SCLBA can achieve attack success rates close to 99% with poisoning rates of less than 3%, and with almost no impact on the performance of model on benign samples.

[AI-19] Envisioning an AI-Enhanced Mental Health Ecosystem ALT

链接: https://arxiv.org/abs/2503.14883
作者: Kellie Yu Hui Sim,Kenny Tsu Wei Choo
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 5 pages, 0 figures, accepted to the CHI’25 Envisioning the Future of Interactive Health Workshop, to be published in HAL

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs), reasoning models, and agentic AI approaches coincides with a growing global mental health crisis, where increasing demand has not translated into adequate access to professional support, particularly for underserved populations. This presents a unique opportunity for AI to complement human-led interventions, offering scalable and context-aware support while preserving human connection in this sensitive domain. We explore various AI applications in peer support, self-help interventions, proactive monitoring, and data-driven insights, using a human-centred approach that ensures AI supports rather than replaces human interaction. However, AI deployment in mental health fields presents challenges such as ethical concerns, transparency, privacy risks, and risks of over-reliance. We propose a hybrid ecosystem where where AI assists but does not replace human providers, emphasising responsible deployment and evaluation. We also present some of our early work and findings in several of these AI applications. Finally, we outline future research directions for refining AI-enhanced interventions while adhering to ethical and culturally sensitive guidelines.

[AI-20] 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities

链接: https://arxiv.org/abs/2503.14858
作者: Kevin Wang,Ishaan Javali,Michał Bortkiewicz,Tomasz Trzciński,Benjamin Eysenbach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Link to project website: this https URL

点击查看摘要

Abstract:Scaling up self-supervised learning has driven breakthroughs in language and vision, yet comparable progress has remained elusive in reinforcement learning (RL). In this paper, we study building blocks for self-supervised RL that unlock substantial improvements in scalability, with network depth serving as a critical factor. Whereas most RL papers in recent years have relied on shallow architectures (around 2 - 5 layers), we demonstrate that increasing the depth up to 1024 layers can significantly boost performance. Our experiments are conducted in an unsupervised goal-conditioned setting, where no demonstrations or rewards are provided, so an agent must explore (from scratch) and learn how to maximize the likelihood of reaching commanded goals. Evaluated on simulated locomotion and manipulation tasks, our approach increases performance by 2\times - 50\times . Increasing the model depth not only increases success rates but also qualitatively changes the behaviors learned.

[AI-21] Project Jenkins: Turning Monkey Neural Data into Robotic Arm Movement and Back

链接: https://arxiv.org/abs/2503.14847
作者: Andrii Zahorodnii,Dima Yanovsky
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注: 6 pages, 5 figures, project webpage and github

点击查看摘要

Abstract:Project Jenkins explores how neural activity in the brain can be decoded into robotic movement and, conversely, how movement patterns can be used to generate synthetic neural data. Using real neural data recorded from motor and premotor cortex areas of a macaque monkey named Jenkins, we develop models for decoding (converting brain signals into robotic arm movements) and encoding (simulating brain activity corresponding to a given movement). For the interface between the brain simulation and the physical world, we utilized Koch v1.1 leader and follower robotic arms. We developed an interactive web console that allows users to generate synthetic brain data from joystick movements in real time. Our results are a step towards brain-controlled robotics, prosthetics, and enhancing normal motor function. By accurately modeling brain activity, we take a step toward flexible brain-computer interfaces that generalize beyond predefined movements. To support the research community, we provide open source tools for both synthetic data generation and neural decoding, fostering reproducibility and accelerating progress. The project is available at this https URL

[AI-22] Curiosity-Diffuser: Curiosity Guide Diffusion Models for Reliability

链接: https://arxiv.org/abs/2503.14833
作者: Zihao Liu,Xing Liu,Yizhai Zhang,Zhengxiong Liu,Panfeng Huang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the bottlenecks in robotic intelligence is the instability of neural network models, which, unlike control models, lack a well-defined convergence domain and stability. This leads to risks when applying intelligence in the physical world. Specifically, imitation policy based on neural network may generate hallucinations, leading to inaccurate behaviors that impact the safety of real-world applications. To address this issue, this paper proposes the Curiosity-Diffuser, aimed at guiding the conditional diffusion model to generate trajectories with lower curiosity, thereby improving the reliability of policy. The core idea is to use a Random Network Distillation (RND) curiosity module to assess whether the model’s behavior aligns with the training data, and then minimize curiosity by classifier guidance diffusion to reduce overgeneralization during inference. Additionally, we propose a computationally efficient metric for evaluating the reliability of the policy, measuring the similarity between the generated behaviors and the training dataset, to facilitate research about reliability learning. Finally, simulation verify the effectiveness and applicability of the proposed method to a variety of scenarios, showing that Curiosity-Diffuser significantly improves task performance and produces behaviors that are more similar to the training data. The code for this work is available at: this http URL

[AI-23] Learning with Expert Abstractions for Efficient Multi-Task Continuous Control

链接: https://arxiv.org/abs/2503.14809
作者: Jeff Jewett,Sandhya Saisubramanian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 6 figures. Submitted to RLC 2025. Code and experiments at this https URL

点击查看摘要

Abstract:Decision-making in complex, continuous multi-task environments is often hindered by the difficulty of obtaining accurate models for planning and the inefficiency of learning purely from trial and error. While precise environment dynamics may be hard to specify, human experts can often provide high-fidelity abstractions that capture the essential high-level structure of a task and user preferences in the target environment. Existing hierarchical approaches often target discrete settings and do not generalize across tasks. We propose a hierarchical reinforcement learning approach that addresses these limitations by dynamically planning over the expert-specified abstraction to generate subgoals to learn a goal-conditioned policy. To overcome the challenges of learning under sparse rewards, we shape the reward based on the optimal state value in the abstract model. This structured decision-making process enhances sample efficiency and facilitates zero-shot generalization. Our empirical evaluation on a suite of procedurally generated continuous control environments demonstrates that our approach outperforms existing hierarchical reinforcement learning methods in terms of sample efficiency, task completion rate, scalability to complex tasks, and generalization to novel scenarios.

[AI-24] Long Context Modeling with Ranked Memory-Augmented Retrieval

链接: https://arxiv.org/abs/2503.14800
作者: Ghadir Alselwi,Hao Xue,Shoaib Jameel,Basem Suleiman,Flora D. Salim,Imran Razzak
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective long-term memory management is crucial for language models handling extended contexts. We introduce a novel framework that dynamically ranks memory entries based on relevance. Unlike previous works, our model introduces a novel relevance scoring and a pointwise re-ranking model for key-value embeddings, inspired by learning-to-rank techniques in information retrieval. Enhanced Ranked Memory Augmented Retrieval ERMAR achieves state-of-the-art results on standard benchmarks.

[AI-25] GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

链接: https://arxiv.org/abs/2503.14734
作者: NVIDIA,Johan Bjorck,Fernando Castañeda,Nikita Cherniadev,Xingye Da,Runyu Ding,Linxi “Jim” Fan,Yu Fang,Dieter Fox,Fengyuan Hu,Spencer Huang,Joel Jang,Zhenyu Jiang,Jan Kautz,Kaushil Kundalia,Lawrence Lao,Zhiqi Li,Zongyu Lin,Kevin Lin,Guilin Liu,Edith Llontop,Loic Magne,Ajay Mandlekar,Avnish Narayan,Soroush Nasiriany,Scott Reed,You Liang Tan,Guanzhi Wang,Zu Wang,Jing Wang,Qi Wang,Jiannan Xiang,Yuqi Xie,Yinzhen Xu,Zhenjia Xu,Seonghyeon Ye,Zhiding Yu,Ao Zhang,Hao Zhang,Yizhou Zhao,Ruijie Zheng,Yuke Zhu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Authors are listed alphabetically. Project leads are Linxi “Jim” Fan and Yuke Zhu

点击查看摘要

Abstract:General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.

[AI-26] DPImageBench: A Unified Benchmark for Differentially Private Image Synthesis

链接: https://arxiv.org/abs/2503.14681
作者: Chen Gong,Kecen Li,Zinan Lin,Tianhao Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: The first two authors contributed equally; code available at this https URL

点击查看摘要

Abstract:Differentially private (DP) image synthesis aims to generate artificial images that retain the properties of sensitive images while protecting the privacy of individual images within the dataset. Despite recent advancements, we find that inconsistent–and sometimes flawed–evaluation protocols have been applied across studies. This not only impedes the understanding of current methods but also hinders future advancements. To address the issue, this paper introduces DPImageBench for DP image synthesis, with thoughtful design across several dimensions: (1) Methods. We study eleven prominent methods and systematically characterize each based on model architecture, pretraining strategy, and privacy mechanism. (2) Evaluation. We include nine datasets and seven fidelity and utility metrics to thoroughly assess them. Notably, we find that a common practice of selecting downstream classifiers based on the highest accuracy on the sensitive test set not only violates DP but also overestimates the utility scores. DPImageBench corrects for these mistakes. (3) Platform. Despite the methods and evaluation protocols, DPImageBench provides a standardized interface that accommodates current and future implementations within a unified framework. With DPImageBench, we have several noteworthy findings. For example, contrary to the common wisdom that pretraining on public image datasets is usually beneficial, we find that the distributional similarity between pretraining and sensitive images significantly impacts the performance of the synthetic images and does not always yield improvements. In addition, adding noise to low-dimensional features, such as the high-level characteristics of sensitive images, is less affected by the privacy budget compared to adding noise to high-dimensional features, like weight gradients. The former methods perform better than the latter under a low privacy budget. Comments: The first two authors contributed equally; code available at this https URL Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.14681 [cs.CR] (or arXiv:2503.14681v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.14681 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-27] Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving

链接: https://arxiv.org/abs/2503.14630
作者: Priscylla Silva,Evandro Costa
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models’ capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63% of feedback hints were accurate and complete, while 37% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.

[AI-28] Reducing False Ventricular Tachycardia Alarms in ICU Settings: A Machine Learning Approach ICML

链接: https://arxiv.org/abs/2503.14621
作者: Grace Funmilayo Farayola(University of Buckingham, Buckingham, UK),Akinyemi Sadeeq Akintola(Universidade NOVA de Lisboa, Lisbon, Portugal),Oluwole Fagbohun(Readrly Limited, London, UK),Chukwuka Michael Oforgu(Readrly Limited, London, UK),Bisola Faith Kayode(Independent Researcher, London, UK),Christian Chimezie(Independent Researcher, Bristol, UK),Temitope Kadri(Readrly Limited, London, UK),Abiola Oludotun(Readrly Limited, London, UK),Nelson Ogbeide(Independent Researcher, London, UK),Mgbame Michael(Hankali Intel, Lagos, Nigeria),Adeseye Ifaturoti(University of Greenwich, London, UK),Toyese Oloyede(Independent Researcher, Northampton, UK)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint, Accepted to the International Conference on Machine Learning Technologies (ICMLT 2025), Helsinki, Finland

点击查看摘要

Abstract:False arrhythmia alarms in intensive care units (ICUs) are a significant challenge, contributing to alarm fatigue and potentially compromising patient safety. Ventricular tachycardia (VT) alarms are particularly difficult to detect accurately due to their complex nature. This paper presents a machine learning approach to reduce false VT alarms using the VTaC dataset, a benchmark dataset of annotated VT alarms from ICU monitors. We extract time-domain and frequency-domain features from waveform data, preprocess the data, and train deep learning models to classify true and false VT alarms. Our results demonstrate high performance, with ROC-AUC scores exceeding 0.96 across various training configurations. This work highlights the potential of machine learning to improve the accuracy of VT alarm detection in clinical settings.

[AI-29] PHGNN: A Novel Prompted Hypergraph Neural Network to Diagnose Alzheimers Disease

链接: https://arxiv.org/abs/2503.14577
作者: Chenyu Liu,Luca Rossi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The accurate diagnosis of Alzheimer’s disease (AD) and prognosis of mild cognitive impairment (MCI) conversion are crucial for early intervention. However, existing multimodal methods face several challenges, from the heterogeneity of input data, to underexplored modality interactions, missing data due to patient dropouts, and limited data caused by the time-consuming and costly data collection process. In this paper, we propose a novel Prompted Hypergraph Neural Network (PHGNN) framework that addresses these limitations by integrating hypergraph based learning with prompt learning. Hypergraphs capture higher-order relationships between different modalities, while our prompt learning approach for hypergraphs, adapted from NLP, enables efficient training with limited data. Our model is validated through extensive experiments on the ADNI dataset, outperforming SOTA methods in both AD diagnosis and the prediction of MCI conversion.

[AI-30] SocialJax: An Evaluation Suite for Multi-agent Reinforcement Learning in Sequential Social Dilemmas

链接: https://arxiv.org/abs/2503.14576
作者: Zihao Guo,Richard Willis,Shuqing Shi,Tristan Tomilin,Joel Z. Leibo,Yali Du
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 18 figures, 1 table

点击查看摘要

Abstract:Social dilemmas pose a significant challenge in the field of multi-agent reinforcement learning (MARL). Melting Pot is an extensive framework designed to evaluate social dilemma environments, providing an evaluation protocol that measures generalization to new social partners across various test scenarios. However, running reinforcement learning algorithms in the official Melting Pot environments demands substantial computational resources. In this paper, we introduce SocialJax, a suite of sequential social dilemma environments implemented in JAX. JAX is a high-performance numerical computing library for Python that enables significant improvements in the operational efficiency of SocialJax on GPUs and TPUs. Our experiments demonstrate that the training pipeline of SocialJax achieves a 50\texttimes speedup in real-time performance compared to Melting Pot’s RLlib baselines. Additionally, we validate the effectiveness of baseline algorithms within the SocialJax environments. Finally, we use Schelling diagrams to verify the social dilemma properties of these environments, ensuring they accurately capture the dynamics of social dilemmas.

[AI-31] Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation DATE

链接: https://arxiv.org/abs/2503.14572
作者: Justus Westerhoff,Golzar Atefi,Mario Koddenbrock,Alexei Figueroa,Alexander Löser,Erik Rodner,Felix A. Gers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code: this https URL

点击查看摘要

Abstract:The capacity of a foundation model allows for adaptation to new downstream tasks. Weight imprinting is a universal and efficient method to fulfill this purpose. It has been reinvented several times, but it has not been systematically studied. In this paper, we propose a framework for imprinting, identifying three main components: generation, normalization, and aggregation. This allows us to conduct an in-depth analysis of imprinting and a comparison of the existing work. We reveal the benefits of representing novel data with multiple proxies in the generation step and show the importance of proper normalization. We determine those proxies through clustering and propose a novel variant of imprinting that outperforms previous work. We motivate this by the neural collapse phenomenon – an important connection that we can draw for the first time. Our results show an increase of up to 4% in challenging scenarios with complex data distributions for new classes.

[AI-32] Potential Score Matching: Debiasing Molecular Structure Sampling with Potential Energy Guidance

链接: https://arxiv.org/abs/2503.14569
作者: Liya Guo,Zun Wang,Chang Liu,Junzhe Li,Pipi Hu,Yi Zhu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The ensemble average of physical properties of molecules is closely related to the distribution of molecular conformations, and sampling such distributions is a fundamental challenge in physics and chemistry. Traditional methods like molecular dynamics (MD) simulations and Markov chain Monte Carlo (MCMC) sampling are commonly used but can be time-consuming and costly. Recently, diffusion models have emerged as efficient alternatives by learning the distribution of training data. Obtaining an unbiased target distribution is still an expensive task, primarily because it requires satisfying ergodicity. To tackle these challenges, we propose Potential Score Matching (PSM), an approach that utilizes the potential energy gradient to guide generative models. PSM does not require exact energy functions and can debias sample distributions even when trained on limited and biased data. Our method outperforms existing state-of-the-art (SOTA) models on the Lennard-Jones (LJ) potential, a commonly used toy model. Furthermore, we extend the evaluation of PSM to high-dimensional problems using the MD17 and MD22 datasets. The results demonstrate that molecular distributions generated by PSM more closely approximate the Boltzmann distribution compared to traditional diffusion models.

[AI-33] SpecReX: Explainable AI for Raman Spectroscopy ALT AAAI

链接: https://arxiv.org/abs/2503.14567
作者: Nathan Blake,David A. Kelly,Akchunya Chanchal,Sarah Kapllani-Mucaj,Geraint Thomas,Hana Chockler
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注: AAAI Workshop on Health Intelligencee (W3PHIAI-25)

点击查看摘要

Abstract:Raman spectroscopy is becoming more common for medical diagnostics with deep learning models being increasingly used to leverage its full potential. However, the opaque nature of such models and the sensitivity of medical diagnosis together with regulatory requirements necessitate the need for explainable AI tools. We introduce SpecReX, specifically adapted to explaining Raman spectra. SpecReX uses the theory of actual causality to rank causal responsibility in a spectrum, quantified by iteratively refining mutated versions of the spectrum and testing if it retains the original classification. The explanations provided by SpecReX take the form of a responsibility map, highlighting spectral regions most responsible for the model to make a correct classification. To assess the validity of SpecReX, we create increasingly complex simulated spectra, in which a “ground truth” signal is seeded, to train a classifier. We then obtain SpecReX explanations and compare the results with another explainability tool. By using simulated spectra we establish that SpecReX localizes to the known differences between classes, under a number of conditions. This provides a foundation on which we can find the spectral features which differentiate disease classes. This is an important first step in proving the validity of SpecReX.

[AI-34] Workflow for Safe-AI

链接: https://arxiv.org/abs/2503.14563
作者: Suzana Veljanovska,Hans Dermot Doran
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The development and deployment of safe and dependable AI models is crucial in applications where functional safety is a key concern. Given the rapid advancement in AI research and the relative novelty of the safe-AI domain, there is an increasing need for a workflow that balances stability with adaptability. This work proposes a transparent, complete, yet flexible and lightweight workflow that highlights both reliability and qualifiability. The core idea is that the workflow must be qualifiable, which demands the use of qualified tools. Tool qualification is a resource-intensive process, both in terms of time and cost. We therefore place value on a lightweight workflow featuring a minimal number of tools with limited features. The workflow is built upon an extended ONNX model description allowing for validation of AI algorithms from their generation to runtime deployment. This validation is essential to ensure that models are validated before being reliably deployed across different runtimes, particularly in mixed-criticality systems. Keywords-AI workflows, safe-AI, dependable-AI, functional safety, v-model development

[AI-35] Generating Causal Explanations of Vehicular Agent Behavioural Interactions with Learnt Reward Profiles

链接: https://arxiv.org/abs/2503.14557
作者: Rhys Howard,Nick Hawes,Lars Kunze
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注: 8 Pages, 5 Figures, To be published in the Proceedings of the 2025 IEEE International Conference on Robotics Automation, Initial upload of accepted paper

点击查看摘要

Abstract:Transparency and explainability are important features that responsible autonomous vehicles should possess, particularly when interacting with humans, and causal reasoning offers a strong basis to provide these qualities. However, even if one assumes agents act to maximise some concept of reward, it is difficult to make accurate causal inferences of agent planning without capturing what is of importance to the agent. Thus our work aims to learn a weighting of reward metrics for agents such that explanations for agent interactions can be causally inferred. We validate our approach quantitatively and qualitatively across three real-world driving datasets, demonstrating a functional improvement over previous methods and competitive performance across evaluation metrics.

[AI-36] Designing and Deploying AI Models for Sustainable Logistics Optimization: A Case Study on Eco-Efficient Supply Chains in the USA

链接: https://arxiv.org/abs/2503.14556
作者: Reza E Rabbi Shawon,MD Rokibul Hasan,Md Anisur Rahman,Mohamed Ghandri,Iman Ahmed Lamari,Mohammed Kawsar,Rubi Akter
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid evolution of Artificial Intelligence (AI) and Machine Learning (ML) has significantly transformed logistics and supply chain management, particularly in the pursuit of sustainability and eco-efficiency. This study explores AI-based methodologies for optimizing logistics operations in the USA, focusing on reducing environmental impact, improving fuel efficiency, and minimizing costs. Key AI applications include predictive analytics for demand forecasting, route optimization through machine learning, and AI-powered fuel efficiency strategies. Various models, such as Linear Regression, XGBoost, Support Vector Machine, and Neural Networks, are applied to real-world logistics datasets to reduce carbon emissions based on logistics operations, optimize travel routes to minimize distance and travel time, and predict future deliveries to plan optimal routes. Other models such as K-Means and DBSCAN are also used to optimize travel routes to minimize distance and travel time for logistics operations. This study utilizes datasets from logistics companies’ databases. The study also assesses model performance using metrics such as mean absolute error (MAE), mean squared error (MSE), and R2 score. This study also explores how these models can be deployed to various platforms for real-time logistics and supply chain use. The models are also examined through a thorough case study, highlighting best practices and regulatory frameworks that promote sustainability. The findings demonstrate AI’s potential to enhance logistics efficiency, reduce carbon footprints, and contribute to a more resilient and adaptive supply chain ecosystem.

[AI-37] A Generalist Hanabi Agent

链接: https://arxiv.org/abs/2503.14555
作者: Arjun V Sudhakar,Hadi Nekoei,Mathieu Reymond,Miao Liu,Janarthanan Rajendran,Sarath Chandar
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional multi-agent reinforcement learning (MARL) systems can develop cooperative strategies through repeated interactions. However, these systems are unable to perform well on any other setting than the one they have been trained on, and struggle to successfully cooperate with unfamiliar collaborators. This is particularly visible in the Hanabi benchmark, a popular 2-to-5 player cooperative card-game which requires complex reasoning and precise assistance to other agents. Current MARL agents for Hanabi can only learn one specific game-setting (e.g., 2-player games), and play with the same algorithmic agents. This is in stark contrast to humans, who can quickly adjust their strategies to work with unfamiliar partners or situations. In this paper, we introduce Recurrent Replay Relevance Distributed DQN (R3D2), a generalist agent for Hanabi, designed to overcome these limitations. We reformulate the task using text, as language has been shown to improve transfer. We then propose a distributed MARL algorithm that copes with the resulting dynamic observation- and action-space. In doing so, our agent is the first that can play all game settings concurrently, and extend strategies learned from one setting to other ones. As a consequence, our agent also demonstrates the ability to collaborate with different algorithmic agents – agents that are themselves unable to do so. The implementation code is available at: \hrefthis https URLR3D2-A-Generalist-Hanabi-Agent

[AI-38] Sampling Decisions

链接: https://arxiv.org/abs/2503.14549
作者: Michael Chertkov,Sungsoo Ahn,Hamidreza Behjoo
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注: 6 pages, 3 figures

点击查看摘要

[AI-39] Inteligencia Artificial para la conservación y uso sostenible de la biodiversidad una visión desde Colombia (Artificial Intelligence for conservation and sustainable use of biodiversity a view from Colombia)

链接: https://arxiv.org/abs/2503.14543
作者: Juan Sebastián Cañas,Camila Parra-Guevara,Manuela Montoya-Castrillón,Julieta M Ramírez-Mejía,Gabriel-Alejandro Perilla,Esteban Marentes,Nerieth Leuro,Jose Vladimir Sandoval-Sierra,Sindy Martinez-Callejas,Angélica Díaz,Mario Murcia,Elkin A. Noguera-Urbano,Jose Manuel Ochoa-Quintero,Susana Rodríguez Buriticá,Juan Sebastián Ulloa
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-40] Accessibility Considerations in the Development of an AI Action Plan

链接: https://arxiv.org/abs/2503.14522
作者: Jennifer Mankoff,Janice Light,James Coughlan,Christian Vogler,Abraham Glasser,Gregg Vanderheiden,Laura Rice
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:We argue that there is a need for Accessibility to be represented in several important domains: - Capitalize on the new capabilities AI provides - Support for open source development of AI, which can allow disabled and disability focused professionals to contribute, including - Development of Accessibility Apps which help realise the promise of AI in accessibility domains - Open Source Model Development and Validation to ensure that accessibility concerns are addressed in these algorithms - Data Augmentation to include accessibility in data sets used to train models - Accessible Interfaces that allow disabled people to use any AI app, and to validate its outputs - Dedicated Functionality and Libraries that can make it easy to integrate AI support into a variety of settings and apps. - Data security and privacy and privacy risks including data collected by AI based accessibility technologies; and the possibility of disability disclosure. - Disability-specific AI risks and biases including both direct bias (during AI use by the disabled person) and indirect bias (when AI is used by someone else on data relating to a disabled person). Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2503.14522 [cs.CY] (or arXiv:2503.14522v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2503.14522 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jennifer Mankoff [view email] [v1] Fri, 14 Mar 2025 21:57:23 UTC (567 KB) Full-text links: Access Paper: View a PDF of the paper titled Accessibility Considerations in the Development of an AI Action Plan, by Jennifer Mankoff and 6 other authorsView PDFOther Formats view license Current browse context: cs.CY prev | next new | recent | 2025-03 Change to browse by: cs cs.AI cs.HC References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-41] Content ARCs: Decentralized Content Rights in the Age of Generative AI

链接: https://arxiv.org/abs/2503.14519
作者: Kar Balan,Andrew Gilbert,John Collomosse
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

[AI-42] Prompt Injection Attacks on Large Language Models in Oncology

链接: https://arxiv.org/abs/2407.18981
作者: Jan Clusmann,Dyke Ferber,Isabella C. Wiest,Carolin V. Schneider,Titus J. Brinker,Sebastian Foersch,Daniel Truhn,Jakob N. Kather
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 57 Pages, 5 Figures

点击查看摘要

Abstract:Vision-language artificial intelligence models (VLMs) possess medical knowledge and can be employed in healthcare in numerous ways, including as image interpreters, virtual scribes, and general decision support systems. However, here, we demonstrate that current VLMs applied to medical tasks exhibit a fundamental security flaw: they can be attacked by prompt injection attacks, which can be used to output harmful information just by interacting with the VLM, without any access to its parameters. We performed a quantitative study to evaluate the vulnerabilities to these attacks in four state of the art VLMs which have been proposed to be of utility in healthcare: Claude 3 Opus, Claude 3.5 Sonnet, Reka Core, and GPT-4o. Using a set of N=297 attacks, we show that all of these models are susceptible. Specifically, we show that embedding sub-visual prompts in medical imaging data can cause the model to provide harmful output, and that these prompts are non-obvious to human observers. Thus, our study demonstrates a key vulnerability in medical VLMs which should be mitigated before widespread clinical adoption.

[AI-43] An extensive simulation study evaluating the interaction of resampling techniques across multiple causal discovery contexts

链接: https://arxiv.org/abs/2503.15436
作者: Ritwick Banerjee,Bryan Andrews,Erich Kummerfeld
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the accelerating presence of exploratory causal analysis in modern science and medicine, the available non-experimental methods for validating causal models are not well characterized. One of the most popular methods is to evaluate the stability of model features after resampling the data, similar to resampling methods for estimating confidence intervals in statistics. Many aspects of this approach have received little to no attention, however, such as whether the choice of resampling method should depend on the sample size, algorithms being used, or algorithm tuning parameters. We present theoretical results proving that certain resampling methods closely emulate the assignment of specific values to algorithm tuning parameters. We also report the results of extensive simulation experiments, which verify the theoretical result and provide substantial data to aid researchers in further characterizing resampling in the context of causal discovery analysis. Together, the theoretical work and simulation results provide specific guidance on how resampling methods and tuning parameters should be selected in practice.

[AI-44] Probing the topology of the space of tokens with structured prompts

链接: https://arxiv.org/abs/2503.15421
作者: Michael Robinson,Sourya Dey,Taisa Kushner
类目: Differential Geometry (math.DG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:This article presents a general and flexible method for prompting a large language model (LLM) to reveal its (hidden) token input embedding up to homeomorphism. Moreover, this article provides strong theoretical justification – a mathematical proof for generic LLMs – for why this method should be expected to work. With this method in hand, we demonstrate its effectiveness by recovering the token subspace of Llemma-7B. The results of this paper apply not only to LLMs but also to general nonlinear autoregressive processes.

[AI-45] A Foundational Theory for Decentralized Sensory Learning

链接: https://arxiv.org/abs/2503.15130
作者: Linus Mårtensson,Jonas M.D. Enander,Udaya B. Rongala,Henrik Jörntell
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In both neuroscience and artificial intelligence, popular functional frameworks and neural network formulations operate by making use of extrinsic error measurements and global learning algorithms. Through a set of conjectures based on evolutionary insights on the origin of cellular adaptive mechanisms, we reinterpret the core meaning of sensory signals to allow the brain to be interpreted as a negative feedback control system, and show how this could lead to local learning algorithms without the need for global error correction metrics. Thereby, a sufficiently good minima in sensory activity can be the complete reward signal of the network, as well as being both necessary and sufficient for biological learning to arise. We show that this method of learning was likely already present in the earliest unicellular life forms on earth. We show evidence that the same principle holds and scales to multicellular organisms where it in addition can lead to division of labour between cells. Available evidence shows that the evolution of the nervous system likely was an adaptation to more effectively communicate intercellular signals to support such division of labour. We therefore propose that the same learning principle that evolved already in the earliest unicellular life forms, i.e. negative feedback control of externally and internally generated sensor signals, has simply been scaled up to become a fundament of the learning we see in biological brains today. We illustrate diverse biological settings, from the earliest unicellular organisms to humans, where this operational principle appears to be a plausible interpretation of the meaning of sensor signals in biology, and how this relates to current neuroscientific theories and findings.

[AI-46] aching Artificial Intelligence to Perform Rapid Resolution-Invariant Grain Growth Modeling via Fourier Neural Operator

链接: https://arxiv.org/abs/2503.14568
作者: Iman Peivaste,Ahmed Makradi,Salim Belouettar
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Microstructural evolution, particularly grain growth, plays a critical role in shaping the physical, optical, and electronic properties of materials. Traditional phase-field modeling accurately simulates these phenomena but is computationally intensive, especially for large systems and fine spatial resolutions. While machine learning approaches have been employed to accelerate simulations, they often struggle with resolution dependence and generalization across different grain scales. This study introduces a novel approach utilizing Fourier Neural Operator (FNO) to achieve resolution-invariant modeling of microstructure evolution in multi-grain systems. FNO operates in the Fourier space and can inherently handle varying resolutions by learning mappings between function spaces. By integrating FNO with the phase field method, we developed a surrogate model that significantly reduces computational costs while maintaining high accuracy across different spatial scales. We generated a comprehensive dataset from phase-field simulations using the Fan Chen model, capturing grain evolution over time. Data preparation involved creating input-output pairs with a time shift, allowing the model to predict future microstructures based on current and past states. The FNO-based neural network was trained using sequences of microstructures and demonstrated remarkable accuracy in predicting long-term evolution, even for unseen configurations and higher-resolution grids not encountered during training.

[AI-47] Acceptance or Rejection of Lots while Minimizing and Controlling Type I and Type II Errors

链接: https://arxiv.org/abs/2503.14514
作者: Edson Luiz Ursini,Elaine Cristina Catapani Poletti,Loreno Menezes da Silveira,José Roberto Emiliano Leite
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注:

点击查看摘要

机器学习

[LG-0] SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

链接: https://arxiv.org/abs/2503.15478
作者: Yifei Zhou,Song Jiang,Yuandong Tian,Jason Weston,Sergey Levine,Sainbayar Sukhbaatar,Xian Li
类目: Machine Learning (cs.LG)
*备注: 29 pages, 16 figures

点击查看摘要

Abstract:Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.

[LG-1] mporal Encoding Strategies for Energy Time Series Prediction

链接: https://arxiv.org/abs/2503.15456
作者: Aayam Bansal,Keertan Balaji,Zeus Lalani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In contemporary power systems, energy consumption prediction plays a crucial role in maintaining grid stability and resource allocation enabling power companies to minimize energy waste and avoid overloading the grid. While there are several research works on energy optimization, they often fail to address the complexities of real-time fluctuations and the cyclic pattern of energy consumption. This work proposes a novel approach to enhance the accuracy of predictive models by employing sinusoidal encoding on periodic features of time-series data. To demonstrate the increase in performance, several statistical and ensemble machine learning models were trained on an energy demand dataset, using the proposed sinusoidal encoding. The performance of these models was then benchmarked against identical models trained on traditional encoding methods. The results demonstrated a 12.6% improvement of Root Mean Squared Error (from 0.5497 to 0.4802) and a 7.8% increase in the R^2 score (from 0.7530 to 0.8118), indicating that the proposed encoding better captures the cyclic nature of temporal patterns than traditional methods. The proposed methodology significantly improves prediction accuracy while maintaining computational efficiency, making it suitable for real-time applications in smart grid systems.

[LG-2] Reducing Communication Overhead in Federated Learning for Network Anomaly Detection with Adaptive Client Selection

链接: https://arxiv.org/abs/2503.15448
作者: William Marfo,Deepak Tosh,Shirley Moore,Joshua Suetterlein,Joseph Manzano
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Communication overhead in federated learning (FL) poses a significant challenge for network anomaly detection systems, where diverse client configurations and network conditions impact efficiency and detection accuracy. Existing approaches attempt optimization individually but struggle to balance reduced overhead with performance. This paper presents an adaptive FL framework combining batch size optimization, client selection, and asynchronous updates for efficient anomaly detection. Using UNSW-NB15 for general network traffic and ROAD for automotive networks, our framework reduces communication overhead by 97.6% (700.0s to 16.8s) while maintaining comparable accuracy (95.10% vs. 95.12%). The Mann-Whitney U test confirms significant improvements (p 0.05). Profiling analysis reveals efficiency gains via reduced GPU operations and memory transfers, ensuring robust detection across varying client conditions.

[LG-3] A discontinuity-capturing neural network with categorical embedding and its application to anisotropic elliptic interface problems

链接: https://arxiv.org/abs/2503.15441
作者: Wei-Fan Hu,Te-Sheng Lin,Ming-Chih Lai
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a discontinuity-capturing shallow neural network with categorical embedding to represent piecewise smooth functions. The network comprises three hidden layers, a discontinuity-capturing layer, a categorical embedding layer, and a fully-connected layer. Under such a design, we show that a piecewise smooth function, even with a large number of pieces, can be approximated by a single neural network with high prediction accuracy. We then leverage the proposed network model to solve anisotropic elliptic interface problems. The network is trained by minimizing the mean squared error loss of the system. Our results show that, despite its simple and shallow structure, the proposed neural network model exhibits comparable efficiency and accuracy to traditional grid-based numerical methods.

[LG-4] Exploiting Prior Knowledge in Preferential Learning of Individualized Autonomous Vehicle Driving Styles

链接: https://arxiv.org/abs/2503.15407
作者: Lukas Theiner,Sebastian Hirt,Alexander Steinke,Rolf Findeisen
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures, accepted for ECC 2025

点击查看摘要

Abstract:Trajectory planning for automated vehicles commonly employs optimization over a moving horizon - Model Predictive Control - where the cost function critically influences the resulting driving style. However, finding a suitable cost function that results in a driving style preferred by passengers remains an ongoing challenge. We employ preferential Bayesian optimization to learn the cost function by iteratively querying a passenger’s preference. Due to increasing dimensionality of the parameter space, preference learning approaches might struggle to find a suitable optimum with a limited number of experiments and expose the passenger to discomfort when exploring the parameter space. We address these challenges by incorporating prior knowledge into the preferential Bayesian optimization framework. Our method constructs a virtual decision maker from real-world human driving data to guide parameter sampling. In a simulation experiment, we achieve faster convergence of the prior-knowledge-informed learning procedure compared to existing preferential Bayesian optimization approaches and reduce the number of inadequate driving styles sampled.

[LG-5] Geometrically-Aware One-Shot Skill Transfer of Category-Level Objects

链接: https://arxiv.org/abs/2503.15371
作者: Cristiana de Farias,Luis Figueredo,Riddhiman Laha,Maxime Adjigble,Brahim Tamadazte,Rustam Stolkin,Sami Haddadin,Naresh Marturi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Robotic manipulation of unfamiliar objects in new environments is challenging and requires extensive training or laborious pre-programming. We propose a new skill transfer framework, which enables a robot to transfer complex object manipulation skills and constraints from a single human demonstration. Our approach addresses the challenge of skill acquisition and task execution by deriving geometric representations from demonstrations focusing on object-centric interactions. By leveraging the Functional Maps (FM) framework, we efficiently map interaction functions between objects and their environments, allowing the robot to replicate task operations across objects of similar topologies or categories, even when they have significantly different shapes. Additionally, our method incorporates a Task-Space Imitation Algorithm (TSIA) which generates smooth, geometrically-aware robot paths to ensure the transferred skills adhere to the demonstrated task constraints. We validate the effectiveness and adaptability of our approach through extensive experiments, demonstrating successful skill transfer and task execution in diverse real-world environments without requiring additional training.

[LG-6] Online Imitation Learning for Manipulation via Decaying Relative Correction through Teleoperation

链接: https://arxiv.org/abs/2503.15368
作者: Cheng Pan,Hung Hon Cheng,Josie Hughes
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Teleoperated robotic manipulators enable the collection of demonstration data, which can be used to train control policies through imitation learning. However, such methods can require significant amounts of training data to develop robust policies or adapt them to new and unseen tasks. While expert feedback can significantly enhance policy performance, providing continuous feedback can be cognitively demanding and time-consuming for experts. To address this challenge, we propose to use a cable-driven teleoperation system which can provide spatial corrections with 6 degree of freedom to the trajectories generated by a policy model. Specifically, we propose a correction method termed Decaying Relative Correction (DRC) which is based upon the spatial offset vector provided by the expert and exists temporarily, and which reduces the intervention steps required by an expert. Our results demonstrate that DRC reduces the required expert intervention rate by 30% compared to a standard absolute corrective method. Furthermore, we show that integrating DRC within an online imitation learning framework rapidly increases the success rate of manipulation tasks such as raspberry harvesting and cloth wiping.

[LG-7] FedBEns: One-Shot Federated Learning based on Bayesian Ensemble

链接: https://arxiv.org/abs/2503.15367
作者: Jacopo Talpini,Marco Savi,Giovanni Neglia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One-Shot Federated Learning (FL) is a recent paradigm that enables multiple clients to cooperatively learn a global model in a single round of communication with a central server. In this paper, we analyze the One-Shot FL problem through the lens of Bayesian inference and propose FedBEns, an algorithm that leverages the inherent multimodality of local loss functions to find better global models. Our algorithm leverages a mixture of Laplace approximations for the clients’ local posteriors, which the server then aggregates to infer the global model. We conduct extensive experiments on various datasets, demonstrating that the proposed method outperforms competing baselines that typically rely on unimodal approximations of the local losses.

[LG-8] Borsuk-Ulam and Replicable Learning of Large-Margin Halfspaces

链接: https://arxiv.org/abs/2503.15294
作者: Ari Blondal,Hamed Hatami,Pooya Hatami,Chavdar Lalov,Sivan Tretiak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in learning theory have established that, for total concepts, list replicability, global stability, differentially private (DP) learnability, and shared-randomness replicability coincide precisely with the finiteness of the Littlestone dimension. Does the same hold for partial concept classes? We answer this question by studying the large-margin half-spaces class, which has bounded Littlestone dimension and is purely DP-learnable and shared-randomness replicable even in high dimensions. We prove that the list replicability number of \gamma -margin half-spaces satisfies [ \fracd2 + 1 \le \mathrmLR(H_\gamma^d) \le d, ] which increases with the dimension d . This reveals a surprising separation for partial concepts: list replicability and global stability do not follow from bounded Littlestone dimension, DP-learnability, or shared-randomness replicability. By applying our main theorem, we also answer the following open problems. - We prove that any disambiguation of an infinite-dimensional large-margin half-space to a total concept class has unbounded Littlestone dimension, answering an open question of Alon et al. (FOCS '21). - We prove that the maximum list-replicability number of any finite set of points and homogeneous half-spaces in d -dimensional Euclidean space is d , resolving a problem of Chase et al. (FOCS '23). - We prove that any disambiguation of the Gap Hamming Distance problem in the large gap regime has unbounded public-coin randomized communication complexity. This answers an open problem of Fang et al. (STOC '25). We prove the lower bound via a topological argument involving the local Borsuk-Ulam theorem of Chase et al. (STOC '24). For the upper bound, we design a learning rule that relies on certain triangulations of the cross-polytope and recent results on the generalization properties of SVM. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.15294 [cs.LG] (or arXiv:2503.15294v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.15294 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Pooya Hatami [view email] [v1] Wed, 19 Mar 2025 15:17:13 UTC (34 KB) Full-text links: Access Paper: View a PDF of the paper titled Borsuk-Ulam and Replicable Learning of Large-Margin Halfspaces, by Ari Blondal and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-9] Learning to quantify graph nodes

链接: https://arxiv.org/abs/2503.15267
作者: Alessio Micheli,Alejandro Moreo,Marco Podda,Fabrizio Sebastiani,William Simoni,Domenico Tortorella
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Network Quantification is the problem of estimating the class proportions in unlabeled subsets of graph nodes. When prior probability shift is at play, this task cannot be effectively addressed by first classifying the nodes and then counting the class predictions. In addition, unlike non-relational quantification on i.i.d. datapoints, Network Quantification demands enhanced flexibility to capture a broad range of connectivity patterns, resilience to the challenge of heterophily, and efficiency to scale to larger networks. To meet these stringent requirements we introduce XNQ, a novel method that synergizes the flexibility and efficiency of the unsupervised node embeddings computed by randomized recursive Graph Neural Networks, with an Expectation-Maximization algorithm that provides a robust quantification-aware adjustment to the output probabilities of a calibrated node classifier. We validate the design choices underpinning our method through comprehensive ablation experiments. In an extensive evaluation, we find that our approach consistently and significantly improves on the best Network Quantification methods to date, thereby setting the new state of the art for this challenging task. Simultaneously, it provides a training speed-up of up to 10x-100x over other graph learning based methods.

[LG-10] ImputeGAP: A Comprehensive Library for Time Series Imputation

链接: https://arxiv.org/abs/2503.15250
作者: Quentin Nater,Mourad Khayati,Jacques Pasquier
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:With the prevalence of sensor failures, imputation–the process of estimating missing values–has emerged as the cornerstone of time series data preparation. While numerous imputation algorithms have been developed to address these data gaps, existing libraries provide limited support. Furthermore, they often lack the ability to simulate realistic patterns of time series missing data and fail to account for the impact of imputation on subsequent downstream analysis. This paper introduces ImputeGAP, a comprehensive library for time series imputation that supports a diverse range of imputation methods and modular missing data simulation catering to datasets with varying characteristics. The library includes extensive customization options, such as automated hyperparameter tuning, benchmarking, explainability, downstream evaluation, and compatibility with popular time series frameworks. Subjects: Machine Learning (cs.LG); Databases (cs.DB) Cite as: arXiv:2503.15250 [cs.LG] (or arXiv:2503.15250v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.15250 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] A Foundation Model for Patient Behavior Monitoring and Suicide Detection UAI2025

链接: https://arxiv.org/abs/2503.15221
作者: Rodrigo Oliver,Josué Pérez-Sabater,Leire Paz-Arbaizar,Alejandro Lancho,Antonio Artés,Pablo M. Olmos
类目: Machine Learning (cs.LG)
*备注: 10 pages (31 with appendices), 6 figures (13 with appendices); submitted to UAI 2025

点击查看摘要

Abstract:Foundation models (FMs) have achieved remarkable success across various domains, yet their adoption in healthcare remains limited. While significant advances have been made in medical imaging, genetic biomarkers, and time series from electronic health records, the potential of FMs for patient behavior monitoring through wearable devices remains underexplored. These datasets are inherently heterogeneous, multisource, and often exhibit high rates of missing data, posing unique challenges. This paper introduces a novel FM based on a modified vector quantized variational autoencoder (VQ-VAE), specifically designed to process real-world data from wearable devices. We demonstrate that our pretrained FM, trained on a broad cohort of psychiatric patients, performs downstream tasks via its latent representation without fine-tuning on a held-out cohort of suicidal patients. To illustrate this, we develop a probabilistic change-point detection algorithm for suicide detection and demonstrate the FM’s effectiveness in predicting emotional states. Our results show that the discrete latent structure of the VQ-VAE outperforms a state-of-the-art Informer architecture in unsupervised suicide detection, while matching its performance in supervised emotion prediction when the latent dimensionality is increased, though at the cost of reduced unsupervised accuracy. This trade-off highlights the need for future FMs to integrate hybrid discrete-continuous structures for balanced performance across tasks.

[LG-12] Kolmogorov-Arnold Network for Transistor Compact Modeling

链接: https://arxiv.org/abs/2503.15209
作者: Rodion Novkin,Hussam Amrouch
类目: Machine Learning (cs.LG)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Neural network (NN)-based transistor compact modeling has recently emerged as a transformative solution for accelerating device modeling and SPICE circuit simulations. However, conventional NN architectures, despite their widespread adoption in state-of-the-art methods, primarily function as black-box problem solvers. This lack of interpretability significantly limits their capacity to extract and convey meaningful insights into learned data patterns, posing a major barrier to their broader adoption in critical modeling tasks. This work introduces, for the first time, Kolmogorov-Arnold network (KAN) for the transistor - a groundbreaking NN architecture that seamlessly integrates interpretability with high precision in physics-based function modeling. We systematically evaluate the performance of KAN and Fourier KAN for FinFET compact modeling, benchmarking them against the golden industry-standard compact model and the widely used MLP architecture. Our results reveal that KAN and FKAN consistently achieve superior prediction accuracy for critical figures of merit, including gate current, drain charge, and source charge. Furthermore, we demonstrate and improve the unique ability of KAN to derive symbolic formulas from learned data patterns - a capability that not only enhances interpretability but also facilitates in-depth transistor analysis and optimization. This work highlights the transformative potential of KAN in bridging the gap between interpretability and precision in NN-driven transistor compact modeling. By providing a robust and transparent approach to transistor modeling, KAN represents a pivotal advancement for the semiconductor industry as it navigates the challenges of advanced technology scaling.

[LG-13] Partially Observable Reinforcement Learning with Memory Traces

链接: https://arxiv.org/abs/2503.15200
作者: Onno Eberhard,Michael Muehlebach,Claire Vernade
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Partially observable environments present a considerable computational challenge in reinforcement learning due to the need to consider long histories. Learning with a finite window of observations quickly becomes intractable as the window length grows. In this work, we introduce memory traces. Inspired by eligibility traces, these are compact representations of the history of observations in the form of exponential moving averages. We prove sample complexity bounds for the problem of offline on-policy evaluation that quantify the value errors achieved with memory traces for the class of Lipschitz continuous value estimates. We establish a close connection to the window approach, and demonstrate that, in certain environments, learning with memory traces is significantly more sample efficient. Finally, we underline the effectiveness of memory traces empirically in online reinforcement learning experiments for both value prediction and control.

[LG-14] Learning Topology Actions for Power Grid Control: A Graph-Based Soft-Label Imitation Learning Approach

链接: https://arxiv.org/abs/2503.15190
作者: Mohamed Hassouna,Clara Holzhüter,Malte Lehna,Matthijs de Jong,Jan Viebahn,Bernhard Sick,Christoph Scholz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rising proportion of renewable energy in the electricity mix introduces significant operational challenges for power grid operators. Effective power grid management demands adaptive decision-making strategies capable of handling dynamic conditions. With the increase in complexity, more and more Deep Learning (DL) approaches have been proposed to find suitable grid topologies for congestion management. In this work, we contribute to this research by introducing a novel Imitation Learning (IL) approach that leverages soft labels derived from simulated topological action outcomes, thereby capturing multiple viable actions per state. Unlike traditional IL methods that rely on hard labels to enforce a single optimal action, our method constructs soft labels over actions, by leveraging effective actions that prove suitable in resolving grid congestion. To further enhance decision-making, we integrate Graph Neural Networks (GNNs) to encode the structural properties of power grids, ensuring that the topology-aware representations contribute to better agent performance. Our approach significantly outperforms state-of-the-art baselines, all of which use only topological actions, as well as feedforward and GNN-based architectures with hard labels. Most notably, it achieves a 17% better performance compared to the greedy expert agent from which the imitation targets were derived.

[LG-15] Food Delivery Time Prediction in Indian Cities Using Machine Learning Models KR

链接: https://arxiv.org/abs/2503.15177
作者: Ananya Garg,Mohmmad Ayaan,Swara Parekh,Vikranth Udandarao
类目: Machine Learning (cs.LG)
*备注: for code implementation, check this https URL

点击查看摘要

Abstract:Accurate prediction of food delivery times significantly impacts customer satisfaction, operational efficiency, and profitability in food delivery services. However, existing studies primarily utilize static historical data and often overlook dynamic, real-time contextual factors crucial for precise prediction, particularly in densely populated Indian cities. This research addresses these gaps by integrating real-time contextual variables such as traffic density, weather conditions, local events, and geospatial data (restaurant and delivery location coordinates) into predictive models. We systematically compare various machine learning algorithms, including Linear Regression, Decision Trees, Bagging, Random Forest, XGBoost, and LightGBM, on a comprehensive food delivery dataset specific to Indian urban contexts. Rigorous data preprocessing and feature selection significantly enhanced model performance. Experimental results demonstrate that the LightGBM model achieves superior predictive accuracy, with an R2 score of 0.76 and Mean Squared Error (MSE) of 20.59, outperforming traditional baseline approaches. Our study thus provides actionable insights for improving logistics strategies in complex urban environments. The complete methodology and code are publicly available for reproducibility and further research.

[LG-16] Global Group Fairness in Federated Learning via Function Tracking AISTATS2025

链接: https://arxiv.org/abs/2503.15163
作者: Yves Rychener,Daniel Kuhn,Yifan Hu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Methodology (stat.ME)
*备注: The paper is accepted to AISTATS 2025

点击查看摘要

Abstract:We investigate group fairness regularizers in federated learning, aiming to train a globally fair model in a distributed setting. Ensuring global fairness in distributed training presents unique challenges, as fairness regularizers typically involve probability metrics between distributions across all clients and are not naturally separable by client. To address this, we introduce a function-tracking scheme for the global fairness regularizer based on a Maximum Mean Discrepancy (MMD), which incurs a small communication overhead. This scheme seamlessly integrates into most federated learning algorithms while preserving rigorous convergence guarantees, as demonstrated in the context of FedAvg. Additionally, when enforcing differential privacy, the kernel-based MMD regularization enables straightforward analysis through a change of kernel, leveraging an intuitive interpretation of kernel convolution. Numerical experiments confirm our theoretical insights.

[LG-17] Preference Construction: A Bayesian Interactive Preference Elicitation Framework Based on Monte Carlo Tree Search

链接: https://arxiv.org/abs/2503.15150
作者: Yan Wang,Jiapeng Liu,Milosz Kadziński,Xiuwu Liao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel preference learning framework to capture participant preferences efficiently within limited interaction rounds. It involves three main contributions. First, we develop a variational Bayesian approach to infer the participant’s preference model by estimating posterior distributions and managing uncertainty from limited information. Second, we propose an adaptive questioning policy that maximizes cumulative uncertainty reduction, formulating questioning as a finite Markov decision process and using Monte Carlo Tree Search to prioritize promising question trajectories. By considering long-term effects and leveraging the efficiency of the Bayesian approach, the policy avoids shortsightedness. Third, we apply the framework to Multiple Criteria Decision Aiding, with pairwise comparison as the preference information and an additive value function as the preference model. We integrate the reparameterization trick to address high-variance issues, enhancing robustness and efficiency. Computational studies on real-world and synthetic datasets demonstrate the framework’s practical usability, outperforming baselines in capturing preferences and achieving superior uncertainty reduction within limited interactions.

[LG-18] Machine learning surrogate models of many-body dispersion interactions in polymer melts

链接: https://arxiv.org/abs/2503.15149
作者: Zhaoxiang Shen,Raúl I. Sosa,Jakub Lengiewicz,Alexandre Tkatchenko,Stéphane P.A. Bordas
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Accurate prediction of many-body dispersion (MBD) interactions is essential for understanding the van der Waals forces that govern the behavior of many complex molecular systems. However, the high computational cost of MBD calculations limits their direct application in large-scale simulations. In this work, we introduce a machine learning surrogate model specifically designed to predict MBD forces in polymer melts, a system that demands accurate MBD description and offers structural advantages for machine learning approaches. Our model is based on a trimmed SchNet architecture that selectively retains the most relevant atomic connections and incorporates trainable radial basis functions for geometric encoding. We validate our surrogate model on datasets from polyethylene, polypropylene, and polyvinyl chloride melts, demonstrating high predictive accuracy and robust generalization across diverse polymer systems. In addition, the model captures key physical features, such as the characteristic decay behavior of MBD interactions, providing valuable insights for optimizing cutoff strategies. Characterized by high computational efficiency, our surrogate model enables practical incorporation of MBD effects into large-scale molecular simulations.

[LG-19] DeCaFlow: A Deconfounding Causal Generative Model

链接: https://arxiv.org/abs/2503.15114
作者: Alejandro Almodóvar,Adrián Javaloy,Juan Parras,Santiago Zazo,Isabel Valera
类目: Machine Learning (cs.LG)
*备注: 32 pages, 22 figures. Under submission

点击查看摘要

Abstract:Causal generative models (CGMs) have recently emerged as capable approaches to simulate the causal mechanisms generating our observations, enabling causal inference. Unfortunately, existing approaches either are overly restrictive, assuming the absence of hidden confounders, or lack generality, being tailored to a particular query and graph. In this work, we introduce DeCaFlow, a CGM that accounts for hidden confounders in a single amortized training process using only observational data and the causal graph. Importantly, DeCaFlow can provably identify all causal queries with a valid adjustment set or sufficiently informative proxy variables. Remarkably, for the first time to our knowledge, we show that a confounded counterfactual query is identifiable, and thus solvable by DeCaFlow, as long as its interventional counterpart is as well. Our empirical results on diverse settings (including the Ecoli70 dataset, with 3 independent hidden confounders, tens of observed variables and hundreds of causal queries) show that DeCaFlow outperforms existing approaches, while demonstrating its out-of-the-box flexibility.

[LG-20] FedLWS: Federated Learning with Adaptive Layer-wise Weight Shrinking ICLR2025

链接: https://arxiv.org/abs/2503.15111
作者: Changlong Shi,Jinmeng Li,He Zhao,Dan dan Guo,Yi Chang
类目: Machine Learning (cs.LG)
*备注: Accepted in ICLR 2025

点击查看摘要

Abstract:In Federated Learning (FL), weighted aggregation of local models is conducted to generate a new global model, and the aggregation weights are typically normalized to 1. A recent study identifies the global weight shrinking effect in FL, indicating an enhancement in the global model’s generalization when the sum of weights (i.e., the shrinking factor) is smaller than 1, where how to learn the shrinking factor becomes crucial. However, principled approaches to this solution have not been carefully studied from the adequate consideration of privacy concerns and layer-wise distinctions. To this end, we propose a novel model aggregation strategy, Federated Learning with Adaptive Layer-wise Weight Shrinking (FedLWS), which adaptively designs the shrinking factor in a layer-wise manner and avoids optimizing the shrinking factors on a proxy dataset. We initially explored the factors affecting the shrinking factor during the training process. Then we calculate the layer-wise shrinking factors by considering the distinctions among each layer of the global model. FedLWS can be easily incorporated with various existing methods due to its flexibility. Extensive experiments under diverse scenarios demonstrate the superiority of our method over several state-of-the-art approaches, providing a promising tool for enhancing the global model in FL.

[LG-21] Control Optimal Transport and Neural Differential Equations in Supervised Learning

链接: https://arxiv.org/abs/2503.15105
作者: Minh-Nhat Phung,Minh-Binh Tran
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:From the perspective of control theory, neural differential equations (neural ODEs) have become an important tool for supervised learning. In the fundamental work of Ruiz-Balet and Zuazua (SIAM REVIEW 2023), the authors pose an open problem regarding the connection between control theory, optimal transport theory, and neural differential equations. More precisely, they inquire how one can quantify the closeness of the optimal flows in neural transport equations to the true dynamic optimal transport. In this work, we propose a construction of neural differential equations that converge to the true dynamic optimal transport in the limit, providing a significant step in solving the formerly mentioned open problem.

[LG-22] Continual Contrastive Learning on Tabular Data with Out of Distribution

链接: https://arxiv.org/abs/2503.15089
作者: Achmad Ginanjar,Xue Li,Priyanka Singh,Wen Hua
类目: Machine Learning (cs.LG)
*备注: accepeted on esann 2025

点击查看摘要

Abstract:Out-of-distribution (OOD) prediction remains a significant challenge in machine learning, particularly for tabular data where traditional methods often fail to generalize beyond their training distribution. This paper introduces Tabular Continual Contrastive Learning (TCCL), a novel framework designed to address OOD challenges in tabular data processing. TCCL integrates contrastive learning principles with continual learning mechanisms, featuring a three-component architecture: an Encoder for data transformation, a Decoder for representation learning, and a Learner Head. We evaluate TCCL against 14 baseline models, including state-of-the-art deep learning approaches and gradient-boosted decision trees (GBDT), across eight diverse tabular datasets. Our experimental results demonstrate that TCCL consistently outperforms existing methods in both classification and regression tasks on OOD data, with particular strength in handling distribution shifts. These findings suggest that TCCL represents a significant advancement in handling OOD scenarios for tabular data.

[LG-23] Multivariate Gaussian Topic Modelling: A novel approach to discover topics with greater semantic coherence

链接: https://arxiv.org/abs/2503.15036
作者: Satyajeet Sahoo,Jhareswar Maiti,Virendra Kumar Tewari
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:An important aspect of text mining involves information retrieval in form of discovery of semantic themes (topics) from documents using topic modelling. While generative topic models like Latent Dirichlet Allocation (LDA) elegantly model topics as probability distributions and are useful in identifying latent topics from large document corpora with minimal supervision, they suffer from difficulty in topic interpretability and reduced performance in shorter texts. Here we propose a novel Multivariate Gaussian Topic modelling (MGD) approach. In this approach topics are presented as Multivariate Gaussian Distributions and documents as Gaussian Mixture Models. Using EM algorithm, the various constituent Multivariate Gaussian Distributions and their corresponding parameters are identified. Analysis of the parameters helps identify the keywords having the highest variance and mean contributions to the topic, and from these key-words topic annotations are carried out. This approach is first applied on a synthetic dataset to demonstrate the interpretability benefits vis-à-vis LDA. A real-world application of this topic model is demonstrated in analysis of risks and hazards at a petrochemical plant by applying the model on safety incident reports to identify the major latent hazards plaguing the plant. This model achieves a higher mean topic coherence of 0.436 vis-à-vis 0.294 for LDA.

[LG-24] Scalable Trajectory-User Linking with Dual-Stream Representation Networks AAAI2025

链接: https://arxiv.org/abs/2503.15002
作者: Hao Zhang,Wei Chen,Xingyu Zhao,Jianpeng Qi,Guiyuan Jiang,Yanwei Yu
类目: Machine Learning (cs.LG)
*备注: The paper has been accepted by AAAI 2025

点击查看摘要

Abstract:Trajectory-user linking (TUL) aims to match anonymous trajectories to the most likely users who generated them, offering benefits for a wide range of real-world spatio-temporal applications. However, existing TUL methods are limited by high model complexity and poor learning of the effective representations of trajectories, rendering them ineffective in handling large-scale user trajectory data. In this work, we propose a novel \underlineScal abl \underlinee Trajectory-User Linking with dual-stream representation networks for large-scale \underlineTUL problem, named ScaleTUL. Specifically, ScaleTUL generates two views using temporal and spatial augmentations to exploit supervised contrastive learning framework to effectively capture the irregularities of trajectories. In each view, a dual-stream trajectory encoder, consisting of a long-term encoder and a short-term encoder, is designed to learn unified trajectory representations that fuse different temporal-spatial dependencies. Then, a TUL layer is used to associate the trajectories with the corresponding users in the representation space using a two-stage training model. Experimental results on check-in mobility datasets from three real-world cities and the nationwide U.S. demonstrate the superiority of ScaleTUL over state-of-the-art baselines for large-scale TUL tasks.

[LG-25] Embedding spatial context in urban traffic forecasting with contrastive pre-training

链接: https://arxiv.org/abs/2503.14980
作者: Matthew Low,Arian Prabowo,Hao Xue,Flora Salim
类目: Machine Learning (cs.LG)
*备注: 21 pages with references, 10 figures

点击查看摘要

Abstract:Urban traffic forecasting is a commonly encountered problem, with wide-ranging applications in fields such as urban planning, civil engineering and transport. In this paper, we study the enhancement of traffic forecasting with pre-training, focusing on spatio-temporal graph methods. While various machine learning methods to solve traffic forecasting problems have been explored and extensively studied, there is a gap of a more contextual approach: studying how relevant non-traffic data can improve prediction performance on traffic forecasting problems. We call this data spatial context. We introduce a novel method of combining road and traffic information through the notion of a traffic quotient graph, a quotient graph formed from road geometry and traffic sensors. We also define a way to encode this relationship in the form of a geometric encoder, pre-trained using contrastive learning methods and enhanced with OpenStreetMap data. We introduce and discuss ways to integrate this geometric encoder with existing graph neural network (GNN)-based traffic forecasting models, using a contrastive pre-training paradigm. We demonstrate the potential for this hybrid model to improve generalisation and performance with zero additional traffic data. Code for this paper is available at this https URL.

[LG-26] Continual Multimodal Contrastive Learning

链接: https://arxiv.org/abs/2503.14963
作者: Xiaohao Liu,Xiaobo Xia,See-Kiong Ng,Tat-Seng Chua
类目: Machine Learning (cs.LG)
*备注: 36 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Multimodal contrastive learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space. By leveraging contrastive learning across diverse modalities, large-scale multimodal data enhances representational quality. However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive. Instead, emergent multimodal data can be used to optimize existing models gradually, \textiti.e., models are trained on a sequence of modality pair data. We define this problem as Continual Multimodal Contrastive Learning (CMCL), an underexplored yet crucial research direction at the intersection of multimodal and continual learning. In this paper, we formulate CMCL through two specialized principles of stability and plasticity. We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge. Two upper bounds provide theoretical insights on both stability and plasticity in our solution. Beyond our theoretical contributions, we conduct experiments on multiple datasets by comparing our method against advanced continual learning baselines. The empirical results further support our claims and demonstrate the efficacy of our method. The code will be publicly available.

[LG-27] Proceedings of the 3rd Italian Conference on Big Data and Data Science (ITADATA2024)

链接: https://arxiv.org/abs/2503.14937
作者: Nicola Bena,Claudia Diamantini,Michela Natilli,Luigi Romano,Giovanni Stilo,Valentina Pansanella,Claudio A. Ardagna,Anna Monreale,Roberto Trasarti
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Proceedings of the 3rd Italian Conference on Big Data and Data Science (ITADATA2024), held in Pisa, Italy, September 17-19, 2024. The Italian Conference on Big Data and Data Science (ITADATA2024) is the annual event supported by the CINI Big Data National Laboratory and ISTI CNR that aims to put together Italian researchers and professionals from academia, industry, government, and public administration working in the field of big data and data science, as well as related fields (e.g., security and privacy, HPC, Cloud). ITADATA2024 covered research on all theoretical and practical aspects of Big Data and data science including data governance, data processing, data analysis, data reporting, data protection, as well as experimental studies and lessons learned. In particular, ITADATA2024 focused on - Data spaces - Data processing life cycle - Machine learning and Large Language Models - Applications of big data and data science in healthcare, finance, industry 5.0, and beyond - Data science for social network analysis Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2503.14937 [cs.DB] (or arXiv:2503.14937v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2503.14937 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nicola Bena [view email] [v1] Wed, 19 Mar 2025 06:48:18 UTC (3 KB) Full-text links: Access Paper: View a PDF of the paper titled Proceedings of the 3rd Italian Conference on Big Data and Data Science (ITADATA2024), by Nicola Bena and 8 other authorsHTMLOther Formats view license Current browse context: cs.DB prev | next new | recent | 2025-03 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-28] Enhancing Code LLM Training with Programmer Attention

链接: https://arxiv.org/abs/2503.14936
作者: Yifan Zhang,Chen Huang,Zachary Karas,Dung Thuy Nguyen,Kevin Leach,Yu Huang
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human attention provides valuable yet underexploited signals for code LLM training, offering a perspective beyond purely machine-driven attention. Despite the complexity and cost of collecting eye-tracking data, there has also been limited progress in systematically using these signals for code LLM training. To address both issues, we propose a cohesive pipeline spanning augmentation and reward-based fine-tuning. Specifically, we introduce (1) an eye-tracking path augmentation method to expand programmer attention datasets, (2) a pattern abstraction step that refines raw fixations into learnable attention motifs, and (3) a reward-guided strategy for integrating these insights directly into a CodeT5 supervised fine-tuning process. Our experiments yield +7.16 in CodeBLEU on the CodeXGlue benchmark for code summarization, underscoring how uniting human and machine attention can boost code intelligence. We hope this work encourages broader exploration of human-centric methods in next-generation AI4SE.

[LG-29] Prada: Black-Box LLM Adaptation with Private Data on Resource-Constrained Devices

链接: https://arxiv.org/abs/2503.14932
作者: Ziyao Wang,Yexiao He,Zheyu Shen,Yu Li,Guoheng Sun,Myungjin Lee,Ang Li
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable abilities in various natural language processing tasks. However, adapting these models to specialized domains using private datasets stored on resource-constrained edge devices, such as smartphones and personal computers, remains challenging due to significant privacy concerns and limited computational resources. Existing model adaptation methods either compromise data privacy by requiring data transmission or jeopardize model privacy by exposing proprietary LLM parameters. To address these challenges, we propose Prada, a novel privacy-preserving and efficient black-box LLM adaptation system using private on-device datasets. Prada employs a lightweight proxy model fine-tuned with Low-Rank Adaptation (LoRA) locally on user devices. During inference, Prada leverages the logits offset, i.e., difference in outputs between the base and adapted proxy models, to iteratively refine outputs from a remote black-box LLM. This offset-based adaptation approach preserves both data privacy and model privacy, as there is no need to share sensitive data or proprietary model parameters. Furthermore, we incorporate speculative decoding to further speed up the inference process of Prada, making the system practically deployable on bandwidth-constrained edge devices, enabling a more practical deployment of Prada. Extensive experiments on various downstream tasks demonstrate that Prada achieves performance comparable to centralized fine-tuning methods while significantly reducing computational overhead by up to 60% and communication costs by up to 80%.

[LG-30] ACE: A Cardinality Estimator for Set-Valued Queries VLDB

链接: https://arxiv.org/abs/2503.14929
作者: Yufan Sheng,Xin Cao,Kaiqi Zhao,Yixiang Fang,Jianzhong Qi,Wenjie Zhang,Christian S. Jensen
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: This paper has been accepted by PVLDB Vol 18

点击查看摘要

Abstract:Cardinality estimation is a fundamental functionality in database systems. Most existing cardinality estimators focus on handling predicates over numeric or categorical data. They have largely omitted an important data type, set-valued data, which frequently occur in contemporary applications such as information retrieval and recommender systems. The few existing estimators for such data either favor high-frequency elements or rely on a partial independence assumption, which limits their practical applicability. We propose ACE, an Attention-based Cardinality Estimator for estimating the cardinality of queries over set-valued data. We first design a distillation-based data encoder to condense the dataset into a compact matrix. We then design an attention-based query analyzer to capture correlations among query elements. To handle variable-sized queries, a pooling module is introduced, followed by a regression model (MLP) to generate final cardinality estimates. We evaluate ACE on three datasets with varying query element distributions, demonstrating that ACE outperforms the state-of-the-art competitors in terms of both accuracy and efficiency.

[LG-31] Semi-Gradient SARSA Routing with Theoretical Guarantee on Traffic Stability and Weight Convergence

链接: https://arxiv.org/abs/2503.14927
作者: Yidan Wu,Yu Yu,Jianan Zhang,Li Jin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
*备注: arXiv admin note: text overlap with arXiv:2404.09188

点击查看摘要

Abstract:We consider the traffic control problem of dynamic routing over parallel servers, which arises in a variety of engineering systems such as transportation and data transmission. We propose a semi-gradient, on-policy algorithm that learns an approximate optimal routing policy. The algorithm uses generic basis functions with flexible weights to approximate the value function across the unbounded state space. Consequently, the training process lacks Lipschitz continuity of the gradient, boundedness of the temporal-difference error, and a prior guarantee on ergodicity, which are the standard prerequisites in existing literature on reinforcement learning theory. To address this, we combine a Lyapunov approach and an ordinary differential equation-based method to jointly characterize the behavior of traffic state and approximation weights. Our theoretical analysis proves that the training scheme guarantees traffic state stability and ensures almost surely convergence of the weights to the approximate optimum. We also demonstrate via simulations that our algorithm attains significantly faster convergence than neural network-based methods with an insignificant approximation error.

[LG-32] pFedFair: Towards Optimal Group Fairness-Accuracy Trade-off in Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2503.14925
作者: Haoyu Lei,Shizhan Gong,Qi Dou,Farzan Farnia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) algorithms commonly aim to maximize clients’ accuracy by training a model on their collective data. However, in several FL applications, the model’s decisions should meet a group fairness constraint to be independent of sensitive attributes such as gender or race. While such group fairness constraints can be incorporated into the objective function of the FL optimization problem, in this work, we show that such an approach would lead to suboptimal classification accuracy in an FL setting with heterogeneous client distributions. To achieve an optimal accuracy-group fairness trade-off, we propose the Personalized Federated Learning for Client-Level Group Fairness (pFedFair) framework, where clients locally impose their fairness constraints over the distributed training process. Leveraging the image embedding models, we extend the application of pFedFair to computer vision settings, where we numerically show that pFedFair achieves an optimal group fairness-accuracy trade-off in heterogeneous FL settings. We present the results of several numerical experiments on benchmark and synthetic datasets, which highlight the suboptimality of non-personalized FL algorithms and the improvements made by the pFedFair method.

[LG-33] Pseudo-Relevance Feedback Can Improve Zero-Shot LLM -Based Dense Retrieval

链接: https://arxiv.org/abs/2503.14887
作者: Hang Li,Xiao Wang,Bevan Koopman,Guido Zuccon
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pseudo-relevance feedback (PRF) refines queries by leveraging initially retrieved documents to improve retrieval effectiveness. In this paper, we investigate how large language models (LLMs) can facilitate PRF for zero-shot LLM-based dense retrieval, extending the recently proposed PromptReps method. Specifically, our approach uses LLMs to extract salient passage features-such as keywords and summaries-from top-ranked documents, which are then integrated into PromptReps to produce enhanced query representations. Experiments on passage retrieval benchmarks demonstrate that incorporating PRF significantly boosts retrieval performance. Notably, smaller rankers with PRF can match the effectiveness of larger rankers without PRF, highlighting PRF’s potential to improve LLM-driven search while maintaining an efficient balance between effectiveness and resource usage.

[LG-34] Robust Support Vector Machines for Imbalanced and Noisy Data via Benders Decomposition

链接: https://arxiv.org/abs/2503.14873
作者: Seyed Mojtaba Mohasel,Hamidreza Koosha
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study introduces a novel formulation to enhance Support Vector Machines (SVMs) in handling class imbalance and noise. Unlike the conventional Soft Margin SVM, which penalizes the magnitude of constraint violations, the proposed model quantifies the number of violations and aims to minimize their frequency. To achieve this, a binary variable is incorporated into the objective function of the primal SVM formulation, replacing the traditional slack variable. Furthermore, each misclassified sample is assigned a priority and an associated constraint. The resulting formulation is a mixed-integer programming model, efficiently solved using Benders decomposition. The proposed model’s performance was benchmarked against existing models, including Soft Margin SVM, weighted SVM, and NuSVC. Two primary hypotheses were examined: 1) The proposed model improves the F1-score for the minority class in imbalanced classification tasks. 2) The proposed model enhances classification accuracy in noisy datasets. These hypotheses were evaluated using a Wilcoxon test across multiple publicly available datasets from the OpenML repository. The results supported both hypotheses (( p 0.05 )). In addition, the proposed model exhibited several interesting properties, such as improved robustness to noise, a decision boundary shift favoring the minority class, a reduced number of support vectors, and decreased prediction time. The open-source Python implementation of the proposed SVM model is available.

[LG-35] Evaluating Time Series Models with Knowledge Discovery SDM2025

链接: https://arxiv.org/abs/2503.14869
作者: Li Zhang
类目: Machine Learning (cs.LG)
*备注: accepted in SIAM SDM 2025 - Blue Sky Track (to appear)

点击查看摘要

Abstract:Time series data is one of the most ubiquitous data modalities existing in a diverse critical domains such as healthcare, seismology, manufacturing and energy. Recent years, there are increasing interest of the data mining community to develop time series deep learning models to pursue better performance. The models performance often evaluate by certain evaluation metrics such as RMSE, Accuracy, and F1-score. Yet time series data are often hard to interpret and are collected with unknown environmental factors, sensor configuration, latent physic mechanisms, and non-stationary evolving behavior. As a result, a model that is better on standard metric-based evaluation may not always perform better in real-world tasks. In this blue sky paper, we aim to explore the challenge that exists in the metric-based evaluation framework for time series data mining and propose a potential blue-sky idea – developing a knowledge-discovery-based evaluation framework, which aims to effectively utilize domain-expertise knowledge to evaluate a model. We demonstrate that an evidence-seeking explanation can potentially have stronger persuasive power than metric-based evaluation and obtain better generalization ability for time series data mining tasks.

[LG-36] Scaled Supervision is an Implicit Lipschitz Regularizer AAAI

链接: https://arxiv.org/abs/2503.14813
作者: Zhongyu Ouyang,Chunhui Zhang,Yaning Jia,Soroush Vosoughi
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted to the International AAAI Conference on Web and Social Media (ICWSM 2025)

点击查看摘要

Abstract:In modern social media, recommender systems (RecSys) rely on the click-through rate (CTR) as the standard metric to evaluate user engagement. CTR prediction is traditionally framed as a binary classification task to predict whether a user will interact with a given item. However, this approach overlooks the complexity of real-world social modeling, where the user, item, and their interactive features change dynamically in fast-paced online environments. This dynamic nature often leads to model instability, reflected in overfitting short-term fluctuations rather than higher-level interactive patterns. While overfitting calls for more scaled and refined supervisions, current solutions often rely on binary labels that overly simplify fine-grained user preferences through the thresholding process, which significantly reduces the richness of the supervision. Therefore, we aim to alleviate the overfitting problem by increasing the supervision bandwidth in CTR training. Specifically, (i) theoretically, we formulate the impact of fine-grained preferences on model stability as a Lipschitz constrain; (ii) empirically, we discover that scaling the supervision bandwidth can act as an implicit Lipschitz regularizer, stably optimizing existing CTR models to achieve better generalizability. Extensive experiments show that this scaled supervision significantly and consistently improves the optimization process and the performance of existing CTR models, even without the need for additional hyperparameter tuning.

[LG-37] Pruning-Based TinyML Optimization of Machine Learning Models for Anomaly Detection in Electric Vehicle Charging Infrastructure

链接: https://arxiv.org/abs/2503.14799
作者: Fatemeh Dehrouyeh,Ibrahim Shaer,Soodeh Nikan,Firouz Badrkhani Ajaei,Abdallah Shami
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This paper has been accepted for presentation at IEEE ICC 2025. The final published version will be available in the conference proceedings. The implementation and code are available at: this https URL

点击查看摘要

Abstract:With the growing need for real-time processing on IoT devices, optimizing machine learning (ML) models’ size, latency, and computational efficiency is essential. This paper investigates a pruning method for anomaly detection in resource-constrained environments, specifically targeting Electric Vehicle Charging Infrastructure (EVCI). Using the CICEVSE2024 dataset, we trained and optimized three models-Multi-Layer Perceptron (MLP), Long Short-Term Memory (LSTM), and XGBoost-through hyperparameter tuning with Optuna, further refining them using SHapley Additive exPlanations (SHAP)-based feature selection (FS) and unstructured pruning techniques. The optimized models achieved significant reductions in model size and inference times, with only a marginal impact on their performance. Notably, our findings indicate that, in the context of EVCI, pruning and FS can enhance computational efficiency while retaining critical anomaly detection capabilities.

[LG-38] A New Benchmark for Online Learning with Budget-Balancing Constraints

链接: https://arxiv.org/abs/2503.14796
作者: Mark Braverman,Jingyi Liu,Jieming Mao,Jon Schneider,Eric Xue
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:The adversarial Bandit with Knapsack problem is a multi-armed bandits problem with budget constraints and adversarial rewards and costs. In each round, a learner selects an action to take and observes the reward and cost of the selected action. The goal is to maximize the sum of rewards while satisfying the budget constraint. The classical benchmark to compare against is the best fixed distribution over actions that satisfies the budget constraint in expectation. Unlike its stochastic counterpart, where rewards and costs are drawn from some fixed distribution (Badanidiyuru et al., 2018), the adversarial BwK problem does not admit a no-regret algorithm for every problem instance due to the “spend-or-save” dilemma (Immorlica et al., 2022). A key problem left open by existing works is whether there exists a weaker but still meaningful benchmark to compare against such that no-regret learning is still possible. In this work, we present a new benchmark to compare against, motivated both by real-world applications such as autobidding and by its underlying mathematical structure. The benchmark is based on the Earth Mover’s Distance (EMD), and we show that sublinear regret is attainable against any strategy whose spending pattern is within EMD o(T^2) of any sub-pacing spending pattern. As a special case, we obtain results against the “pacing over windows” benchmark, where we partition time into disjoint windows of size w and allow the benchmark strategies to choose a different distribution over actions for each window while satisfying a pacing budget constraint. Against this benchmark, our algorithm obtains a regret bound of \tildeO(T/\sqrtw+\sqrtwT) . We also show a matching lower bound, proving the optimality of our algorithm in this important special case. In addition, we provide further evidence of the necessity of the EMD condition for obtaining a sublinear regret. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2503.14796 [cs.LG] (or arXiv:2503.14796v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.14796 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] SEEK: Self-adaptive Explainable Kernel For Nonstationary Gaussian Processes

链接: https://arxiv.org/abs/2503.14785
作者: Nima Negarandeh,Carlos Mora,Ramin Bostanabad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gaussian processes (GPs) are powerful probabilistic models that define flexible priors over functions, offering strong interpretability and uncertainty quantification. However, GP models often rely on simple, stationary kernels which can lead to suboptimal predictions and miscalibrated uncertainty estimates, especially in nonstationary real-world applications. In this paper, we introduce SEEK, a novel class of learnable kernels to model complex, nonstationary functions via GPs. Inspired by artificial neurons, SEEK is derived from first principles to ensure symmetry and positive semi-definiteness, key properties of valid kernels. The proposed method achieves flexible and adaptive nonstationarity by learning a mapping from a set of base kernels. Compared to existing techniques, our approach is more interpretable and much less prone to overfitting. We conduct comprehensive sensitivity analyses and comparative studies to demonstrate that our approach is not robust to only many of its design choices, but also outperforms existing stationary/nonstationary kernels in both mean prediction accuracy and uncertainty quantification.

[LG-40] Fake Runs Real Fixes – Analyzing xPU Performance Through Simulation

链接: https://arxiv.org/abs/2503.14781
作者: Ioannis Zarkadas,Amanda Tomlinson,Asaf Cidon,Baris Kasikci,Ofir Weisse
类目: Performance (cs.PF); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As models become larger, ML accelerators are a scarce resource whose performance must be continually optimized to improve efficiency. Existing performance analysis tools are coarse grained, and fail to capture model performance at the machine-code level. In addition, these tools often do not provide specific recommendations for optimizations. We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level that provides actionable optimization suggestions. Our core insight is to use a hardware-level simulator, an artifact of the hardware design process that we can re-purpose for performance analysis. xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator to gain low-level insights into the model’s performance. We implement xPU-Shark for our in-house accelerator and used it to analyze the performance of several of our production LLMs, revealing several previously-unknown microarchitecture inefficiencies. Leveraging these insights, we optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.

[LG-41] Better Private Distribution Testing by Leverag ing Unverified Auxiliary Data

链接: https://arxiv.org/abs/2503.14709
作者: Maryam Aliakbarpour,Arnav Burudgunte,Clément Cannone,Ronitt Rubinfeld
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We extend the framework of augmented distribution testing (Aliakbarpour, Indyk, Rubinfeld, and Silwal, NeurIPS 2024) to the differentially private setting. This captures scenarios where a data analyst must perform hypothesis testing tasks on sensitive data, but is able to leverage prior knowledge (public, but possibly erroneous or untrusted) about the data distribution. We design private algorithms in this augmented setting for three flagship distribution testing tasks, uniformity, identity, and closeness testing, whose sample complexity smoothly scales with the claimed quality of the auxiliary information. We complement our algorithms with information-theoretic lower bounds, showing that their sample complexity is optimal (up to logarithmic factors). Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2503.14709 [cs.LG] (or arXiv:2503.14709v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.14709 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] Sepsyn-OLCP: An Online Learning-based Framework for Early Sepsis Prediction with Uncertainty Quantification using Conformal Prediction

链接: https://arxiv.org/abs/2503.14663
作者: Anni Zhou,Beyah Raheem,Rishikesan Kamaleswaran,Yao Xie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sepsis is a life-threatening syndrome with high morbidity and mortality in hospitals. Early prediction of sepsis plays a crucial role in facilitating early interventions for septic patients. However, early sepsis prediction systems with uncertainty quantification and adaptive learning are scarce. This paper proposes Sepsyn-OLCP, a novel online learning algorithm for early sepsis prediction by integrating conformal prediction for uncertainty quantification and Bayesian bandits for adaptive decision-making. By combining the robustness of Bayesian models with the statistical uncertainty guarantees of conformal prediction methodologies, this algorithm delivers accurate and trustworthy predictions, addressing the critical need for reliable and adaptive systems in high-stakes healthcare applications such as early sepsis prediction. We evaluate the performance of Sepsyn-OLCP in terms of regret in stochastic bandit setting, the area under the receiver operating characteristic curve (AUROC), and F-measure. Our results show that Sepsyn-OLCP outperforms existing individual models, increasing AUROC of a neural network from 0.64 to 0.73 without retraining and high computational costs. And the model selection policy converges to the optimal strategy in the long run. We propose a novel reinforcement learning-based framework integrated with conformal prediction techniques to provide uncertainty quantification for early sepsis prediction. The proposed methodology delivers accurate and trustworthy predictions, addressing a critical need in high-stakes healthcare applications like early sepsis prediction.

[LG-43] Anomaly-Flow: A Multi-domain Federated Generative Adversarial Network for Distributed Denial-of-Service Detection

链接: https://arxiv.org/abs/2503.14618
作者: Leonardo Henrique de Melo,Gustavo de Carvalho Bertoli,Michele Nogueira,Aldri Luiz dos Santos,Lourenço Alves Pereira Junior
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Distributed denial-of-service (DDoS) attacks remain a critical threat to Internet services, causing costly disruptions. While machine learning (ML) has shown promise in DDoS detection, current solutions struggle with multi-domain environments where attacks must be detected across heterogeneous networks and organizational boundaries. This limitation severely impacts the practical deployment of ML-based defenses in real-world settings. This paper introduces Anomaly-Flow, a novel framework that addresses this critical gap by combining Federated Learning (FL) with Generative Adversarial Networks (GANs) for privacy-preserving, multi-domain DDoS detection. Our proposal enables collaborative learning across diverse network domains while preserving data privacy through synthetic flow generation. Through extensive evaluation across three distinct network datasets, Anomaly-Flow achieves an average F1-score of 0.747 , outperforming baseline models. Importantly, our framework enables organizations to share attack detection capabilities without exposing sensitive network data, making it particularly valuable for critical infrastructure and privacy-sensitive sectors. Beyond immediate technical contributions, this work provides insights into the challenges and opportunities in multi-domain DDoS detection, establishing a foundation for future research in collaborative network defense systems. Our findings have important implications for academic research and industry practitioners working to deploy practical ML-based security solutions. Comments: 8 pages, 4 figures Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2503.14618 [cs.CR] (or arXiv:2503.14618v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.14618 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-44] PANDORA: Diffusion Policy Learning for Dexterous Robotic Piano Playing

链接: https://arxiv.org/abs/2503.14545
作者: Yanjia Huang,Renjie Li,Zhengzhong Tu
类目: Machine Learning (cs.LG); Robotics (cs.RO); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We present PANDORA, a novel diffusion-based policy learning framework designed specifically for dexterous robotic piano performance. Our approach employs a conditional U-Net architecture enhanced with FiLM-based global conditioning, which iteratively denoises noisy action sequences into smooth, high-dimensional trajectories. To achieve precise key execution coupled with expressive musical performance, we design a composite reward function that integrates task-specific accuracy, audio fidelity, and high-level semantic feedback from a large language model (LLM) oracle. The LLM oracle assesses musical expressiveness and stylistic nuances, enabling dynamic, hand-specific reward adjustments. Further augmented by a residual inverse-kinematics refinement policy, PANDORA achieves state-of-the-art performance in the ROBOPIANIST environment, significantly outperforming baselines in both precision and expressiveness. Ablation studies validate the critical contributions of diffusion-based denoising and LLM-driven semantic feedback in enhancing robotic musicianship. Videos available at: this https URL

[LG-45] Natural Quantization of Neural Networks

链接: https://arxiv.org/abs/2503.15482
作者: Richard Barney,Djamil Lakhdar-Hamina,Victor Galitski
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 7 pages, 8 figures, 1 table

点击查看摘要

Abstract:We propose a natural quantization of a standard neural network, where the neurons correspond to qubits and the activation functions are implemented via quantum gates and measurements. The simplest quantized neural network corresponds to applying single-qubit rotations, with the rotation angles being dependent on the weights and measurement outcomes of the previous layer. This realization has the advantage of being smoothly tunable from the purely classical limit with no quantum uncertainty (thereby reproducing the classical neural network exactly) to a quantum case, where superpositions introduce an intrinsic uncertainty in the network. We benchmark this architecture on a subset of the standard MNIST dataset and find a regime of “quantum advantage,” where the validation error rate in the quantum realization is smaller than that in the classical model. We also consider another approach where quantumness is introduced via weak measurements of ancilla qubits entangled with the neuron qubits. This quantum neural network also allows for smooth tuning of the degree of quantumness by controlling an entanglement angle, g , with g=\frac\pi 2 replicating the classical regime. We find that validation error is also minimized within the quantum regime in this approach. We also observe a quantum transition, with sharp loss of the quantum network’s ability to learn at a critical point g_c . The proposed quantum neural networks are readily realizable in present-day quantum computers on commercial datasets.

[LG-46] Accurate transferable and verifiable machine-learned interatomic potentials for layered materials

链接: https://arxiv.org/abs/2503.15432
作者: Johnathan D. Georgaras,Akash Ramdas,Chung Hsuan Shan,Elena Halsted,Berwyn,Tianshu Li,Felipe H. da Jornada
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Twisted layered van-der-Waals materials often exhibit unique electronic and optical properties absent in their non-twisted counterparts. Unfortunately, predicting such properties is hindered by the difficulty in determining the atomic structure in materials displaying large moiré domains. Here, we introduce a split machine-learned interatomic potential and dataset curation approach that separates intralayer and interlayer interactions and significantly improves model accuracy – with a tenfold increase in energy and force prediction accuracy relative to conventional models. We further demonstrate that traditional MLIP validation metrics – force and energy errors – are inadequate for moiré structures and develop a more holistic, physically-motivated metric based on the distribution of stacking configurations. This metric effectively compares the entirety of large-scale moiré domains between two structures instead of relying on conventional measures evaluated on smaller commensurate cells. Finally, we establish that one-dimensional instead of two-dimensional moiré structures can serve as efficient surrogate systems for validating MLIPs, allowing for a practical model validation protocol against explicit DFT calculations. Applying our framework to HfS2/GaS bilayers reveals that accurate structural predictions directly translate into reliable electronic properties. Our model-agnostic approach integrates seamlessly with various intralayer and interlayer interaction models, enabling computationally tractable relaxation of moiré materials, from bilayer to complex multilayers, with rigorously validated accuracy.

[LG-47] HQNN-FSP: A Hybrid Classical-Quantum Neural Network for Regression-Based Financial Stock Market Prediction

链接: https://arxiv.org/abs/2503.15403
作者: Prashant Kumar Choudhary,Nouhaila Innan,Muhammad Shafique,Rajeev Singh
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 11 pages and 11 figures

点击查看摘要

[LG-48] Robustness of Nonlinear Representation Learning

链接: https://arxiv.org/abs/2503.15355
作者: Simon Buchholz,Bernhard Schölkopf
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 37 pages

点击查看摘要

[LG-49] Fast MLE and MAPE-Based Device Activity Detection for Grant-Free Access via PSCA and PSCA-Net

链接: https://arxiv.org/abs/2503.15259
作者: Bowen Tan,Ying Cui
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-50] Online federated learning framework for classification

链接: https://arxiv.org/abs/2503.15210
作者: Wenxing Guo,Jinhan Xie,Jianya Lu,Bei jiang,Hongsheng Dai,Linglong Kong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-51] Interpretability of Graph Neural Networks to Assert Effects of Global Change Drivers on Ecological Networks

链接: https://arxiv.org/abs/2503.15107
作者: Emre Anakok,Pierre Barbillon,Colin Fontaine,Elisa Thebault
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-52] Ambient Noise Full Waveform Inversion with Neural Operators

链接: https://arxiv.org/abs/2503.15013
作者: Caifeng Zou,Zachary E. Ross,Robert W. Clayton,Fan-Chi Lin,Kamyar Azizzadenesheli
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-53] Robust Transmission of Punctured Text with Large Language Model-based Recovery

链接: https://arxiv.org/abs/2503.14831
作者: Sojeong Park,Hyeonho Noh,Hyun Jong Yang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:With the recent advancements in deep learning, semantic communication which transmits only task-oriented features, has rapidly emerged. However, since feature extraction relies on learning-based models, its performance fundamentally depends on the training dataset or tasks. For practical scenarios, it is essential to design a model that demonstrates robust performance regardless of dataset or tasks. In this correspondence, we propose a novel text transmission model that selects and transmits only a few characters and recovers the missing characters at the receiver using a large language model (LLM). Additionally, we propose a novel importance character extractor (ICE), which selects transmitted characters to enhance LLM recovery performance. Simulations demonstrate that the proposed filter selection by ICE outperforms random filter selection, which selects transmitted characters randomly. Moreover, the proposed model exhibits robust performance across different datasets and tasks and outperforms traditional bit-based communication in low signal-to-noise ratio conditions.

[LG-54] he Hardness of Validating Observational Studies with Experimental Data AISTATS2025

链接: https://arxiv.org/abs/2503.14795
作者: Jake Fawkes,Michael O’Riordan,Athanasios Vlontzos,Oriol Corcoll,Ciarán Mark Gilligan-Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Published at AISTATS 2025

点击查看摘要

[LG-55] Variational Autoencoded Multivariate Spatial Fay-Herriot Models

链接: https://arxiv.org/abs/2503.14710
作者: Zhenhua Wang,Paul A. Parker,Scott H. Holan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-56] he Exoplanet Citizen Science Pipeline: Human Factors and Machine Learning

链接: https://arxiv.org/abs/2503.14575
作者: Oisín Creaner,Anna Preis,Cormac Ryan,Nika Gorchakova
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: Author’s manuscript version of paper accepted for publication in Proceedings of the International Astronomical Union, published by Cambridge University Press. Presented at Kavli-IAU Symposium 2024 “(Toward) Discovery of life beyond Earth and its impact”. 5 pages, 2 figures

点击查看摘要

Abstract:We present the progress of work to streamline and simplify the process of exoplanet observation by citizen scientists. International collaborations such as ExoClock and Exoplanet Watch enable citizen scientists to use small telescopes to carry out transit observations. These studies provide essential supports for space missions such as JWST and ARIEL. Contributions include maintenance or recovery of ephemerides, follow up confirmation and transit time variations. Ongoing observation programs benefit from a large pool of observers, with a wide variety of experience levels. Our projects work closely with these communities to streamline their observation pipelines and enable wider participation. Two complementary approaches are taken: Star Guide applies human-centric design and community consultation to identify points of friction within existing systems and provide complementary online tools and resources to reduce barriers to entry to the observing community. Machine Learning is used to accelerate data processing and automate steps which are currently manual, providing a streamlined tool for citizen science and a scalable solution for large-scale archival research.

[LG-57] Sequence Analysis Using the Bezier Curve

链接: https://arxiv.org/abs/2503.14574
作者: Taslim Murad,Sarwan Ali,Murray Patterson
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-58] Efficient Data Selection for Training Genomic Perturbation Models

链接: https://arxiv.org/abs/2503.14571
作者: George Panagopoulos,Johannes Lutzeyer,Sofiane Ennadir,Michalis Vazirgiannis,Jun Pang
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

[LG-59] Identifying Critical Phases for Disease Onset with Sparse Haematological Biomarkers

链接: https://arxiv.org/abs/2503.14561
作者: Andrea Zerio,Maya Bechler-Speicher,Tine Jess,Aleksejs Sazonovs
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Routinely collected clinical blood tests are an emerging molecular data source for large-scale biomedical research but inherently feature irregular sampling and informative observation. Traditional approaches rely on imputation, which can distort learning signals and bias predictions while lacking biological interpretability. We propose a novel methodology using Graph Neural Additive Networks (GNAN) to model biomarker trajectories as time-weighted directed graphs, where nodes represent sampling events and edges encode the time delta between events. GNAN’s additive structure enables the explicit decomposition of feature and temporal contributions, allowing the detection of critical disease-associated time points. Unlike conventional imputation-based approaches, our method preserves the temporal structure of sparse data without introducing artificial biases and provides inherently interpretable predictions by decomposing contributions from each biomarker and time interval. This makes our model clinically applicable, as well as allowing it to discover biologically meaningful disease signatures.

[LG-60] Machine learning algorithms to predict stroke in China based on causal inference of time series analysis

链接: https://arxiv.org/abs/2503.14512
作者: Qizhi Zheng,Ayang Zhao,Xinzhu Wang,Yanhong Bai,Zikun Wang,Xiuying Wang,Xianzhang Zeng,Guanghui Dong
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 17 pages

点击查看摘要

信息检索

[IR-0] Optimizing Retrieval Strategies for Financial Question Answering Documents in Retrieval-Augmented Generation Systems ICLR2025

链接: https://arxiv.org/abs/2503.15191
作者: Sejong Kim,Hyunseo Song,Hyunwoo Seo,Hyunjun Kim
类目: Information Retrieval (cs.IR)
*备注: 15 pages, 3 figures, 11 tables. Accepted at ICLR 2025 Workshop on Advances in Financial AI. Code available at this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a promising framework to mitigate hallucinations in Large Language Models (LLMs), yet its overall performance is dependent on the underlying retrieval system. In the finance domain, documents such as 10-K reports pose distinct challenges due to domain-specific vocabulary and multi-hierarchical tabular data. In this work, we introduce an efficient, end-to-end RAG pipeline that enhances retrieval for financial documents through a three-phase approach: pre-retrieval, retrieval, and post-retrieval. In the pre-retrieval phase, various query and corpus preprocessing techniques are employed to enrich input data. During the retrieval phase, we fine-tuned state-of-the-art (SOTA) embedding models with domain-specific knowledge and implemented a hybrid retrieval strategy that combines dense and sparse representations. Finally, the post-retrieval phase leverages Direct Preference Optimization (DPO) training and document selection methods to further refine the results. Evaluations on seven financial question answering datasets-FinDER, FinQABench, FinanceBench, TATQA, FinQA, ConvFinQA, and MultiHiertt-demonstrate substantial improvements in retrieval performance, leading to more accurate and contextually appropriate generation. These findings highlight the critical role of tailored retrieval techniques in advancing the effectiveness of RAG systems for financial applications. A fully replicable pipeline is available on GitHub: this https URL.

[IR-1] Graph-Based Re-ranking: Emerging Techniques Limitations and Opportunities

链接: https://arxiv.org/abs/2503.14802
作者: Md Shahir Zaoad,Niamat Zawad,Priyanka Ranade,Richard Krogman,Latifur Khan,James Holt
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Knowledge graphs have emerged to be promising datastore candidates for context augmentation during Retrieval Augmented Generation (RAG). As a result, techniques in graph representation learning have been simultaneously explored alongside principal neural information retrieval approaches, such as two-phased retrieval, also known as re-ranking. While Graph Neural Networks (GNNs) have been proposed to demonstrate proficiency in graph learning for re-ranking, there are ongoing limitations in modeling and evaluating input graph structures for training and evaluation for passage and document ranking tasks. In this survey, we review emerging GNN-based ranking model architectures along with their corresponding graph representation construction methodologies. We conclude by providing recommendations on future research based on community-wide challenges and opportunities.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-03-20

目录

概览 (2025-03-20)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载