Arxiv今日论文 | 2025-03-18

本篇博文主要内容为 2025-03-18 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂推理任务中过度依赖训练数据中的推理模式匹配，而缺乏主动选择最适认知策略的问题。现有方法通常采用固定的认知结构，在特定任务中表现良好但缺乏跨场景适应性。为克服这一局限，论文提出METASCALE，这是一种基于元思维（meta-thoughts）的测试时扩展框架。其关键是通过多臂老虎机算法结合上置信界选择来迭代选取和评估候选元思维，并利用奖励模型指导优化过程；同时引入遗传算法进一步演化高奖励的元思维，动态调整和扩充策略池，从而在推理阶段提升模型的准确性和泛化能力。实验结果表明，METASCALE显著优于标准推理方法，在不同任务中表现出更强的扩展性和更高质量的专家级响应。

链接: https://arxiv.org/abs/2503.13447
作者: Qin Liu,Wenxuan Zhou,Nan Xu,James Y. Huang,Fei Wang,Sheng Zhang,Hoifung Poon,Muhao Chen
机构: University of California, Davis (加州大学戴维斯分校); University of Southern California (南加州大学); Microsoft Research (微软研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:One critical challenge for large language models (LLMs) for making complex reasoning is their reliance on matching reasoning patterns from training data, instead of proactively selecting the most appropriate cognitive strategy to solve a given task. Existing approaches impose fixed cognitive structures that enhance performance in specific tasks but lack adaptability across diverse scenarios. To address this limitation, we introduce METASCALE, a test-time scaling framework based on meta-thoughts – adaptive thinking strategies tailored to each task. METASCALE initializes a pool of candidate meta-thoughts, then iteratively selects and evaluates them using a multi-armed bandit algorithm with upper confidence bound selection, guided by a reward model. To further enhance adaptability, a genetic algorithm evolves high-reward meta-thoughts, refining and extending the strategy pool over time. By dynamically proposing and optimizing meta-thoughts at inference time, METASCALE improves both accuracy and generalization across a wide range of tasks. Experimental results demonstrate that MetaScale consistently outperforms standard inference approaches, achieving an 11% performance gain in win rate on Arena-Hard for GPT-4o, surpassing o1-mini by 0.9% under style control. Notably, METASCALE scales more effectively with increasing sampling budgets and produces more structured, expert-level responses.
zh

[NLP-1] Faithfulness of LLM Self-Explanations for Commonsense Tasks: Larger Is Better and Instruction-Tuning Allows Trade-Offs but Not Pareto Dominance

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）自生成解释与内部决策过程一致性的问题，以确保其在安全性和监管方面的可靠性。论文的关键解决方案在于引入了一种简化的相关性反事实测试方法（phi-CCT），该方法无需依赖令牌概率即可解释大部分原始测试的方差，从而实现对62个来自8个家族模型的全面反事实忠实性分析。通过这种方法，研究揭示了模型规模与解释忠实性之间的清晰趋势，并深入探讨了指令微调、提示策略以及解释冗长性对忠实表示模型决策过程的影响及其权衡关系。

链接: https://arxiv.org/abs/2503.13445
作者: Noah Y. Siegel,Nicolas Heess,Maria Perez-Ortiz,Oana-Maria Camburu
机构: Google DeepMind (Google 深度思维); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 38 pages, 9 figures

点击查看摘要

Abstract:As large language models (LLMs) become increasingly capable, ensuring that their self-generated explanations are faithful to their internal decision-making process is critical for safety and oversight. In this work, we conduct a comprehensive counterfactual faithfulness analysis across 62 models from 8 families, encompassing both pretrained and instruction-tuned variants and significantly extending prior studies of counterfactual tests. We introduce phi-CCT, a simplified variant of the Correlational Counterfactual Test, which avoids the need for token probabilities while explaining most of the variance of the original test. Our findings reveal clear scaling trends: larger models are consistently more faithful on our metrics. However, when comparing instruction-tuned and human-imitated explanations, we find that observed differences in faithfulness can often be attributed to explanation verbosity, leading to shifts along the true-positive/false-positive Pareto frontier. While instruction-tuning and prompting can influence this trade-off, we find limited evidence that they fundamentally expand the frontier of explanatory faithfulness beyond what is achievable with pretrained models of comparable size. Our analysis highlights the nuanced relationship between instruction-tuning, verbosity, and the faithful representation of model decision processes.
zh

[NLP-2] xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在推理阶段因计算预算高昂而导致的效率和速度瓶颈问题。论文的关键解决方案在于引入了一种基于xLSTM架构的70亿参数规模的LLM——xLSTM 7B，它结合了xLSTM架构的线性计算扩展性和恒定内存使用等特性，并针对快速高效推理进行了针对性优化。实验表明，xLSTM 7B在下游任务上的性能与同类规模的其他LLMs相当，同时实现了显著更快的推理速度和更高的效率，从而确立了其作为当前最快的7B规模LLM的地位。

链接: https://arxiv.org/abs/2503.13427
作者: Maximilian Beck,Korbinian Pöppel,Phillip Lippe,Richard Kurle,Patrick M. Blies,Günter Klambauer,Sebastian Böck,Sepp Hochreiter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code available at: this https URL and this https URL

点击查看摘要

Abstract:Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM’s architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM’s potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.
zh

[NLP-3] SuperBPE: Space Travel for Language Models

【速读】：该论文试图解决传统语言模型（Language Model, LM）分词方案中存在的局限性问题，即基于子词（subword）的分词假设是否限制了现代语言模型的潜力。论文指出，以空格为边界的分词方式并非可靠的语义分割方法，尤其是在多词表达（如“by the way”）、跨语言概念表达差异（如德语中的“raumanzughelm”）以及不使用空格的语言（如中文）中表现尤为明显。为突破这一局限，论文提出了一种名为SuperBPE的“超词”分词器，其关键在于通过在字节对编码（Byte-Pair Encoding, BPE）算法中引入简单的预分词课程，先学习子词，再学习跨越空格的超词。这种设计不仅显著提升了编码效率（相比BPE平均减少33%的token数量），还在保持模型规模、词汇量和训练计算资源固定的情况下，通过仅改变词汇学习算法，在30个下游任务中实现了平均+4.0%的绝对性能提升，并将推理计算需求降低了27%。因此，SuperBPE的关键创新在于其能够生成更均匀难度的token分割，同时捕捉更多具有语义整体性的多词表达，从而全面提升语言模型的整体性能。

链接: https://arxiv.org/abs/2503.13423
作者: Alisa Liu,Jonathan Hayase,Valentin Hofmann,Sewoong Oh,Noah A. Smith,Yejin Choi
机构: University of Washington (华盛顿大学); NVIDIA (英伟达); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: preprint, code and artifacts will become available at this https URL

点击查看摘要

Abstract:The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., “by the way”), crosslingual variation in the number of words needed to express a concept (e.g., “spacesuit helmet” in German is “raumanzughelm”), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a “superword” tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying only the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better language models overall.
zh

[NLP-4] DLPO: Towards a Robust Efficient and Generalizable Prompt Optimization Framework from a Deep-Learning Perspective

【速读】：该论文旨在解决自动化提示优化（Prompt Optimization）在鲁棒性（Robustness）、效率（Efficiency）和泛化能力（Generalization）方面的关键挑战。现有方法受限于基于反射（Reflection-based）范式的局限性，难以满足大规模应用的需求。论文的关键创新在于提出了一种受传统深度学习范式启发的文本梯度优化方法（Deep Learning-based Prompt Optimization, DLPO），通过将这些概念与文本优化过程无缝集成，系统性地克服上述挑战。这种方法不仅验证了理论的有效性，还为未来的研究提供了指导，并深化了对该领域挑战与潜在解决方案的理解。代码已开源。

链接: https://arxiv.org/abs/2503.13413
作者: Dengyun Peng,Yuhang Zhou,Qiguang Chen,Jinhao Liu,Jingjing Chen,Libo Qin
机构: Research Center for Social Computing and Iterative Robot (社会计算与迭代机器人研究中心), Harbin Institute of Technology (哈尔滨工业大学), China; School of Computer Science and Engineering, Central South University (中南大学), China; FuDan University (复旦大学), China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across diverse tasks, largely driven by well-designed prompts. However, crafting and selecting such prompts often requires considerable human effort, significantly limiting its scalability. To mitigate this, recent studies have explored automated prompt optimization as a promising solution. Despite these efforts, existing methods still face critical challenges in robustness, efficiency, and generalization. To systematically address these challenges, we first conduct an empirical analysis to identify the limitations of current reflection-based prompt optimization paradigm. Building on these insights, we propose 7 innovative approaches inspired by traditional deep learning paradigms for prompt optimization (DLPO), seamlessly integrating these concepts into text-based gradient optimization. Through these advancements, we progressively tackle the aforementioned challenges and validate our methods through extensive experimentation. We hope our study not only provides valuable guidance for future research but also offers a comprehensive understanding of the challenges and potential solutions in prompt optimization. Our code is available at this https URL.
zh

[NLP-5] Using the Tools of Cognitive Science to Understand Large Language Models at Different Levels of Analysis

【速读】：该论文试图解决现代大型语言模型（Large Language Models）等先进人工智能系统日益强大但难以理解的问题。论文将此问题类比为历史上理解人类心智的困难，并提出借用认知科学（Cognitive Science）的方法来增进对这些复杂AI系统的理解。解决方案的关键在于基于Marr的三层次分析框架（computational, algorithmic, and implementational levels），重新审视与每个层次相关的认知科学方法，并展示它们揭示大型语言模型行为及内部组织潜在机制的能力，从而为解读这些新型“心智”提供工具箱。

链接: https://arxiv.org/abs/2503.13401
作者: Alexander Ku,Declan Campbell,Xuechunzi Bai,Jiayi Geng,Ryan Liu,Raja Marjieh,R. Thomas McCoy,Andrew Nam,Ilia Sucholutsky,Veniamin Veselovsky,Liyi Zhang,Jian-Qiao Zhu,Thomas L. Griffiths
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on Marr’s three levels of analysis. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.
zh

[NLP-6] MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research MICRO CVPR2025

【速读】：该论文旨在解决科学领域特别是生物学中复杂多模态推理能力评估不足的问题。现有多模态大语言模型（MLLMs）在辅助研究中的进展与当前多模态推理基准测试的难度水平之间存在差距，这些基准大多仅限于大学水平，而研究级别的基准更侧重于低层次感知任务，无法满足科学研究所需的复杂多模态推理需求。为填补这一空白，论文引入了MicroVQA，这是一个视觉-问答（VQA）基准，用于评估研究工作流程中至关重要的三种推理能力：专家级图像理解、假设生成以及实验设计。

MicroVQA的关键在于其创新性的构建方法。它包含由生物学家专家精心挑选的1,042道多项选择题（MCQs），覆盖多种显微成像技术，以确保VQA样本能够真实反映实际科学研究过程。在构建过程中发现，标准的MCQ生成方法容易导致语言捷径问题，因此提出了一种新的两阶段管道来优化这一过程：首先利用经过优化的大语言模型提示将问答对结构化为多项选择题；其次通过基于代理的“RefineBot”更新这些题目以消除语言捷径。此外，实验表明，即使是较小的语言模型也能接近顶级模型的表现，这表明基于语言的推理相较于多模态推理更具挑战性，并且使用科学文章进行微调可以提升模型性能。进一步分析表明，在推理错误中，感知错误最为常见，其次是知识错误，最后是过度概括错误。这些发现强调了多模态科学推理面临的挑战，并证明MicroVQA是一个推动AI驱动的生物医学研究发展的宝贵资源。MicroVQA及相关项目页面可通过提供的链接访问。

链接: https://arxiv.org/abs/2503.13399
作者: James Burgess,Jeffrey J Nirschl,Laura Bravo-Sánchez,Alejandro Lozano,Sanket Rajan Gupte,Jesus G. Galaz-Montoya,Yuhui Zhang,Yuchang Su,Disha Bhowmik,Zachary Coman,Sarina M. Hasan,Alexandra Johannesson,William D. Leineweber,Malvika G Nair,Ridhi Yarlagadda,Connor Zuraski,Wah Chiu,Sarah Cohen,Jan N. Hansen,Manuel D Leonetti,Chad Liu,Emma Lundberg,Serena Yeung-Levy
机构: Baylor College of Medicine (贝勒医学院); Stanford University (斯坦福大学); Technical University of Denmark (丹麦技术大学); University of California San Diego (加州大学圣地亚哥分校); Karolinska Institutet (卡罗林斯卡学院); Imperial College London (帝国理工学院); ETH Zurich (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
备注: CVPR 2025 (Conference on Computer Vision and Pattern Recognition) Project page at this https URL Benchmark at this https URL

点击查看摘要

Abstract:Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot’ updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at this https URL, and project page at this https URL.
zh

[NLP-7] Aligned Probing: Relating Toxic Behavior and Model Internals

【速读】：该论文试图解决语言模型（Language Models, LMs）在处理毒性（toxicity）相关任务时的行为与内部表征之间缺乏系统性关联理解的问题。为解决这一问题，论文提出了对齐探测（Aligned Probing）这一新颖的可解释性框架，通过将模型的行为（基于输出）与其内部表征（Internals）进行对齐，首次从行为和内部视角全面分析了毒性相关的特性。关键在于利用对齐探测框架，不仅揭示了模型在较低层对输入及输出毒性水平的强大编码能力，还提供了因果证据表明，当模型更有效地编码输入毒性信息时，其生成的输出毒性更低。此外，通过对多个模型的独特属性（如威胁性Threat）进行异质性分析，并结合案例研究进一步验证，论文为深入理解语言模型在毒性任务中的工作机制提供了重要贡献。

链接: https://arxiv.org/abs/2503.13390
作者: Andreas Waldis,Vagrant Gautam,Anne Lauscher,Dietrich Klakow,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt (达姆施塔特工业大学); Information Systems Research Lab, Lucerne University of Applied Sciences and Arts (卢塞恩应用科技大学); Spoken Language Systems, Saarland University (萨尔兰大学); Data Science Group, University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce aligned probing, a novel interpretability framework that aligns the behavior of language models (LMs), based on their outputs, and their internal representations (internals). Using this framework, we examine over 20 OLMo, Llama, and Mistral models, bridging behavioral and internal perspectives for toxicity for the first time. Our results show that LMs strongly encode information about the toxicity level of inputs and subsequent outputs, particularly in lower layers. Focusing on how unique LMs differ offers both correlative and causal evidence that they generate less toxic output when strongly encoding information about the input toxicity. We also highlight the heterogeneity of toxicity, as model behavior and internals vary across unique attributes such as Threat. Finally, four case studies analyzing detoxification, multi-prompt evaluations, model quantization, and pre-training dynamics underline the practical impact of aligned probing with further concrete insights. Our findings contribute to a more holistic understanding of LMs, both within and beyond the context of toxicity.
zh

[NLP-8] Cream of the Crop: Harvesting Rich Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning DATE

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）在微调阶段（Supervised Fine-Tuning, SFT）所需的最小监督假设与实际应用中模型稳定性及泛化能力不足之间的矛盾问题。特别是针对多模态LLMs（Multi-modal LLMs, MLLMs），由于数据源的异质性和数量庞大，如何高效且稳健地选择高质量的多模态指令数据成为研究难点。论文的关键解决方案在于重新定义数据质量评估的粒度，将其分解为14个与视觉-语言相关的能力，并引入多模态丰富评分器（multi-modal rich scorers, mmSSR）来量化每个候选数据的质量；同时，通过将交互风格作为多样性指标，利用多模态丰富样式器（multi-modal rich styler）识别数据指令模式，从而保证高分信息以多样化形式传递给用户。这种设计摆脱了基于嵌入的聚类或贪心采样的限制，能够高效扩展至数百万数据样本，支持预算约束下的定制化能力获取以及无训练的跨领域数据筛选，最终在14个多模态基准测试中实现了显著优于随机采样、基线策略及现有最优选择方法的表现，仅使用30%的数据即达到全量数据99.1%的性能水平。

链接: https://arxiv.org/abs/2503.13383
作者: Mengyao Lyu,Yan Li,Huasong Zhong,Wenhao Yang,Hui Chen,Jungong Han,Guiguang Ding,Zhenheng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: update comparison with sota and analysis

点击查看摘要

Abstract:The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee Ippolito, 2024; Xia et al., 2024b). Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate. To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data. Comments: update comparison with sota and analysis Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2503.13383 [cs.CV] (or arXiv:2503.13383v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.13383 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-9] meZero: Temporal Video Grounding with Reasoning -Guided LVLM WWW

【速读】：该论文旨在解决视频时间定位（Temporal Video Grounding, TVG）任务中的问题，即从长视频中精确地定位与给定语言查询相关的视频片段。为应对这一挑战，论文提出了一种名为TimeZero的推理引导型大模型（LVLM），其关键在于通过强化学习（Reinforcement Learning）扩展推理过程，使模型能够仅基于视频-语言关系进行推理。实验结果显示，TimeZero在Charades-STA数据集上达到了当前最优性能（state-of-the-art）。

链接: https://arxiv.org/abs/2503.13377
作者: Ye Wang,Boshen Xu,Zihao Yue,Zihan Xiao,Ziheng Wang,Liang Zhang,Dingyi Yang,Wenxuan Wang,Qin Jin
机构: Renmin University of China (中国人民大学); Beijing University of Posts and Telecommunications (北京邮电大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:We introduce TimeZero, a reasoning-guided LVLM designed for the temporal video grounding (TVG) task. This task requires precisely localizing relevant video segments within long videos based on a given language query. TimeZero tackles this challenge by extending the inference process, enabling the model to reason about video-language relationships solely through reinforcement learning. To evaluate the effectiveness of TimeZero, we conduct experiments on two benchmarks, where TimeZero achieves state-of-the-art performance on Charades-STA. Code is available at this https URL.
zh

[NLP-10] Valid Text-to-SQL Generation with Unification-based DeepStochLog

【速读】：该论文旨在解决大型语言模型在将自然语言问题转换为SQL查询时，由于缺乏语法和数据库模式的硬性约束，偶尔会产生无效且不可执行查询的问题。这种局限性阻碍了这些系统在实际场景中的应用。论文的关键解决方案是提出了一种神经符号框架（neurosymbolic framework），通过基于合一（unification-based）的确定性句法规则（definite clause grammars）施加SQL语法和模式约束，从而确保生成的查询始终有效。此外，该框架还构建了一个双向接口以利用语言模型的自然语言理解能力。实验结果显示，在所测试的SQL语法规则子集上，所有输出查询均有效，证明了这一方法在提升查询有效性、执行准确性和与地面真值对齐方面的显著优势。

链接: https://arxiv.org/abs/2503.13342
作者: Ying Jiao,Luc De Raedt,Giuseppe Marra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have been used to translate natural language questions to SQL queries. Without hard constraints on syntax and database schema, they occasionally produce invalid queries that are not executable. These failures limit the usage of these systems in real-life scenarios. We propose a neurosymbolic framework that imposes SQL syntax and schema constraints with unification-based definite clause grammars and thus guarantees the generation of valid queries. Our framework also builds a bi-directional interface to language models to leverage their natural language understanding abilities. The evaluation results on a subset of SQL grammars show that all our output queries are valid. This work is the first step towards extending language models with unification-based grammars. We demonstrate this extension enhances the validity, execution accuracy, and ground truth alignment of the underlying language model by a large margin. Our code is available at this https URL.
zh

[NLP-11] Reliable and Efficient Amortized Model-based Evaluation

【速读】：该论文旨在解决语言模型（Language Model, LM）在开发与部署阶段全面评估成本过高的问题。当前，通过广泛基准测试衡量模型性能需要大量资源，而采用随机子集平均得分的方法通常因基准问题难度的混杂效应导致结果不可靠。为应对这一挑战，论文的关键解决方案是训练一个从问题内容预测难度的模型，从而以较低成本实现可靠测量。此外，利用该难度预测器进一步设计了一个基于指定难度水平生成问题的问题生成器，用于自适应测试中动态选择信息量最大的问题，取代随机子集选择。实验表明，此方法相较于现有常见做法更加可靠且高效。

链接: https://arxiv.org/abs/2503.13335
作者: Sang Truong,Yuheng Tu,Percy Liang,Bo Li,Sanmi Koyejo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities (e.g., mathematical reasoning, legal support, or medical diagnostic) as well as safety risks (e.g., racial bias, toxicity, or misinformation). The average score across a wide range of benchmarks provides a signal that helps guide the use of these LMs in practice. Currently, holistic evaluations are costly due to the large volume of benchmark questions, making frequent evaluations impractical. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach, unfortunately, often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. Item response theory (IRT) was designed to address this challenge, providing a reliable measurement by careful controlling for question difficulty. Unfortunately, question difficulty is expensive to estimate. Facing this challenge, we train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost. In addition, we leverage this difficulty predictor to further improve the evaluation efficiency through training a question generator given a difficulty level. This question generator is essential in adaptive testing, where, instead of using a random subset of the benchmark questions, informative questions are adaptively chosen based on the current estimation of LLM performance. Experiments on 22 common natural language benchmarks and 172 LMs show that this approach is more reliable and efficient compared to current common practice.
zh

[NLP-12] Computation Mechanism Behind LLM Position Generalization

【速读】：该论文试图解决的问题是如何理解大型语言模型（Large Language Models, LLMs）在处理文本位置扰动时展现出的灵活性（position generalization），并揭示其内部计算机制如何实现这种能力。论文的关键在于发现LLMs通过一种反直觉的方式解耦了注意力logits值，并证明这些值与位置相关性和语义重要性之和存在高度线性相关性（0.959）。此外，研究进一步识别出一种在中间特征中存在的普遍模式，该模式能够理论上支持上述效果，且并非随机初始化参数的自然结果，而是模型学习到的行为。基于这些发现，论文提供了关于LLMs位置灵活性的计算解释和判断标准，从而首次将位置泛化现象与现代LLMs的内部机制联系起来。

链接: https://arxiv.org/abs/2503.13305
作者: Chi Han,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Most written natural languages are composed of sequences of words and sentences. Similar to humans, large language models (LLMs) exhibit flexibility in handling textual positions - a phenomenon we term position generalization. They can understand texts with position perturbations and generalize to longer texts than those encountered during training with the latest techniques. These phenomena suggest that LLMs handle positions tolerantly, but how LLMs computationally process positional relevance remains largely unexplored. This work connects the linguistic phenomenon with LLMs’ computational mechanisms. We show how LLMs enforce certain computational mechanisms for the aforementioned tolerance in position perturbations. Despite the complex design of the self-attention mechanism, this work reveals that LLMs learn a counterintuitive disentanglement of attention logits. Their values show a 0.959 linear correlation with an approximation of the arithmetic sum of positional relevance and semantic importance. Furthermore, we identify a prevalent pattern in intermediate features, which we prove theoretically enables this effect. The pattern, which is different from how randomly initialized parameters would behave, suggests that it is a learned behavior rather than a natural result of the model architecture. Based on these findings, we provide computational explanations and criteria for LLMs’ position flexibilities. This work takes a pioneering step in linking position generalization with modern LLMs’ internal mechanisms.
zh

[NLP-13] A Survey on Transformer Context Extension: Approaches and Evaluation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理长上下文场景时性能下降的问题。论文系统性地回顾了针对长上下文挑战的多种方法，并将其分类为四种主要类型：位置编码（positional encoding）、上下文压缩（context compression）、检索增强（retrieval augmented）以及注意力模式（attention pattern）。解决方案的关键在于提出这一分类框架，通过明确不同方法的技术路径，为研究者提供清晰的方向以应对长上下文任务中的性能瓶颈，并进一步探讨了评估长上下文任务的相关数据集、任务和指标。最后，论文总结了当前未解决的问题并展望了未来的发展方向。

链接: https://arxiv.org/abs/2503.13299
作者: Yijun Liu,Jinzheng Yu,Yang Xu,Zhongyang Li,Qingfu Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Large language models (LLMs) based on Transformer have been widely applied in the filed of natural language processing (NLP), demonstrating strong performance, particularly in handling short text tasks. However, when it comes to long context scenarios, the performance of LLMs degrades due to some challenges. To alleviate this phenomenon, there is a number of work proposed recently. In this survey, we first list the challenges of applying pre-trained LLMs to process long contexts. Then systematically review the approaches related to long context and propose our taxonomy categorizing them into four main types: positional encoding, context compression, retrieval augmented, and attention pattern. In addition to the approaches, we focus on the evaluation of long context, organizing relevant data, tasks, and metrics based on existing long context benchmarks. Finally, we summarize unresolved issues in the long context domain and put forward our views on future developments.
zh

[NLP-14] ϕ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

【速读】：该论文旨在解决基于搜索的解码策略在推理时间优化中存在的短视问题以及搜索空间过大导致的过度探索和不足利用的问题。为了解决这些问题，论文的关键在于提出了一种新的解码策略——\phi-Decoding。该方法通过前瞻采样（foresight sampling）来模拟未来的步骤，从而实现全局最优的步长估计，并通过近似前瞻分布和聚类分布来提供精确且富有表现力的步长评估。此外，为了支持自适应计算分配，论文还提出了宽度剪枝（in-width pruning）和深度剪枝（in-depth pruning）策略，以提高推理效率。实验结果表明，\phi-Decoding 在性能和效率方面均优于强大的基线模型，并且具有良好的通用性和可扩展性。

链接: https://arxiv.org/abs/2503.13288
作者: Fangzhi Xu,Hang Yan,Chang Ma,Haiteng Zhao,Jun Liu,Qika Lin,Zhiyong Wu
机构: Shanghai AI Lab (上海人工智能实验室); Xi’an Jiaotong University (西安交通大学); The University of Hong Kong (香港大学); Peking University (北京大学); National University of Singapore (新加坡国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named \phi -Decoding. To provide a precise and expressive estimation of step value, \phi -Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show \phi -Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets. The code will be released at this https URL, and the open-source PyPI package is coming soon.
zh

[NLP-15] LLM -Match: An Open-Sourced Patient Matching Model Based on Large Language Models and Retrieval-Augmented Generation

【速读】：该论文旨在解决临床试验患者匹配（Patient Matching）的问题，即通过准确识别和匹配患者的医疗记录与临床试验的纳入和排除标准，将患者与合适的临床试验关联起来。论文的关键创新在于提出了一种名为LLM-Match的新框架，其核心解决方案包括四个关键组件：首先，检索增强生成（Retrieval-Augmented Generation, RAG）模块从海量电子健康记录（Electronic Health Records, EHRs）中提取相关的患者上下文信息；其次，提示生成模块通过整合试验资格标准、患者上下文和系统指令构建输入提示；第三，带有分类头的微调模块利用结构化提示和真实标签优化模型参数；最后，评估模块在测试数据集上评估微调模型的性能。这些组件共同确保了模型在多个公开数据集上的卓越表现，超越了现有基线方法。

链接: https://arxiv.org/abs/2503.13281
作者: Xiaodi Li,Shaika Chowdhury,Chung Il Wi,Maria Vassilaki,Ken Liu,Terence T Sio,Owen Garrick,Young J Juhn,James R Cerhan,Cui Tao,Nansu Zong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:Patient matching is the process of linking patients to appropriate clinical trials by accurately identifying and matching their medical records with trial eligibility criteria. We propose LLM-Match, a novel framework for patient matching leveraging fine-tuned open-source large language models. Our approach consists of four key components. First, a retrieval-augmented generation (RAG) module extracts relevant patient context from a vast pool of electronic health records (EHRs). Second, a prompt generation module constructs input prompts by integrating trial eligibility criteria (both inclusion and exclusion criteria), patient context, and system instructions. Third, a fine-tuning module with a classification head optimizes the model parameters using structured prompts and ground-truth labels. Fourth, an evaluation module assesses the fine-tuned model’s performance on the testing datasets. We evaluated LLM-Match on four open datasets, n2c2, SIGIR, TREC 2021, and TREC 2022, using open-source models, comparing it against TrialGPT, Zero-Shot, and GPT-4-based closed models. LLM-Match outperformed all baselines.
zh

[NLP-16] ablePilot; Recommending Human-Preferred Tabular Data Analysis with Large Language Models

【速读】：该论文旨在解决在表格数据分析工作流中，高效识别并推荐针对新表最相关分析查询、代码及结果的挑战。这一问题的核心难点在于表格数据的复杂性、多样化的分析操作以及对高质量分析结果的需求，使得整个过程繁琐且耗时。为应对这些挑战，论文提出的关键解决方案是开发TablePilot框架，它利用大型语言模型（Large Language Models）自主生成全面且优质的分析结果，而不依赖于用户画像或历史交互信息。TablePilot框架通过引入分析准备（Analysis Preparation）与分析优化（Analysis Optimization）的关键设计来提升推荐准确性，并进一步提出了Rec-Align方法以提高推荐质量并更好地符合人类偏好。实验结果表明，基于GPT-4o优化后的TablePilot在DART数据集上的top-5推荐召回率达到77.0%，并通过人工评估验证了其在优化表格数据分析工作流方面的有效性。

链接: https://arxiv.org/abs/2503.13262
作者: Deyin Yi,Yihao Liu,Lang Cao,Mengyu Zhou,Haoyu Dong,Shi Han,Dongmei Zhang
机构: Shanghai University of Finance and Economics (上海财经大学); Peking University (北京大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Microsoft Research (微软研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tabular data analysis is crucial in many scenarios, yet efficiently identifying the most relevant data analysis queries and results for a new table remains a significant challenge. The complexity of tabular data, diverse analytical operations, and the demand for high-quality analysis make the process tedious. To address these challenges, we aim to recommend query-code-result triplets tailored for new tables in tabular data analysis workflows. In this paper, we present TablePilot, a pioneering tabular data analysis framework leveraging large language models to autonomously generate comprehensive and superior analytical results without relying on user profiles or prior interactions. The framework incorporates key designs in analysis preparation and analysis optimization to enhance accuracy. Additionally, we propose Rec-Align, a novel method to further improve recommendation quality and better align with human preferences. Experiments on DART, a dataset specifically designed for comprehensive tabular data analysis recommendation, demonstrate the effectiveness of our framework. Based on GPT-4o, the tuned TablePilot achieves 77.0% top-5 recommendation recall. Human evaluations further highlight its effectiveness in optimizing tabular data analysis workflows.
zh

[NLP-17] Can Language Models Follow Multiple Turns of Entangled Instructions?

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理多轮指令时的能力局限性，特别是面对潜在交织或冲突指令时的一致性和复杂任务集成能力不足的问题。论文聚焦于三个难度层次：从指令中检索信息、跨轮次跟踪与推理，以及解决指令间的冲突。研究通过人机协作构建了包含约1100段高质量多轮对话的数据集MultiTurnInstruct，并揭示了不同能力之间的权衡关系。论文的关键发现指出，尽管GPT模型在记忆能力方面表现优异，但在需要选择性信息保留的隐私保护任务中效果减弱；更大的模型虽具备更强的推理能力，但仍然难以有效解决冲突指令。此外，这些性能差距并非完全由信息丢失引起，因为模型在记忆任务中的BLEU分数较高，但其注意力机制未能有效地整合相关指令。因此，论文强调了在复杂真实场景中处理多轮指令所需改进的关键领域。

链接: https://arxiv.org/abs/2503.13222
作者: Chi Han
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Despite significant achievements in improving the instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflicting instructions remains a considerable challenge. Real-world scenarios often require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization, which demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs’ capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct with around 1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in nine capability categories, including statics and dynamics, reasoning, and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks but their attention mechanisms fail to integrate multiple related instructions effectively. These findings highlight critical areas for improvement in complex real-world tasks involving multi-turn instructions.
zh

[NLP-18] Improving Complex Reasoning with Dynamic Prompt Corruption: A soft prompt Optimization Approach ICLR2025

【速读】：该论文旨在解决基于提示调优（Prompt-tuning, PT）在复杂推理任务中性能提升有限甚至可能退化的问题。研究发现，软提示（soft prompts）虽然对某些实例有积极作用，但在推理后期对其他实例可能产生负面影响，主要源于模型深层中错误的信息流模式。为应对这些挑战，论文提出了一种名为动态提示腐蚀（Dynamic Prompt Corruption, DPC）的新方法。DPC 的关键在于通过两个阶段动态调整软提示的影响：首先，动态触发（Dynamic Trigger）评估软提示的影响，判断其益处或害处；其次，动态腐蚀（Dynamic Corruption）通过选择性屏蔽干扰推理过程的关键标记来减轻软提示的负面影响。实验结果表明，DPC 在多种大型语言模型（LLMs）和推理任务上的表现优于传统的 PT 方法，提升了 4%-8% 的准确性，验证了该方法的有效性及其增强 LLMs 复杂推理能力的潜力。

链接: https://arxiv.org/abs/2503.13208
作者: Sinan Fan,Liang Xie,Chen Shen,Ge Teng,Xiaosong Yuan,Xiaofeng Zhang,Chenxi Huang,Wenxiao Wang,Xiaofei He,Jieping Ye
机构: Zhejiang University (浙江大学); Hangzhou YunQi Academy of Engineering (杭州云栖工程院); College of Computer Science and Technology, Zhejiang University of Technology (浙江工业大学计算机科学与技术学院); Alibaba Cloud Computing (阿里云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Prompt-tuning (PT) for large language models (LLMs) can facilitate the performance on various conventional NLP tasks with significantly fewer trainable parameters. However, our investigation reveals that PT provides limited improvement and may even degrade the primitive performance of LLMs on complex reasoning tasks. Such a phenomenon suggests that soft prompts can positively impact certain instances while negatively affecting others, particularly during the later phases of reasoning. To address these challenges, We first identify an information accumulation within the soft prompts. Through detailed analysis, we demonstrate that this phenomenon is often accompanied by erroneous information flow patterns in the deeper layers of the model, which ultimately lead to incorrect reasoning outcomes. we propose a novel method called \textbfDynamic \textbfPrompt \textbfCorruption (DPC) to take better advantage of soft prompts in complex reasoning tasks, which dynamically adjusts the influence of soft prompts based on their impact on the reasoning process. Specifically, DPC consists of two stages: Dynamic Trigger and Dynamic Corruption. First, Dynamic Trigger measures the impact of soft prompts, identifying whether beneficial or detrimental. Then, Dynamic Corruption mitigates the negative effects of soft prompts by selectively masking key tokens that interfere with the reasoning process. We validate the proposed approach through extensive experiments on various LLMs and reasoning tasks, including GSM8K, MATH, and AQuA. Experimental results demonstrate that DPC can consistently enhance the performance of PT, achieving 4%-8% accuracy gains compared to vanilla prompt tuning, highlighting the effectiveness of our approach and its potential to enhance complex reasoning in LLMs.
zh

[NLP-19] MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for Inpatient Pathways

【速读】：该论文旨在解决住院患者诊疗路径中复杂临床决策支持的问题，尤其是在缺乏大规模住院数据集和现有医疗基准集中忽视住院环境中多维度临床决策的情况下。论文的关键创新在于开发了Inpatient Pathway Decision Support (IPDS)基准数据集，并提出了Multi-Agent Inpatient Pathways (MAP)框架。MAP框架通过三个临床智能体（分诊智能体、诊断智能体和治疗智能体）协同工作实现住院路径管理，并引入首席智能体进行全局协调与指导，从而显著提升了诊断准确性（比最先进的大语言模型HuatuoGPT2-13B提高了25.10%），并在临床合规性方面超越了三名获得认证的医生10%-12%，为住院路径系统奠定了基础。

链接: https://arxiv.org/abs/2503.13205
作者: Zhen Chen,Zhihao Peng,Xusheng Liang,Cheng Wang,Peigan Liang,Linsheng Zeng,Minjie Ju,Yixuan Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Inpatient pathways demand complex clinical decision-making based on comprehensive patient information, posing critical challenges for clinicians. Despite advancements in large language models (LLMs) in medical applications, limited research focused on artificial intelligence (AI) inpatient pathways systems, due to the lack of large-scale inpatient datasets. Moreover, existing medical benchmarks typically concentrated on medical question-answering and examinations, ignoring the multifaceted nature of clinical decision-making in inpatient settings. To address these gaps, we first developed the Inpatient Pathway Decision Support (IPDS) benchmark from the MIMIC-IV database, encompassing 51,274 cases across nine triage departments and 17 major disease categories alongside 16 standardized treatment options. Then, we proposed the Multi-Agent Inpatient Pathways (MAP) framework to accomplish inpatient pathways with three clinical agents, including a triage agent managing the patient admission, a diagnosis agent serving as the primary decision maker at the department, and a treatment agent providing treatment plans. Additionally, our MAP framework includes a chief agent overseeing the inpatient pathways to guide and promote these three clinician agents. Extensive experiments showed our MAP improved the diagnosis accuracy by 25.10% compared to the state-of-the-art LLM HuatuoGPT2-13B. It is worth noting that our MAP demonstrated significant clinical compliance, outperforming three board-certified clinicians by 10%-12%, establishing a foundation for inpatient pathways systems.
zh

[NLP-20] Are LLM s (Really) Ideological? An IRT-based Analysis and Alignment Tool for Perceived Socio-Economic Bias in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中潜在的社会经济偏见检测与量化问题，同时避免依赖主观的人类判断。论文的关键解决方案在于引入基于项目反应理论（Item Response Theory, IRT）的框架，通过考虑项目难度来改进对意识形态偏见的估计。具体而言，研究者通过微调两种具有不同意识形态立场的LLM家族（Meta-LLaMa 3.2-1B-Instruct和ChatGPT 3.5），采用两阶段方法：首先建模响应回避行为，其次评估已回答问题中的感知偏见。结果表明，现成的LLMs通常倾向于避免意识形态上的参与，而非表现出明显的偏见，这挑战了先前关于党派倾向的主张。这一经验证的框架推动了AI对齐研究，并促进了更公平的AI治理。

链接: https://arxiv.org/abs/2503.13149
作者: Jasmin Wachter,Michael Radloff,Maja Smolej,Katharina Kinder-Kurlanda
机构: Department of AI and Cybersecurity (人工智能与网络安全部门), Department of Health Psychology (健康心理学部门); Digital Age Research Center D’ARC (数字时代研究中心 D’ARC); University of Klagenfurt (克拉根福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We introduce an Item Response Theory (IRT)-based framework to detect and quantify socioeconomic bias in large language models (LLMs) without relying on subjective human judgments. Unlike traditional methods, IRT accounts for item difficulty, improving ideological bias estimation. We fine-tune two LLM families (Meta-LLaMa 3.2-1B-Instruct and Chat- GPT 3.5) to represent distinct ideological positions and introduce a two-stage approach: (1) modeling response avoidance and (2) estimating perceived bias in answered responses. Our results show that off-the-shelf LLMs often avoid ideological engagement rather than exhibit bias, challenging prior claims of partisanship. This empirically validated framework enhances AI alignment research and promotes fairer AI governance.
zh

[NLP-21] Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

【速读】：该论文旨在解决长视频内容理解中逻辑关系被忽视的问题，传统方法通常依赖密集采样的帧级字幕或端到端特征选择器，但这些方法往往未能充分考虑文本查询与视觉元素之间的逻辑关联。在实际应用中，计算资源限制导致粗粒度的帧采样成为挑战，类似于“大海捞针”。为了解决这一问题，论文提出了一种语义驱动的搜索框架，即在视觉语义-逻辑搜索范式下重新定义关键帧选择。关键在于引入了四种基本逻辑依赖关系：空间共现、时间邻近性、属性依赖性和因果顺序，并通过迭代 refinement 过程动态更新帧采样分布，从而实现针对特定查询需求的上下文感知的关键帧识别。实验结果表明，该方法在关键帧选择指标上达到了新的 SOTA 性能，并在下游视频问答任务中表现出显著的性能提升，验证了其在弥合文本查询与视觉-时间推理之间逻辑鸿沟方面的有效性。

链接: https://arxiv.org/abs/2503.13139
作者: Weiyu Guo,Ziyang Chen,Shaoguang Wang,Jianxiang He,Yijie Xu,Jinhui Ye,Ying Sun,Hui Xiong
机构: AI Thrust, HKUST(GZ)(香港科技大学（广州）); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
备注: 18 pages, under review

点击查看摘要

Abstract:Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to ``finding a needle in a haystack.‘’ To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.
zh

[NLP-22] MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLM s

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在三维空间推理能力方面的局限性，尽管其在二维视觉理解方面表现出色。论文的关键解决方案在于利用大规模高质量的三维场景数据集及其开放集标注，提出了一种新的监督微调数据集（Cubify Anything VQA, CA-VQA）和评估基准，专注于室内场景的空间任务，如空间关系预测、尺寸与距离估计以及三维定位。通过引入度量深度和多视图输入等特性，论文展示了如何进一步提升三维空间理解能力，并证明仅通过数据即可使模型获得接近专用单目深度估计算法的深度感知能力。最终，该研究训练了一个强大的通用型MLLM——MM-Spatial，并在包括自定义基准在内的三维空间理解任务中实现了最先进的性能。

链接: https://arxiv.org/abs/2503.13111
作者: Erik Daxberger,Nina Wenzel,David Griffiths,Haiming Gang,Justin Lazarow,Gefen Kohavi,Kai Kang,Marcin Eichner,Yinfei Yang,Afshin Dehghan,Peter Grasch
机构: Apple (苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models. We will publish our SFT dataset and benchmark.
zh

[NLP-23] Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences

【速读】：该论文试图解决在归纳推理（Inductive Reasoning）领域因缺乏高质量过程监督数据而导致研究不足的问题。论文的关键解决方案在于创新性地利用数列作为归纳推理的数据来源，通过将数列封装为算法问题，并借助代码求解器推导数列的通项公式。这种方法不仅能够验证代码解法是否适用于当前数列的任意项，还能通过代码单元测试注入基于案例的监督信号。最终，构建了一个数列合成数据管道并形成训练数据集CodeSeq，从而有效提升了模型在代码推理和综合推理基准任务上的性能。

链接: https://arxiv.org/abs/2503.13109
作者: Kedi Chen,Zhikai Lei,Fan Zhang,Yinqi Zhang,Qin Chen,Jie Zhou,Liang He,Qipeng Guo,Kai Chen,Wei Zhang
机构: East China Normal University (华东师范大学); Shanghai AI Laboratory (上海人工智能实验室); Georgia Institute of Technology (乔治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models make remarkable progress in reasoning capabilities. Existing works focus mainly on deductive reasoning tasks (e.g., code and math), while another type of reasoning mode that better aligns with human learning, inductive reasoning, is not well studied. We attribute the reason to the fact that obtaining high-quality process supervision data is challenging for inductive reasoning. Towards this end, we novelly employ number sequences as the source of inductive reasoning data. We package sequences into algorithmic problems to find the general term of each sequence through a code solution. In this way, we can verify whether the code solution holds for any term in the current sequence, and inject case-based supervision signals by using code unit tests. We build a sequence synthetic data pipeline and form a training dataset CodeSeq. Experimental results show that the models tuned with CodeSeq improve on both code and comprehensive reasoning benchmarks.
zh

[NLP-24] REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）作为评判者（LLM-as-a-judge）这一新型范式的适用性在俄语环境中的表现。此前，该框架主要在英语环境中研究，而本文通过引入俄语错误类型标注数据集（REPA），包含1千个用户查询和2千个由LLM生成的响应，填补了这一空白。论文的关键在于通过人类标注者对生成响应的偏好进行十种特定错误类型的标注以及整体偏好选择，以此为基础采用三种基于人类偏好的评分系统对六种生成式LLM进行排名，并进一步评估八种LLM裁判在零样本和少样本设置下的表现。研究发现俄语环境下LLM裁判的表现与英语存在显著差距，但人类与LLM偏好之间的部分一致性表明，尽管当前LLM裁判在俄语细粒度评估方面存在挑战，但仍具有改进潜力。因此，论文的核心解决方案在于构建REPA数据集并结合多维度的人类标注，以系统性地分析和评估LLM裁判在俄语场景中的性能及改进空间。

链接: https://arxiv.org/abs/2503.13102
作者: Alexander Pugachev,Alena Fenogenova,Vladislav Mikhailov,Ekaterina Artemova
机构: Higher School of Economics (高等经济学院); SaluteDevices (SaluteDevices); University of Oslo (奥斯陆大学); Toloka AI (Toloka AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have introduced the novel paradigm of using LLMs as judges, where an LLM evaluates and scores the outputs of another LLM, which often correlates highly with human preferences. However, the use of LLM-as-a-judge has been primarily studied in English. In this paper, we evaluate this framework in Russian by introducing the Russian Error tyPes Annotation dataset (REPA), a dataset of 1k user queries and 2k LLM-generated responses. Human annotators labeled each response pair expressing their preferences across ten specific error types, as well as selecting an overall preference. We rank six generative LLMs across the error types using three rating systems based on human preferences. We also evaluate responses using eight LLM judges in zero-shot and few-shot settings. We describe the results of analyzing the judges and position and length biases. Our findings reveal a notable gap between LLM judge performance in Russian and English. However, rankings based on human and LLM preferences show partial alignment, suggesting that while current LLM judges struggle with fine-grained evaluation in Russian, there is potential for improvement.
zh

[NLP-25] Who Wrote This? Identifying Machine vs Human-Generated Text in Hausa

【速读】：该论文旨在解决低资源语言（如豪萨语）中区分人为生成文本与机器生成文本的问题。现有机器生成文本检测器大多针对高资源语言（如英语、法语等）进行训练，而对豪萨语等低资源语言缺乏有效工具。论文的关键解决方案是开发了一个大规模的豪萨语文本检测器，并通过利用豪萨语媒体的真实文本与基于Gemini-2.0模型自动生成的对应文章，构建了一个标注数据集。随后，论文对四个预训练的非洲中心模型（AfriTeVa、AfriBERTa、AfroXLMR 和 AfroXLMR-76L）进行了微调，并评估其性能。结果表明，AfroXLMR 在准确性（99.23%）和 F1 分数（99.21%）方面表现最优，验证了其在豪萨语文本检测中的有效性。因此，解决方案的核心在于针对低资源语言设计适配的检测方法，并通过高质量的数据集和适当的模型微调实现高性能检测。

链接: https://arxiv.org/abs/2503.13101
作者: Babangida Sani,Aakansha Soy,Sukairaj Hafiz Imam,Ahmad Mustapha,Lukman Jibril Aliyu,Idris Abdulmumin,Ibrahim Said Ahmad,Shamsuddeen Hassan Muhammad
机构: Kalinga University (卡林加大学); Bayero University, Kano (凯诺拜罗大学, 凯诺); Arewa Data Science Academy (阿雷瓦数据科学学院); DSFSI, University of Pretoria (南非普勒托利亚大学 DSFSI); Northeastern University (东北大学); Imperial College London (伦敦帝国理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of large language models (LLMs) has allowed them to be proficient in various tasks, including content generation. However, their unregulated usage can lead to malicious activities such as plagiarism and generating and spreading fake news, especially for low-resource languages. Most existing machine-generated text detectors are trained on high-resource languages like English, French, etc. In this study, we developed the first large-scale detector that can distinguish between human- and machine-generated content in Hausa. We scrapped seven Hausa-language media outlets for the human-generated text and the Gemini-2.0 flash model to automatically generate the corresponding Hausa-language articles based on the human-generated article headlines. We fine-tuned four pre-trained Afri-centric models (AfriTeVa, AfriBERTa, AfroXLMR, and AfroXLMR-76L) on the resulting dataset and assessed their performance using accuracy and F1-score metrics. AfroXLMR achieved the highest performance with an accuracy of 99.23% and an F1 score of 99.21%, demonstrating its effectiveness for Hausa text detection. Our dataset is made publicly available to enable further research.
zh

[NLP-26] ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在边缘设备部署时因模型规模庞大而导致的存储和计算资源限制问题。具体而言，传统权重仅量化（Weight-only Quantization）方法虽可减小模型尺寸，但在较低位宽下性能会显著下降，且标准微调方法与量化模型不兼容，而现有的替代方案往往无法达到全精度微调的效果。论文提出的解决方案——ClusComp，其关键是通过将权重矩阵聚类为代码本（codebooks），并以块为单位进行微调，从而在2-4位量化中实现卓越性能，同时在1位量化时仍优于其他超低比特方法，并支持高效微调，甚至在某些情况下超越全精度FP16微调，且能够在单个A6000-48GB GPU上完成对70B规模LLM的压缩与微调。

链接: https://arxiv.org/abs/2503.13089
作者: Baohao Liao,Christian Herold,Seyyed Hadi Hashemi,Stefan Vasilev,Shahram Khadivi,Christof Monz
机构: Language Technology Lab, University of Amsterdam (语言技术实验室, 阿姆斯特丹大学); eBay Inc. (易趣公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 11 figures, 18 tables

点击查看摘要

Abstract:As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.
zh

[NLP-27] A Framework to Assess Multilingual Vulnerabilities of LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在低资源语言（Low-Resource Languages, LRL）中因训练数据和人工评估资源不平衡而导致的安全性脆弱性问题。论文的关键解决方案在于提出了一种自动评估框架，用于检测常用LLMs在多语言环境下的潜在漏洞。通过该框架，研究者评估了六种LLMs在八种不同语言中的表现，并通过人在回路（human evaluation）的方式验证了框架的有效性，证明其结果与人工判断在大多数情况下一致。这为识别和缓解低资源语言中的模型脆弱性提供了系统化的方法。

链接: https://arxiv.org/abs/2503.13081
作者: Likai Tang,Niruth Bogahawatta,Yasod Ginige,Jiarui Xu,Shixuan Sun,Surangika Ranathunga,Suranga Seneviratne
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are acquiring a wider range of capabilities, including understanding and responding in multiple languages. While they undergo safety training to prevent them from answering illegal questions, imbalances in training data and human evaluation resources can make these models more susceptible to attacks in low-resource languages (LRL). This paper proposes a framework to automatically assess the multilingual vulnerabilities of commonly used LLMs. Using our framework, we evaluated six LLMs across eight languages representing varying levels of resource availability. We validated the assessments generated by our automated framework through human evaluation in two languages, demonstrating that the framework’s results align with human judgments in most cases. Our findings reveal vulnerabilities in LRL; however, these may pose minimal risk as they often stem from the model’s poor performance, resulting in incoherent responses.
zh

[NLP-28] Overview of the NTCIR-18 Automatic Evaluation of LLM s (AEOLLM ) Task

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）评估中有效性和全面性不足的问题。随着LLMs在学术界和工业界的普及，如何准确评估其能力成为日益重要但仍未完全解决的挑战。现有评估方法主要分为两类：耗资巨大的人工评估和面临任务格式限制（多数为选择题）及参考依赖度高的自动评估。为推动自动评估方法的创新，论文提出了NTCIR-18自动评估LLMs（Automatic Evaluation of LLMs, AEOLLM）任务，聚焦于生成式任务，并鼓励无参考（reference-free）的评估方法。同时，设置了对话生成、文本扩展、摘要生成和非事实型问答等多种子任务以全面测试不同方法的效果。关键在于通过多样化的生成任务设计和对无参考评估方法的支持，突破传统自动评估的局限性。

链接: https://arxiv.org/abs/2503.13038
作者: Junjie Chen,Haitao Li,Zhumin Chu,Yiqun Liu,Qingyao Ai
机构: DCST, Tsinghua University (清华大学); Quan Cheng Laboratory (量晨实验室); Zhongguancun Laboratory (中关村实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we provide an overview of the NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) task. As large language models (LLMs) grow popular in both academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we propose the AEOLLM task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as dialogue generation, text expansion, summary generation and non-factoid question answering to comprehensively test different methods. This year, we received 48 runs from 4 teams in total. This paper will describe the background of the task, the data set, the evaluation measures and the evaluation results, respectively.
zh

[NLP-29] Halving transcription time: A fast user-friendly and GDPR-compliant workflow to create AI-assisted transcripts for content analysis

【速读】：该论文旨在解决定性研究中数据转录耗时且劳动密集的问题，并提出了一种利用人工智能（Artificial Intelligence, AI）的工作流解决方案。该方案的关键在于结合自动语音识别技术生成初步转录文本，并通过格式化处理使其兼容标准内容分析软件（如 NVivo 或 MAXQDA），从而显著提升转录效率（实验数据显示可减少高达 46.2% 的转录时间），同时确保流程符合 GDPR 规范并支持本地离线操作，尤其适用于学生、研究人员以及非母语使用者，适应多种学习、教学和研究环境。

链接: https://arxiv.org/abs/2503.13031
作者: Jakob Sponholz,Andreas Weilinghoff,Juliane Schopf
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 1 table

点击查看摘要

Abstract:In qualitative research, data transcription is often labor-intensive and time-consuming. To expedite this process, a workflow utilizing artificial intelligence (AI) was developed. This workflow not only enhances transcription speed but also addresses the issue of AI-generated transcripts often lacking compatibility with standard content analysis software. Within this workflow, automatic speech recognition is employed to create initial transcripts from audio recordings, which are then formatted to be compatible with content analysis software such as this http URL or MAXQDA. Empirical data from a study of 12 interviews suggests that this workflow can reduce transcription time by up to 46.2%. Furthermore, by using widely used standard software, this process is suitable for both students and researchers while also being adaptable to a variety of learning, teaching, and research environments. It is also particularly beneficial for non-native speakers. In addition, the workflow is GDPR-compliant and facilitates local, offline transcript generation, which is crucial when dealing with sensitive data.
zh

[NLP-30] Dynamic Relation Inference via Verb Embeddings

【速读】：该论文旨在解决视觉关系推理（relation inference）的问题，具体而言，当图像中的对象关系需要通过语义理解而非仅依赖于对象匹配来完成时，现有基于对比学习的跨模态模型（如CLIP）表现不佳。论文的关键解决方案是提出了一种名为Dynamic Relation Inference via Verb Embeddings (DRIVE) 的方法。DRIVE 通过引入语言监督增强 COCO 数据集，利用带有硬负样本的主语-关系-宾语三元组及其对应的图像对 CLIP 进行微调，并设计了一种新的损失函数以提升关系检测性能。这种方法在冻结和微调设置下显著提高了零样本关系推理的准确性，同时在未见数据上表现出良好的泛化能力。

链接: https://arxiv.org/abs/2503.13021
作者: Omri Suissa,Muhiim Ali,Ariana Azarbal,Hui Shen,Shekhar Pradhan
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:CLIP has demonstrated exceptional image-text matching capabilities due to its training on contrastive learning tasks. Past research has suggested that whereas CLIP effectively matches text to images when the matching can be achieved just by matching the text with the objects in the image, CLIP struggles when the matching depends on representing the relationship among the objects in the images (i.e., inferring relations). Previous attempts to address this limitation by training CLIP on relation detection datasets with only linguistic supervision have met with limited success. In this paper, we offer insights and practical methods to advance the field of relation inference from images. This paper approaches the task of creating a model that effectively detects relations among the objects in images by producing text and image embeddings that capture relationships through linguistic supervision. To this end, we propose Dynamic Relation Inference via Verb Embeddings (DRIVE), which augments the COCO dataset, fine-tunes CLIP with hard negatives subject-relation-object triples and corresponding images, and introduces a novel loss function to improve relation detection. Evaluated on multiple CLIP-based models, our method significantly improves zero-shot relation inference accuracy in both frozen and fine-tuned settings, significantly outperforming CLIP and state-of-the-art models while generalizing well on unseen data.
zh

[NLP-31] Intra-neuronal attention within language models Relationships between activation and semantics

【速读】：该论文试图解决语言模型中感知机型神经元是否能够执行基于特定激活区域分割的同质类别片段识别（即“神经元内注意力”）的问题。解决方案的关键在于确定形式化神经元能否在基于激活的分割与类别分割之间建立同态关系，尤其关注高激活水平令牌上的潜在关联性。研究结果表明，这种关系仅在极高的激活水平下勉强存在，并进一步揭示了这种神经元内注意力机制如何促进后续层神经元的类别重组过程，从而逐步形成高级别的类别抽象。

链接: https://arxiv.org/abs/2503.12992
作者: Michael Pichat,William Pogrund,Paloma Pichat,Armanouche Gasparian,Samuel Demarchi,Corbet Alois Georgeon,Michael Veillet-Guillem
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:This study investigates the ability of perceptron-type neurons in language models to perform intra-neuronal attention; that is, to identify different homogeneous categorical segments within the synthetic thought category they encode, based on a segmentation of specific activation zones for the tokens to which they are particularly responsive. The objective of this work is therefore to determine to what extent formal neurons can establish a homomorphic relationship between activation-based and categorical segmentations. The results suggest the existence of such a relationship, albeit tenuous, only at the level of tokens with very high activation levels. This intra-neuronal attention subsequently enables categorical restructuring processes at the level of neurons in the following layer, thereby contributing to the progressive formation of high-level categorical abstractions.
zh

[NLP-32] A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models

【速读】：该论文旨在解决自动标注标准化职业分类数据（即职业分类任务）中的挑战，特别是在数据稀缺和人工标注困难的情况下。现有方法依赖于大型语言模型（Large Language Models, LLMs），但这些模型在职业分类任务上的表现受其对职业分类体系知识掌握程度的影响尚不明确。论文的关键在于提出了一种多阶段框架，包括推理（inference）、检索（retrieval）和重排序（reranking）三个阶段，并通过引入与分类体系引导相关的推理示例，增强输出与分类体系知识的一致性，从而提升模型性能。评估结果显示，该框架显著提高了分类准确性，并且在多标签技能分类任务中也表现出良好的适应性，优于现有的基于LLMs的方法，为跨领域职业分类及相关任务提供了实用且可扩展的解决方案。

链接: https://arxiv.org/abs/2503.12989
作者: Palakorn Achananuparp,Ee-Peng Lim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show significant improvements in classification accuracy. Furthermore, we demonstrate the framework’s adaptability for multi-label skill classification. Our results indicate that the framework outperforms existing LLM-based methods, offering a practical and scalable solution for occupation classification and related tasks across LLMs.
zh

[NLP-33] HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Model, MLLM）在实际应用中因无法同时获取所有可能的指令数据集而面临的适应性挑战。此外，现有连续指令微调方法通常以牺牲内存效率为代价来换取性能提升，导致整体效率显著下降。论文的关键解决方案在于提出了一种基于不同数据集训练时模型各层中心化核对齐（Centered Kernel Alignment, CKA）相似性变化的任务特定扩展与任务通用融合框架。此外，论文还分析了现有基准中的信息泄露问题，并构建了一个更具有挑战性的新基准以更合理地评估不同方法的性能。实验结果表明，所提方法在性能上显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2503.12941
作者: Haiyang Guo,Fanhu Zeng,Ziwei Xiang,Fei Zhu,Da-Han Wang,Xu-Yao Zhang,Cheng-Lin Liu
机构: SAIS, UCAS (UCAS 大数据学院); MAIS, Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); SAI, UCAS (UCAS 大数据学院); Centre for Artificial Intelligence and Robotics, HKISI-CAS (香港智能研究所-中国科学院); Department of Computer and Information Engineering, Xiamen University of Technology (厦门理工学院计算机与信息工程系)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Our code will be public available.
zh

[NLP-34] R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

【速读】：该论文试图解决现有多模态大型语言模型（MLLMs）在推理能力提升过程中仅被动模仿正确推理路径而缺乏对错误推理路径理解的问题。为了解决这一问题，论文提出了一种名为Step-wise Group Relative Policy Optimization (StepGRPO) 的新在线强化学习框架。StepGRPO的关键在于通过简单、有效且密集的逐步奖励机制，使模型能够自我提升推理能力。具体而言，StepGRPO引入了两种基于规则的推理奖励：Step-wise Reasoning Accuracy Reward (StepRAR) 和 Step-wise Reasoning Validity Reward (StepRVR)。StepRAR通过软关键步骤匹配技术奖励包含必要中间推理步骤的路径，而StepRVR则通过推理完整性和逻辑一致性评估策略奖励遵循良好结构化逻辑过程的推理路径。这些创新使得模型能够在多种基准测试中展现出卓越的逐步推理能力。

链接: https://arxiv.org/abs/2503.12937
作者: Jingyi Zhang,Jiaxing Huang,Huanjin Yao,Shunyu Liu,Xikun Zhang,Shijian Lu,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies generally enhance MLLMs’ reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are. In this work, we aim to enhance the MLLMs’ reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.
zh

[NLP-35] hinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLM s

【速读】：该论文试图解决的问题是如何系统性地理解不同思维模式（thinking types）对大规模语言模型（Large Language Models, LLMs）性能的影响，特别是在模型规模变化下的机制。现有研究缺乏对这一领域的深入分析，尤其是针对“思考后响应”（Thinking then Responding）范式中内部思维模式如何影响模型表现的理解。

论文的关键解决方案是引入了一个名为ThinkPatterns-21k的数据集，该数据集包含从现有指令跟随数据集中提取的21k个指令-响应对（QA），并为每个样本标注了五种不同的内部思维模式：一种无结构的思维（monologue）以及四种结构化变体（decomposition, self-ask, self-debate, 和self-critic）。通过在不同参数规模（3B到32B）的模型上进行广泛的评估，论文得出了两个关键发现：首先，较小规模的模型（如30B参数）能够从大多数结构化思维模式中获益，而较大规模的模型（如32B参数）在使用某些结构化思维模式（如分解，decomposition）时可能会导致性能下降；其次，无结构化的独白式思维（unstructured monologue）在不同规模的模型中均表现出广泛的有效性。最终，作者公开了所有相关数据集、检查点及训练日志，以促进该领域的进一步研究。

链接: https://arxiv.org/abs/2503.12918
作者: Pengcheng Wen,Jiaming Ji,Chi-Min Chan,Juntao Dai,Donghai Hong,Yaodong Yang,Sirui Han,Yike Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Peking University (北京大学); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated enhanced performance through the \textitThinking then Responding paradigm, where models generate internal thoughts before final responses (aka, System 2 thinking). However, existing research lacks a systematic understanding of the mechanisms underlying how thinking patterns affect performance across model sizes. In this work, we conduct a comprehensive analysis of the impact of various thinking types on model performance and introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs (QA) collected from existing instruction-following datasets with five thinking types. For each pair, we augment it with five distinct internal thinking patterns: one unstructured thinking (monologue) and four structured variants (decomposition, self-ask, self-debate and self-critic), while maintaining the same instruction and response. Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance and (2) unstructured monologue demonstrates broad effectiveness across different model sizes. Finally, we released all of our datasets, checkpoints, training logs of diverse thinking patterns to reproducibility, aiming to facilitate further research in this direction.
zh

[NLP-36] HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成过程中容易产生幻觉（hallucinations），即输出内容在上下文中不准确或事实错误的问题。为了解决这一问题，论文提出了一种名为HICD（Hallucination-Induced Contrastive Decoding）的新方法。HICD的关键在于通过选择模型预测至关重要的注意力头（attention heads）作为诱导头，并通过分散这些诱导头的注意力来诱发幻觉，然后将诱发的输出与原始输出进行对比解码以获得最终结果。这种方法显著提升了需要上下文忠实性的任务（如上下文补全、阅读理解和问答）的表现，同时增强了需要准确知识回忆的任务的事实性。研究显示，HICD的注意力分散和诱导头选择方法能够产生更具对比效果的幻觉，优于其他幻觉诱导方法，为通过受控方式减少幻觉、提升LLMs在广泛任务中的性能提供了有前景的策略。

链接: https://arxiv.org/abs/2503.12908
作者: Xinyan Jiang,Hang Ye,Yongxin Zhu,Xiaoying Zheng,Zikang Chen,Jun Gong
机构: Shanghai Advanced Research Institute, Chinese Academy of Sciences (上海先进研究院，中国科学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review at ARR - February 2025

点击查看摘要

Abstract:Large Language Models (LLMs) often generate hallucinations, producing outputs that are contextually inaccurate or factually incorrect. We introduce HICD, a novel method designed to induce hallucinations for contrastive decoding to mitigate hallucinations. Unlike existing contrastive decoding methods, HICD selects attention heads crucial to the model’s prediction as inducing heads, then induces hallucinations by dispersing attention of these inducing heads and compares the hallucinated outputs with the original outputs to obtain the final result. Our approach significantly improves performance on tasks requiring contextual faithfulness, such as context completion, reading comprehension, and question answering. It also improves factuality in tasks requiring accurate knowledge recall. We demonstrate that our inducing heads selection and attention dispersion method leads to more “contrast-effective” hallucinations for contrastive decoding, outperforming other hallucination-inducing methods. Our findings provide a promising strategy for reducing hallucinations by inducing hallucinations in a controlled manner, enhancing the performance of LLMs in a wide range of tasks.
zh

[NLP-37] A Semantic-based Optimization Approach for Repairing LLM s: Case Study on Code Generation

【速读】：该论文旨在解决语言模型（Language Models, LMs）在软件工程领域用于代码生成时可能产生错误代码的问题。不同于直接修复生成的代码，论文提出了一种新的方法——通过修复模型本身的潜在缺陷来解决问题。这种方法被称为轻量级语言模型修复（LM repair），它具有数据需求少、计算成本低以及副作用小等优点，特别适合资源受限、高性能需求或严格安全性要求的场景。

论文的关键在于提出了一个名为\textscSTAR（Semantic Targeting for Analytical Repair）的新颖优化方法。该方法在优化过程中实现了语言模型修复的主要操作，包括定位“有问题的神经元”（buggy neurons）、解决“神经元补丁”（neuron patches）以及修补“有问题的神经元”。具体而言，\textscSTAR通过计算权重矩阵的变化作为先验信息来引导优化过程，并利用统计洞见确定目标层和神经元。神经元补丁则通过基于语义的分析公式计算得出，该公式直接将神经元变化与logits的变化联系起来，通过调整潜在表示实现。与之前的工作相比，\textscSTAR结合了现有修复方法（如\textscMINT）和优化技术（如\textscSGD）的优势，同时克服了它们的局限性，能够同时解决多个故障，显著提升了实用价值。评估结果显示，\textscSTAR不仅在有效性上表现优异，而且效率更高，在保持泛化能力的同时减少了特定性带来的副作用。

链接: https://arxiv.org/abs/2503.12899
作者: Jian Gu,Aldeida Aleti,Chunyang Chen,Hongyu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 6 figure, 6 tables, under peer-review

点击查看摘要

Abstract:Language Models (LMs) are widely used in software engineering for code generation, but they may produce code with errors. Rather than repairing the generated code, an alternative way is to address the underlying failures of models. LM repair offers a lightweight solution to this challenge: it requires minimal data, reduces computational costs, and reduces the side effects. Unlike retraining, LM repair focuses on applying tailored updates to targeted neurons, making it ideal for scenarios with limited resources, high-performance demands, or strict safety requirements. In this paper, we propose \ulSemantic \ulTargeting for \ulAnalytical \ulRepair (\textscSTAR), a pioneering and novel semantic-based optimization approach for repairing LLMs. \textscSTAR realizes main operations in LM repair methods in an optimization process, including locating buggy neurons'', solving neuron patches’‘, and patching ``buggy neurons’'. Correspondingly, it computes the deltas of weight matrix as the prior information to guide optimization; and attributes the targeted layers and neurons leveraging statistical insights. The neuron patches are computed with a solid semantic-based analytical formula, which directly bridges the changes to logits with the deltas of neurons, by steering latent representations. Compared to the prior work of LM repair (\textscMINT) and optimization methods (\textscSGD), \textscSTAR integrates their strengths while mitigating their limitations. \textscSTAR supports solving multiple failures together, significantly improving the usefulness. Evaluated on three code generation tasks using popular code LMs, \textscSTAR demonstrates superior effectiveness. Additionally, \textscSTAR exhibits better efficiency. In terms of side effects, namely the balance between generalization and specificity, \textscSTAR outperforms prior work by a significant margin.
zh

[NLP-38] DAPI: Domain Adaptive Toxicity Probe Vector Intervention for Fine-Grained Detoxification

【速读】：该论文试图解决现有线性探测器在毒性消除中的局限性问题，即单一毒性探测向量难以有效处理细粒度毒性子类别的去除。解决方案的关键在于提出了一种类别特定的毒性探测向量方法：首先训练多个针对不同毒性类别的探测向量；生成过程中基于当前上下文动态选择最相关的探测向量，并对其动态缩放后从模型输出中减去。这种方法成功解决了单一探测向量无法处理的某些毒性类别问题，实验表明其在评估数据集上的毒性降低了78.52%，而流畅性仅下降0.052%。

链接: https://arxiv.org/abs/2503.12882
作者: Cho Hyeonsu,Dooyoung Kim,Youngjoong Ko
机构: SungKyunKwan University (成均馆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:There have been attempts to utilize linear probe for detoxification, with existing studies relying on a single toxicity probe vector to reduce toxicity. However, toxicity can be fine-grained into various subcategories, making it difficult to remove certain types of toxicity by using a single toxicity probe vector. To address this limitation, we propose a category-specific toxicity probe vector approach. First, we train multiple toxicity probe vectors for different toxicity categories. During generation, we dynamically select the most relevant toxicity probe vector based on the current context. Finally, the selected vector is dynamically scaled and subtracted from model. Our method successfully mitigated toxicity from categories that the single probe vector approach failed to detoxify. Experiments demonstrate that our approach achieves up to a 78.52% reduction in toxicity on the evaluation dataset, while fluency remains nearly unchanged, with only a 0.052% drop compared to the unsteered model.
zh

[NLP-39] nvBench 2.0: A Benchmark for Natural Language to Visualization under Ambiguity

【速读】：该论文旨在解决自然语言到可视化（Natural Language to Visualization, NL2VIS）系统在处理模糊查询时面临的挑战，即用户通常以不精确的语言表达可视化需求，导致系统难以准确理解。为应对这一问题，论文提出了nvBench 2.0，这是一个专门设计的新基准数据集，用于评估NL2VIS系统在模糊查询场景下的性能。解决方案的关键在于通过受控的模糊注入管道（controlled ambiguity-injection pipeline），从无歧义的种子可视化出发，选择性地引入模糊性，从而为每个查询生成多个合理的解释，并通过逐步推理路径将每个模糊查询追溯到其对应的可视化结果。此外，论文还提出了一种基于大型语言模型（Large Language Model, LLM）的Step-NL2VIS模型，该模型经过nvBench 2.0训练并通过逐步偏好优化提升了在模糊场景中的性能表现。

链接: https://arxiv.org/abs/2503.12880
作者: Tianqi Luo,Chuhan Huang,Leixian Shen,Boyan Li,Shuyu Shen,Wei Zeng,Nan Tang,Yuyu Luo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural Language to Visualization (NL2VIS) enables users to create visualizations from natural language queries, making data insights more accessible. However, NL2VIS faces challenges in interpreting ambiguous queries, as users often express their visualization needs in imprecise language. To address this challenge, we introduce nvBench 2.0, a new benchmark designed to evaluate NL2VIS systems in scenarios involving ambiguous queries. nvBench 2.0 includes 7,878 natural language queries and 24,076 corresponding visualizations, derived from 780 tables across 153 domains. It is built using a controlled ambiguity-injection pipeline that generates ambiguous queries through a reverse-generation workflow. By starting with unambiguous seed visualizations and selectively injecting ambiguities, the pipeline yields multiple valid interpretations for each query, with each ambiguous query traceable to its corresponding visualization through step-wise reasoning paths. We evaluate various Large Language Models (LLMs) on their ability to perform ambiguous NL2VIS tasks using nvBench 2.0. We also propose Step-NL2VIS, an LLM-based model trained on nvBench 2.0, which enhances performance in ambiguous scenarios through step-wise preference optimization. Our results show that Step-NL2VIS outperforms all baselines, setting a new state-of-the-art for ambiguous NL2VIS tasks.
zh

[NLP-40] Harnessing Test-time Adaptation for NLU tasks Involving Dialects of English

【速读】：该论文旨在解决在自然语言处理（NLP）领域中，当目标域数据分布与训练数据显著不同时，如何在无标注数据集的情况下有效适应模型的问题。具体而言，论文关注于方言设置下的NLP任务，其中模型通常在标准美式英语（SAE）上训练，但在印度英语或尼日利亚英语等分布差异较大的方言数据上进行评估。论文的关键解决方案是探索一种名为SHOT的测试时适配（Test-time Adaptation, TTA）技术，并通过微调和评估SHOT在多种方言GLUE组合上的表现来验证其有效性。研究发现，SHOT在无标注数据集可用时是一种可行的技术，并且提出了方言间隙（dialectal gap）的概念，表明其与SHOT效果之间存在正相关关系。此外，论文还观察到，在许多情况下，使用标准美式英语进行微调的表现优于使用方言数据。代码已开源。

链接: https://arxiv.org/abs/2503.12858
作者: Duke Nguyen,Aditya Joshi,Flora Salim
机构: University of New South Wales (新南威尔士大学), Australia
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-time adaptation (TTA) is an excellent method which helps generalize models across domains, tasks, and distributions without the use of labeled datasets. Thus, TTA is very useful in natural language processing (NLP) in the dialectal setting, since oftentimes, models are trained on Standard American English (SAE), evaluated on Indian English or Nigerian English, of which distribution differs significantly from the former. This is especially useful since dialectal datasets are scarce. In this paper, we explore one of the most famous TTA techniques, SHOT, in dialectal NLP. We finetune and evaluate SHOT on different combinations of dialectal GLUE. Our findings show that SHOT is a viable technique when labeled datasets are unavailable. We also theoretically propose the concept of dialectal gap and show that it has a positive correlation with the effectiveness of SHOT. We also find that in many cases, finetuning on SAE yields higher performance than finetuning on dialectal data. Our code is available at this https URL
zh

[NLP-41] VITED: Video Temporal Evidence Distillation

【速读】：该论文旨在解决复杂视频问答（VideoQA）任务中多步推理能力不足的问题。现有模型在处理多步推理时通常采用均匀采样的方式固定抽取帧数，这可能导致关键证据被遗漏，尤其是在这些证据在视频中非均匀分布的情况下。此外，现有方法缺乏在完整视频上下文中精确定位视觉证据的能力，而这种能力对于回答复杂的视频相关问题是至关重要的。论文的关键解决方案是提出了一种框架（VITED），通过自动搜索视频中的感兴趣区间及其支持性证据来构建证据推理链，从而最大化回答给定问题的可能性。该模型不仅能够直接生成这些证据推理链，还能够在长视频内容中实现证据窗口定位与跨窗口的多步推理，显著提升了长视频问答任务上的性能。

链接: https://arxiv.org/abs/2503.12855
作者: Yujie Lu,Yale Song,William Wang,Lorenzo Torresani,Tushar Nagarajan
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校); FAIR, Meta (Meta人工智能研究院, Meta)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate complex video question answering via chain-of-evidence reasoning – identifying sequences of temporal spans from multiple relevant parts of the video, together with visual evidence within them. Existing models struggle with multi-step reasoning as they uniformly sample a fixed number of frames, which can miss critical evidence distributed nonuniformly throughout the video. Moreover, they lack the ability to temporally localize such evidence in the broader context of the full video, which is required for answering complex questions. We propose a framework to enhance existing VideoQA datasets with evidence reasoning chains, automatically constructed by searching for optimal intervals of interest in the video with supporting evidence, that maximizes the likelihood of answering a given question. We train our model (VITED) to generate these evidence chains directly, enabling it to both localize evidence windows as well as perform multi-step reasoning across them in long-form video content. We show the value of our evidence-distilled models on a suite of long video QA benchmarks where we outperform state-of-the-art approaches that lack evidence reasoning capabilities.
zh

[NLP-42] Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

【速读】：该论文旨在解决基于强化学习（Reinforcement Learning, RL）的方法在提升大型语言模型（Large Language Models, LLMs）推理能力时面临的高计算成本问题，探索一种更高效且可扩展的替代方案。论文的关键解决方案是通过直接偏好优化（Direct Preference Optimization, DPO），利用迭代式基于偏好的学习机制实现LLMs的自我改进。具体而言，单轮DPO结合粗粒度筛选即可显著提升数学推理性能，尤其对于强大的基础模型效果显著。此外，论文设计了一种针对生成器和奖励模型（Reward Model, RM）的迭代增强框架，使两者通过多轮DPO中的在线交互实现相互改进。最终，通过简单的可验证奖励，DPO-VP模型达到了与RL相当的性能，但计算开销大幅降低。这些结果表明，DPO是一种可扩展且经济高效的RL替代方法，在资源受限的情况下为提升LLM推理能力提供了实用方案。

链接: https://arxiv.org/abs/2503.12854
作者: Songjun Tu,Jiahao Lin,Xiangyu Tian,Qichao Zhang,Linjing Li,Yuqian Fu,Nan Xu,Wei He,Xiangyuan Lan,Dongmei Jiang,Dongbin Zhao
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); Pengcheng Laboratory (鹏城实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Wenge Technology (文歌科技); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effectiveness of DPO in facilitating self-improvement for LLMs through iterative preference-based learning. We demonstrate that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance, particularly for strong base model. Furthermore, we design an iterative enhancement framework for both the generator and the reward model (RM), enabling their mutual improvement through online interaction across multiple rounds of DPO. Finally, with simple verifiable rewards, our model DPO-VP achieves RL-level performance with significantly lower computational overhead. These findings highlight DPO as a scalable and cost-effective alternative to RL, offering a practical solution for enhancing LLM reasoning in resource-constrained situations.
zh

[NLP-43] Modelling Child Learning and Parsing of Long-range Syntactic Dependencies

【速读】：该论文旨在解决儿童语言习得过程中如何学习长距离句法依赖（long-range syntactic dependencies）的问题，尤其是在宾语疑问句（object wh-questions）等语言现象中的应用。论文的关键在于提出了一种概率性的儿童语言习得模型（probabilistic child language acquisition model），该模型能够同时学习词汇意义和特定语言的句法结构。通过在真实的儿童母语输入语料库（child-directed speech corpus）上进行训练，并将每个句子与其逻辑形式（logical form）配对作为语义表示，模型能够在训练后推导出给定句子-语义对的正确句法分析树和词汇意义，甚至仅凭句子就能推断其语义。这种对长距离依赖的成功建模得益于模型中超越上下文无关语法（trans-context-free）的特性。

链接: https://arxiv.org/abs/2503.12832
作者: Louis Mahon,Mark Johnson,Mark Steedman
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work develops a probabilistic child language acquisition model to learn a range of linguistic phenonmena, most notably long-range syntactic dependencies of the sort found in object wh-questions, among other constructions. The model is trained on a corpus of real child-directed speech, where each utterance is paired with a logical form as a meaning representation. It then learns both word meanings and language-specific syntax simultaneously. After training, the model can deduce the correct parse tree and word meanings for a given utterance-meaning pair, and can infer the meaning if given only the utterance. The successful modelling of long-range dependencies is theoretically important because it exploits aspects of the model that are, in general, trans-context-free.
zh

[NLP-44] A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

【速读】：该论文旨在探索大型语言模型在不同学习率调度（如恒定、余弦和阶梯衰减）下的预训练损失演化规律，以量化模型性能与超参数之间的关系。论文的关键解决方案在于提出了一条经验定律，该定律采用多幂次形式，结合基于学习率总和的幂律以及额外的幂律来描述学习率衰减诱导的损失减少效应。通过在多种模型规模和架构上的广泛验证，研究证明该定律能够准确预测未见过的学习率调度的损失曲线，并发现了一种优于广泛使用的余弦学习率调度的新调度方案，其最终预训练损失更低。这种自动发现的调度方案与最近提出的Warmup-Stable-Decay (WSD) 调度有一定相似性，但表现更优。这一成果为理解预训练动态和设计更高效的学习率调度提供了有价值的见解。

链接: https://arxiv.org/abs/2503.12811
作者: Kairong Luo,Haodong Wen,Shengding Hu,Zhenbo Sun,Zhiyuan Liu,Maosong Sun,Kaifeng Lyu,Wenguang Chen
机构: Department of Computer Science and Technology, Tsinghua University (清华大学); Qian Xuesen College, Xi’an Jiaotong University (西安交通大学钱学森学院); Simons Institute, University of California, Berkeley (加州大学伯克利分校西蒙斯计算理论研究所); Peng Cheng Laboratory (鹏城实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency.
zh

[NLP-45] Leverag ing Deep Neural Networks for Aspect-Based Sentiment Classification

【速读】：该论文旨在解决基于方面的情感分析（Aspect-based Sentiment Analysis, ABSA）中，传统图卷积网络（Graph Convolutional Networks, GCNs）在提取句法特征时可能丢失关键信息的问题。为应对这一挑战，论文提出了一种新颖的边增强图卷积网络（Edge-Enhanced Graph Convolutional Network, EEGCN）。该方法通过保留特征完整性来提升性能，并结合双向长短期记忆网络（Bidirectional Long Short-Term Memory, Bi-LSTM）与基于自注意力机制的Transformer进行有效文本编码，确保长距离依赖关系的保留。此外，引入双向图卷积网络（Bidirectional GCN, Bi-GCN）实现实体间关系的捕获，同时采用面向方面的掩码技术去除冗余信息。这些创新点共同构成了EEGCN的核心解决方案，显著提升了基于方面的情感分析性能，并解决了句法特征提取中的难题。

链接: https://arxiv.org/abs/2503.12803
作者: Chen Li,Debo Cheng,Yasuhiko Morimoto
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aspect-based sentiment analysis seeks to determine sentiment with a high level of detail. While graph convolutional networks (GCNs) are commonly used for extracting sentiment features, their straightforward use in syntactic feature extraction can lead to a loss of crucial information. This paper presents a novel edge-enhanced GCN, called EEGCN, which improves performance by preserving feature integrity as it processes syntactic graphs. We incorporate a bidirectional long short-term memory (Bi-LSTM) network alongside a self-attention-based transformer for effective text encoding, ensuring the retention of long-range dependencies. A bidirectional GCN (Bi-GCN) with message passing then captures the relationships between entities, while an aspect-specific masking technique removes extraneous information. Extensive evaluations and ablation studies on four benchmark datasets show that EEGCN significantly enhances aspect-based sentiment analysis, overcoming issues with syntactic feature extraction and advancing the field’s methodologies.
zh

[NLP-46] DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLM s for Knowledge-Intensive Visual Grounding

【速读】：该论文旨在解决当前多模态大型语言模型（Multimodal Large Language Models, MLLMs）在利用领域知识进行精细化视觉辨别方面的能力不足问题。尽管这些模型具备专家级的知识，但它们在将推理融入视觉感知方面表现欠佳，通常直接生成响应而缺乏深入分析。为了解决这一问题，论文引入了知识密集型视觉定位（Knowledge-Intensive Visual Grounding, KVG），这是一个需要精细化感知和特定领域知识整合的新任务。

解决方案的关键在于提出DeepPerception，这是一种增强了认知视觉感知能力的MLLM。具体而言，该方法包含两个主要部分：(1) 一个自动化的数据合成流水线，用于生成高质量且与知识对齐的训练样本；(2) 一种两阶段训练框架，结合监督微调以构建认知推理结构，并通过强化学习优化感知与认知之间的协同作用。此外，为了评估性能，论文还引入了KVG-Bench数据集，包含10个领域的1300多个手动整理的测试案例。实验结果表明，DeepPerception相比直接微调显著提升了性能，在KVG-Bench上提高了8.08%的准确率，并在跨领域泛化能力上优于基线方法4.60%。这凸显了将认知过程集成到MLLMs中对于实现类人视觉感知的重要性，并为多模态推理研究开辟了新方向。

链接: https://arxiv.org/abs/2503.12797
作者: Xinyu Ma,Ziyang Ding,Zhicong Luo,Chi Chen,Zonghao Guo,Derek F. Wong,Xiaoyi Feng,Maosong Sun
机构: University of Macau (澳门大学); Tsinghua University (清华大学); Northwestern Polytechnical University (西北工业大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features, a capability that remains underdeveloped in current Multimodal Large Language Models (MLLMs). Despite possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning into visual perception, often generating direct responses without deeper analysis. To bridge this gap, we introduce knowledge-intensive visual grounding (KVG), a novel visual grounding task that requires both fine-grained perception and domain-specific knowledge integration. To address the challenges of KVG, we propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities. Our approach consists of (1) an automated data synthesis pipeline that generates high-quality, knowledge-aligned training samples, and (2) a two-stage training framework combining supervised fine-tuning for cognitive reasoning scaffolding and reinforcement learning to optimize perception-cognition synergy. To benchmark performance, we introduce KVG-Bench a comprehensive dataset spanning 10 domains with 1.3K manually curated test cases. Experimental results demonstrate that DeepPerception significantly outperforms direct fine-tuning, achieving +8.08% accuracy improvements on KVG-Bench and exhibiting +4.60% superior cross-domain generalization over baseline approaches. Our findings highlight the importance of integrating cognitive processes into MLLMs for human-like visual perception and open new directions for multimodal reasoning research. The data, codes, and models are released at this https URL.
zh

[NLP-47] A Reinforcement Learning-Driven Transformer GAN for Molecular Generation

【速读】：该论文旨在解决基于数据驱动的分子生成中面临的挑战，特别是简化分子输入行输入系统（SMILES）表示的敏感性以及生成对抗网络（GANs）应用于离散数据时的困难。为应对这些挑战，论文提出了一种名为RL-MolGAN的新框架，其关键创新在于采用Transformer为基础的解码器-编码器结构，结合强化学习（RL）和蒙特卡洛树搜索（MCTS）技术，以提高GAN训练的稳定性并优化生成分子的化学性质。此外，通过引入RL-MolWGAN扩展模型，进一步利用Wasserstein距离和小批量判别技术增强GAN的稳定性，从而实现高质量且具有多样性和理想化学性质的分子结构生成。

链接: https://arxiv.org/abs/2503.12796
作者: Chen Li,Huidong Tang,Ye Zhu,Yoshihiro Yamanishi
机构: Osaka University (大阪大学); Nagoya University (名古屋大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Chemical Physics (physics.chem-ph)
备注:

点击查看摘要

Abstract:Generating molecules with desired chemical properties presents a critical challenge in fields such as chemical synthesis and drug discovery. Recent advancements in artificial intelligence (AI) and deep learning have significantly contributed to data-driven molecular generation. However, challenges persist due to the inherent sensitivity of simplified molecular input line entry system (SMILES) representations and the difficulties in applying generative adversarial networks (GANs) to discrete data. This study introduces RL-MolGAN, a novel Transformer-based discrete GAN framework designed to address these challenges. Unlike traditional Transformer architectures, RL-MolGAN utilizes a first-decoder-then-encoder structure, facilitating the generation of drug-like molecules from both de~novo and scaffold-based designs. In addition, RL-MolGAN integrates reinforcement learning (RL) and Monte Carlo tree search (MCTS) techniques to enhance the stability of GAN training and optimize the chemical properties of the generated molecules. To further improve the model’s performance, RL-MolWGAN, an extension of RL-MolGAN, incorporates Wasserstein distance and mini-batch discrimination, which together enhance the stability of the GAN. Experimental results on two widely used molecular datasets, QM9 and ZINC, validate the effectiveness of our models in generating high-quality molecular structures with diverse and desirable chemical properties.
zh

[NLP-48] RAG -RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

【速读】：本文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）框架下检索模型难以有效获取有用上下文以及生成模型难以充分利用这些上下文的问题。为了解决这些问题，论文提出了一种专门针对RAG训练的推理语言模型（Reasoning Language Model, RLM），即RAG-RL。其关键在于通过强化学习后训练过程中的课程设计提升模型性能，同时证明更强的生成式答案模型能够从更大的检索信息集中识别相关上下文，从而减轻检索器的负担，并更有效地利用这些上下文，最终在两个开放域问答数据集上实现了当前最优（SOTA）的表现。

链接: https://arxiv.org/abs/2503.12759
作者: Jerry Huang,Siddarth Madala,Risham Sidhu,Cheng Niu,Julia Hockenmaier,Tong Zhang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校); NewsBreak
类目: Computation and Language (cs.CL)
备注: 11 Pages, 3 Figures, Preprint

点击查看摘要

Abstract:Recent research highlights the challenges retrieval models face in retrieving useful contexts and the limitations of generation models in effectively utilizing those contexts in retrieval-augmented generation (RAG) settings. To address these challenges, we introduce RAG-RL, the first reasoning language model (RLM) specifically trained for RAG. RAG-RL demonstrates that stronger answer generation models can identify relevant contexts within larger sets of retrieved information – thereby alleviating the burden on retrievers – while also being able to utilize those contexts more effectively. Moreover, we show that curriculum design in the reinforcement learning (RL) post-training process is a powerful approach to enhancing model performance. We benchmark our method on two open-domain question-answering datasets and achieve state-of-the-art results, surpassing previous SOTA generative reader models. In addition, we offers empirical insights into various curriculum learning strategies, providing a deeper understanding of their impact on model performance.
zh

[NLP-49] NCSE: Tensors Norm Constraints for Unsupervised Contrastive Learning of Sentence Embeddings

【速读】：该论文试图解决无监督句子嵌入表示在自然语言处理中的方向与模长特征未被有效结合的问题，现有方法仅关注表征的方向约束，而忽略了正样本间模长特征的控制。解决方案的关键在于提出了一种新的训练目标，通过约束正样本之间模长特征来优化无监督对比学习，并将张量模长约束（Tensor’s Norm Constraints）与集成学习相结合，构建了一个新的句子嵌入表示框架TNCSE（Tensor’s Norm Constraints for Sentence Embedding）。这一创新有效地提升了语义文本相似性任务的表现，并在零样本评估中超越了其他基线模型。

链接: https://arxiv.org/abs/2503.12739
作者: Tianyu Zong,Bingkang Shi,Hongzhu Yi,Jungang Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unsupervised sentence embedding representation has become a hot research topic in natural language processing. As a tensor, sentence embedding has two critical properties: direction and norm. Existing works have been limited to constraining only the orientation of the samples’ representations while ignoring the features of their module lengths. To address this issue, we propose a new training objective that optimizes the training of unsupervised contrastive learning by constraining the module length features between positive samples. We combine the training objective of Tensor’s Norm Constraints with ensemble learning to propose a new Sentence Embedding representation framework, TNCSE. We evaluate seven semantic text similarity tasks, and the results show that TNCSE and derived models are the current state-of-the-art approach; in addition, we conduct extensive zero-shot evaluations, and the results show that TNCSE outperforms other baselines.
zh

[NLP-50] Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多智能体协作中的合作难题，以及由此导致的次优结果。为应对这一挑战，论文受到阿克塞尔罗德迭代囚徒困境（Axelrod’s Iterated Prisoner’s Dilemma, IPD）竞赛的启发，探索人格特质对LLMs合作行为的影响。解决方案的关键在于通过表征工程（representation engineering）操控LLMs的人格特质（如宜人性 Agreeableness 和尽责性 Conscientiousness），并分析这些特质对其在IPD决策过程中合作与被利用倾向的影响，从而揭示基于人格特质引导的潜力与局限性。

链接: https://arxiv.org/abs/2503.12722
作者: Kenneth J. K. Ong,Lye Jia Jun,Hieu Minh “Jord” Nguyen,Seong Hah Cho,Natalia Pérez-Campanero Antolín
机构: AI.DA STC, ST Engineering (ST Engineering); Singapore Management University (新加坡管理大学); Apart Research; Apart Research; Apart Research; Apart Research
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: Poster, Technical AI Safety Conference 2025

点击查看摘要

Abstract:As Large Language Models (LLMs) gain autonomous capabilities, their coordination in multi-agent settings becomes increasingly important. However, they often struggle with cooperation, leading to suboptimal outcomes. Inspired by Axelrod’s Iterated Prisoner’s Dilemma (IPD) tournaments, we explore how personality traits influence LLM cooperation. Using representation engineering, we steer Big Five traits (e.g., Agreeableness, Conscientiousness) in LLMs and analyze their impact on IPD decision-making. Our results show that higher Agreeableness and Conscientiousness improve cooperation but increase susceptibility to exploitation, highlighting both the potential and limitations of personality-based steering for aligning AI agents.
zh

[NLP-51] Plausibility Vaccine: Injecting LLM Knowledge for Event Plausibility

【速读】：该论文试图解决语言模型在从共现信息构建语义表示时无法有效区分合理与不合理事件的问题。解决方案的关键在于通过参数高效的微调，将大型语言模型提示出的潜在知识注入到模型中，并利用任务适配器学习多种物理属性和关联度量，结合预训练的 AlBERT 嵌入进行适配器融合，以构建潜在语义知识。此外，论文实现了辅助任务数据的自动化生成，从而能够扩展方法并在两个合理性数据集上微调所学表示。

链接: https://arxiv.org/abs/2503.12667
作者: Jacob Chmura,Jonah Dauvet,Sebastian Sabry
机构: Mila (米拉研究所), McGill University (麦吉尔大学); McGill University (麦吉尔大学); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite advances in language modelling, distributional methods that build semantic representations from co-occurrences fail to discriminate between plausible and implausible events. In this work, we investigate how plausibility prediction can be improved by injecting latent knowledge prompted from large language models using parameter-efficient fine-tuning. We train 12 task adapters to learn various physical properties and association measures and perform adapter fusion to compose latent semantic knowledge from each task on top of pre-trained AlBERT embeddings. We automate auxiliary task data generation, which enables us to scale our approach and fine-tune our learned representations across two plausibility datasets. Our code is available at this https URL.
zh

[NLP-52] Logic-RAG : Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding

【速读】：该论文试图解决大型多模态模型（Large Multimodal Models, LMMs）在自动驾驶系统中因细粒度空间推理能力不足而导致的系统可解释性差和用户信任度低的问题。论文提出的解决方案核心在于引入Logic-RAG框架，这是一种基于检索增强生成（Retrieval-Augmented Generation, RAG）的新方法。Logic-RAG通过构建一个动态知识库（KB），利用一阶逻辑（First-Order Logic, FOL）描述物体间关系，结合感知模块、查询到逻辑嵌入器以及逻辑推理引擎，显著提升了LMMs在驾驶场景中的空间理解能力。实验结果表明，与未使用Logic-RAG的传统LMMs相比，在合成和真实驾驶视频的视觉空间查询任务中，其准确性分别从55%提升至超过80%，以及从低于75%提升至超过90%。即使不依赖逻辑推理，仅基于事实的上下文信息即可使准确性提高15%。此外，Logic-RAG具有扩展性，允许无缝替换组件或由领域专家以一阶逻辑及自然语言形式新增知识。总之，Logic-RAG有效弥补了现有LMMs在自动驾驶应用中的空间推理缺陷。

链接: https://arxiv.org/abs/2503.12663
作者: Imran Kabir,Md Alimoor Reza,Syed Billah
机构: College of Information Sciences and Technology, Pennsylvania State University (宾夕法尼亚州立大学); Department of Mathematics and Computer Science, Drake University (德雷克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs’ spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at this https URL.
zh

[NLP-53] VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）代理在复合人工智能系统中执行复杂推理任务时经常未能达到人类标准的问题，导致系统整体性能受损。由于代理的不透明推理过程、与人类预期的不一致、代理依赖关系的复杂性以及人工检查的高成本，通过人为干预解决这些失败极具挑战性。

解决方案的关键在于提出了一种以人类为中心的评估框架VeriLA（Verifying LLM Agent failures），用于系统性地评估代理失败，从而减少人为努力并使这些失败对人类可解释。该框架首先通过制定由人工设计的代理标准来明确每个代理的期望；然后开发一个人类对齐的代理验证模块，使用人工黄金标准进行训练，以评估每个代理的执行输出。这种方法能够从人类标准出发进行细粒度的性能评估，提供清晰的修订指南，并减轻人类的认知负担。实验结果表明，VeriLA不仅具有可解释性，而且高效，有助于实践者更有效地与系统交互，同时通过确保人机协作中的问责制，为构建更值得信赖且与人类对齐的复合人工智能系统铺平了道路。

链接: https://arxiv.org/abs/2503.12651
作者: Yoo Yeon Sung,Hannah Kim,Dan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:AI practitioners increasingly use large language model (LLM) agents in compound AI systems to solve complex reasoning tasks, these agent executions often fail to meet human standards, leading to errors that compromise the system’s overall performance. Addressing these failures through human intervention is challenging due to the agents’ opaque reasoning processes, misalignment with human expectations, the complexity of agent dependencies, and the high cost of manual inspection. This paper thus introduces a human-centered evaluation framework for Verifying LLM Agent failures (VeriLA), which systematically assesses agent failures to reduce human effort and make these agent failures interpretable to humans. The framework first defines clear expectations of each agent by curating human-designed agent criteria. Then, it develops a human-aligned agent verifier module, trained with human gold standards, to assess each agent’s execution output. This approach enables granular evaluation of each agent’s performance by revealing failures from a human standard, offering clear guidelines for revision, and reducing human cognitive load. Our case study results show that VeriLA is both interpretable and efficient in helping practitioners interact more effectively with the system. By upholding accountability in human-agent collaboration, VeriLA paves the way for more trustworthy and human-aligned compound AI systems.
zh

[NLP-54] Online Misinformation Detection in Live Streaming Videos WSDM

【速读】：该论文试图解决在线直播视频中谣言检测（MDLS, Misinformation Detection in Live Streaming）这一问题。目前，尽管已有许多针对离线模式下谣言检测的研究，但在线直播环境中的实时谣言检测尚未得到充分研究。论文强调了这一任务的重要性与挑战，并提出了将该问题转化为人工智能挑战的可行方法，同时探讨了潜在的解决方案。解决方案的关键在于设计能够有效应对实时性和复杂场景需求的技术手段，以实现对在线直播视频中谣言的高效检测与遏制。

链接: https://arxiv.org/abs/2503.12627
作者: Rui Cao
机构: Singapore Management University (新加坡管理大学); Singapore (新加坡)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: First prize winner in the Smart City Challenge in the 16th ACM international WSDM conference(WSDM), 2023

点击查看摘要

Abstract:Online misinformation detection is an important issue and methods are proposed to detect and curb misinformation in various forms. However, previous studies are conducted in an offline manner. We claim a realistic misinformation detection setting that has not been studied yet is online misinformation detection in live streaming videos (MDLS). In the proposal, we formulate the problem of MDLS and illustrate the importance and the challenge of the task. Besides, we propose feasible ways of developing the problem into AI challenges as well as potential solutions to the problem.
zh

[NLP-55] UniBERTs: Adversarial Training for Language-Universal Representations

【速读】：该论文旨在解决多语言自然语言处理任务中大规模预训练模型的计算成本高以及跨语言泛化能力不足的问题。论文的关键解决方案是提出了一种名为UniBERT的紧凑型多语言语言模型，其创新之处在于结合了三种组件：掩码语言建模（Masked Language Modeling）、对抗训练（Adversarial Training）和知识蒸馏（Knowledge Distillation）。通过在精心筛选的涵盖107种语言的维基百科语料库上进行预训练，UniBERT不仅降低了大模型的计算需求，还在多个自然语言处理任务中保持了竞争力。实验结果表明，通过引入对抗目标增强的多语言训练策略显著提升了跨语言泛化能力，相比传统基线模型平均相对改进达7.72%，而后者仅为1.17%。这一工作强调了结合对抗训练和知识蒸馏构建可扩展且鲁棒的语言模型的重要性，从而推动了多语言和跨语言自然语言处理领域的发展。

链接: https://arxiv.org/abs/2503.12608
作者: Andrei-Marius Avram,Marian Lupaşcu,Dumitru-Clementin Cercel,Ionuţ Mironică,Ştefan Trăuşan-Matu
机构: National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学); University of Bucharest (布加勒斯特大学); Adobe Research (Adobe 研究院); Academy of Romanian Scientists (罗马尼亚科学家学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents UniBERT, a compact multilingual language model that leverages an innovative training framework integrating three components: masked language modeling, adversarial training, and knowledge distillation. Pre-trained on a meticulously curated Wikipedia corpus spanning 107 languages, UniBERT is designed to reduce the computational demands of large-scale models while maintaining competitive performance across various natural language processing tasks. Comprehensive evaluations on four tasks – named entity recognition, natural language inference, question answering, and semantic textual similarity – demonstrate that our multilingual training strategy enhanced by an adversarial objective significantly improves cross-lingual generalization. Specifically, UniBERT models show an average relative improvement of 7.72% over traditional baselines, which achieved an average relative improvement of only 1.17%, with statistical analysis confirming the significance of these gains (p-value = 0.0181). This work highlights the benefits of combining adversarial training and knowledge distillation to build scalable and robust language models, thereby advancing the field of multilingual and cross-lingual natural language processing.
zh

[NLP-56] MoECollab: Democratizing LLM Development Through Collaborative Mixture of Experts

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）开发过度集中化的问题，限制了资源有限组织的参与。为了解决这一问题，论文提出了一种名为MoECollab的新框架，利用混合专家（Mixture of Experts, MoE）架构实现分布式协作的LLM开发。其关键解决方案在于将单一的大规模模型分解为多个专业化专家模块，并通过可训练的路由网络进行协调，使不同贡献者无论资源如何都能参与其中。此外，论文提供了完整的数学基础和技术实现，包括专家动态、路由机制及集成策略，从而在保持高精度的同时显著降低了计算需求，并提升了特定领域的性能表现。

链接: https://arxiv.org/abs/2503.12592
作者: Harshit
机构: IIIT Delhi (德里国际信息技术学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) development has become increasingly centralized, limiting participation to well-resourced organizations. This paper introduces MoECollab, a novel framework leveraging Mixture of Experts (MoE) architecture to enable distributed, collaborative LLM development. By decomposing monolithic models into specialized expert modules coordinated by a trainable gating network, our framework allows diverse contributors to participate regardless of computational resources. We provide a complete technical implementation with mathematical foundations for expert dynamics, gating mechanisms, and integration strategies. Experiments on multiple datasets demonstrate that our approach achieves accuracy improvements of 3-7% over baseline models while reducing computational requirements by 34%. Expert specialization yields significant domain-specific gains, with improvements from 51% to 88% F1 score in general classification and from 23% to 44% accuracy in news categorization. We formalize the routing entropy optimization problem and demonstrate how proper regularization techniques lead to 14% higher expert utilization rates. These results validate MoECollab as an effective approach for democratizing LLM development through architecturally-supported collaboration.
zh

[NLP-57] RaSA: Rank-Sharing Low-Rank Adaptation ICLR2025

【速读】：该论文试图解决低秩适应（Low-rank Adaptation, LoRA）在表达能力上的局限性问题，特别是在代码生成和数学推理等复杂任务中的瓶颈。LoRA 的低秩约束限制了其表达能力，而论文提出的解决方案是 Rank-Sharing Low-Rank Adaptation (RaSA)，其关键在于通过跨层的部分秩共享机制增强 LoRA 的表达能力。RaSA 构建了一个共享秩池，并应用层特定的加权策略，在不显著增加参数开销的情况下有效提升了秩的数量，从而显著改善了代码生成和数学任务的性能。

链接: https://arxiv.org/abs/2503.12576
作者: Zhiwei He,Zhaopeng Tu,Xing Wang,Xingyu Chen,Zhijie Wang,Jiahao Xu,Tian Liang,Wenxiang Jiao,Zhuosheng Zhang,Rui Wang
机构: Shanghai Jiao Tong University (上海交通大学); Tencent AI Lab (腾讯人工智能实验室)
类目: Computation and Language (cs.CL)
备注: ICLR 2025

点击查看摘要

Abstract:Low-rank adaptation (LoRA) has been prominently employed for parameter-efficient fine-tuning of large language models (LLMs). However, the limited expressive capacity of LoRA, stemming from the low-rank constraint, has been recognized as a bottleneck, particularly in rigorous tasks like code generation and mathematical reasoning. To address this limitation, we introduce Rank-Sharing Low-Rank Adaptation (RaSA), an innovative extension that enhances the expressive capacity of LoRA by leveraging partial rank sharing across layers. By forming a shared rank pool and applying layer-specific weighting, RaSA effectively increases the number of ranks without augmenting parameter overhead. Our theoretically grounded and empirically validated approach demonstrates that RaSA not only maintains the core advantages of LoRA but also significantly boosts performance in challenging code and math tasks. Code, data and scripts are available at: this https URL.
zh

[NLP-58] Multi-Granular Multimodal Clue Fusion for Meme Understanding AAAI2025

【速读】：该论文旨在解决多模态 meme 理解（Multimodal Meme Understanding, MMU）任务中的两个关键问题：一是细粒度隐喻视觉线索的丢失，二是文本与图像之间弱多模态相关性的忽视。为克服这些局限性，论文提出了一种多粒度多模态线索融合模型（Multi-Granular Multimodal Clue Fusion Model, MGMCF）。其解决方案的关键在于：首先设计了对象级语义挖掘模块以提取对象级图像特征线索，实现细粒度特征线索的提取并增强模型捕捉隐喻细节和语义的能力；其次提出了全新的全局-局部跨模态交互模型，通过双向跨模态注意力机制有效促进全局多模态上下文线索与局部单模态特征线索之间的交互，强化其表征能力；最后开发了一种双语义引导训练策略，提升模型在语义空间中对多模态表示的理解与对齐能力。实验结果表明，该方法在多个任务上显著优于现有先进基线模型，验证了解决方案的有效性和潜力。

链接: https://arxiv.org/abs/2503.12560
作者: Li Zheng,Hao Fei,Ting Dai,Zuquan Peng,Fei Li,Huisheng Ma,Chong Teng,Donghong Ji
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:With the continuous emergence of various social media platforms frequently used in daily life, the multimodal meme understanding (MMU) task has been garnering increasing attention. MMU aims to explore and comprehend the meanings of memes from various perspectives by performing tasks such as metaphor recognition, sentiment analysis, intention detection, and offensiveness detection. Despite making progress, limitations persist due to the loss of fine-grained metaphorical visual clue and the neglect of multimodal text-image weak correlation. To overcome these limitations, we propose a multi-granular multimodal clue fusion model (MGMCF) to advance MMU. Firstly, we design an object-level semantic mining module to extract object-level image feature clues, achieving fine-grained feature clue extraction and enhancing the model’s ability to capture metaphorical details and semantics. Secondly, we propose a brand-new global-local cross-modal interaction model to address the weak correlation between text and images. This model facilitates effective interaction between global multimodal contextual clues and local unimodal feature clues, strengthening their representations through a bidirectional cross-modal attention mechanism. Finally, we devise a dual-semantic guided training strategy to enhance the model’s understanding and alignment of multimodal representations in the semantic space. Experiments conducted on the widely-used MET-MEME bilingual dataset demonstrate significant improvements over state-of-the-art baselines. Specifically, there is an 8.14% increase in precision for offensiveness detection task, and respective accuracy enhancements of 3.53%, 3.89%, and 3.52% for metaphor recognition, sentiment analysis, and intention detection tasks. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing MMU.
zh

[NLP-59] AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding

【速读】：该论文试图解决多模态大语言模型（MLLMs）在处理长视频时受上下文长度限制的问题。现有方法通过均匀压缩视频利用视觉冗余，虽取得一定成果，但定量分析表明冗余在时间和模型层间差异显著，需要更灵活的压缩策略。论文提出了一种无需训练的自适应保留与压缩方法（AdaReTaKe），其关键是通过理论保证动态分配时间维度和模型层间的压缩比例，以灵活减少视觉冗余。实验表明，集成AdaReTaKe后，MLLMs的处理能力从256帧提升至2048帧，同时保持关键信息，并在多个数据集上超越现有方法，尤其在最长数据集LVBench上提升了5.9%和6.0%（针对7B和72B模型）。

链接: https://arxiv.org/abs/2503.12559
作者: Xiao Wang,Qingyi Si,Jianlong Wu,Shiyu Zhu,Li Cao,Liqiang Nie
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学，深圳); Huawei Technologies Co., Ltd. (华为技术有限公司); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. Recent methods compress videos by leveraging visual redundancy uniformly, yielding promising results. Nevertheless, our quantitative analysis shows that redundancy varies significantly across time and model layers, necessitating a more flexible compression strategy. We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Integrated into state-of-the-art MLLMs, AdaReTaKe improves processing capacity from 256 to 2048 frames while preserving critical information. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively, with even greater improvements of 5.9% and 6.0% on the longest LVBench. Our code is available at this https URL.
zh

[NLP-60] From Guessing to Asking: An Approach to Resolving the Persona Knowledge Gap in LLM s during Multi-Turn Conversations NAACL2025

【速读】：该论文旨在解决多轮对话中大型语言模型（Large Language Model, LLM）在保持连贯性的同时适应用户特定信息的核心挑战，特别是关注由模型内部理解与实现连贯个性化对话所需知识之间的差距所导致的“persona knowledge gap”（人格知识缺口）。论文指出，尽管已有研究识别了这一问题，但针对其检测与解决的计算方法仍缺乏深入探索。为此，论文提出了一种名为Conversation Preference Elicitation and Recommendation (CPER) 的创新框架，通过内在不确定性量化（intrinsic uncertainty quantification）和反馈驱动精化（feedback-driven refinement），动态检测并解决人格知识缺口。CPER 框架的关键在于三个模块：Contextual Understanding Module（用于偏好提取）、Dynamic Feedback Module（用于测量不确定性并优化人格适配）以及Persona-Driven Response Generation module（用于基于累积用户上下文调整响应）。通过在两个真实世界数据集上的评估，CPER 在 CCPE-M 和 ESConv 上分别比基线模型提升了 42% 和 27% 的用户偏好度，且定性分析表明其响应更受青睐的原因在于更好地维持了上下文相关性和连贯性，尤其是在较长（12+轮次）对话中。

链接: https://arxiv.org/abs/2503.12556
作者: Sarvesh Baskar,Tanmay Tulsidas Verelakar,Srinivasan Parthasarathy,Manas Gaur
机构: BITS Pilani(BITS皮兰亚学院), Goa(果阿); University of Maryland Baltimore County(马里兰大学巴尔的摩郡分校), Baltimore, MD, USA; Ohio State University(俄亥俄州立大学), Columbia, OH, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 Figure, Oral Presentation at NAACL 2025

点击查看摘要

Abstract:In multi-turn dialogues, large language models (LLM) face a critical challenge of ensuring coherence while adapting to user-specific information. This study introduces the persona knowledge gap, the discrepancy between a model’s internal understanding and the knowledge required for coherent, personalized conversations. While prior research has recognized these gaps, computational methods for their identification and resolution remain underexplored. We propose Conversation Preference Elicitation and Recommendation (CPER), a novel framework that dynamically detects and resolves persona knowledge gaps using intrinsic uncertainty quantification and feedback-driven refinement. CPER consists of three key modules: a Contextual Understanding Module for preference extraction, a Dynamic Feedback Module for measuring uncertainty and refining persona alignment, and a Persona-Driven Response Generation module for adapting responses based on accumulated user context. We evaluate CPER on two real-world datasets: CCPE-M for preferential movie recommendations and ESConv for mental health support. Using A/B testing, human evaluators preferred CPER’s responses 42% more often than baseline models in CCPE-M and 27% more often in ESConv. A qualitative human evaluation confirms that CPER’s responses are preferred for maintaining contextual relevance and coherence, particularly in longer (12+ turn) conversations.
zh

[NLP-61] Basic Category Usage in Vision Language Models

【速读】：本文旨在研究视觉-语言模型（Vision-Language Models, VLMs）在基本层级分类（Basic Level Categorization）上的行为是否与人类一致，并探索这些模型是否能够习得人类认知分类行为中的复杂特性。论文的关键在于通过分析两个开源视觉-语言模型（Llama 3.2 Vision Instruct 和 Molmo 7B-D）的表现，验证它们是否倾向于采用与人类相似的基本层级分类策略，特别是涉及生物与非生物类别效应以及专家类别偏移等精细的人类行为模式。研究结果表明，这些模型的行为与人类的分类偏好高度一致，从而证明了它们从训练数据中学到了人类的认知分类机制。

链接: https://arxiv.org/abs/2503.12530
作者: Hunter Sawyer,Jesse Roberts,Kyle Moore
机构: Tennessee Tech University (田纳西理工大学); Vanderbilt University (范德比尔特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The field of psychology has long recognized a basic level of categorization that humans use when labeling visual stimuli, a term coined by Rosch in 1976. This level of categorization has been found to be used most frequently, to have higher information density, and to aid in visual language tasks with priming in humans. Here, we investigate basic level categorization in two recently released, open-source vision-language models (VLMs). This paper demonstrates that Llama 3.2 Vision Instruct (11B) and Molmo 7B-D both prefer basic level categorization consistent with human behavior. Moreover, the models’ preferences are consistent with nuanced human behaviors like the biological versus non-biological basic level effects and the well established expert basic level shift, further suggesting that VLMs acquire cognitive categorization behaviors from the human data on which they are trained.
zh

[NLP-62] Investigating Human-Aligned Large Language Model Uncertainty

【速读】：该论文旨在解决如何量化大型语言模型的不确定性，以便促进模型控制并调节用户信任的问题。解决方案的关键在于评估多种不确定性度量方法的有效性，特别是那些能够与人类群体层面的不确定性表现相一致的度量。研究发现，贝叶斯方法以及基于熵变体的“top-k 熵”更倾向于随模型规模变化而反映人类行为模式。此外，尽管某些单一强相关性度量在模型增大时与人类行为的一致性减弱，但通过多元线性回归分析表明，结合多个不确定性度量可实现与人类行为的相似对齐，并减少对模型规模的依赖性。

链接: https://arxiv.org/abs/2503.12528
作者: Kyle Moore,Jesse Roberts,Daryl Watson,Pamela Wisniewski
机构: Vanderbilt University (范德比尔特大学); Tennessee Tech University (田纳西技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work has sought to quantify large language model uncertainty to facilitate model control and modulate user trust. Previous works focus on measures of uncertainty that are theoretically grounded or reflect the average overt behavior of the model. In this work, we investigate a variety of uncertainty measures, in order to identify measures that correlate with human group-level uncertainty. We find that Bayesian measures and a variation on entropy measures, top-k entropy, tend to agree with human behavior as a function of model size. We find that some strong measures decrease in human-similarity with model size, but, by multiple linear regression, we find that combining multiple uncertainty measures provide comparable human-alignment with reduced size-dependency.
zh

[NLP-63] EXAONE Deep: Reasoning Enhanced Language Models

【速读】：该论文试图解决多模态推理任务中的性能提升问题，特别是数学与编码基准测试的表现。解决方案的关键在于利用一个专注于推理能力训练的数据集，该数据集包含长序列的推理过程。通过在这一数据集上的训练，EXAONE Deep 系列模型在不同规模下均展现出卓越的性能，其中较小的模型（如 2.4B 和 7.8B 参数量）优于同规模其他模型，而最大规模的 32B 模型则与领先的开源权重模型竞争。所有模型均可供研究使用并开放下载。

链接: https://arxiv.org/abs/2503.12524
作者: LG AI Research,Kyunghoon Bae,Eunbi Choi,Kibong Choi,Stanley Jungkyu Choi,Yemuk Choi,Seokhee Hong,Junwon Hwang,Hyojin Jeon,Kijeong Jeon,Gerrard Jeongwon Jo,Hyunjik Jo,Jiyeon Jung,Hyosang Kim,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Yongil Kim,Youchul Kim,Edward Hwayoung Lee,Haeju Lee,Honglak Lee,Jinsik Lee,Kyungmin Lee,Sangha Park,Yongmin Park,Sihoon Yang,Heuiyeen Yeen,Sihyuk Yi,Hyeongu Yun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2412.04862 , arXiv:2408.03541

点击查看摘要

Abstract:We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAONE Deep 32B, demonstrates competitive performance against leading open-weight models. All EXAONE Deep models are openly available for research purposes and can be downloaded from this https URL
zh

[NLP-64] CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences ICLR2025

【速读】：该论文旨在解决大型语言模型（LLMs）在处理长序列时，因关键值（KV）缓存需求增加而导致的推理负担过重问题。现有KV缓存淘汰方法虽有所改进，但未能合理分配不同注意力模式层间的资源。为应对这一挑战，论文提出了一种名为Cascading and Adaptive KV cache Eviction (CAKE) 的新方法，将KV缓存淘汰视为“蛋糕切分问题”。其关键在于通过考虑注意力动态在空间和时间维度上的特性，评估各层特定偏好，从而合理分配缓存大小，并以级联方式管理内存约束。此外，CAKE引入了一种新的淘汰指标，综合考虑令牌随时间的重要性变化，解决了现有方法忽视时间动态的局限性。实验表明，CAKE仅使用3.2%的KV缓存即可保持模型性能，并在多种模型和内存约束下显著优于现有基线，尤其在低内存设置下表现出色。

链接: https://arxiv.org/abs/2503.12491
作者: Ziran Qin,Yuchen Cao,Mingbao Lin,Wen Hu,Shixuan Fan,Ke Cheng,Weiyao Lin,Jianguo Li
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团); Independent Researcher (独立研究者)
类目: Computation and Language (cs.CL)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Large language models (LLMs) excel at processing long sequences, boosting demand for key-value (KV) caching. While recent efforts to evict KV cache have alleviated the inference burden, they often fail to allocate resources rationally across layers with different attention patterns. In this paper, we introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a “cake-slicing problem.” CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner. This approach enables a global view of cache allocation, adaptively distributing resources across diverse attention mechanisms while maintaining memory budgets. CAKE also employs a new eviction indicator that considers the shifting importance of tokens over time, addressing limitations in existing methods that overlook temporal dynamics. Comprehensive experiments on LongBench and NeedleBench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10x speedup in decoding latency compared to full cache when processing contexts of 128K tokens with FlashAttention-2. Our code is available at this https URL.
zh

[NLP-65] HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLM s

【速读】：该论文试图解决香港粤语（Cantonese）在自然语言处理中的独特挑战，包括文化细微差别以及缺乏专用评估数据集的问题。解决方案的关键在于提出HKCanto-Eval基准，它通过评估大型语言模型（Large Language Models, LLMs）在粤语理解任务上的表现，并扩展到英语和书面汉语以进行跨语言评估，从而填补这一研究空白。HKCanto-Eval不仅整合了香港特有的文化和语言细微差别，还设计了问题以挖掘模型的潜在语言元知识，为真实场景下的语言模型评估提供了一个稳健框架。同时，研究表明，尽管专有模型通常优于开源模型，但现有模型在处理粤语特定的语言和文化知识方面仍存在显著局限性，强调了针对性训练数据和评估方法的重要性。

链接: https://arxiv.org/abs/2503.12440
作者: Tsz Chung Cheng,Chung Shing Cheng,Chaak Ming Lau,Eugene Tin-Ho Lam,Chun Yat Wong,Hoi On Yu,Cheuk Hei Chong
机构: Kyushu University (九州大学); The Education University of Hong Kong (香港教育大学); The University of Hong Kong (香港大学); Votee Limited;
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability of language models to comprehend and interact in diverse linguistic and cultural landscapes is crucial. The Cantonese language used in Hong Kong presents unique challenges for natural language processing due to its rich cultural nuances and lack of dedicated evaluation datasets. The HKCanto-Eval benchmark addresses this gap by evaluating the performance of large language models (LLMs) on Cantonese language understanding tasks, extending to English and Written Chinese for cross-lingual evaluation. HKCanto-Eval integrates cultural and linguistic nuances intrinsic to Hong Kong, providing a robust framework for assessing language models in realistic scenarios. Additionally, the benchmark includes questions designed to tap into the underlying linguistic metaknowledge of the models. Our findings indicate that while proprietary models generally outperform open-weight models, significant limitations remain in handling Cantonese-specific linguistic and cultural knowledge, highlighting the need for more targeted training data and evaluation methods. The code can be accessed at this https URL
zh

[NLP-66] CorpusStudio: Surfacing Emergent Patterns in a Corpus of Prior Work while Writing

【速读】：该论文试图解决科学社区等群体在隐性写作规范理解与应用方面的困难。这些规范通常通过阅读文献和接收反馈逐渐习得，但难以明确表达并应用于个人写作中。论文提出的解决方案的关键在于引入两种新的写作支持概念：(1) 对文集内各章节标题的有序分布，以及 (2) 根据用户草稿和光标位置检索出的大量上下文相关的句子，并对后者中重复出现的词汇算法化高亮显示，以帮助用户识别潜在的写作规范。研究结果（N=16）表明，参与者通过使用这些概念调整文档结构与内容，在回顾大量示例后增强了遵循或打破规范的信心。这验证了在写作过程中将其他作者的选择分布具象化的价值。

链接: https://arxiv.org/abs/2503.12436
作者: Hai Dang,Chelse Swoopes,Daniel Buschek,Elena L. Glassman
机构: University of Bayreuth(Google Scholar: University of Bayreuth); Harvard University(Google Scholar: Harvard University)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 19 pages, 12 figures, 1 table, ACM CHI 2025

点击查看摘要

Abstract:Many communities, including the scientific community, develop implicit writing norms. Understanding them is crucial for effective communication with that community. Writers gradually develop an implicit understanding of norms by reading papers and receiving feedback on their writing. However, it is difficult to both externalize this knowledge and apply it to one’s own writing. We propose two new writing support concepts that reify document and sentence-level patterns in a given text corpus: (1) an ordered distribution over section titles and (2) given the user’s draft and cursor location, many retrieved contextually relevant sentences. Recurring words in the latter are algorithmically highlighted to help users see any emergent norms. Study results (N=16) show that participants revised the structure and content using these concepts, gaining confidence in aligning with or breaking norms after reviewing many examples. These results demonstrate the value of reifying distributions over other authors’ writing choices during the writing process.
zh

[NLP-67] Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs

【速读】：该论文试图解决任务导向对话中会话接地（conversational grounding）与任务成功之间的关系问题。研究发现，任务导向对话中的会话摩擦（conversational friction），即由于参与者信念和假设的不一致导致的会话流中断，显著影响任务成功率。论文的关键解决方案在于识别这些会话摩擦，尽管大型语言模型（LLMs）能够检测明显的摩擦案例，但对于需要实用推理或领域特定推理的微妙且依赖上下文的摩擦实例仍存在困难。

链接: https://arxiv.org/abs/2503.12370
作者: Rupak Sarkar,Neha Srikanth,Taylor Hudson,Rachel Rudinger,Claire Bonial,Philip Resnik
机构: University of Maryland, College Park (马里兰大学帕克分校); Oak Ridge Associated Universities (橡树岭联合大学); Army Research Lab (陆军研究实验室)
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:While it is commonly accepted that maintaining common ground plays a role in conversational success, little prior research exists connecting conversational grounding to success in task-oriented conversations. We study failures of grounding in the Ubuntu IRC dataset, where participants use text-only communication to resolve technical issues. We find that disruptions in conversational flow often stem from a misalignment in common ground, driven by a divergence in beliefs and assumptions held by participants. These disruptions, which we call conversational friction, significantly correlate with task success. We find that although LLMs can identify overt cases of conversational friction, they struggle with subtler and more context-dependent instances requiring pragmatic or domain-specific reasoning.
zh

[NLP-68] IPCGRL: Language-Instructed Reinforcement Learning for Procedural Level Generation

【速读】：该论文致力于解决利用文本指令进行程序化内容生成（Procedural Content Generation, PCG）时，深度强化学习（Deep Reinforcement Learning, DRL）代理在可控性和泛化性方面的局限性。论文的关键创新在于提出了一种基于指令的程序化内容生成方法IPCGRL，它通过引入句子嵌入模型来微调任务特定的嵌入表示，从而有效压缩游戏关卡条件。这种方案的核心在于通过任务特定的嵌入优化实现更高的可控性和泛化能力，并扩展了条件输入的模态，为程序化内容生成提供了更灵活和表达力更强的交互框架。

链接: https://arxiv.org/abs/2503.12358
作者: In-Chang Baek,Sung-Hyun Kim,Seo-yung Lee,Dong-Hyun Lee,Kyung-Joong Kim
机构: Gwangju Institute of Science and Technology (GIST)(光州科学技术院); Dongseo University (东义大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Recent research has highlighted the significance of natural language in enhancing the controllability of generative models. While various efforts have been made to leverage natural language for content generation, research on deep reinforcement learning (DRL) agents utilizing text-based instructions for procedural content generation remains limited. In this paper, we propose IPCGRL, an instruction-based procedural content generation method via reinforcement learning, which incorporates a sentence embedding model. IPCGRL fine-tunes task-specific embedding representations to effectively compress game-level conditions. We evaluate IPCGRL in a two-dimensional level generation task and compare its performance with a general-purpose embedding method. The results indicate that IPCGRL achieves up to a 21.4% improvement in controllability and a 17.2% improvement in generalizability for unseen instructions. Furthermore, the proposed method extends the modality of conditional input, enabling a more flexible and expressive interaction framework for procedural content generation.
zh

[NLP-69] Numerical Words and Linguistic Loops: The Perpetual Four-Letter Routine

【速读】：该论文试图探索一种与单词字母数量及其对应数值相关的有趣语言学特性。通过选取任意单词，统计其字母数量，拼写出该数量的字母并再次计数，研究发现了一个意外的模式：在包含100,000个随机单词的数据集中，这一迭代过程总是收敛于数字“四”（4），即所谓的“语言循环（Linguistic Loop, LL）常数”。论文通过对使用拉丁字母的73种语言进行分析，揭示了独特的模式，其中28种语言表现出符合已知特性的LL-正向行为，31种语言不符合该特性为LL-负向，另有13种语言展现出复杂的倾向。论文的关键在于发现并描述了基于拉丁字母的语言在数字符号表示中的这一语言学特性，并提出了关于潜在语言学和认知机制的问题。

链接: https://arxiv.org/abs/2503.12357
作者: Krishna Chaitanya Polavaram
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures, 2 tables

点击查看摘要

Abstract:This study presents a fascinating linguistic property related to the number of letters in words and their corresponding numerical values. By selecting any arbitrary word, counting its constituent letters, and subsequently spelling out the resulting count and tallying the letters anew, an unanticipated pattern is observed. Remarkably, this iterative sequence, conducted on a dataset of 100,000 random words, invariably converges to the numeral four (4), termed the Linguistic Loop (LL) constant. Examining 73 languages utilizing the Latin alphabet, this research reveals distinctive patterns. Among them, 28 languages exhibit LL-positive behavior adhering to the established property, while 31 languages deviate as LL-negative. Additionally, 13 languages display nuanced tendencies: eight feature two LL constants (bi-positivity), and five feature three constants (tri-positivity). This discovery highlights a linguistic quirk within Latin alphabet-based language number-word representations, uncovering an intriguing facet across diverse alphabetic systems. It also raises questions about the underlying linguistic and cognitive mechanisms responsible for this phenomenon.
zh

[NLP-70] Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLM s

【速读】：该论文旨在解决在数据隐私保护的前提下，如何高效生成合成数据以训练模型的问题。传统方法如差分隐私（Differentially Private, DP）微调大规模语言模型（Large Language Models, LLMs）需要大量计算资源，而基于提示的方法则依赖人工设计提示且未能有效利用私有信息。论文的关键解决方案是提出了一种名为CTCL（ConTrollability and CLustering for Data Synthesis）的新框架，该框架通过预训练一个轻量级的1.4亿参数条件生成器和基于聚类的主题模型来适应大规模公开数据，并进一步通过DP微调生成器以捕获细粒度文本信息，同时利用主题模型提取差分隐私直方图表示分布信息。最终，生成器根据差分隐私直方图采样生成所需数量的合成数据样本。这一方法避免了对大规模提示工程的需求或对百亿规模LLM的微调，从而克服了现有方法的局限性。

链接: https://arxiv.org/abs/2503.12347
作者: Bowen Tan,Zheng Xu,Eric Xing,Zhiting Hu,Shanshan Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution, depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.
zh

[NLP-71] General Table Question Answering via Answer-Formula Joint Generation

【速读】：该论文旨在解决现有高级表格问答（TableQA）方法在处理特定问题类型或表结构时缺乏灵活性的问题。尽管这些方法通过提示大型语言模型（LLMs）生成答案文本、SQL查询、Python代码或自定义操作，在复杂推理任务中表现出色，但它们难以适应多样化的场景。相比之下，Spreadsheet Formula作为一种广泛使用且定义明确的表格数据操作语言，尚未被充分用于解决TableQA问题。为此，论文尝试将公式作为逻辑形式，以解决具有不同结构的表格中的复杂推理任务。

论文的关键解决方案是构建了一个名为\texttt{FormulaQA}的大规模带标注的表格问答数据集，并提出了一种通用的表格问答框架\texttt{TabAF}。与现有方法不同，\texttt{TabAF}利用单一LLM主干同时解码答案和公式，展现出强大的适应性和泛化能力。基于Llama3.1-70B的\texttt{TabAF}在WikiTableQuestions、HiTab和TabFact等基准测试中达到了新的性能高度。

链接: https://arxiv.org/abs/2503.12345
作者: Zhongyuan Wang,Richong Zhang,Zhijie Nie
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in progress

点击查看摘要

Abstract:Advanced table question answering (TableQA) methods prompt large language models (LLMs) to generate answer text, SQL query, Python code, or custom operations, which impressively improve the complex reasoning problems in the TableQA task. However, these methods lack the versatility to cope with specific question types or table structures. In contrast, the Spreadsheet Formula, the widely-used and well-defined operation language for tabular data, has not been thoroughly explored to solve TableQA. In this paper, we first attempt to use Formula as the logical form for solving complex reasoning on the tables with different structures. Specifically, we construct a large Formula-annotated TableQA dataset \textttFromulaQA from existing datasets. In addition, we propose \textttTabAF, a general table answering framework to solve multiple types of tasks over multiple types of tables simultaneously. Unlike existing methods, \textttTabAF decodes answers and Formulas with a single LLM backbone, demonstrating great versatility and generalization. \textttTabAF based on Llama3.1-70B achieves new state-of-the-art performance on the WikiTableQuestion, HiTab and TabFact.
zh

[NLP-72] SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression NAACL2025

【速读】：该论文旨在解决现有基于奇异值分解（Singular Value Decomposition, SVD）的大规模语言模型（Large Language Models, LLMs）压缩方法在减少截断损失（truncation loss）方面的不足，从而提升压缩模型的性能竞争力。论文的关键在于提出了一种名为SVD-LLM V2的新方法，通过两项技术优化SVD压缩中的奇异值截断：首先，引入理论截断损失来为不同层的权重矩阵分配独特的压缩比，以适应权重冗余的异质性；其次，提出基于损失优化的权重截断策略，确保实际应用中截断后的奇异值能够带来更低且更稳定的截断损失。这些改进显著提升了压缩模型的表现，使其优于现有的最先进方法。

链接: https://arxiv.org/abs/2503.12340
作者: Xin Wang,Samiul Alam,Zhongwei Wan,Hui Shen,Mi Zhang
机构: The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL)
备注: NAACL 2025; Code available at this https URL

点击查看摘要

Abstract:Despite significant advancements, the practical deployment of Large Language Models (LLMs) is often hampered by their immense sizes, highlighting the need for effective compression techniques. Singular Value Decomposition (SVD) is a promising LLM compression technique. However, existing SVD-based compression methods fall short in reducing truncation losses, leading to less competitive performance in compressed models. In this work, we introduce SVD-LLM V2, a SVD-based LLM compression method that optimizes singular value truncation in SVD compression with two techniques. First, SVD-LLM V2 proposes to use theoretical truncation loss of weight matrices to assign a unique compression ratio to each weight matrix at different layers to accommodate weight redundancy heterogeneity. Second, SVD-LLM V2 proposes loss-optimized weight truncation to ensure that the truncated singular values result in a lower and more stable truncation loss in practice. We evaluate SVD-LLM V2 on ten datasets and five LLMs at various scales. Our results show SVD-LLM V2 outperforms state-of-the-art SVD-based LLM compression methods. Our code is available at this https URL
zh

[NLP-73] CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

【速读】：该论文旨在解决两个关键问题：（1）当前视觉-语言模型（Vision-Language Models, VLMs）在图像描述任务上的实际表现如何，特别是与人类相比的表现？（2）现有的自动化指标是否能够可靠地评估详细的图像描述质量？为了解决这些问题，论文构建了一个名为CapArena的平台，包含超过6000组图像描述对战及高质量的人类偏好投票，以实现对模型性能的竞技场式评估。同时，利用CapArena中的人类标注数据，论文评估了传统的和最新的图像描述指标以及将VLM作为评判者（VLM-as-a-Judge）的方法。研究发现，尽管某些指标（如METEOR）在描述级别上与人类具有一致性，但其系统性偏差导致模型排名不一致；而VLM-as-a-Judge在描述和模型层面均展现出稳健的判别能力。基于这些洞察，论文发布了CapArena-Auto，这是一种准确且高效的自动化基准工具，能够在每次测试仅使用4个样本的情况下达到与人类排名94.3%的相关性。

链接: https://arxiv.org/abs/2503.12329
作者: Kanzhi Cheng,Wenpo Song,Jiaxin Fan,Zheng Ma,Qiushi Sun,Fangzhi Xu,Chenyang Yan,Nuo Chen,Jianbing Zhang,Jiajun Chen
机构: National Key Laboratory for Novel Software Technology, Nanjing University (国家软件新技术重点实验室，南京大学); The University of Hong Kong (香港大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do current VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess detailed caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show decent caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just 4 per test. Data and resources will be open-sourced at this https URL.
zh

[NLP-74] One Goal Many Challenges: Robust Preference Optimization Amid Content-Aware and Multi-Source Noise

【速读】：该论文旨在解决现有偏好对齐（Preference Alignment）方法在处理含噪声人类反馈时的局限性，这些方法通常假设人类反馈无偏，但在实际应用中往往存在多种与内容相关的噪声。论文提出了一种名为Content-Aware Noise-Resilient Preference Optimization (CNRPO) 的新框架，其关键是采用多目标优化方法分离真实偏好与内容相关噪声，并通过利用后门攻击机制高效学习和控制模型中的各类噪声源。理论分析和实验结果表明，CNRPO能够显著提升与主要人类偏好的对齐效果，同时有效抑制次要噪声和偏差的影响。

链接: https://arxiv.org/abs/2503.12301
作者: Amirabbas Afzali,Amirhossein Afsharrad,Seyed Shahabeddin Mousavi,Sanjay Lall
机构: Aktus AI (Aktus AI); Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant strides in generating human-like responses, largely due to preference alignment techniques. However, these methods often assume unbiased human feedback, which is rarely the case in real-world scenarios. This paper introduces Content-Aware Noise-Resilient Preference Optimization (CNRPO), a novel framework that addresses multiple sources of content-dependent noise in preference learning. CNRPO employs a multi-objective optimization approach to separate true preferences from content-aware noises, effectively mitigating their impact. We leverage backdoor attack mechanisms to efficiently learn and control various noise sources within a single model. Theoretical analysis and extensive experiments on different synthetic noisy datasets demonstrate that CNRPO significantly improves alignment with primary human preferences while controlling for secondary noises and biases, such as response length and harmfulness.
zh

[NLP-75] he Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation

【速读】：该论文试图解决大型语言模型预训练数据集中存在的英语中心化偏见问题，并填补现有数据集中法语及相关文化资料不足的空白。同时，论文关注数据权利保护，减少受版权保护材料的使用，以促进更开放的数据生态。解决方案的关键在于构建了一个名为Lucie Training Dataset的多语言数据集，其以法语为核心，不仅包含传统网络资源，还整合了丰富的法国文化遗产文档；此外，该数据集通过优先选择开源和非版权受限的材料来平衡数据权利与可用性。基于此数据集，论文进一步开发了Lucie-7B基础模型及指令微调模型，通过在法语和英语数据上等量训练，努力更好地反映法语使用者的文化特性，从而实现性能与伦理的双重提升。

链接: https://arxiv.org/abs/2503.12294
作者: Olivier Gouvert,Julie Hunter,Jérôme Louradour,Christophe Cerisara,Evan Dufraisse,Yaya Sy,Laura Rivière,Jean-Pierre Lorré,OpenLLM-France community
机构: LINAGORA (林亚戈拉); LORIA (洛里亚); CEA List (法国原子能和替代能源委员会列表)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present both the Lucie Training Dataset and the Lucie-7B foundation model. The Lucie Training Dataset is a multilingual collection of textual corpora centered around French and designed to offset anglo-centric biases found in many datasets for large language model pretraining. Its French data is pulled not only from traditional web sources, but also from French cultural heritage documents, filling an important gap in modern datasets. Beyond French, which makes up the largest share of the data, we added documents to support several other European languages, including English, Spanish, German, and Italian. Apart from its value as a resource for French language and culture, an important feature of this dataset is that it prioritizes data rights by minimizing copyrighted material. In addition, building on the philosophy of past open projects, it is redistributed in the form used for training and its processing is described on Hugging Face and GitHub. The Lucie-7B foundation model is trained on equal amounts of data in French and English – roughly 33% each – in an effort to better represent cultural aspects of French-speaking communities. We also describe two instruction fine-tuned models, Lucie-7B-Instruct-v1.1 and Lucie-7B-Instruct-human-data, which we release as demonstrations of Lucie-7B in use. These models achieve promising results compared to state-of-the-art models, demonstrating that an open approach prioritizing data rights can still deliver strong performance. We see these models as an initial step toward developing more performant, aligned models in the near future. Model weights for Lucie-7B and the Lucie instruct models, along with intermediate checkpoints for the former, are published on Hugging Face, while model training and data preparation code is available on GitHub. This makes Lucie-7B one of the first OSI compliant language models according to the new OSI definition.
zh

[NLP-76] Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在基于表型驱动的罕见病基因优先排序任务中的表现不佳问题，尤其是在从非结构化临床笔记中预测候选基因或疾病诊断方面的挑战。传统方法通常依赖于标准化术语（如Human Phenotype Ontology, HPO），而现实世界中的输入数据多为非结构化的临床文本，这使得LLMs难以直接应用。

解决方案的关键在于结合检索增强生成（Retrieval-Augmented Generation, RAG）与链式思维（Chain-of-Thought, CoT）的两种新方法：RAG-driven CoT和CoT-driven RAG。RAG用于从知识库（如HPO和OMIM）中检索相关信息，而CoT通过模拟专家推理过程构建五步问答协议，引导模型逐步分析临床笔记。这两种方法有效提升了LLMs处理非结构化数据的能力，并在多个罕见病数据集上验证了其性能优势，特别是在DeepSeek模型的支持下，能够实现超过40%的前10基因准确性。

链接: https://arxiv.org/abs/2503.12286
作者: Da Wu,Zhanliang Wang,Quan Nguyen,Kai Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
备注: 31 pages, 3 figures

点击查看摘要

Abstract:Background: Several studies show that large language models (LLMs) struggle with phenotype-driven gene prioritization for rare diseases. These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes. However, in real-world settings, foundation models are not optimized for domain-specific tasks like clinical diagnosis, yet inputs are unstructured clinical notes rather than standardized terms. How LLMs can be instructed to predict candidate genes or disease diagnosis from unstructured clinical notes remains a major challenge. Methods: We introduce RAG-driven CoT and CoT-driven RAG, two methods that combine Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) to analyze clinical notes. A five-question CoT protocol mimics expert reasoning, while RAG retrieves data from sources like HPO and OMIM (Online Mendelian Inheritance in Man). We evaluated these approaches on rare disease datasets, including 5,980 Phenopacket-derived notes, 255 literature-based narratives, and 220 in-house clinical notes from Childrens Hospital of Philadelphia. Results: We found that recent foundations models, including Llama 3.3-70B-Instruct and DeepSeek-R1-Distill-Llama-70B, outperformed earlier versions such as Llama 2 and GPT-3.5. We also showed that RAG-driven CoT and CoT-driven RAG both outperform foundation models in candidate gene prioritization from clinical notes; in particular, both methods with DeepSeek backbone resulted in a top-10 gene accuracy of over 40% on Phenopacket-derived clinical notes. RAG-driven CoT works better for high-quality notes, where early retrieval can anchor the subsequent reasoning steps in domain-specific evidence, while CoT-driven RAG has advantage when processing lengthy and noisy notes.
zh

[NLP-77] Interpretation Gaps in LLM -Assisted Comprehension of Privacy Documents

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在从复杂隐私政策中提取简化数据实践解读时可能产生的准确性（Accuracy）、完整性（Completeness）、清晰性（Clarity）及代表性（Representation）等方面的问题。论文通过实例展示这些差距，并呼吁进一步研究以充分发挥LLM在通过个人助手和自动化合规检查革新隐私管理中的潜力。解决方案的关键在于持续探索如何提升LLM在这类任务中的性能，确保其输出能够更精准、全面且明确地反映隐私政策的核心内容。

链接: https://arxiv.org/abs/2503.12225
作者: Rinku Dewri
机构: Department of Computer Science, University of Denver (丹佛大学), CO, 80208, USA
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This article explores the gaps that can manifest when using a large language model (LLM) to obtain simplified interpretations of data practices from a complex privacy policy. We exemplify these gaps to showcase issues in accuracy, completeness, clarity and representation, while advocating for continued research to realize an LLM’s true potential in revolutionizing privacy management through personal assistants and automated compliance checking.
zh

[NLP-78] PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在边缘设备上的推理需求与有限资源之间的固有矛盾，这一挑战限制了边缘智能的发展。尽管已有研究通过知识蒸馏将LLMs的能力压缩到较小规模的小型语言模型（Small Language Models, SLMs），但这些模型通常保留了大模型的基本架构设计，仍对边缘设备的存储和带宽能力造成较大压力。

为了解决这一问题，论文提出了一种名为PLM（Peripheral Language Model）的新型模型，其关键在于通过模型架构与边缘系统约束的协同设计（co-design）优化，实现了更高效的模型部署。具体而言，PLM引入了多头潜在注意力机制（Multi-head Latent Attention）并采用平方ReLU激活函数以促进稀疏性，从而显著降低了推理阶段的峰值内存占用。此外，在训练过程中，论文设计了多阶段训练策略，并引入了Warmup-Stable-Decay-Constant（WSDC）学习率调度器，同时结合基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）方法，进一步提升了模型性能。最终，PLM在通用任务中提升2%，GSM8K任务中提升9%，编码任务中提升11%。评估结果显示，PLM在保持较低激活参数数量的同时，超越了现有基于公开数据训练的小型语言模型，且在多种边缘设备（如消费级GPU、手机和Raspberry Pi）上的部署验证了其适用于边缘应用场景的能力。

链接: https://arxiv.org/abs/2503.12167
作者: Cheng Deng,Luoyang Sun,Jiwen Jiang,Yongcheng Zeng,Xinjian Wu,Wenxin Zhao,Qingfa Xiao,Jiachuan Wang,Lei Chen,Lionel M. Ni,Haifeng Zhang,Jun Wang
机构: The Hong Kong University of Science and Technology (Guangzhou); Institution of Automation, Chinese Academy of Sciences; University College London; The Hong Kong University of Science and Technology; PLM Team (PLM团队)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While scaling laws have been continuously validated in large language models (LLMs) with increasing model parameters, the inherent tension between the inference demands of LLMs and the limited resources of edge devices poses a critical challenge to the development of edge intelligence. Recently, numerous small language models have emerged, aiming to distill the capabilities of LLMs into smaller footprints. However, these models often retain the fundamental architectural principles of their larger counterparts, still imposing considerable strain on the storage and bandwidth capacities of edge devices. In this paper, we introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimizes model architecture and edge system constraints. The PLM utilizes a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint during inference. During training, we collect and reorganize open-source datasets, implement a multi-phase training strategy, and empirically investigate the Warmup-Stable-Decay-Constant (WSDC) learning rate scheduler. Additionally, we incorporate Reinforcement Learning from Human Feedback (RLHF) by adopting the ARIES preference learning approach. Following a two-phase SFT process, this method yields performance gains of 2% in general tasks, 9% in the GSM8K task, and 11% in coding tasks. In addition to its novel architecture, evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data while maintaining the lowest number of activated parameters. Furthermore, deployment across various edge devices, including consumer-grade GPUs, mobile phones, and Raspberry Pis, validates PLM’s suitability for peripheral applications. The PLM series models are publicly available at this https URL.
zh

[NLP-79] Improving LLM -based Document-level Machine Translation with Multi-Knowledge Fusion

【速读】：该论文试图解决在基于大语言模型（Large Language Model, LLM）的文档级机器翻译（Document-Level Machine Translation, DMT）中，仅依赖句子序列上下文可能限制模型性能的问题。具体而言，当前方法主要通过将源文档展平为长句序列来利用句子间上下文，但文档级序列的复杂性远高于短句级序列，这可能导致LLM在处理文档级翻译任务时能力受限。

解决方案的关键在于引入多源知识融合策略。论文提出通过大语言模型获取源文档的摘要（summarization）和实体翻译（entity translation）作为额外知识，并分别结合这两种单一知识源生成两种翻译结果。最终，通过多知识融合策略对不同句子选择最优的知识来源，从而优化翻译质量并提升整体性能。实验结果显示，在三个不同的LLM基线模型上，该方法相比无额外知识的基线平均提升了COMET分数0.8、0.6和0.4分。

链接: https://arxiv.org/abs/2503.12152
作者: Bin Liu,Xinglin Lyu,Junhui Li,Daimeng Wei,Min Zhang,Shimin Tao,Hao Yang
机构: School of Computer Science and Technology, Soochow University (苏州大学), Suzhou, China; Huawei Translation Services Center (华为翻译服务中心), Beijing, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies in prompting large language model (LLM) for document-level machine translation (DMT) primarily focus on the inter-sentence context by flatting the source document into a long sequence. This approach relies solely on the sequence of sentences within the document. However, the complexity of document-level sequences is greater than that of shorter sentence-level sequences, which may limit LLM’s ability in DMT when only this single-source knowledge is used. In this paper, we propose an enhanced approach by incorporating multiple sources of knowledge, including both the document summarization and entity translation, to enhance the performance of LLM-based DMT. Given a source document, we first obtain its summarization and translation of entities via LLM as the additional knowledge. We then utilize LLMs to generate two translations of the source document by fusing these two single knowledge sources, respectively. Finally, recognizing that different sources of knowledge may aid or hinder the translation of different sentences, we refine and rank the translations by leveraging a multi-knowledge fusion strategy to ensure the best results. Experimental results in eight document-level translation tasks show that our approach achieves an average improvement of 0.8, 0.6, and 0.4 COMET scores over the baseline without extra knowledge for LLaMA3-8B-Instruct, Mistral-Nemo-Instruct, and GPT-4o-mini, respectively.
zh

[NLP-80] Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models

【速读】：该论文试图解决的问题是：不同大型视觉-语言模型（Large Vision-Language Models, LVLMs）是否以不同的方式解读多模态讽刺，并且能否像人类一样从多个视角理解讽刺。为了解决这一问题，论文的关键解决方案是引入了一个基于系统设计提示的分析框架，用于评估现有的多模态讽刺数据集中的模型表现。通过测试12个最先进的LVLMs在2,409个样本上的表现，研究聚焦于模型内部和跨模型之间的解释差异，包括置信水平、与数据集标签的一致性以及对模糊“中性”案例的识别能力。研究发现，尽管面向分类的提示可以提高内部一致性，但在进行解释性推理任务时，模型之间存在显著分歧，这表明讽刺具有主观性，并挑战了二元标注范式的有效性。论文提倡超越严格的标注方案，转向多视角且考虑不确定性的建模方法，以提供对多模态讽刺理解的更深层次洞察。

链接: https://arxiv.org/abs/2503.12149
作者: Junjie Chen,Xuyang Liu,Subin Huang,Linfeng Zhang,Hang Yu
机构: Ahpu(安徽工程大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous “neutral” cases. Our findings reveal notable discrepancies – across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm’s subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: this https URL
zh

[NLP-81] Enhanced Sentiment Analysis of Iranian Restaurant Reviews Utilizing Sentiment Intensity Analyzer Fuzzy Logic

【速读】：本文旨在解决传统情感分析技术在评估情感极性和强度时存在的中性偏差问题，尤其是在使用基于规则的情感强度分析工具（如VADER）时，其对中性倾向的偏好可能导致情感强度的表征不够准确。为了解决这一问题，论文的关键创新在于结合模糊逻辑引入了两种改进方法：通过平方根和四次方根变换来放大积极和消极情感分数，同时保持中性区域不变。这形成了三种不同的处理路径——分别采用未经调整的VADER得分（方法1）、平方根变换后的得分（方法2）以及更精细的四次方根变换（方法3）。随后，构建了一个包含全面模糊规则的模糊推理系统，用于整合这些优化后的评分，并为每条评论生成一个单一连续的整体情感值。实验结果表明，这些改进方法有效减少了中性偏差，提升了情感强度捕捉能力，从而显著改善了整体情感分析效果。然而，研究也发现了一些局限性，如偶发的过度放大效应及特定领域内持续存在的中性现象，为此提出了未来研究方向以克服这些问题。

链接: https://arxiv.org/abs/2503.12141
作者: Shayan Rokhva,Babak Teimourpour,Romina Babaei
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This research presents an advanced sentiment analysis framework studied on Iranian restaurant reviews, combining fuzzy logic with conventional sentiment analysis techniques to assess both sentiment polarity and intensity. A dataset of 1266 reviews, alongside corresponding star ratings, was compiled and preprocessed for analysis. Initial sentiment analysis was conducted using the Sentiment Intensity Analyzer (VADER), a rule-based tool that assigns sentiment scores across positive, negative, and neutral categories. However, a noticeable bias toward neutrality often led to an inaccurate representation of sentiment intensity. To mitigate this issue, based on a fuzzy perspective, two refinement techniques were introduced, applying square-root and fourth-root transformations to amplify positive and negative sentiment scores while maintaining neutrality. This led to three distinct methodologies: Approach 1, utilizing unaltered VADER scores; Approach 2, modifying sentiment values using the square root; and Approach 3, applying the fourth root for further refinement. A Fuzzy Inference System incorporating comprehensive fuzzy rules was then developed to process these refined scores and generate a single, continuous sentiment value for each review based on each approach. Comparative analysis, including human supervision and alignment with customer star ratings, revealed that the refined approaches significantly improved sentiment analysis by reducing neutrality bias and better capturing sentiment intensity. Despite these advancements, minor over-amplification and persistent neutrality in domain-specific cases were identified, leading us to propose several future studies to tackle these occasional barriers. The study’s methodology and outcomes offer valuable insights for businesses seeking a more precise understanding of consumer sentiment, enhancing sentiment analysis across various industries.
zh

[NLP-82] Hyperbolic Safety-Aware Vision-Language Models CVPR2025

【速读】：该论文旨在解决视觉-语言模型（如CLIP）中不安全内容检索的问题，当前方法主要依赖于“无学习”(unlearning)技术以消除模型对不安全概念的知识，但这些方法在减少非期望输出的同时限制了模型区分安全与不安全内容的能力。论文提出了一种新颖的方法，从“无学习”转向“意识”范式，利用双曲空间的固有层次特性，将安全和不安全内容编码为一种蕴含关系层次结构，并将其置于双曲空间的不同区域。关键在于引入了HySAC (Hyperbolic Safety-Aware CLIP)，通过采用蕴含损失函数来建模安全与不安全图像-文本对之间的层级和不对称关系，这种建模方式在依赖欧几里得嵌入的标准视觉-语言模型中无效，但在双曲空间中能够赋予模型对不安全内容的感知能力，使其既能作为多模态不安全内容分类器，又能灵活地检索内容，可选择将不安全查询动态重定向至更安全的替代内容或保留原始输出。实验表明，该方法不仅提升了安全性识别能力，还构建了一个更具适应性和可解释性的视觉-语言模型内容监管框架。

链接: https://arxiv.org/abs/2503.12127
作者: Tobia Poppi,Tejaswi Kasarla,Pascal Mettes,Lorenzo Baraldi,Rita Cucchiara
机构: University of Modena and Reggio Emilia (摩德纳雷焦艾米利亚大学, Italy); University of Pisa (比萨大学, Italy); University of Amsterdam (阿姆斯特丹大学, Netherlands); IIT-CNR (意大利国立研究 council, Italy)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: CVPR 2025

点击查看摘要

Abstract:Addressing the retrieval of unsafe content from vision-language models such as CLIP is an important step towards real-world integration. Current efforts have relied on unlearning techniques that try to erase the model’s knowledge of unsafe concepts. While effective in reducing unwanted outputs, unlearning limits the model’s capacity to discern between safe and unsafe content. In this work, we introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space. We propose to encode safe and unsafe content as an entailment hierarchy, where both are placed in different regions of hyperbolic space. Our HySAC, Hyperbolic Safety-Aware CLIP, employs entailment loss functions to model the hierarchical and asymmetrical relations between safe and unsafe image-text pairs. This modelling, ineffective in standard vision-language models due to their reliance on Euclidean embeddings, endows the model with awareness of unsafe content, enabling it to serve as both a multimodal unsafe classifier and a flexible content retriever, with the option to dynamically redirect unsafe queries toward safer alternatives or retain the original output. Extensive experiments show that our approach not only enhances safety recognition but also establishes a more adaptable and interpretable framework for content moderation in vision-language models. Our source code is available at this https URL.
zh

[NLP-83] MT-RewardTree: A Comprehensive Framework for Advancing LLM -Based Machine Translation via Reward Modeling

【速读】：该论文旨在解决过程奖励模型（Process Reward Models, PRMs）在机器翻译（Machine Translation, MT）领域应用不足的问题，主要由于缺乏系统性的方法论和评估基准。论文的关键解决方案在于引入了一个名为\textbfMT-RewardTree的综合框架，用于构建、评估和部署MT中的PRMs。其创新之处在于提出了一种基于近似蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）的方法，以自动生成词级偏好对（token-level preference pairs），从而降低人工标注的高昂成本。此外，论文建立了首个针对MT的奖励模型基准，并通过系统性对比揭示了词级监督的有效性。实验结果表明，所提出的MT-PRM-Qwen-2.5-3B在词级和序列级评估中均达到了最先进的性能。这一方案的核心在于通过自动化生成词级偏好对和建立专用基准，推动了PRMs在MT研究中的应用及其实际效用。

链接: https://arxiv.org/abs/2503.12123
作者: Zhaopeng Feng,Jiahan Ren,Jiayuan Su,Jiamei Zheng,Zhihang Tang,Hongwei Wang,Zuozhu Liu
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review. Project page: this https URL

点击查看摘要

Abstract:Process reward models (PRMs) have shown success in complex reasoning tasks for large language models (LLMs). However, their application to machine translation (MT) remains underexplored due to the lack of systematic methodologies and evaluation benchmarks. To address this gap, we introduce \textbfMT-RewardTree, a comprehensive framework for constructing, evaluating, and deploying process reward models in MT. Unlike traditional vanilla preference pair construction, we propose a novel method for automatically generating token-level preference pairs using approximate Monte Carlo Tree Search (MCTS), which mitigates the prohibitive cost of human annotation for fine-grained steps. Then, we establish the first MT-specific reward model benchmark and provide a systematic comparison of different reward modeling architectures, revealing that token-level supervision effectively captures fine-grained preferences. Experimental results demonstrate that our MT-PRM-Qwen-2.5-3B achieves state-of-the-art performance in both token-level and sequence-level evaluation given the same input prefix. Furthermore, we showcase practical applications where PRMs enable test-time alignment for LLMs without additional alignment training and significantly improve performance in hypothesis ensembling. Our work provides valuable insights into the role of reward models in MT research. Our code and data are released in \hrefthis https URLthis https URL_RewardTreePage.
zh

[NLP-84] RECSIP: REpeated Clustering of Scores Improving the Precision

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在自然语言处理（Natural Language Processing, NLP）领域虽然取得显著进展，但其可靠性仍不足的问题。这种不足主要源于LLMs的随机架构，使得用户难以确定模型响应的可靠性，从而可能导致高风险环境中的严重后果或工业场景下的高昂失败成本。为解决此问题，论文提出了一种名为“重复聚类评分提高精度”（REpeated Clustering of Scores Improving the Precision, RECSIP）的框架。该框架的关键在于通过并行调用多个模型、评分及聚类它们的响应，从而确保模型响应具有更高的可靠性。实验结果表明，与基准MMLU-Pro上的最佳单个模型相比，基于RECSIP框架的参考实现recsip提升了5.8个百分点的整体性能。

链接: https://arxiv.org/abs/2503.12108
作者: André Schamschurko,Nenad Petrovic,Alois Christian Knoll
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Conference paper accepted for IntelliSys2025

点击查看摘要

Abstract:The latest research on Large Language Models (LLMs) has demonstrated significant advancement in the field of Natural Language Processing (NLP). However, despite this progress, there is still a lack of reliability in these models. This is due to the stochastic architecture of LLMs, which presents a challenge for users attempting to ascertain the reliability of a model’s response. These responses may cause serious harm in high-risk environments or expensive failures in industrial contexts. Therefore, we introduce the framework REpeated Clustering of Scores Improving the Precision (RECSIP) which focuses on improving the precision of LLMs by asking multiple models in parallel, scoring and clustering their responses to ensure a higher reliability on the response. The evaluation of our reference implementation recsip on the benchmark MMLU-Pro using the models GPT-4o, Claude and Gemini shows an overall increase of 5.8 per cent points compared to the best used model.
zh

[NLP-85] Large Language Models in Legislative Content Analysis: A Dataset from the Polish Parliament

【速读】：该论文旨在解决如何利用大型语言模型（Large Language Models, LLMs）提升波兰法律体系下立法内容分析的有效性，并探索其在自然语言处理（Natural Language Processing, NLP）领域的应用潜力。论文的关键在于构建了一个由官方立法机构网站获取的数据集，并设计了三个特定的NLP任务来评估LLMs在处理立法文本时的表现。研究强调了理解法律语境等挑战，同时展示了即使是常用数据也可有效应用于立法内容分析。关键解决方案在于结合领域专用数据与针对性的NLP任务设计，以充分发挥LLMs在法律文本自动化分析中的能力。

链接: https://arxiv.org/abs/2503.12100
作者: Arkadiusz Bryłkowski,Jakub Klikowski
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) are among the best methods for processing natural language, partly due to their versatility. At the same time, domain-specific LLMs are more practical in real-life applications. This work introduces a novel natural language dataset created by acquired data from official legislative authorities’ websites. The study focuses on formulating three natural language processing (NLP) tasks to evaluate the effectiveness of LLMs on legislative content analysis within the context of the Polish legal system. Key findings highlight the potential of LLMs in automating and enhancing legislative content analysis while emphasizing specific challenges, such as understanding legal context. The research contributes to the advancement of NLP in the legal field, particularly in the Polish language. It has been demonstrated that even commonly accessible data can be practically utilized for legislative content analysis.
zh

[NLP-86] Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models NAACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）训练数据透明性不足的问题，具体包括外部监督困难、数据作者权益受损以及数据污染和选择等科学研究受阻的挑战。论文提出了一种新的方法，通过信息引导的探测技术识别专有LLMs（如GPT-4）已知的训练数据，而无需访问模型权重或标记概率。解决方案的关键在于利用高惊喜值（high surprisal）文本片段作为记忆探测的良好搜索材料，通过评估模型成功重构这些高惊喜值标记的能力，可以识别出LLMs记住的大量文本内容。

链接: https://arxiv.org/abs/2503.12072
作者: Abhilasha Ravichander,Jillian Fisher,Taylor Sorensen,Ximing Lu,Yuchen Lin,Maria Antoniak,Niloofar Mireshghallah,Chandra Bhagavatula,Yejin Choi
机构: 未知
类目: Computation and Language (cs.CL)
备注: NAACL 2025

点击查看摘要

Abstract:High-quality training data has proven crucial for developing performant large language models (LLMs). However, commercial LLM providers disclose few, if any, details about the data used for training. This lack of transparency creates multiple challenges: it limits external oversight and inspection of LLMs for issues such as copyright infringement, it undermines the agency of data authors, and it hinders scientific research on critical issues such as data contamination and data selection. How can we recover what training data is known to LLMs? In this work, we demonstrate a new method to identify training data known to proprietary LLMs like GPT-4 without requiring any access to model weights or token probabilities, by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes. By evaluating a model’s ability to successfully reconstruct high-surprisal tokens in text, we can identify a surprising number of texts memorized by LLMs.
zh

[NLP-87] LUE: A Tibetan Language Understanding Evaluation Benchmark

【速读】：该论文试图解决低资源语言藏语在大型语言模型（Large Language Models, LLMs）评估中的代表性不足问题。尽管藏语有超过七百万的使用者，但在LLMs的开发与评估中长期被忽视。为填补这一空白，论文提出了TLUE（藏语语言理解评估基准），这是首个用于评估LLMs藏语能力的大规模基准。解决方案的关键在于构建了一个包含两个主要部分的综合评估框架：一是涵盖5个领域和67个子领域的多任务理解基准；二是覆盖7个子领域的安全性基准。通过评估多种最先进的LLMs，实验结果表明大多数模型的表现低于随机基线，这凸显了LLMs处理低资源语言如藏语所面临的重大挑战。TLUE为推动藏语语言理解的研究进展提供了重要基础，并强调了在LLMs开发中增强包容性的必要性。

链接: https://arxiv.org/abs/2503.12051
作者: Fan Gao,Cheng Huang,Nyima Tashi,Xiangxiang Wang,Thupten Tsering,Ban Ma-bao,Renzeg Duojie,Gadeng Luosang,Rinchen Dongrub,Dorje Tashi,Xiao Feng,Yongbin Yu
机构: University of Electronic Science and Technology of China (电子科技大学); Tibet University (西藏大学); University of Texas Southwestern Medical Center (德克萨斯大学西南医学中心)
类目: Computation and Language (cs.CL)
备注: 6 figures, 21 pages

点击查看摘要

Abstract:Large language models (LLMs) have made tremendous progress in recent years, but low-resource languages, such as Tibetan, remain significantly underrepresented in their evaluation. Despite Tibetan being spoken by over seven million people, it has largely been neglected in the development and assessment of LLMs. To address this gap, we present TLUE (A Tibetan Language Understanding Evaluation Benchmark), the first large-scale benchmark for assessing LLMs’ capabilities in Tibetan. TLUE comprises two major components: (1) a comprehensive multi-task understanding benchmark spanning 5 domains and 67 subdomains, and (2) a safety benchmark covering 7 subdomains. We evaluate a diverse set of state-of-the-art LLMs. Experimental results demonstrate that most LLMs perform below the random baseline, highlighting the considerable challenges LLMs face in processing Tibetan, a low-resource language. TLUE provides an essential foundation for driving future research and progress in Tibetan language understanding and underscores the need for greater inclusivity in LLM development.
zh

[NLP-88] Applications of Large Language Model Reasoning in Feature Generation

【速读】：该论文旨在探索大型语言模型（LLMs）推理技术与机器学习任务中特征生成方法的结合，以解决手动指定搜索空间的传统特征工程方法效率低下及复杂性高的问题。论文的关键在于分析四种LLM推理方法：Chain of Thought、Tree of Thoughts、Retrieval-Augmented Generation和Thought Space Exploration，这些方法能够自动生成有效的特征生成规则。通过在金融、医疗和文本分析等多个领域的应用案例，论文展示了LLMs如何从临床笔记和放射学报告中提取关键信息，并在金融领域辅助文档的文本生成、摘要及实体抽取。此外，论文还探讨了评估特征质量和下游性能的方法，特别强调了OCTree决策树推理方法的语言反馈机制用于迭代优化。当前挑战包括幻觉现象、计算效率和领域适应性，而未来方向则聚焦于多模态特征生成、自我改进系统和神经符号方法。因此，论文的核心解决方案在于利用LLMs的推理能力自动化和增强特征工程过程。

链接: https://arxiv.org/abs/2503.11989
作者: Dharani Chandra
机构: Portland State University (波特兰州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing through their state of art reasoning capabilities. This paper explores the convergence of LLM reasoning techniques and feature generation for machine learning tasks. We examine four key reasoning approaches: Chain of Thought, Tree of Thoughts, Retrieval-Augmented Generation, and Thought Space Exploration. Our analysis reveals how these approaches can be used to identify effective feature generation rules without having to manually specify search spaces. The paper categorizes LLM-based feature generation methods across various domains including finance, healthcare, and text analytics. LLMs can extract key information from clinical notes and radiology reports in healthcare, by enabling more efficient data utilization. In finance, LLMs facilitate text generation, summarization, and entity extraction from complex documents. We analyze evaluation methodologies for assessing feature quality and downstream performance, with particular attention to OCTree’s decision tree reasoning approach that provides language-based feedback for iterative improvements. Current challenges include hallucination, computational efficiency, and domain adaptation. As of March 2025, emerging approaches include inference-time compute scaling, reinforcement learning, and supervised fine-tuning with model distillation. Future directions point toward multimodal feature generation, self-improving systems, and neuro-symbolic approaches. This paper provides a detailed overview of an emerging field that promises to automate and enhance feature engineering through language model reasoning.
zh

[NLP-89] No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language models

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）在不同基准数据集上的偏见表现，并提出解决方案以检测和减轻这些偏见。论文的关键在于设计了一套统一的评估框架，使用代表性的LLMs覆盖从物理特征到社会经济类别的多种偏见形式，并提出了五种提示方法（prompting approaches）来跨多个偏见维度进行检测。此外，通过设定三个研究问题，结合不同的评估指标，论文分析了各模型的偏见情况，发现LLaMA3.1-8B在所选模型中偏见程度最低。最终，论文总结了现有挑战并指出了未来的研究方向。

链接: https://arxiv.org/abs/2503.11985
作者: Charaka Vinayak Kumar,Ashok Urlana,Gopichand Kanumolu,Bala Mallikarjunarao Garlapati,Pruthwik Mishra
机构: IIIT Hyderabad(印度国际信息技术学院); TCS Research, Hyderabad, India(塔塔咨询服务公司，海得拉巴，印度); SVNIT Surat, India(苏拉特维杰亚纳加尔国家技术学院，印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure

点击查看摘要

Abstract:Advancements in Large Language Models (LLMs) have increased the performance of different natural language understanding as well as generation tasks. Although LLMs have breached the state-of-the-art performance in various tasks, they often reflect different forms of bias present in the training data. In the light of this perceived limitation, we provide a unified evaluation of benchmarks using a set of representative LLMs that cover different forms of biases starting from physical characteristics to socio-economic categories. Moreover, we propose five prompting approaches to carry out the bias detection task across different aspects of bias. Further, we formulate three research questions to gain valuable insight in detecting biases in LLMs using different approaches and evaluation metrics across benchmarks. The results indicate that each of the selected LLMs suffer from one or the other form of bias with the LLaMA3.1-8B model being the least biased. Finally, we conclude the paper with the identification of key challenges and possible future directions.
zh

[NLP-90] HInter: Exposing Hidden Intersectional Bias in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中存在的交集性偏见（intersectional bias）问题，特别是当这些模型对具有多重属性（如种族和性别等）的个体表现出歧视时。这种偏见的发现极具挑战性，因为其涉及复杂的多属性输入组合。

论文的关键解决方案是提出了一种名为HInter的测试技术。HInter通过结合变异分析（mutation analysis）、依存句法解析（dependency parsing）以及元数学断言（metamorphic oracles），能够自动检测LLMs中的交集性偏见。具体而言，HInter通过系统性地对句子进行多种变异生成测试输入，利用依赖不变量（dependency invariant）验证输入的有效性，并通过对比原始句子与变异后句子的模型响应来检测偏见。实验结果表明，HInter生成的输入中有14.61%暴露了交集性偏见，并且依赖不变量显著减少了误报率。此外，研究还发现有16.62%的偏见错误是隐藏的，即对应的原子案例不会触发偏见。这项工作强调了对LLMs进行交集性偏见测试的重要性。

链接: https://arxiv.org/abs/2503.11962
作者: Badr Souani,Ezekiel Soremekun,Mike Papadakis,Setsuko Yokoyama,Sudipta Chattopadhyay,Yves Le Traon
机构: SnT, University of Luxembourg (卢森堡大学 SnT); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) may portray discrimination towards certain individuals, especially those characterized by multiple attributes (aka intersectional bias). Discovering intersectional bias in LLMs is challenging, as it involves complex inputs on multiple attributes (e.g. race and gender). To address this challenge, we propose HInter, a test technique that synergistically combines mutation analysis, dependency parsing and metamorphic oracles to automatically detect intersectional bias in LLMs. HInter generates test inputs by systematically mutating sentences using multiple mutations, validates inputs via a dependency invariant and detects biases by checking the LLM response on the original and mutated sentences. We evaluate HInter using six LLM architectures and 18 LLM models (GPT3.5, Llama2, BERT, etc) and find that 14.61% of the inputs generated by HInter expose intersectional bias. Results also show that our dependency invariant reduces false positives (incorrect test inputs) by an order of magnitude. Finally, we observed that 16.62% of intersectional bias errors are hidden, meaning that their corresponding atomic cases do not trigger biases. Overall, this work emphasize the importance of testing LLMs for intersectional bias.
zh

[NLP-91] Integration of Explainable AI Techniques with Large Language Models for Enhanced Interpretability for Sentiment Analysis

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在情感分析中的可解释性难题，特别是在高风险应用场景中理解预测背后逻辑的关键挑战。论文提出了一种基于SHAP（Shapley Additive Explanations）的方法，通过将LLMs分解为嵌入层（embedding layer）、编码器（encoder）、解码器（decoder）和注意力层（attention layer）等组件，提供逐层的情感预测解释。这种方法的核心在于通过对LLMs的分层解析，清晰展示不同句子如何影响各个层次，并通过斯坦福情感树库（Stanford Sentiment Treebank, SST-2）数据集验证其有效性。实验评估表明，该方法在阐明特定情感标记 attribution 方面显著优于当前的整体模型可解释性技术，从而提升了基于LLMs的情感分析在关键应用中的可靠性和透明度。

链接: https://arxiv.org/abs/2503.11948
作者: Thivya Thogesan,Anupiya Nugaliyadde,Kok Wai Wong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interpretability remains a key difficulty in sentiment analysis with Large Language Models (LLMs), particularly in high-stakes applications where it is crucial to comprehend the rationale behind forecasts. This research addressed this by introducing a technique that applies SHAP (Shapley Additive Explanations) by breaking down LLMs into components such as embedding layer,encoder,decoder and attention layer to provide a layer-by-layer knowledge of sentiment prediction. The approach offers a clearer overview of how model interpret and categorise sentiment by breaking down LLMs into these parts. The method is evaluated using the Stanford Sentiment Treebank (SST-2) dataset, which shows how different sentences affect different layers. The effectiveness of layer-wise SHAP analysis in clarifying sentiment-specific token attributions is demonstrated by experimental evaluations, which provide a notable enhancement over current whole-model explainability techniques. These results highlight how the suggested approach could improve the reliability and transparency of LLM-based sentiment analysis in crucial applications.
zh

[NLP-92] REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives

【速读】：该论文旨在解决现有推荐系统数据集主要聚焦于序列化项目预测而忽视对话能力评估的问题。为应对这一挑战，论文提出了一个新的数据集REGEN（Reviews Enhanced with GEnerative Narratives），通过在Amazon产品评论数据集基础上增加用户批评（critiques）和叙述性文本（narratives）两个自然语言特征来扩展数据集。其中，批评反映了用户的引导查询，而叙述则包含了与推荐商品相关的产品推荐、购买理由及用户偏好总结等丰富文本信息，并且这些叙述需考虑先前的上下文。此外，论文还构建了一个端到端的建模基准用于评估会话推荐任务中的表现，模型需要基于用户历史记录（包括已购商品及其评论）生成推荐列表以及对应的叙述性描述。为此，作者引入了一种名为LUMEN（基于LLM的统一多任务模型，包含批评、推荐和叙述）的框架，该框架利用大型语言模型作为基础架构来进行批评、检索和生成。实验结果表明，加入批评可以提升推荐质量，使推荐器能够学习语言理解并与推荐信号相结合；同时，在此数据集上训练的LLM不仅能够生成高质量的推荐结果，还能创建与其上下文相关的叙述性内容，其性能接近当前最先进的推荐器和语言模型。

链接: https://arxiv.org/abs/2503.11924
作者: Kun Su,Krishna Sayana,Hubert Pham,James Pine,Yuri Vasilevski,Raghavendra Vasudeva,Marialena Kyriakidi,Liam Hebert,Ambarish Jash,Anushya Subbiah,Sukhdeep Sodhi
机构: Google Research (谷歌)(Mountain View, CA, USA); University of Waterloo (滑铁卢大学)(Waterloo, Ontario, Canada)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces a novel dataset REGEN (Reviews Enhanced with GEnerative Narratives), designed to benchmark the conversational capabilities of recommender Large Language Models (LLMs), addressing the limitations of existing datasets that primarily focus on sequential item prediction. REGEN extends the Amazon Product Reviews dataset by inpainting two key natural language features: (1) user critiques, representing user “steering” queries that lead to the selection of a subsequent item, and (2) narratives, rich textual outputs associated with each recommended item taking into account prior context. The narratives include product endorsements, purchase explanations, and summaries of user preferences. Further, we establish an end-to-end modeling benchmark for the task of conversational recommendation, where models are trained to generate both recommendations and corresponding narratives conditioned on user history (items and critiques). For this joint task, we introduce a modeling framework LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives) which uses an LLM as a backbone for critiquing, retrieval and generation. We also evaluate the dataset’s quality using standard auto-rating techniques and benchmark it by training both traditional and LLM-based recommender models. Our results demonstrate that incorporating critiques enhances recommendation quality by enabling the recommender to learn language understanding and integrate it with recommendation signals. Furthermore, LLMs trained on our dataset effectively generate both recommendations and contextual narratives, achieving performance comparable to state-of-the-art recommenders and language models. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2503.11924 [cs.CL] (or arXiv:2503.11924v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.11924 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-93] LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama ALT

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多语言评估中的不足问题，特别是低资源语言（Low-Resource Languages, LRLs）上的性能评价。传统评估主要依赖于英语数据集，忽视了其他语言的表现。为了解决这一问题，论文的关键在于构建了一个包含高质量非英语数据的稳健评估框架，并首次使用由母语者精心策划的MMLU子集对八种最先进的LLMs在拉脱维亚语和基里马语上的推理能力进行评估。这种基于本地化基准和人工评估的方法强调了文化上下文在LLMs发展中的重要性。

链接: https://arxiv.org/abs/2503.11911
作者: Naome A. Etori,Kevin Lu,Randu Karisa,Arturs Kanepajs
机构: University of Minnesota-Twin Cities (明尼苏达大学双城分校); Bellarmine College Preparatory (贝尔蒙特预备学院); Masakhane
类目: Computation and Language (cs.CL)
备注: Accepted in NoDaLiDa/Baltic-HLT 2025

点击查看摘要

Abstract:As large language models (LLMs) rapidly advance, evaluating their performance is critical. LLMs are trained on multilingual data, but their reasoning abilities are mainly evaluated using English datasets. Hence, robust evaluation frameworks are needed using high-quality non-English datasets, especially low-resource languages (LRLs). This study evaluates eight state-of-the-art (SOTA) LLMs on Latvian and Giriama using a Massive Multitask Language Understanding (MMLU) subset curated with native speakers for linguistic and cultural relevance. Giriama is benchmarked for the first time. Our evaluation shows that OpenAI’s o1 model outperforms others across all languages, scoring 92.8% in English, 88.8% in Latvian, and 70.8% in Giriama on 0-shot tasks. Mistral-large (35.6%) and Llama-70B IT (41%) have weak performance, on both Latvian and Giriama. Our results underscore the need for localized benchmarks and human evaluations in advancing cultural AI contextualization.
zh

[NLP-94] LLM s for Translation: Historical Low-Resourced Languages and Contemporary AI Models NAACL2025

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）在翻译敏感且上下文丰富的历史文本时的安全机制所带来的挑战与局限性。论文以Google的Gemini模型为例，分析其在将18世纪奥斯曼土耳其文手稿《Infidels’ Prisoner: Memoirs of Osman Agha of Timisoara》翻译成英文时的表现。研究发现，Gemini的安全机制标记了手稿中14%到23%的内容为有害信息，导致相应部分未被翻译。论文的关键在于揭示这些安全设置虽然能够有效减少潜在风险，但同时也限制了模型提供完整且精确翻译的能力。通过具体的历史案例，本文强调了当前LLM安全实现方式在处理此类复杂材料时所面临的内在难题，并指出这类问题可能在需要高度准确性与全面性的现代翻译场景（如战争受害者证词的法律程序或人道主义记录）中引发潜在失败。

链接: https://arxiv.org/abs/2503.11898
作者: Merve Tekgurler
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to LaTeCH-CLfL 2025, held in conjunction with NAACL 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable adaptability in performing various tasks, including machine translation (MT), without explicit training. Models such as OpenAI’s GPT-4 and Google’s Gemini are frequently evaluated on translation benchmarks and utilized as translation tools due to their high performance. This paper examines Gemini’s performance in translating an 18th-century Ottoman Turkish manuscript, Prisoner of the Infidels: The Memoirs of Osman Agha of Timisoara, into English. The manuscript recounts the experiences of Osman Agha, an Ottoman subject who spent 11 years as a prisoner of war in Austria, and includes his accounts of warfare and violence. Our analysis reveals that Gemini’s safety mechanisms flagged between 14 and 23 percent of the manuscript as harmful, resulting in untranslated passages. These safety settings, while effective in mitigating potential harm, hinder the model’s ability to provide complete and accurate translations of historical texts. Through real historical examples, this study highlights the inherent challenges and limitations of current LLM safety implementations in the handling of sensitive and context-rich materials. These real-world instances underscore potential failures of LLMs in contemporary translation scenarios, where accurate and comprehensive translations are crucial-for example, translating the accounts of modern victims of war for legal proceedings or humanitarian documentation.
zh

[NLP-95] Resolving UnderEdit OverEdit with Iterative Neighbor-Assisted Model Editing ACL’25

【速读】：该论文试图解决大型语言模型（LLMs）在知识更新过程中因重新训练或微调成本高昂而采用模型编辑方法时存在的不完美性问题，具体表现为UnderEdit（知识编辑失败）和OverEdit（邻近知识被污染）。为解决这些问题，论文的关键解决方案是提出迭代模型编辑方法以缓解UnderEdit，并通过邻近知识辅助的模型编辑方法在编辑过程中融入邻近知识以减少OverEdit。实验结果表明，所提方法可有效将UnderEdit降低多达38个百分点，将OverEdit降低多达6个百分点。

链接: https://arxiv.org/abs/2503.11895
作者: Bhiman Kumar Baghel,Scott M. Jordan,Zheyuan Ryan Shi,Xiang Lorraine Li
机构: Department of Computer Science, University of Pittsburgh, PA, USA (计算机科学系, 匹兹堡大学, 美国宾夕法尼亚州)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review @ ACL’25

点击查看摘要

Abstract:Large Language Models (LLMs) are used in various downstream language tasks, making it crucial to keep their knowledge up-to-date, but both retraining and fine-tuning the model can be costly. Model editing offers an efficient and effective alternative by a single update to only a key subset of model parameters. While being efficient, these methods are not perfect. Sometimes knowledge edits are unsuccessful, i.e., UnderEdit, or the edit contaminated neighboring knowledge that should remain unchanged, i.e., OverEdit. To address these limitations, we propose iterative model editing, based on our hypothesis that a single parameter update is often insufficient, to mitigate UnderEdit, and neighbor-assisted model editing, which incorporates neighboring knowledge during editing to minimize OverEdit. Extensive experiments demonstrate that our methods effectively reduce UnderEdit up to 38 percentage points and OverEdit up to 6 percentage points across multiple model editing algorithms, LLMs, and benchmark datasets.
zh

[NLP-96] GPT s Devastated and LLaMAs Content: Emotion Representation Alignment in LLM s for Keyword-based Generation

【速读】：该论文试图解决在基于关键词的句子生成任务中控制情绪表达的问题，特别是在大型语言模型（Large Language Models, LLMs）如GPT-4和LLaMA-3中的情感可控性。研究关注如何有效利用不同情绪表示（包括词汇形式的单词、Valence-Arousal-Dominance (VAD) 维度的词典形式和数值形式以及表情符号）来实现与人类预期的一致性，并通过人工评估考察了每种表示下的人类与LLMs之间的对齐程度、生成句子的准确性及现实性。关键在于发现虽然数值化的VAD维度便于计算，但人们更倾向于基于特定词汇（如“angry”）生成的结果，而非直接使用数值尺度。此外，将原本数值化的VAD转换为词典化的描述（例如，“+4.0”变为“High”）显著提高了人机一致性。这表明情绪表达的感知高度依赖于所使用的LLM、情绪表示类型及其具体情绪类别。

链接: https://arxiv.org/abs/2503.11881
作者: Shadab Choudhury,Asha Kumar,Lara J. Martin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In controlled text generation using large language models (LLMs), gaps arise between the language model’s interpretation and human expectations. We look at the problem of controlling emotions in keyword-based sentence generation for both GPT-4 and LLaMA-3. We selected four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. Our human evaluation looked at the Human-LLM alignment for each representation, as well as the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g., “angry”) rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. However, we found that converting the originally-numeric VAD scales to Lexical scales (e.g., +4.0 becomes “High”) dramatically improved agreement. Furthermore, the perception of how much a generated sentence conveys an emotion is highly dependent on the LLM, representation type, and which emotion it is.
zh

[NLP-97] FedALT: Federated Fine-Tuning through Adaptive Local Training with Rest-of-the-World LoRA

【速读】：该论文旨在解决联邦设置下微调大型语言模型（Large Language Models, LLMs）时因模型聚合导致的跨客户干扰问题，特别是在数据异质性环境下，现有基于FedAvg的联邦LoRA微调方法容易产生有害的跨客户干扰且个性化效果不佳。论文的关键解决方案是提出了一种名为FedALT的新个性化联邦LoRA微调算法，其核心创新在于摒弃了FedAvg的聚合模型初始化方式，每个客户端在保持自身LoRA独立训练的同时，通过一个独立的“世界其余部分”（Rest-of-the-World, RoTW）LoRA组件引入共享知识。为了有效平衡局部适应与全局信息利用，FedALT引入了一种自适应混合器，利用专家混合（Mixture-of-Experts, MoE）原理动态学习输入相关的个体LoRA与RoTW LoRA之间的权重分配。实验结果表明，FedALT在NLP基准测试中显著优于现有的个性化联邦LoRA微调方法，实现了更优的局部适应性且保持了计算效率。

链接: https://arxiv.org/abs/2503.11880
作者: Jieming Bian,Lei Wang,Letian Zhang,Jie Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) in federated settings enables privacy-preserving adaptation but suffers from cross-client interference due to model aggregation. Existing federated LoRA fine-tuning methods, primarily based on FedAvg, struggle with data heterogeneity, leading to harmful cross-client interference and suboptimal personalization. In this work, we propose \textbfFedALT, a novel personalized federated LoRA fine-tuning algorithm that fundamentally departs from FedAvg. Instead of using an aggregated model to initialize local training, each client continues training its individual LoRA while incorporating shared knowledge through a separate Rest-of-the-World (RoTW) LoRA component. To effectively balance local adaptation and global information, FedALT introduces an adaptive mixer that dynamically learns input-specific weightings between the individual and RoTW LoRA components using the Mixture-of-Experts (MoE) principle. Through extensive experiments on NLP benchmarks, we demonstrate that FedALT significantly outperforms state-of-the-art personalized federated LoRA fine-tuning methods, achieving superior local adaptation without sacrificing computational efficiency.
zh

[NLP-98] OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLM s

【速读】：该论文旨在解决现有基于大型语言模型（Large Language Models, LLMs）的自然语言生成（NLG）评估指标存在的两大主要问题：对专有模型的依赖以及缺乏细粒度的解释性反馈。为了解决这些问题，论文提出了OpeNLGauge，这是一种完全开源的、无参考的NLG评估指标，能够基于错误片段提供精确的解释。其关键创新在于通过两阶段集成开放权重的大规模LLMs或小型微调评估模型实现这一目标，并且在未见过的任务、领域和方面上表现出良好的泛化能力，同时在元评估中展示了与人工判断具有竞争力的相关性，且提供了超过现有最先进模型两倍准确率的解释性反馈。

链接: https://arxiv.org/abs/2503.11858
作者: Ivan Kartáč,Mateusz Lango,Ondřej Dušek
机构: Charles University, Faculty of Mathematics and Physics (查尔斯大学，数学与物理学院); Institute of Formal and Applied Linguistics (形式与应用语言学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated great potential as evaluators of NLG systems, allowing for high-quality, reference-free, and multi-aspect assessments. However, existing LLM-based metrics suffer from two major drawbacks: reliance on proprietary models to generate training data or perform evaluations, and a lack of fine-grained, explanatory feedback. In this paper, we introduce OpeNLGauge, a fully open-source, reference-free NLG evaluation metric that provides accurate explanations based on error spans. OpeNLGauge is available as a two-stage ensemble of larger open-weight LLMs, or as a small fine-tuned evaluation model, with confirmed generalizability to unseen tasks, domains and aspects. Our extensive meta-evaluation shows that OpeNLGauge achieves competitive correlation with human judgments, outperforming state-of-the-art models on certain tasks while maintaining full reproducibility and providing explanations more than twice as accurate.
zh

[NLP-99] A Transformer and Prototype-based Interpretable Model for Contextual Sarcasm Detection

【速读】：本文旨在解决情感系统在讽刺（Sarcasm）检测方面面临的挑战，特别是传统方法在处理讽刺语句中字面意义与实际意图之间矛盾时的不足。解决方案的关键在于提出了一种结合Transformer语言模型（Language Model, LM）、基于原型的网络以及情感嵌入（Sentiment Embeddings）的方法，该方法能够在无需额外后验解释技术的情况下实现可解释的讽刺检测。通过引入不一致损失（Incongruity Loss）并利用情感原型（Sentiment Prototypes），该方法不仅提升了检测性能，还在基准数据集上超越了现有最先进水平，同时增强了模型的内在可解释性。

链接: https://arxiv.org/abs/2503.11838
作者: Ximing Wen,Rezvaneh Rezapour
机构: Drexel University (德雷塞尔大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Sarcasm detection, with its figurative nature, poses unique challenges for affective systems designed to perform sentiment analysis. While these systems typically perform well at identifying direct expressions of emotion, they struggle with sarcasm’s inherent contradiction between literal and intended sentiment. Since transformer-based language models (LMs) are known for their efficient ability to capture contextual meanings, we propose a method that leverages LMs and prototype-based networks, enhanced by sentiment embeddings to conduct interpretable sarcasm detection. Our approach is intrinsically interpretable without extra post-hoc interpretability techniques. We test our model on three public benchmark datasets and show that our model outperforms the current state-of-the-art. At the same time, the prototypical layer enhances the model’s inherent interpretability by generating explanations through similar examples in the reference time. Furthermore, we demonstrate the effectiveness of incongruity loss in the ablation study, which we construct using sentiment prototypes.
zh

[NLP-100] ransfer Learning for Automated Feedback Generation on Small Datasets

【速读】：该论文试图解决在学习过程中提供及时且准确反馈的挑战，尤其是在依赖人工评分者时。为应对这一挑战，论文提出了一种通过三阶段迁移学习管道训练自动化反馈生成系统的方案，能够在非常小的数据集和极长序列的任务上实现接近最先进的结果，尽管生成的反馈听起来不够人性化。解决方案的关键在于采用这种三阶段迁移学习方法，以克服数据量小和序列长度长所带来的困难。

链接: https://arxiv.org/abs/2503.11836
作者: Oscar Morris
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Feedback is a very important part the learning process. However, it is challenging to make this feedback both timely and accurate when relying on human markers. This is the challenge that Automated Feedback Generation attempts to address. In this paper, a technique to train such a system on a very small dataset with very long sequences is presented. Both of these attributes make this a very challenging task, however, by using a three stage transfer learning pipeline state-of-the-art results can be achieved with qualitatively accurate but unhuman sounding results. The use of both Automated Essay Scoring and Automated Feedback Generation systems in the real world is also discussed.
zh

[NLP-101] Bridging the LLM Accessibility Divide? Performance Fairness and Cost of Closed versus Open LLM s for Automated Essay Scoring

【速读】：该论文试图解决的问题是如何评估开放（open）和开源（open-source）大型语言模型（LLMs）在文本评估与生成任务中的性能，尤其是在自动化论文评分场景下，与其闭源（closed）LLMs同行相比的表现。论文的关键解决方案在于通过严谨的对比分析，涵盖封闭、开放及开源LLM生态系统中的九个领先模型，从预测性能、公平性以及成本效益等多个维度进行系统性研究。研究发现表明，开放LLMs如Llama 3和Qwen2.5在少量样本学习（few-shot learning）的评估任务中可媲美GPT-4，且在年龄或种族相关的公平性指标上无显著差异，同时Llama 3展现出高达37倍的成本优势。对于生成任务，顶级开放LLMs生成的文章在语义组成/嵌入以及机器评估分数方面与闭源LLMs相当。这一研究挑战了闭源LLMs的主导地位，并强调了开放LLMs的民主化潜力，表明它们能够在保持竞争力和公平性的前提下弥合可及性差距。

链接: https://arxiv.org/abs/2503.11827
作者: Kezia Oketch,John P. Lalor,Yi Yang,Ahmed Abbasi
机构: Department of IT, Analytics, and Operations, University of Notre Dame (圣母大学); Department of Information Systems, Business Statistics and Operations Management, Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Closed large language models (LLMs) such as GPT-4 have set state-of-the-art results across a number of NLP tasks and have become central to NLP and machine learning (ML)-driven solutions. Closed LLMs’ performance and wide adoption has sparked considerable debate about their accessibility in terms of availability, cost, and transparency. In this study, we perform a rigorous comparative analysis of nine leading LLMs, spanning closed, open, and open-source LLM ecosystems, across text assessment and generation tasks related to automated essay scoring. Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen2.5 perform comparably to GPT-4 in terms of predictive performance, with no significant differences in disparate impact scores when considering age- or race-related fairness. Moreover, Llama 3 offers a substantial cost advantage, being up to 37 times more cost-efficient than GPT-4. For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores. Our findings challenge the dominance of closed LLMs and highlight the democratizing potential of open LLMs, suggesting they can effectively bridge accessibility divides while maintaining competitive performance and fairness.
zh

[NLP-102] Key Value Compress: A Systematic Exploration of KV Cache Compression Techniques

【速读】：该论文旨在解决大型语言模型（LLMs）在处理长上下文时，由于注意力机制的计算成本随 token 数量呈平方级增长所带来的效率挑战。论文的关键在于分析并提出多种 Key-Value (KV) 缓存压缩策略，通过全面的分类法将这些方法按其原理与实现技术进行归类，并评估它们对性能和推理延迟的影响，揭示其在长上下文场景中的有效性及权衡关系，从而为更高效的 LLM 实现提供指导。

链接: https://arxiv.org/abs/2503.11816
作者: Neusha Javidnia,Bita Darvish Rouhani,Farinaz Koushanfar
机构: 未知
类目: Computation and Language (cs.CL)
备注: Invited paper to IEEE Custom Integrated Circuits Conference (CICC) 2025

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional capabilities in generating text, images, and video content. However, as context length grows, the computational cost of attention increases quadratically with the number of tokens, presenting significant efficiency challenges. This paper presents an analysis of various Key-Value (KV) cache compression strategies, offering a comprehensive taxonomy that categorizes these methods by their underlying principles and implementation techniques. Furthermore, we evaluate their impact on performance and inference latency, providing critical insights into their effectiveness. Our findings highlight the trade-offs involved in KV cache compression and its influence on handling long-context scenarios, paving the way for more efficient LLM implementations.
zh

[NLP-103] Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在处理细粒度视觉推理任务时，由于图像裁剪技术引入大量视觉标记而导致的效率低下和潜在干扰问题。同时，针对现有VLMs在图像表示泛化能力上的挑战，论文提出了一种轻量级且通用的框架SEMCLIP，以无缝集成到现有VLMs中，增强其处理细粒度细节的能力。关键在于利用文本语义识别关键视觉区域，并将文本信号融入视觉编码过程，从而在不需重新训练VLM的情况下提升视觉问答（Visual Question Answering, VQA）性能，在7个基准测试中平均提升了3.3%，特别是在挑战性的详细理解基准V*上提升了5.3%。

链接: https://arxiv.org/abs/2503.11794
作者: Bangzheng Li,Fei Wang,Wenxuan Zhou,Nan Xu,Ben Zhou,Sheng Zhang,Hoifung Poon,Muhao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM). This unified input paradigm enables VLMs to excel in vision-language tasks such as visual question answering (VQA). To improve fine-grained visual reasoning, recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model. However, this approach significantly increases the number of visual tokens, leading to inefficiency and potential distractions for the LLM. To address the generalization challenges of image representation in VLMs, we propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details. Our method leverages textual semantics to identify key visual areas, improving VQA performance without requiring any retraining of the VLM. Additionally, it incorporates textual signals into the visual encoding process, enhancing both efficiency and effectiveness. The proposed method, SEMCLIP, strengthens the visual understanding of a 7B VLM, LLaVA-1.5 by 3.3% on average across 7 benchmarks, and particularly by 5.3% on the challenging detailed understanding benchmark V*.
zh

[NLP-104] reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs

【速读】：该论文旨在解决现代自然语言处理（NLP）中奖励模型（Reward Models）的鲁棒性不足问题，特别是其在面对输入数据的细微变化时表现显著下降的现象。论文指出，尽管最新的奖励模型在标准基准测试中表现出色，但这种性能提升可能部分归因于过拟合效应，从而混淆了对其真正能力的理解。为了解决这一问题，论文的关键解决方案是通过构建reWordBench数据集，系统性地对奖励模型输入进行意义或排名保留的变换，以评估模型的鲁棒性，并发现当前最先进的奖励模型在面对这些变换时存在严重的性能退化。为了提高奖励模型的鲁棒性，论文提出了一种新的训练方法，即明确地让模型对释义（paraphrases）赋予相似的分数，这种方法不仅改善了对释义变换的鲁棒性，还提升了对其他类型变换的适应能力。实验结果表明，经过改进的鲁棒奖励模型在RewardBench的Chat Hard子集上的性能退化降低了约一半，并且在用于模型对齐时，能够生成更高质量的输出，在与标准训练的奖励模型对比中获胜比例高达59%。

链接: https://arxiv.org/abs/2503.11751
作者: Zhaofeng Wu,Michihiro Yasunaga,Andrew Cohen,Yoon Kim,Asli Celikyilmaz,Marjan Ghazvininejad
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward models have become a staple in modern NLP, serving as not only a scalable text evaluator, but also an indispensable component in many alignment recipes and inference-time algorithms. However, while recent reward models increase performance on standard benchmarks, this may partly be due to overfitting effects, which would confound an understanding of their true capability. In this work, we scrutinize the robustness of reward models and the extent of such overfitting. We build reWordBench, which systematically transforms reward model inputs in meaning- or ranking-preserving ways. We show that state-of-the-art reward models suffer from substantial performance degradation even with minor input transformations, sometimes dropping to significantly below-random accuracy, suggesting brittleness. To improve reward model robustness, we propose to explicitly train them to assign similar scores to paraphrases, and find that this approach also improves robustness to other distinct kinds of transformations. For example, our robust reward model reduces such degradation by roughly half for the Chat Hard subset in RewardBench. Furthermore, when used in alignment, our robust reward models demonstrate better utility and lead to higher-quality outputs, winning in up to 59% of instances against a standardly trained RM.
zh

[NLP-105] CoLLM Light: Cooperative Large Language Model Agents for Network-Wide Traffic Signal Control

【速读】：本文旨在解决交通信号控制（TSC）领域中现有基于大语言模型（LLM）方法未能有效实现网络级优化的问题，主要由于缺乏有效的多智能体协作机制。为了解决这一问题，论文提出了CoLLMLight框架，其关键在于构建了一个结构化的时空图（structured spatiotemporal graph），用于捕捉实时交通动态及邻近交叉口之间的空间关系，使LLM能够推理复杂的交通交互。此外，引入了一种基于复杂度感知的推理机制（complexity-aware reasoning mechanism），动态调整推理深度以适应实时交通状况，在保证决策质量的同时实现最优计算效率。同时，通过迭代仿真驱动的数据收集与环境反馈设计了一种轻量级LLM的微调策略，进一步增强了其在合作式TSC中的适用性。实验结果表明，CoLLMLight在多种交通场景下优于现有最先进的方法，体现了其有效性、可扩展性和鲁棒性。

链接: https://arxiv.org/abs/2503.11739
作者: Zirui Yuan,Siqi Lai,Hao Liu
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review, 14 pages

点击查看摘要

Abstract:Traffic Signal Control (TSC) plays a critical role in urban traffic management by optimizing traffic flow and mitigating congestion. While Large Language Models (LLMs) have recently emerged as promising tools for TSC due to their exceptional problem-solving and generalization capabilities, existing approaches fail to address the essential need for inter-agent coordination, limiting their effectiveness in achieving network-wide optimization. To bridge this gap, we propose CoLLMLight, a cooperative LLM agent framework for TSC. Specifically, we first construct a structured spatiotemporal graph to capture real-time traffic dynamics and spatial relationships among neighboring intersections, enabling the LLM to reason about complex traffic interactions. Moreover, we introduce a complexity-aware reasoning mechanism that dynamically adapts reasoning depth based on real-time traffic conditions, ensuring optimal computational efficiency without sacrificing decision quality. Besides, we propose a fine-tuning strategy that leverages iterative simulation-driven data collection and environmental feedback to build a lightweight LLM tailored for cooperative TSC. Extensive experiments on both synthetic and real-world datasets demonstrate that CoLLMLight outperforms state-of-the-art methods in diverse traffic scenarios, showcasing its effectiveness, scalability, and robustness.
zh

[NLP-106] LLM Agents for Education: Advances and Applications

【速读】：该论文旨在系统性地综述大型语言模型（Large Language Model, LLM）代理在教育领域的最新研究进展，试图解决如何有效利用LLM代理提升教学与学习体验的问题。论文将LLM代理分为两类：教学代理（Pedagogical Agents）和领域特定教育代理（Domain-Specific Educational Agents），分别探讨其在支持教师和学生以及服务于科学教育、语言学习和职业发展等专业化领域的应用潜力。解决方案的关键在于深入分析支撑这些代理的技术基础，包括关键数据集、基准测试及算法框架，并进一步讨论隐私保护、偏见与公平性、幻觉现象缓解以及与现有教育生态系统的整合等挑战。通过这一综述，论文期望为LLM代理在教育领域的技术发展提供全面概述，促进相关研究合作，以最大化其对学习者和教育者的积极影响。

链接: https://arxiv.org/abs/2503.11733
作者: Zhendong Chu,Shen Wang,Jian Xie,Tinghui Zhu,Yibo Yan,Jinheng Ye,Aoxiao Zhong,Xuming Hu,Jing Liang,Philip S. Yu,Qingsong Wen
机构: Squirrel Ai Learning (松鼠Ai学习)(USA); Fudan University (复旦大学)(China); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）)(China); Tsinghua University (清华大学)(China); University of Illinois Chicago (芝加哥伊利诺伊大学)(USA)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 17 pages

点击查看摘要

Abstract:Large Language Model (LLM) agents have demonstrated remarkable capabilities in automating tasks and driving innovation across diverse educational applications. In this survey, we provide a systematic review of state-of-the-art research on LLM agents in education, categorizing them into two broad classes: (1) \emphPedagogical Agents, which focus on automating complex pedagogical tasks to support both teachers and students; and (2) \emphDomain-Specific Educational Agents, which are tailored for specialized fields such as science education, language learning, and professional development. We comprehensively examine the technological advancements underlying these LLM agents, including key datasets, benchmarks, and algorithmic frameworks that drive their effectiveness. Furthermore, we discuss critical challenges such as privacy, bias and fairness concerns, hallucination mitigation, and integration with existing educational ecosystems. This survey aims to provide a comprehensive technological overview of LLM agents for education, fostering further research and collaboration to enhance their impact for the greater good of learners and educators alike.
zh

[NLP-107] oward a method for LLM -enabled Indoor Navigation

【速读】：该论文旨在解决室内导航面临的独特挑战，包括复杂布局、缺乏GPS信号及无障碍需求等问题，现有方案在实时适应性和用户个性化需求方面存在不足。论文的关键解决方案是探索大型语言模型（Large Language Model, LLM），即ChatGPT，在从室内地图图像生成自然且上下文感知的导航指令方面的潜力。研究通过设计和评估不同真实环境中的测试用例，分析了LLMs在解读空间布局、处理用户约束以及规划高效路径方面的有效性。结果显示，LLMs在支持个性化室内导航方面具有潜力，正确指示的平均比例为52%，最高可达62%，其性能与布局或预期路径的复杂性无关，而主要受兴趣点数量和视觉信息丰富程度的影响，这些因素会负面影响性能。

链接: https://arxiv.org/abs/2503.11702
作者: Alberto Coffrini,Mohammad Amin Zadenoori,Paolo Barsocchi,Francesco Furfari,Antonino Crivello,Alessio Ferrari
机构: Institute of Information Science and Technologies (ISTI), National Research Council of Italy (CNR)(意大利国家研究委员会); Department of Computer Science, University of Pisa (比萨大学计算机科学系)(比萨大学); University College Dublin (UCD)(都柏林大学学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 3 figures, 5 tables

点击查看摘要

Abstract:Indoor navigation presents unique challenges due to complex layouts, lack of GPS signals, and accessibility concerns. Existing solutions often struggle with real-time adaptability and user-specific needs. In this work, we explore the potential of a Large Language Model (LLM), i.e., ChatGPT, to generate natural, context-aware navigation instructions from indoor map images. We design and evaluate test cases across different real-world environments, analyzing the effectiveness of LLMs in interpreting spatial layouts, handling user constraints, and planning efficient routes. Our findings demonstrate the potential of LLMs for supporting personalized indoor navigation, with an average of 52% correct indications and a maximum of 62%. The results do not appear to depend on the complexity of the layout or the complexity of the expected path, but rather on the number of points of interest and the abundance of visual information, which negatively affect the performance.
zh

[NLP-108] LogitLens4LLM s: Extending Logit Lens Analysis to Modern Large Language Models

【速读】：本文旨在解决Logit Lens技术无法应用于现代大型语言模型（如Qwen-2.5和Llama-3.1）的问题。解决方案的关键在于开发了针对特定组件的钩子（hooks），以捕获注意力机制和多层感知机（MLP）的输出，同时保持与HuggingFace transformer库的完全兼容性，并确保低推理开销。通过自动化关键分析工作流，该工具包不仅支持交互式探索，还具备批量处理能力，从而实现大规模分层分析。

链接: https://arxiv.org/abs/2503.11667
作者: Zhenyu Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces LogitLens4LLMs, a toolkit that extends the Logit Lens technique to modern large language models. While Logit Lens has been a crucial method for understanding internal representations of language models, it was previously limited to earlier model architectures. Our work overcomes the limitations of existing implementations, enabling the technique to be applied to state-of-the-art architectures (such as Qwen-2.5 and Llama-3.1) while automating key analytical workflows. By developing component-specific hooks to capture both attention mechanisms and MLP outputs, our implementation achieves full compatibility with the HuggingFace transformer library while maintaining low inference overhead. The toolkit provides both interactive exploration and batch processing capabilities, supporting large-scale layer-wise analyses. Through open-sourcing our implementation, we aim to facilitate deeper investigations into the internal mechanisms of large-scale language models. The toolkit is openly available at this https URL.
zh

[NLP-109] An LLM -Based Approach for Insight Generation in Data Analysis NAACL2025

【速读】：该论文旨在解决从多表数据库中生成有洞察力且可操作的信息这一关键数据解析挑战。论文提出了一种利用大型语言模型（Large Language Models, LLMs）自动生成文本形式洞察的新方法。方案的关键在于其框架设计，包括假设生成器（Hypothesis Generator）以提出领域相关的问题，查询代理（Query Agent）通过生成SQL查询来回答这些问题，以及总结模块（Summarization Module）将查询结果转化为易于理解的文本洞察。此外，该方法通过结合人工评估与自动化指标，确保生成洞察的正确性和主观洞见性。实验结果显示，相比其他方法，该方法能够生成更高质量且更具洞察力的结果，同时保持准确性。

链接: https://arxiv.org/abs/2503.11664
作者: Alberto Sánchez Pérez,Alaa Boukhary,Paolo Papotti,Luis Castejón Lozano,Adam Elwood
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注: Accepted for publication at NAACL 2025

点击查看摘要

Abstract:Generating insightful and actionable information from databases is critical in data analysis. This paper introduces a novel approach using Large Language Models (LLMs) to automatically generate textual insights. Given a multi-table database as input, our method leverages LLMs to produce concise, text-based insights that reflect interesting patterns in the tables. Our framework includes a Hypothesis Generator to formulate domain-relevant questions, a Query Agent to answer such questions by generating SQL queries against a database, and a Summarization module to verbalize the insights. The insights are evaluated for both correctness and subjective insightfulness using a hybrid model of human judgment and automated metrics. Experimental results on public and enterprise databases demonstrate that our approach generates more insightful insights than other approaches while maintaining correctness.
zh

[NLP-110] Lorecast: Layout-Aware Performance and Power Forecasting from Natural Language

【速读】：该论文试图解决芯片设计规划中不同设计选项性能和功耗预测不准确及传统方法效率低的问题。解决方案的关键在于提出了一种名为Lorecast的新方法，它通过接受自然语言提示作为输入，快速生成考虑布局的性能和功耗估算，无需开发硬件描述语言（HDL）代码或进行综合操作，从而实现高效且用户友好的预测，同时实验结果显示其精度接近后布局分析的几个百分点误差范围内。

链接: https://arxiv.org/abs/2503.11662
作者: Runzhi Wang,Prianka Sengupta,Yiran Chen,Jiang Hu
机构: Dept. of Electrical and Computer Engineering, Texas A&M University (得克萨斯农工大学); Dept. of Electrical and Computer Engineering, Duke University (杜克大学); Dept. of Computer Science and Engineering, Texas A&M University (得克萨斯农工大学)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In chip design planning, obtaining reliable performance and power forecasts for various design options is of critical importance. Traditionally, this involves using system-level models, which often lack accuracy, or trial synthesis, which is both labor-intensive and time-consuming. We introduce a new methodology, called Lorecast, which accepts English prompts as input to rapidly generate layout-aware performance and power estimates. This approach bypasses the need for HDL code development or synthesis, making it both fast and user-friendly. Experimental results demonstrate that Lorecast achieves accuracy within a few percent of error compared to post-layout analysis.
zh

[NLP-111] Automating Mathematical Proof Generation Using Large Language Model Agents and Knowledge Graphs

【速读】：该论文致力于解决大型语言模型（LLMs）在自动化识别数学概念、理解其关联性以及在严格框架内形式化证明方面的挑战。论文的关键解决方案在于提出了一种新颖的框架，该框架利用知识图谱增强LLMs以构建和形式化数学证明。通过这种方式，研究实现了显著的性能提升，在MUSTARDSAUCE数据集上的成功率最高可达34%，并在不同模型上持续优于基线方法2%-11%。此方法成功弥合了自然语言理解和形式逻辑证明系统之间的差距，并提升了基础模型的表现。

链接: https://arxiv.org/abs/2503.11657
作者: Vincent Li,Yule Fu,Tim Knappe,Kevin Han,Kevin Zhu
机构: Boston University (波士顿大学); Duke University (杜克大学); Provadis (普罗瓦迪斯); Algoverse AI Research (阿尔戈沃斯人工智能研究); Carnegie Mellon University (卡内基梅隆大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable capabilities in natural language processing tasks, including mathematical problem-solving that requires multi-step logical reasoning. However, challenges persist in automating the identification of key mathematical concepts, understanding their interrelations, and formalizing proofs within a rigorous framework. We present a novel framework that leverages knowledge graphs to augment LLMs to construct and formalize mathematical proofs. Our results demonstrate significant performance improvements across multiple datasets, with using knowledge graphs, achieving up to a 34% success rate on the MUSTARDSAUCE dataset on o1-mini and consistently outperforming baseline approaches by 2-11% across different models. We show how this approach bridges the gap between natural language understanding and formal logic proof systems and achieve elevated results for foundation models over baseline.
zh

[NLP-112] RUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models

【速读】：该论文旨在解决大型语言模型在多轮对话中逢迎行为（sycophancy）的持久性和演化问题，这一现象表现为模型过度迎合或讨好用户，往往以牺牲事实准确性为代价。现有研究主要聚焦于单轮交互中的逢迎行为分析，而其在多步对话中的表现尚未得到充分探索。为应对这一挑战，论文引入了名为TRUTH DECAY的基准测试集，用于评估语言模型在迭代式用户反馈、质疑及说服场景下的逢迎倾向。研究通过设计四种逢迎偏差诱发方式来测试模型，并提出并验证了一系列减少逢迎行为的策略，这些策略超越了单一交互步骤的限制，重点在于探索和优化多轮对话中的逢迎抑制机制。

链接: https://arxiv.org/abs/2503.11656
作者: Joshua Liu,Aarav Jain,Soham Takuri,Srihan Vege,Aslihan Akalin,Kevin Zhu,Sean O’Brien,Vasu Sharma
机构: Algoverse AI Research
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Rapid improvements in large language models have unveiled a critical challenge in human-AI interaction: sycophancy. In this context, sycophancy refers to the tendency of models to excessively agree with or flatter users, often at the expense of factual accuracy. While previous studies have primarily analyzed this behavior in single-turn interactions, its persistence and evolution in multi-step conversations remain largely unexplored. We introduce TRUTH DECAY, a benchmark specifically designed to evaluate sycophancy in extended dialogues, where language models must navigate iterative user feedback, challenges, and persuasion. We prompt models to elicit four types of sycophantic biases. We then propose and test sycophancy reduction strategies, evaluating their effectiveness beyond single-step interactions.
zh

[NLP-113] Explainable Sentiment Analysis with DeepSeek -R1: Performance Efficiency and Few-Shot Learning

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在情感分析任务中性能、效率与可解释性之间权衡不足的问题。论文的关键解决方案在于全面评估DeepSeek-R1系列开源LLMs在情感分析中的表现，并将其与OpenAI的GPT-4及GPT-4-mini进行对比，通过系统性研究few-shot提示条件下模型的表现，扩展至最多50-shot配置以评估其思维链推理能力。研究发现，DeepSeek-R1在多分类情感任务中表现出竞争性的准确性，同时通过详细的推理过程提供了增强的可解释性，从而在性能与可解释性之间达成平衡。

链接: https://arxiv.org/abs/2503.11655
作者: Donghao Huang,Zhaoxia Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, 4 tables, submitted to an IEEE journal

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly enhanced sentiment analysis capabilities. However, the trade-offs between model performance, efficiency, and explainability of some latest models remain underexplored. This study presents the first comprehensive evaluation of the DeepSeek-R1 series of models, reasoning open-source LLMs, for sentiment analysis, comparing them against OpenAI’s GPT-4 and GPT-4-mini. We systematically analyze their performance under few-shot prompting conditions, scaling up to 50-shot configurations to assess in-context learning effectiveness. Our experiments reveal that DeepSeek-R1 demonstrates competitive accuracy, particularly in multi-class sentiment tasks, while offering enhanced interpretability through its detailed reasoning process. Additionally, we highlight the impact of increasing few-shot examples on model performance and discuss key trade-offs between explainability and computational efficiency.
zh

计算机视觉

[CV-0] MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation CVPR2025

【速读】：该论文旨在解决移动操作（mobile manipulation）在不同任务和环境中的泛化能力不足问题，特别是现有固定基座操作（fixed-base manipulation）的视觉-语言-动作（Vision-Language-Action, VLA）基础模型难以直接应用于移动机器人平台的问题。论文的关键解决方案是提出了一种名为MoManipVLA的高效策略适应框架，通过将预训练的固定基座VLA模型迁移到移动操作任务中，实现跨任务和环境的高泛化能力。具体而言，该方法利用预训练的VLA模型生成具有强泛化能力的操作末端执行器路点，并设计了针对移动基座和机械臂的运动规划目标以确保轨迹的物理可行性。此外，通过双层优化框架，上层优化预测基座移动路点以扩展操作策略空间，下层优化选择最优末端执行器轨迹完成任务，从而实现在零样本情况下调整机器人基座位置，使从固定基座VLA模型预测的路点变得可行。这一方案显著提升了移动操作的成功率，并大幅降低了实际部署所需的训练成本。

链接: https://arxiv.org/abs/2503.13446
作者: Zhenyu Wu,Yuheng Zhou,Xiuwei Xu,Ziwei Wang,Haibin Yan
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Mobile manipulation is the fundamental challenge for robotics to assist humans with diverse tasks and environments in everyday life. However, conventional mobile manipulation approaches often struggle to generalize across different tasks and environments because of the lack of large-scale training. In contrast, recent advances in vision-language-action (VLA) models have shown impressive generalization capabilities, but these foundation models are developed for fixed-base manipulation tasks. Therefore, we propose an efficient policy adaptation framework named MoManipVLA to transfer pre-trained VLA models of fix-base manipulation to mobile manipulation, so that high generalization ability across tasks and environments can be achieved in mobile manipulation policy. Specifically, we utilize pre-trained VLA models to generate waypoints of the end-effector with high generalization ability. We design motion planning objectives for the mobile base and the robot arm, which aim at maximizing the physical feasibility of the trajectory. Finally, we present an efficient bi-level objective optimization framework for trajectory generation, where the upper-level optimization predicts waypoints for base movement to enhance the manipulator policy space, and the lower-level optimization selects the optimal end-effector trajectory to complete the manipulation task. In this way, MoManipVLA can adjust the position of the robot base in a zero-shot manner, thus making the waypoints predicted from the fixed-base VLA models feasible. Extensive experimental results on OVMM and the real world demonstrate that MoManipVLA achieves a 4.2% higher success rate than the state-of-the-art mobile manipulation, and only requires 50 training cost for real world deployment due to the strong generalization ability in the pre-trained VLA models.
zh

[CV-1] VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

【速读】：该论文致力于解决视频多模态推理（尤其是时间相关理解）这一未被充分探索的研究领域。传统大语言模型虽在推理能力上有显著进展，但针对视频的时间定位与理解仍缺乏有效方法。论文提出的关键解决方案是设计了一个名为VideoMind的新颖视频-语言代理，其核心创新包括：(i) 定义视频时间推理所需的关键能力，并构建基于角色的代理工作流，包含规划器、时间定位器、验证器及问答器；(ii) 提出一种Chain-of-LoRA策略，通过轻量级LoRA适配器实现角色间的无缝切换，同时避免多模型部署带来的开销，从而平衡效率与灵活性。实验结果表明，VideoMind在多个公开基准测试中表现出色，验证了其在视频理解任务中的有效性。

链接: https://arxiv.org/abs/2503.13444
作者: Ye Liu,Kevin Qinghong Lin,Chang Wen Chen,Mike Zheng Shou
机构: The Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks demonstrate that our agent achieves state-of-the-art performance on diverse video understanding tasks, including 3 on grounded video question-answering, 6 on video temporal grounding, and 5 on general video question-answering, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.
zh

[CV-2] DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models CVPR2025

【速读】：该论文旨在解决基于CLIP的提示调优（Prompt Tuning）过程中普遍存在的Base-New Trade-off (BNT)问题，即在目标类（base classes）上进行连续微调会导致对新类（unseen classes）泛化能力的同时下降。现有方法通过添加约束来调节提示调优过程以平衡BNT，但这些约束未能完全避免基类与新类优化方向之间的相互排斥性。论文的关键解决方案是提出了一种插拔式Dual-Prompt Collaboration (DPC)框架，这是首个在提示级别解耦基类和新类优化过程的方法。具体而言，DPC通过克隆一个可学习的平行提示，并引入可变权重解耦框架，独立控制针对基类或新类任务的双提示优化方向，从而避免泛化的冲突。此外，还提出了Dynamic Hard Negative Optimizer，利用双提示增强基类优化任务的挑战性。论文进一步证明了提示向量在优化过程中的特征通道不变性，为DPC的权重解耦提供了理论支持。实验结果表明，DPC能够在不引入任何外部知识的情况下显著提升基类性能，同时保持对新类的泛化能力。

链接: https://arxiv.org/abs/2503.13443
作者: Haoyang Li,Liang Wang,Chao Wang,Jing Jiang,Yan Peng,Guodong Long
机构: Shanghai University (上海大学); University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 (CVPR 2025)

点击查看摘要

Abstract:The Base-New Trade-off (BNT) problem universally exists during the optimization of CLIP-based prompt tuning, where continuous fine-tuning on base (target) classes leads to a simultaneous decrease of generalization ability on new (unseen) classes. Existing approaches attempt to regulate the prompt tuning process to balance BNT by appending constraints. However, imposed on the same target prompt, these constraints fail to fully avert the mutual exclusivity between the optimization directions for base and new. As a novel solution to this challenge, we propose the plug-and-play Dual-Prompt Collaboration (DPC) framework, the first that decoupling the optimization processes of base and new tasks at the prompt level. Specifically, we clone a learnable parallel prompt based on the backbone prompt, and introduce a variable Weighting-Decoupling framework to independently control the optimization directions of dual prompts specific to base or new tasks, thus avoiding the conflict in generalization. Meanwhile, we propose a Dynamic Hard Negative Optimizer, utilizing dual prompts to construct a more challenging optimization task on base classes for enhancement. For interpretability, we prove the feature channel invariance of the prompt vector during the optimization process, providing theoretical support for the Weighting-Decoupling of DPC. Extensive experiments on multiple backbones demonstrate that DPC can significantly improve base performance without introducing any external knowledge beyond the base classes, while maintaining generalization to new classes. Code is available at: this https URL.
zh

[CV-3] Humanoid Policy ~ Human Policy

【速读】：本文旨在解决通过机器人自身演示数据进行操作策略训练存在的劳动密集型和难以规模化的问题。为克服这一挑战，论文探索了一种更具可扩展性的数据来源——以第一人称视角（egocentric）的人类演示，将其作为跨形态平台的训练数据。解决方案的关键在于从数据和建模两方面缩小类人机器人与人类在形态上的差距。首先，构建了一个与类人机器人操作演示直接对齐的第一人称任务导向数据集（PH2D）。其次，提出了一种名为Human Action Transformer (HAT) 的人类-类人行为策略模型，其状态-动作空间统一适用于人类和类人机器人，并可通过可微重定向技术转化为机器人动作。HAT 通过与少量机器人数据协同训练，无需额外监督即可直接将类人机器人和人类视为不同的形态进行建模。实验表明，人类数据显著提升了HAT的泛化能力和鲁棒性，同时大幅提高了数据收集效率。

链接: https://arxiv.org/abs/2503.13441
作者: Ri-Zhao Qiu,Shiqi Yang,Xuxin Cheng,Chaitanya Chawla,Jialong Li,Tairan He,Ge Yan,Lars Paulsen,Ge Yang,Sha Yi,Guanya Shi,Xiaolong Wang
机构: UC San Diego (加州大学圣地亚哥分校); CMU (卡内基梅隆大学); University of Washington (华盛顿大学); MIT (麻省理工学院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Code and data: this https URL

点击查看摘要

Abstract:Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset (PH2D) that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency. Code and data: this https URL
zh

[CV-4] MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

【速读】：该论文旨在解决Transformer模型在处理长序列任务时面临的二次复杂度挑战，同时克服RNN模型难以捕获长程依赖及在下游理解与复杂推理任务中表现不佳的问题。论文的关键创新在于提出了一种混合模型MaTVLM，通过在预训练视觉语言模型（VLM）中部分替换Transformer解码器层为Mamba-2层，并利用注意力机制与Mamba-2之间的内在关系初始化Mamba-2以加速收敛。此外，采用单阶段蒸馏过程，以预训练VLM作为教师模型传递知识，进一步提升收敛速度与性能。研究还探索了训练框架内微分蒸馏损失的影响。最终，MaTVLM在多个基准测试中表现出色，不仅实现了高达3.6倍的推理加速和27.5%的GPU内存节省，且性能不逊于教师模型及同类规模的其他模型。

链接: https://arxiv.org/abs/2503.13440
作者: Yingyue Li,Bencheng Liao,Wenyu Liu,Xinggang Wang
机构: School of EIC, Huazhong University of Science & Technology (华中科技大学电子与信息工程学院); Institute of Artificial Intelligence, Huazhong University of Science & Technology (华中大学人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and model are available at this http URL

点击查看摘要

Abstract:With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers. However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, limiting contextual understanding. This results in slow convergence, high resource demands, and poor performance on downstream understanding and complex reasoning tasks. In this work, we present a hybrid model MaTVLM by substituting a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. Leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. Subsequently, we employ a single-stage distillation process, using the pre-trained VLM as the teacher model to transfer knowledge to the MaTVLM, further enhancing convergence speed and performance. Furthermore, we investigate the impact of differential distillation loss within our training framework. We evaluate the MaTVLM on multiple benchmarks, demonstrating competitive performance against the teacher model and existing VLMs while surpassing both Mamba-based VLMs and models of comparable parameter scales. Remarkably, the MaTVLM achieves up to 3.6x faster inference than the teacher model while reducing GPU memory consumption by 27.5%, all without compromising performance. Code and models are released at this http URL.
zh

[CV-5] Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images

【速读】：该论文旨在解决现有基于图像的3D物体重建方法忽视真实场景中常见遮挡（occlusions）的问题。论文提出了一种名为Amodal3R的条件性3D生成模型，专门用于从部分观测中重建完整的3D物体。其关键解决方案在于引入了一个带掩码加权多头交叉注意力机制，并随后加入一个显式利用遮挡先验知识的注意力层，通过引导重建过程来恢复被遮挡物体的几何形状与外观。实验表明，仅使用合成数据训练的Amodal3R能够在包含遮挡的真实场景中学习恢复完整的3D物体，显著优于先进行2D完整化再进行3D重建的传统方法，从而为遮挡感知的3D重建设定了新的基准。

链接: https://arxiv.org/abs/2503.13439
作者: Tianhao Wu,Chuanxia Zheng,Frank Guan,Andrea Vedaldi,Tat-Jen Cham
机构: S-Lab, Nanyang Technological University (南洋理工大学); Visual Geometry Group, University of Oxford (牛津大学视觉几何组); Singapore Institute of Technology (新加坡科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a “foundation” 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes. It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.
zh

[CV-6] Unified Autoregressive Visual Generation and Understanding with Continuous Tokens

【速读】：该论文旨在解决视觉生成与理解任务之间的权衡问题，提出了一种名为UniFluid的统一自回归框架，通过利用连续视觉标记（continuous visual tokens）实现图像和文本的联合生成与理解。解决方案的关键在于设计了一个统一的自回归架构，能够处理多模态图像和文本输入，并生成离散文本标记和连续图像标记。通过精心调整训练方法，论文发现这些任务可以相互促进。此外，选择合适的损失平衡权重以及在训练过程中采用更强的预训练大型语言模型（LLMs）和随机顺序生成策略，对于在该统一框架内实现高保真图像生成至关重要。UniFluid基于Gemma模型系列构建，在图像生成和理解方面表现出色，并展示了对多种下游任务的强大迁移能力。

链接: https://arxiv.org/abs/2503.13436
作者: Lijie Fan,Luming Tang,Siyang Qin,Tianhong Li,Xuan Yang,Siyuan Qiao,Andreas Steiner,Chen Sun,Yuanzhen Li,Tao Zhu,Michael Rubinstein,Michalis Raptis,Deqing Sun,Radu Soricut
机构: Google DeepMind (谷歌深度思维); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Tech report

点击查看摘要

Abstract:We present UniFluid, a unified autoregressive framework for joint visual generation and understanding leveraging continuous visual tokens. Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image. We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other. By selecting an appropriate loss balance weight, the unified model achieves results comparable to or exceeding those of single-task baselines on both tasks. Furthermore, we demonstrate that employing stronger pre-trained LLMs and random-order generation during training is important to achieve high-fidelity image generation within this unified framework. Built upon the Gemma model series, UniFluid exhibits competitive performance across both image generation and understanding, demonstrating strong transferability to various downstream tasks, including image editing for generation, as well as visual captioning and question answering for understanding.
zh

[CV-7] WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes

【速读】：该论文致力于解决现有4D重建方法在处理具有显著物体空间运动的场景时的局限性问题。具体而言，现有的4D重建基准数据集主要展示局限于特定场景内的动作（如原地舞蹈），而无法充分涵盖包含大范围空间运动的实际场景，这限制了4D重建方法在复杂场景中的表现能力。此外，当前基于变形场(deformation field)的4D重建方法难以准确估计大范围空间运动下的动态特性，进一步阻碍了高质量4D场景重建的实现。

为了解决这些问题，论文提出了一个名为WideRange4D的新4D重建基准，其中包含具有大范围空间变化的丰富4D场景数据，以更全面地评估4D生成方法的生成能力。同时，论文还引入了一种新的4D重建方法Progress4D，该方法能够在多种复杂的4D场景重建任务中生成稳定且高质量的结果。通过WideRange4D上的定量和定性对比实验，结果显示Progress4D在性能上超越了现有的最先进的4D重建方法。

链接: https://arxiv.org/abs/2503.13435
作者: Ling Yang,Kaixin Zhu,Juanxi Tian,Bohan Zeng,Mingbao Lin,Hongjuan Pei,Wentao Zhang,Shuicheng Yan
机构: Peking University (北京大学); University of Chinese Academy of Sciences (中国科学院大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project: this https URL

点击查看摘要

Abstract:With the rapid development of 3D reconstruction technology, research in 4D reconstruction is also advancing, existing 4D reconstruction methods can generate high-quality 4D scenes. However, due to the challenges in acquiring multi-view video data, the current 4D reconstruction benchmarks mainly display actions performed in place, such as dancing, within limited scenarios. In practical scenarios, many scenes involve wide-range spatial movements, highlighting the limitations of existing 4D reconstruction datasets. Additionally, existing 4D reconstruction methods rely on deformation fields to estimate the dynamics of 3D objects, but deformation fields struggle with wide-range spatial movements, which limits the ability to achieve high-quality 4D scene reconstruction with wide-range spatial movements. In this paper, we focus on 4D scene reconstruction with significant object spatial movements and propose a novel 4D reconstruction benchmark, WideRange4D. This benchmark includes rich 4D scene data with large spatial variations, allowing for a more comprehensive evaluation of the generation capabilities of 4D generation methods. Furthermore, we introduce a new 4D reconstruction method, Progress4D, which generates stable and high-quality 4D results across various complex 4D scene reconstruction tasks. We conduct both quantitative and qualitative comparison experiments on WideRange4D, showing that our Progress4D outperforms existing state-of-the-art 4D reconstruction methods. Project: this https URL
zh

[CV-8] BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

【速读】：该论文旨在解决现有基于扩散（Diffusion-based）方法在元素级视觉操作中的精度和灵活性不足的问题。论文提出BlobCtrl框架，通过采用概率性Blob表示统一元素级生成与编辑，有效解耦并表征空间位置、语义内容和身份信息，从而实现精确的元素级操作。解决方案的关键在于：1）设计了一种具有分层特征融合的双分支扩散架构以实现前景与背景的无缝集成；2）提出自监督训练范式，并结合定制的数据增强和评分函数；3）引入可控丢弃策略以平衡保真度与多样性。这些创新共同构成了BlobCtrl的核心优势。

链接: https://arxiv.org/abs/2503.13434
作者: Yaowei Li,Lingen Li,Zhaoyang Zhang,Xiaoyu Li,Guangzhi Wang,Hongxiang Li,Xiaodong Cun,Ying Shan,Yuexian Zou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Project Webpage: this https URL

点击查看摘要

Abstract:Element-level visual manipulation is essential in digital content creation, but current diffusion-based methods lack the precision and flexibility of traditional tools. In this work, we introduce BlobCtrl, a framework that unifies element-level generation and editing using a probabilistic blob-based representation. By employing blobs as visual primitives, our approach effectively decouples and represents spatial location, semantic content, and identity information, enabling precise element-level manipulation. Our key contributions include: 1) a dual-branch diffusion architecture with hierarchical feature fusion for seamless foreground-background integration; 2) a self-supervised training paradigm with tailored data augmentation and score functions; and 3) controllable dropout strategies to balance fidelity and diversity. To support further research, we introduce BlobData for large-scale training and BlobBench for systematic evaluation. Experiments show that BlobCtrl excels in various element-level manipulation tasks while maintaining computational efficiency, offering a practical solution for precise and flexible visual content creation. Project page: this https URL
zh

[CV-9] Less Biased Noise Scale Estimation for Threshold-Robust RANSAC

【速读】：该论文试图解决通过图像匹配鲁棒估计相对位姿时，RANSAC方法中内点阈值（inlier threshold）难以手动设置且依赖于真实数据的问题。论文的关键解决方案在于改进内点噪声尺度（inlier noise scale）的估计方法。具体而言，作者重新审视了SIMFIT方法，并修正了其在估计噪声尺度时存在的偏差，这些偏差源于使用相同数据进行模型拟合和内点噪声估计，以及未考虑阈值本身的影响。此外，针对场景内的最优阈值近似恒定的特点，提出了SIMFIT++的多对样本扩展版本，通过过滤估计值进一步提升性能。该方法在广泛的阈值范围内表现出稳健性，如图1所示。

链接: https://arxiv.org/abs/2503.13433
作者: Johan Edstedt
机构: Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The gold-standard for robustly estimating relative pose through image matching is RANSAC. While RANSAC is powerful, it requires setting the inlier threshold that determines whether the error of a correspondence under an estimated model is sufficiently small to be included in its consensus set. Setting this threshold is typically done by hand, and is difficult to tune without a access to ground truth data. Thus, a method capable of automatically determining the optimal threshold would be desirable. In this paper we revisit inlier noise scale estimation, which is an attractive approach as the inlier noise scale is linear to the optimal threshold. We revisit the noise scale estimation method SIMFIT and find bias in the estimate of the noise scale. In particular, we fix underestimates from using the same data for fitting the model as estimating the inlier noise, and from not taking the threshold itself into account. Secondly, since the optimal threshold within a scene is approximately constant we propose a multi-pair extension of SIMFIT++, by filtering of estimates, which improves results. Our approach yields robust performance across a range of thresholds, shown in Figure 1.
zh

[CV-10] AugMapNet: Improving Spatial Latent Structure via BEV Grid Augmentation for Enhanced Vectorized Online HD Map Construction

【速读】：该论文旨在解决自动驾驶中实时从传感器数据中提取基础设施元素（如车道线和人行横道）并以矢量化形式表示的问题。传统方法通常通过学习得到的鸟瞰图（BEV）编码器将多视角相机图像融合为一个联合潜在BEV网格，并从中预测中间栅格地图以提供密集的空间监督，但需要额外的后处理步骤转换为所需的矢量化形式。近年来，有研究直接利用矢量化地图解码器从潜在BEV表示中推导出基础设施元素，提供了实例级信息。然而，这些方法在矢量化地图预测性能上仍有提升空间。

论文提出了一种名为Augmentation Map Network (AugMapNet) 的新方法，其关键在于引入了一种新颖的潜在BEV网格增强技术，显著提升了潜在BEV表示的质量。与现有架构相比，AugMapNet能够更有效地结合矢量化解码和密集空间监督，同时保持易于集成且具有通用性。实验结果表明，在nuScenes和Argoverse2数据集上，AugMapNet在60米范围内相较于StreamMapNet基线提升了高达13.3%的矢量化地图预测性能，并在更大范围上取得更大的改进。此外，通过对潜在BEV网格的详细分析验证了AugMapNet的潜在空间更加结构化，并展示了该创新概念的价值不仅限于性能提升。

链接: https://arxiv.org/abs/2503.13430
作者: Thomas Monninger,Md Zafar Anwar,Stanislaw Antol,Steffen Staab,Sihao Ding
机构: Mercedes-Benz Research & Development North America (梅赛德斯-奔驰研发北美) ; University of Stuttgart (斯图加特大学) ; University of Southampton (南安普顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous driving requires an understanding of the infrastructure elements, such as lanes and crosswalks. To navigate safely, this understanding must be derived from sensor data in real-time and needs to be represented in vectorized form. Learned Bird’s-Eye View (BEV) encoders are commonly used to combine a set of camera images from multiple views into one joint latent BEV grid. Traditionally, from this latent space, an intermediate raster map is predicted, providing dense spatial supervision but requiring post-processing into the desired vectorized form. More recent models directly derive infrastructure elements as polylines using vectorized map decoders, providing instance-level information. Our approach, Augmentation Map Network (AugMapNet), proposes latent BEV grid augmentation, a novel technique that significantly enhances the latent BEV representation. AugMapNet combines vector decoding and dense spatial supervision more effectively than existing architectures while remaining as straightforward to integrate and as generic as auxiliary supervision. Experiments on nuScenes and Argoverse2 datasets demonstrate significant improvements in vectorized map prediction performance up to 13.3% over the StreamMapNet baseline on 60m range and greater improvements on larger ranges. We confirm transferability by applying our method to another baseline and find similar improvements. A detailed analysis of the latent BEV grid confirms a more structured latent space of AugMapNet and shows the value of our novel concept beyond pure performance improvement. The code will be released soon.
zh

[CV-11] Escaping Platos Cave: Robust Conceptual Reasoning through Interpretable 3D Neural Object Volumes

【速读】：该论文试图解决神经网络在高风险应用中同时具备鲁棒性（robustness）和可解释性（interpretability）的问题。随着三维体积对象表示的分类器在分布外数据上的鲁棒性显著增强，但其可解释性尚未被深入研究。论文的关键解决方案是提出CAVE（Concept Aware Volumes for Explanations），一种将可解释性和鲁棒性统一于图像分类的新方法。通过从现有三维感知分类器的体积表示中提取概念并扩展模型，设计出一种内在可解释且鲁棒的分类器。实验表明，CAVE发现的概念不仅在不同图像中保持一致性，同时实现了卓越的鲁棒性。

链接: https://arxiv.org/abs/2503.13429
作者: Nhi Pham,Bernt Schiele,Adam Kortylewski,Jonas Fischer
机构: Max Planck Institute for Informatics (马克斯·普朗克计算机科学研究所), Germany; University of Freiburg (弗赖堡大学), Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rise of neural networks, especially in high-stakes applications, these networks need two properties (i) robustness and (ii) interpretability to ensure their safety. Recent advances in classifiers with 3D volumetric object representations have demonstrated a greatly enhanced robustness in out-of-distribution data. However, these 3D-aware classifiers have not been studied from the perspective of interpretability. We introduce CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification. We design an inherently-interpretable and robust classifier by extending existing 3D-aware classifiers with concepts extracted from their volumetric representations for classification. In an array of quantitative metrics for interpretability, we compare against different concept-based approaches across the explainable AI literature and show that CAVE discovers well-grounded concepts that are used consistently across images, while achieving superior robustness.
zh

[CV-12] Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation

【速读】：该论文试图解决在具身人工智能（Embodied AI）相关任务中高质量大尺度可动对象需求迫切的问题。现有的创建可动对象的方法主要依赖于数据驱动或基于模拟，但这些方法受限于训练数据的规模与质量，或者模拟的真实性与劳动强度。论文提出的关键解决方案是Infinite Mobility，这是一种通过程序化生成实现高保真可动对象的新方法。其关键是利用程序化生成技术，在不依赖大量标注数据或复杂模拟的情况下，高效合成高质量的可动对象，从而突破现有方法的局限性。用户研究和定量评估表明，该方法生成的结果在物理特性和网格质量方面优于当前最先进的方法，并且可以媲美人工标注的数据集。此外，论文还展示了合成数据可用于生成模型的训练，以支持下一步的规模化应用。

链接: https://arxiv.org/abs/2503.13424
作者: Xinyu Lian,Zichao Yu,Ruiming Liang,Yitong Wang,Li Ray Luo,Kaixu Chen,Yuanzhen Zhou,Qihong Tang,Xudong Xu,Zhaoyang Lyu,Bo Dai,Jiangmiao Pang
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); South China University of Technology (华南理工大学); University of Science and Technology of China (中国科学技术大学); Tongji University (同济大学); Fudan University (复旦大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL 10 pages,12 figures

点击查看摘要

Abstract:Large-scale articulated objects with high quality are desperately needed for multiple tasks related to embodied AI. Most existing methods for creating articulated objects are either data-driven or simulation based, which are limited by the scale and quality of the training data or the fidelity and heavy labour of the simulation. In this paper, we propose Infinite Mobility, a novel method for synthesizing high-fidelity articulated objects through procedural generation. User study and quantitative evaluation demonstrate that our method can produce results that excel current state-of-the-art methods and are comparable to human-annotated datasets in both physics property and mesh quality. Furthermore, we show that our synthetic data can be used as training data for generative models, enabling next-step scaling up. Code is available at this https URL
zh

[CV-13] Scale Efficient Training for Large Datasets CVPR2025

【速读】：该论文旨在解决因数据集规模扩大导致训练效率下降的问题，特别是由于低价值样本（包括冗余样本、过于困难的样本以及贡献较小的简单样本）的存在，使得训练过程变得低效。论文的关键解决方案是提出了一种名为Scale Efficient Training (SeTa) 的动态样本剪枝方法，通过无损方式减少训练时间。SeTa 首先随机剪枝去除冗余样本，然后根据损失衡量的学习难度对剩余样本进行聚类，并结合滑动窗口策略逐步移除过于困难和低效的简单聚类样本，从而实现从易到难的高效训练过程。实验结果表明，SeTa 方法在多种大规模数据集和任务上能够将训练成本降低高达50%，同时保持或提升性能。

链接: https://arxiv.org/abs/2503.13385
作者: Qing Zhou,Junyu Gao,Qi Wang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model this http URL address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard this http URL conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million this http URL reduces training costs by up to 50% while maintaining or improving performance, with minimal degradation even at 70% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at this https URL.
zh

[CV-14] Sightation Counts: Leverag ing Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions

【速读】：该论文旨在解决盲人和低视力（BLV）用户在获取详细图表描述方面的挑战，这些问题源于标注者群体与最终用户群体之间的需求及视觉能力差异。现有研究表明，由 sighted 注释者直接生成的描述成本高、存在偏见且未完全符合 BLV 用户的标准。为应对这一难题，研究提出让 sighted 个体评估而非生成由视觉-语言模型（Vision-Language Model, VLM）通过多轮推理引导下产生的图表描述。这种基于潜在监督（latent supervision）的方法能够有效提升生成描述的质量，并证明了其对专业教育者的实用性，这些教育者自身即为 BLV 用户且负责教授视障学习者。研究还发布了 Sightation 数据集，包含跨越 5000 张图表和 137000 个样本的数据集，用于完成、偏好选择、检索、问答及推理任务的训练，并展示了其在多种下游任务中的微调潜力。因此，该研究的关键在于通过 sighted 评估替代传统的 sighted 生成方式，从而提高生成描述的质量并满足 BLV 用户的实际需求。

链接: https://arxiv.org/abs/2503.13369
作者: Wan Ju Kang,Eunki Kim,Na Min An,Sangryul Kim,Haemin Choi,Ki Hoon Kwak,James Thorne
机构: KAIST AI (KAIST AI); Sungkyunkwan University (成均馆大学); Yonsei University (延世大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 37 pages, 10 figures, 21 tables

点击查看摘要

Abstract:Often, the needs and visual abilities differ between the annotator group and the end user group. Generating detailed diagram descriptions for blind and low-vision (BLV) users is one such challenging domain. Sighted annotators could describe visuals with ease, but existing studies have shown that direct generations by them are costly, bias-prone, and somewhat lacking by BLV standards. In this study, we ask sighted individuals to assess – rather than produce – diagram descriptions generated by vision-language models (VLM) that have been guided with latent supervision via a multi-pass inference. The sighted assessments prove effective and useful to professional educators who are themselves BLV and teach visually impaired learners. We release Sightation, a collection of diagram description datasets spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes and demonstrate their fine-tuning potential in various downstream tasks.
zh

[CV-15] Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在长链推理过程中对视觉信息关注逐渐减弱的问题，导致模型过度依赖文本输出，从而影响多模态推理性能。论文通过实验发现，在数学视觉任务中，当截断推理过程并移除图像输入时，模型仅在测试集较难子集上的准确率下降约2%，表明模型的推理主要由文本主导。为应对这一挑战，论文提出了一种名为“Take-along Visual Conditioning (TVC)”的策略，其关键是将图像输入动态压缩为关键推理阶段所需的精简视觉特征，并通过动态剪枝去除冗余视觉标记，从而在整个推理过程中保持对视觉信息的关注。该方法在五个数学推理基准上实现了当前最优性能（平均提升+3.4% vs. 前人最佳结果），验证了TVC在提升多模态推理系统性能方面的有效性。

链接: https://arxiv.org/abs/2503.13360
作者: Hai-Long Sun,Zhun Sun,Houwen Peng,Han-Jia Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The project page is available at this https URL

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated enhanced reasoning capabilities, evolving from Chain-of-Thought (CoT) prompting to advanced, product-oriented solutions like OpenAI o1. During our re-implementation of this model, we noticed that in multimodal tasks requiring visual input (e.g., geometry problems), Multimodal LLMs (MLLMs) struggle to maintain focus on the visual information, in other words, MLLMs suffer from a gradual decline in attention to visual information as reasoning progresses, causing text-over-relied outputs. To investigate this, we ablate image inputs during long-chain reasoning. Concretely, we truncate the reasoning process midway, then re-complete the reasoning process with the input image removed. We observe only a ~2% accuracy drop on MathVista’s test-hard subset, revealing the model’s textual outputs dominate the following reasoning process. Motivated by this, we propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning. Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks (+3.4% vs previous sota), demonstrating the effectiveness of TVC in enhancing multimodal reasoning systems.
zh

[CV-16] One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation

【速读】：该论文旨在解决基于扩散模型的超分辨率（Super-Resolution, SR）方法在生成高质量视觉结果的同时面临昂贵计算成本的问题，同时克服现有加速方法的局限性，例如无法生成逼真的感知细节（如SinSR）或可能产生不存在结构的伪影（如OSEDiff）。论文的关键在于提出了一种名为RSD的新蒸馏方法，用于优化ResShift——一种顶级的基于扩散的SR模型。RSD的核心思想是训练学生网络生成特定图像，使得基于这些图像重新训练的“假”ResShift模型能够与教师模型保持一致。通过这种方式，RSD实现了单步恢复，并显著超越了教师模型的表现。此外，与基于预训练文本到图像模型的SR方法相比，RSD在感知质量上具有竞争力，同时提供更好的输入退化图像对齐效果，且需要更少的参数和GPU内存。实验结果表明，RSD在真实世界和合成数据集（如RealSR、RealSet65、DRealSR、ImageNet和DIV2K）上均表现出色。

链接: https://arxiv.org/abs/2503.13358
作者: Daniil Selikhanovych,David Li,Aleksei Leonov,Nikita Gushchin,Sergei Kushneriuk,Alexander Filippov,Evgeny Burnaev,Iaroslav Koshelev,Alexander Korotin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift, one of the top diffusion-based SR models. Our method is based on training the student network to produce such images that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a large margin. We show that our distillation method can surpass the other distillation-based method for ResShift - SinSR - making it on par with state-of-the-art diffusion-based SR distillation methods. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality, provides images with better alignment to degraded input images, and requires fewer parameters and GPU memory. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.
zh

[CV-17] Parameter-free structure-texture image decomposition by unrolling

【速读】：该论文旨在解决结构-纹理图像分解（Structure-Texture Image Decomposition）问题。其解决方案的关键在于提出了一种名为LPR-NET的神经网络模型，该模型基于低块秩（Low Patch Rank, LPR）模型的展开（unrolling）。通过这种方法，不仅能够从数据中自动学习参数，还能在计算效率上显著提升，同时保持与传统迭代模型驱动方法相当的分解效果。此外，尽管该网络是在合成图像上训练的，但数值实验表明其具有良好的泛化能力，可有效应用于自然图像。

链接: https://arxiv.org/abs/2503.13354
作者: Laura Girometti,Jean-François Aujol,Antoine Guennec,Yann Traonmilin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Numerical Analysis (math.NA)
备注: To be published in Conference Proceedings: Scale Space and Variational Method in Computer Vision, 2025

点击查看摘要

Abstract:In this work, we propose a parameter-free and efficient method to tackle the structure-texture image decomposition problem. In particular, we present a neural network LPR-NET based on the unrolling of the Low Patch Rank model. On the one hand, this allows us to automatically learn parameters from data, and on the other hand to be computationally faster while obtaining qualitatively similar results compared to traditional iterative model-based methods. Moreover, despite being trained on synthetic images, numerical experiments show the ability of our network to generalize well when applied to natural images.
zh

[CV-18] riDF: Triplane-Accelerated Density Fields for Few-Shot Remote Sensing Novel View Synthesis

【速读】：本文旨在解决遥感场景中基于少视角图像（few-shot）的新视角合成（Novel View Synthesis, NVS）任务中计算开销大且性能不佳的问题。现有方法在处理有限输入视角时容易过拟合或计算效率低下，尤其在遥感场景中表现欠佳。为应对这一挑战，论文提出了一种高效的混合三维表示方法TriDF，其关键是将颜色信息与体密度信息解耦并独立建模，以减轻隐式辐射场的计算负担并加速重建过程。通过将高频颜色信息映射到三平面（triplane）表示结构，并直接优化特征平面来加快收敛速度，同时利用基于图像的渲染技术结合参考特征来补偿输入数据的不足。此外，引入基于点云的深度引导优化，有效缓解了少视角NVS中的过拟合问题。实验结果表明，该方法相比基于NeRF的方法速度提升30倍，同时在峰值信噪比（PSNR）、结构相似性指数（SSIM）和线性感知图像距离（LPIPS）等指标上优于先进的少视角方法。

链接: https://arxiv.org/abs/2503.13347
作者: Jiaming Kang,Keyan Chen,Zhengxia Zou,Zhenwei Shi
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing novel view synthesis (NVS) offers significant potential for 3D interpretation of remote sensing scenes, with important applications in urban planning and environmental monitoring. However, remote sensing scenes frequently lack sufficient multi-view images due to acquisition constraints. While existing NVS methods tend to overfit when processing limited input views, advanced few-shot NVS methods are computationally intensive and perform sub-optimally in remote sensing scenes. This paper presents TriDF, an efficient hybrid 3D representation for fast remote sensing NVS from as few as 3 input views. Our approach decouples color and volume density information, modeling them independently to reduce the computational burden on implicit radiance fields and accelerate reconstruction. We explore the potential of the triplane representation in few-shot NVS tasks by mapping high-frequency color information onto this compact structure, and the direct optimization of feature planes significantly speeds up convergence. Volume density is modeled as continuous density fields, incorporating reference features from neighboring views through image-based rendering to compensate for limited input data. Additionally, we introduce depth-guided optimization based on point clouds, which effectively mitigates the overfitting problem in few-shot NVS. Comprehensive experiments across multiple remote sensing scenes demonstrate that our hybrid representation achieves a 30x speed increase compared to NeRF-based methods, while simultaneously improving rendering quality metrics over advanced few-shot methods (7.4% increase in PSNR, 12.2% in SSIM, and 18.7% in LPIPS). The code is publicly available at this https URL
zh

[CV-19] STEP: Simultaneous Tracking and Estimation of Pose for Animals and Humans

【速读】：该论文旨在解决跨多种动物物种及人类的同时姿态跟踪与估计问题。传统判别模型通常需要预定义的目标状态来确定模型权重，而本文通过引入高斯映射软预测（Gaussian Map Soft Prediction, GMSP）模块和偏移映射回归适配器（Offset Map Regression Adapter, OMRA）模块解决了这一挑战，这些模块去除了对关键点目标状态作为输入的需求，简化了流程。关键在于利用基于Transformer的判别模型预测，并结合上述创新模块实现从初始帧开始的连续跟踪与姿态估计输出。

链接: https://arxiv.org/abs/2503.13344
作者: Shashikant Verma,Harish Katti,Soumyaratna Debnath,Yamuna Swamy,Shanmuganathan Raman
机构: Indian Institute of Technology Gandhinagar (IITGN)(印度理工学院甘地讷格尔); National Institutes of Health (NIH)(美国国立卫生研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce STEP, a novel framework utilizing Transformer-based discriminative model prediction for simultaneous tracking and estimation of pose across diverse animal species and humans. We are inspired by the fact that the human brain exploits spatiotemporal continuity and performs concurrent localization and pose estimation despite the specialization of brain areas for form and motion processing. Traditional discriminative models typically require predefined target states for determining model weights, a challenge we address through Gaussian Map Soft Prediction (GMSP) and Offset Map Regression Adapter (OMRA) Modules. These modules remove the necessity of keypoint target states as input, streamlining the process. Our method starts with a known target state initialized through a pre-trained detector or manual initialization in the initial frame of a given video sequence. It then seamlessly tracks the target and estimates keypoints of anatomical importance as output for subsequent frames. Unlike prevalent top-down pose estimation methods, our approach doesn’t rely on per-frame target detections due to its tracking capability. This facilitates a significant advancement in inference efficiency and potential applications. We train and validate our approach on datasets encompassing diverse species. Our experiments demonstrate superior results compared to existing methods, opening doors to various applications, including but not limited to action recognition and behavioral analysis.
zh

[CV-20] Edit Transfer: Learning Image Editing via Vision In-Context Relations

【速读】：本文提出了一种新的任务设置——编辑迁移（Edit Transfer），旨在通过仅使用单个源-目标示例来学习转换，并将其应用于新的查询图像。传统基于文本的方法在语义操作方面表现出色，但难以处理精确的几何细节（如姿态和视角变化）；而基于参考的编辑通常关注于风格或外观，无法应对非刚性变换。为克服这两种方法的局限性，论文通过显式地从源-目标对中学习编辑转换来实现Edit Transfer。关键在于受大语言模型上下文学习的启发，提出了一种视觉关系上下文学习范式，基于DiT文本到图像模型构建。该方法将编辑后的示例与查询图像组合成统一的四面板复合图，并利用轻量级LoRA微调技术，从少量样本中捕捉复杂的空间变换。尽管训练样本仅包含42个，Edit Transfer在多种非刚性场景下显著超越了最先进的TIE和RIE方法，证明了少量样本视觉关系学习的有效性。

链接: https://arxiv.org/abs/2503.13327
作者: Lan Chen,Qi Mao,Yuchao Gu,Mike Zheng Shou
机构: MIPG, Communication University of China (传媒大学媒体智能处理研究所); Show Lab, National University of Singapore (新加坡国立大学展示实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts, they often struggle with precise geometric details (e.g., poses and viewpoint changes). Reference-based editing, on the other hand, typically focuses on style or appearance and fails at non-rigid transformations. By explicitly learning the editing transformation from a source-target pair, Edit Transfer mitigates the limitations of both text-only and appearance-centric references. Drawing inspiration from in-context learning in large language models, we propose a visual relation in-context learning paradigm, building upon a DiT-based text-to-image model. We arrange the edited example and the query image into a unified four-panel composite, then apply lightweight LoRA fine-tuning to capture complex spatial transformations from minimal examples. Despite using only 42 training samples, Edit Transfer substantially outperforms state-of-the-art TIE and RIE methods on diverse non-rigid scenarios, demonstrating the effectiveness of few-shot visual relation learning.
zh

[CV-21] MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Portrait Few-Step Synthesis

【速读】：该论文旨在解决大规模可扩散模型（VDM）在人像视频合成任务中因训练内存不足和训练崩溃导致的性能瓶颈问题。具体而言，论文提出了一种名为Weak-to-Strong Video Distillation (W2SVD) 的方法，通过引入LoRA技术缓解训练内存限制，并利用弱到强分布匹配策略调整真实DiT模型参数以靠近假DiT模型参数，从而有效改善少量步数生成器合成视频偏离真实数据分布的问题，进一步提升合成视频的视觉质量。关键在于结合LoRA技术和W2S分布匹配策略，优化了训练过程中的内存使用效率与模型收敛稳定性。

链接: https://arxiv.org/abs/2503.13319
作者: Shitong Shao,Hongwei Yi,Hanzhong Guo,Tian Ye,Daquan Zhou,Michael Lingelbach,Zhiqiang Xu,Zeke Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-tuning open-source large-scale VDMs for the portrait video synthesis task can result in significant improvements across multiple dimensions, such as visual quality and natural facial motion dynamics. Despite their advancements, how to achieve step distillation and reduce the substantial computational overhead of large-scale VDMs remains unexplored. To fill this gap, this paper proposes Weak-to-Strong Video Distillation (W2SVD) to mitigate both the issue of insufficient training memory and the problem of training collapse observed in vanilla DMD during the training process. Specifically, we first leverage LoRA to fine-tune the fake diffusion transformer (DiT) to address the out-of-memory issue. Then, we employ the W2S distribution matching to adjust the real DiT’s parameter, subtly shifting it toward the fake DiT’s parameter. This adjustment is achieved by utilizing the weak weight of the low-rank branch, effectively alleviate the conundrum where the video synthesized by the few-step generator deviates from the real data distribution, leading to inaccuracies in the KL divergence approximation. Additionally, we minimize the distance between the fake data distribution and the ground truth distribution to further enhance the visual quality of the synthesized videos. As experimentally demonstrated on HunyuanVideo, W2SVD surpasses the standard Euler, LCM, DMD and even the 28-step standard sampling in FID/FVD and VBench in 1/4-step video synthesis. The project page is in this https URL.
zh

[CV-22] UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation

【速读】：本文旨在解决从单目图像估计手部及其可能所持物体的3D姿态这一长期挑战。现有方法多为专用设计，专注于裸手或手与物体交互的单一场景，无法灵活处理两种场景且在应用于另一场景时性能会下降。为应对这一问题，论文提出UniHOPE，一种统一的3D手部-物体姿态估计算法，能够灵活适应两种场景。其关键在于设计了一个基于抓握状态感知的特征融合模块（grasp-aware feature fusion module），用于整合手部与物体的特征，并通过一个物体切换器（object switcher）动态控制手部-物体姿态估计。此外，为了提升手部姿态估计的鲁棒性，无论物体是否存在，论文生成了逼真的去遮挡图像对（de-occluded image pairs）来训练模型以学习由物体引起的遮挡，并提出了多级特征增强技术以学习不依赖于遮挡的特征。实验结果表明，UniHOPE在三种常用基准数据集上实现了当前最优性能（SOTA）。

链接: https://arxiv.org/abs/2503.13303
作者: Yinqiao Wang,Hao Xu,Pheng-Ann Heng,Chi-Wing Fu
机构: Department of Computer Science and Engineering (计算机科学与工程系), Institute of Medical Intelligence and XR (医学智能与扩展现实研究所), The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Estimating the 3D pose of hand and potential hand-held object from monocular images is a longstanding challenge. Yet, existing methods are specialized, focusing on either bare-hand or hand interacting with object. No method can flexibly handle both scenarios and their performance degrades when applied to the other scenario. In this paper, we propose UniHOPE, a unified approach for general 3D hand-object pose estimation, flexibly adapting both scenarios. Technically, we design a grasp-aware feature fusion module to integrate hand-object features with an object switcher to dynamically control the hand-object pose estimation according to grasping status. Further, to uplift the robustness of hand pose estimation regardless of object presence, we generate realistic de-occluded image pairs to train the model to learn object-induced hand occlusions, and formulate multi-level feature enhancement techniques for learning occlusion-invariant features. Extensive experiments on three commonly-used benchmarks demonstrate UniHOPE’s SOTA performance in addressing hand-only and hand-object scenarios. Code will be released on this https URL.
zh

[CV-23] Progressive Human Motion Generation Based on Text and Few Motion Frames

【速读】：该论文旨在解决现有文本到运动（Text-to-Motion, T2M）方法难以精确对齐生成运动与所需姿态的问题，因为仅依赖文本描述不足以精准刻画多样的姿态。为实现更可控的生成，论文提出通过允许用户输入少量描述精确姿态的运动帧作为补充信息。为此，探索了一种新的文本-帧-到-运动（Text-Frame-to-Motion, TF2M）生成任务，目标是从文本和极少数给定帧生成运动。

解决方案的关键在于提出的渐进式运动生成（Progressive Motion Generation, PMG）方法。PMG 从不确定性较低的帧开始，在多个阶段逐步生成不确定性较高的帧。在每个阶段，通过文本-帧引导生成器（Text-Frame Guided Generator），结合文本、给定帧以及先前生成帧的帧感知语义条件生成新帧。此外，为了缓解测试过程中因多阶段累积错误生成帧导致的训练-测试差距，提出了伪帧替换策略（Pseudo-frame Replacement Strategy）用于训练。实验结果表明，PMG 在仅有一个给定帧的情况下显著优于现有的 T2M 方法，验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.13300
作者: Ling-An Zeng,Gaojie Wu,Ancong Wu,Jian-Fang Hu,Wei-Shi Zheng
机构: School of Artificial Intelligence, Sun Yat-sen University (中山大学); School of Computer Science and Engineering, Sun Yat-sen University (中山大学); Guangdong Key Laboratory of Information Security Technology, Sun Yat-sen University (中山大学); Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-sen University, Ministry of Education (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although existing text-to-motion (T2M) methods can produce realistic human motion from text description, it is still difficult to align the generated motion with the desired postures since using text alone is insufficient for precisely describing diverse postures. To achieve more controllable generation, an intuitive way is to allow the user to input a few motion frames describing precise desired postures. Thus, we explore a new Text-Frame-to-Motion (TF2M) generation task that aims to generate motions from text and very few given frames. Intuitively, the closer a frame is to a given frame, the lower the uncertainty of this frame is when conditioned on this given frame. Hence, we propose a novel Progressive Motion Generation (PMG) method to progressively generate a motion from the frames with low uncertainty to those with high uncertainty in multiple stages. During each stage, new frames are generated by a Text-Frame Guided Generator conditioned on frame-aware semantics of the text, given frames, and frames generated in previous stages. Additionally, to alleviate the train-test gap caused by multi-stage accumulation of incorrectly generated frames during testing, we propose a Pseudo-frame Replacement Strategy for training. Experimental results show that our PMG outperforms existing T2M generation methods by a large margin with even one given frame, validating the effectiveness of our PMG. Code will be released.
zh

[CV-24] Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

【速读】：该论文试图解决三维场景一致性和真实感合成这一计算机视觉领域的重要问题。当前视频扩散模型虽然能够生成高质量视频，但无法直接合成具有三维一致性的表示（即生成序列缺乏3D一致性）。此外，由于大规模三维训练数据的匮乏，直接训练生成式三维模型面临挑战。为解决这些问题，论文提出了一种名为生成式高斯散射（Generative Gaussian Splatting, GGS）的新方法。其关键是将三维表示与预训练的潜在视频扩散模型相结合，通过参数化的方式合成由三维高斯基元构成的特征场，并进一步渲染为多视角图像或直接上采样为三维辐射场。实验结果表明，GGS显著提升了生成多视角图像的三维一致性以及生成三维场景的质量，在RealEstate10K和ScanNet+数据集上的FID指标相比无三维表示的类似模型提高了约20%。

链接: https://arxiv.org/abs/2503.13272
作者: Katja Schwarz,Norman Mueller,Peter Kontschieder
机构: Meta Reality Labs (Reality Labs)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) – a novel approach that integrates a 3D representation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet+, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet+. Project page: this https URL
zh

[CV-25] FlexWorld: Progressively Expanding 3D Scenes for Flexiable-View Synthesis

【速读】：该论文旨在解决从单张图像生成灵活视图（包括360°旋转和缩放）的3D场景这一挑战性问题，主要由于缺乏3D数据。论文提出的解决方案核心在于FlexWorld框架，其关键组件包括：(1) 一个强大的视频到视频(V2V)扩散模型，用于从粗略场景渲染的不完整输入中生成高质量的新视角图像；(2) 一种渐进扩展过程，用于构建完整的3D场景。特别地，通过利用先进的预训练视频模型和精确的深度估计训练对，V2V模型能够在大范围相机姿态变化下生成新视角。在此基础上，FlexWorld通过几何感知场景融合逐步生成新的3D内容并整合到全局场景中。实验结果表明，FlexWorld在多个流行指标和数据集上相比现有最先进的方法能够生成更高视觉质量的新型视角视频和灵活视图的3D场景。

链接: https://arxiv.org/abs/2503.13265
作者: Luxi Chen,Zihan Zhou,Min Zhao,Yikai Wang,Ge Zhang,Wenhao Huang,Hao Sun,Ji-Rong Wen,Chongxuan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating flexible-view 3D scenes, including 360° rotation and zooming, from single images is challenging due to a lack of 3D data. To this end, we introduce FlexWorld, a novel framework consisting of two key components: (1) a strong video-to-video (V2V) diffusion model to generate high-quality novel view images from incomplete input rendered from a coarse scene, and (2) a progressive expansion process to construct a complete 3D scene. In particular, leveraging an advanced pre-trained video model and accurate depth-estimated training pairs, our V2V model can generate novel views under large camera pose variations. Building upon it, FlexWorld progressively generates new 3D content and integrates it into the global scene through geometry-aware scene fusion. Extensive experiments demonstrate the effectiveness of FlexWorld in generating high-quality novel view videos and flexible-view 3D scenes from single images, achieving superior visual quality under multiple popular metrics and datasets compared to existing state-of-the-art methods. Qualitatively, we highlight that FlexWorld can generate high-fidelity scenes with flexible views like 360° rotations and zooming. Project page: this https URL.
zh

[CV-26] Dont Judge Before You CLIP: A Unified Approach for Perceptual Tasks

【速读】：该论文旨在解决视觉感知任务中因主观标注数据稀缺导致的数据集规模小、泛化能力差的问题。传统方法通常为每个特定的感知任务设计专门模型，而本文提出了一种基于CLIP（Contrastive Language-Image Pretraining）统一架构框架的解决方案。其关键是利用CLIP作为先验知识，通过少量适配即可完成多种感知任务。作者基于认知科学发现指出，CLIP在训练过程中不仅学习了图像与文本的对齐，还隐式捕捉了人类的情感和偏好，这得益于训练数据中包含的人类撰写图像描述，它们不仅描述事实，也蕴含情感信息。因此，通过轻量级的微调，无需针对具体任务修改网络结构，即可实现对多个感知任务的有效处理，并在记忆性预测、无参考图像质量评估及视觉情感分析三个任务上取得了当前最优结果，同时展现了跨数据集的良好泛化性能。

链接: https://arxiv.org/abs/2503.13260
作者: Amit Zalcher,Navve Wasserman,Roman Beliy,Oliver Heinimann,Michal Irani
机构: Weizmann Institute of Science (魏茨曼科学研究学会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual perceptual tasks aim to predict human judgment of images (e.g., emotions invoked by images, image quality assessment). Unlike objective tasks such as object/scene recognition, perceptual tasks rely on subjective human assessments, making its data-labeling difficult. The scarcity of such human-annotated data results in small datasets leading to poor generalization. Typically, specialized models were designed for each perceptual task, tailored to its unique characteristics and its own training dataset. We propose a unified architectural framework for solving multiple different perceptual tasks leveraging CLIP as a prior. Our approach is based on recent cognitive findings which indicate that CLIP correlates well with human judgment. While CLIP was explicitly trained to align images and text, it implicitly also learned human inclinations. We attribute this to the inclusion of human-written image captions in CLIP’s training data, which contain not only factual image descriptions, but inevitably also human sentiments and emotions. This makes CLIP a particularly strong prior for perceptual tasks. Accordingly, we suggest that minimal adaptation of CLIP suffices for solving a variety of perceptual tasks. Our simple unified framework employs a lightweight adaptation to fine-tune CLIP to each task, without requiring any task-specific architectural changes. We evaluate our approach on three tasks: (i) Image Memorability Prediction, (ii) No-reference Image Quality Assessment, and (iii) Visual Emotion Analysis. Our model achieves state-of-the-art results on all three tasks, while demonstrating improved generalization across different datasets.
zh

[CV-27] Sampling Innovation-Based Adaptive Compressive Sensing CVPR2025

【速读】：该论文旨在解决现有场景感知自适应压缩感知（ACS）方法在未知场景下对自适应采样分配（ASA）的判断不够准确且缺乏鲁棒反馈机制的问题，从而限制了高保真场景感知的能力。论文的关键解决方案是提出了一种基于采样创新的ACS（SIB-ACS）方法，通过引入一种创新准则来预测采样增量引起的图像重建误差减少量，从而有效识别并优先分配采样到图像重建挑战区域。此外，还设计了一个由多阶段反馈过程迭代优化的采样创新引导型自适应采样框架，并提出了主成分压缩域网络（PCCD-Net）用于高效且忠实的图像重建。实验结果表明，所提出的SIB-ACS方法在图像重建保真度和视觉效果方面显著优于现有先进方法。

链接: https://arxiv.org/abs/2503.13241
作者: Zhifu Tian,Tao Hu,Chaoyang Niu,Di Wu,Shu Wang
机构: Information Engineering University (信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: CVPR2025 accepted

点击查看摘要

Abstract:Scene-aware Adaptive Compressive Sensing (ACS) has attracted significant interest due to its promising capability for efficient and high-fidelity acquisition of scene images. ACS typically prescribes adaptive sampling allocation (ASA) based on previous samples in the absence of ground truth. However, when confronting unknown scenes, existing ACS methods often lack accurate judgment and robust feedback mechanisms for ASA, thus limiting the high-fidelity sensing of the scene. In this paper, we introduce a Sampling Innovation-Based ACS (SIB-ACS) method that can effectively identify and allocate sampling to challenging image reconstruction areas, culminating in high-fidelity image reconstruction. An innovation criterion is proposed to judge ASA by predicting the decrease in image reconstruction error attributable to sampling increments, thereby directing more samples towards regions where the reconstruction error diminishes significantly. A sampling innovation-guided multi-stage adaptive sampling (AS) framework is proposed, which iteratively refines the ASA through a multi-stage feedback process. For image reconstruction, we propose a Principal Component Compressed Domain Network (PCCD-Net), which efficiently and faithfully reconstructs images under AS scenarios. Extensive experiments demonstrate that the proposed SIB-ACS method significantly outperforms the state-of-the-art methods in terms of image reconstruction fidelity and visual effects. Codes are available at this https URL.
zh

[CV-28] Gradient Extrapolation for Debiased Representation Learning

【速读】：该论文旨在解决机器学习分类模型在经验风险最小化（Empirical Risk Minimization, ERM）训练过程中无意依赖于虚假相关性（spurious correlations）的问题。当这些非目标属性与目标标签之间的意外关联在测试数据中不存在时，会导致模型泛化能力下降。为应对这一挑战，论文提出了一种名为梯度外推去偏表征学习（Gradient Extrapolation for Debiased Representation Learning, GERNE）的新方法。GERNE 的关键在于通过使用包含不同数量虚假相关性的两批数据来定义目标梯度，将其作为从每批次损失计算出的两个梯度的线性外推。如果该外推梯度指向虚假相关性较少的批次梯度方向，则可以引导训练过程学习到去偏的模型。此外，GERNE 可以作为一个通用框架，将 ERM、重新加权（reweighting）和重采样（resampling）等方法视为其特殊情况。通过推导外推因子的理论上下界以确保收敛性，并调整此因子，GERNE 可用于最大化组平衡准确率（Group-Balanced Accuracy, GBA）或最差组准确率（Worst-Group Accuracy）。实验结果表明，该方法在五个视觉任务和一个自然语言处理（NLP）基准数据集上的表现具有竞争力且常常优于当前最先进的基线方法。

链接: https://arxiv.org/abs/2503.13236
作者: Ihab Asaad,Maha Shadaydeh,Joachim Denzler
机构: Computer Vision Group, Friedrich Schiller University Jena (弗里德里希·席勒大学耶拿分校计算机视觉小组)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine learning classification models trained with empirical risk minimization (ERM) often inadvertently rely on spurious correlations. When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. GERNE uses two distinct batches with different amounts of spurious correlations to define the target gradient as the linear extrapolation of two gradients computed from each batch’s loss. It is demonstrated that the extrapolated gradient, if directed toward the gradient of the batch with fewer amount of spurious correlation, can guide the training process toward learning a debiased model. GERNE can serve as a general framework for debiasing with methods, such as ERM, reweighting, and resampling, being shown as special cases. The theoretical upper and lower bounds of the extrapolation factor are derived to ensure convergence. By adjusting this factor, GERNE can be adapted to maximize the Group-Balanced Accuracy (GBA) or the Worst-Group Accuracy. The proposed approach is validated on five vision and one NLP benchmarks, demonstrating competitive and often superior performance compared to state-of-the-art baseline methods.
zh

[CV-29] HoloGest: Decoupled Diffusion and Motion Priors for Generating Holisticly Expressive Co-speech Gestures

【速读】：该论文旨在解决虚拟角色动画中整体性手势与语音同步（co-speech gestures）生成的问题，特别是现有系统因音频与手势之间弱相关性导致的物理不自然现象及其对用户体验的影响。论文的关键在于提出了一种名为HoleGest的新型神经网络框架，其基于解耦扩散模型（decoupled diffusion）和运动先验（motion priors），通过大规模人体运动数据集学习具有低音频依赖性和高运动依赖性的稳健先验，从而实现高质量且富有表现力的手势生成。此外，通过结合隐式关节约束与显式几何及条件约束，显著提升了扩散模型的生成效率，同时保持了高保真的运动质量。此外，设计的共享嵌入空间实现了手势与转录文本的语义对齐，确保生成的语义正确的手势动作。这些创新点共同构成了论文的核心解决方案。

链接: https://arxiv.org/abs/2503.13229
作者: Yongkang Cheng,Shaoli Huang
机构: Tencent AILab (腾讯人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by 3DV 2025

点击查看摘要

Abstract:Animating virtual characters with holistic co-speech gestures is a challenging but critical task. Previous systems have primarily focused on the weak correlation between audio and gestures, leading to physically unnatural outcomes that degrade the user experience. To address this problem, we introduce HoleGest, a novel neural network framework based on decoupled diffusion and motion priors for the automatic generation of high-quality, expressive co-speech gestures. Our system leverages large-scale human motion datasets to learn a robust prior with low audio dependency and high motion reliance, enabling stable global motion and detailed finger movements. To improve the generation efficiency of diffusion-based models, we integrate implicit joint constraints with explicit geometric and conditional constraints, capturing complex motion distributions between large strides. This integration significantly enhances generation speed while maintaining high-quality motion. Furthermore, we design a shared embedding space for gesture-transcription text alignment, enabling the generation of semantically correct gesture actions. Extensive experiments and user feedback demonstrate the effectiveness and potential applications of our model, with our method achieving a level of realism close to the ground truth, providing an immersive user experience. Our code, model, and demo are are available at this https URL.
zh

[CV-30] Mind the Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch CVPR2025

【速读】：该论文致力于解决联邦半监督学习（Federated Semi-Supervised Learning, FSSL）中由于数据异质性（data heterogeneity）导致伪标签质量下降的问题，进而影响全局模型性能与收敛速度。论文指出，数据异质性不仅加剧了伪标签不匹配现象，还导致本地模型与全局模型预测倾向的分歧。为应对这一挑战，论文提出了一种名为“全局增强集成的半监督聚合”（Semi-supervised Aggregation for Globally-Enhanced Ensemble, SAGE）的方法，其关键是通过置信度差异灵活校正伪标签，从而有效缓解因错误伪标签引起的性能退化，并增强本地模型与全局模型之间的共识。实验结果表明，SAGE在性能和收敛性方面均优于现有FSSL方法。

链接: https://arxiv.org/abs/2503.13227
作者: Yijie Liu,Xinyi Shang,Yiqun Zhang,Yang Lu,Chen Gong,Jing-Hao Xue,Hanzi Wang
机构: Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University (厦门大学), Xiamen, China; Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University (厦门大学), Xiamen, China; Department of Computer Science, Hong Kong Baptist University (香港浸会大学), Hong Kong, China
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Federated Semi-Supervised Learning (FSSL) aims to leverage unlabeled data across clients with limited labeled data to train a global model with strong generalization ability. Most FSSL methods rely on consistency regularization with pseudo-labels, converting predictions from local or global models into hard pseudo-labels as supervisory signals. However, we discover that the quality of pseudo-label is largely deteriorated by data heterogeneity, an intrinsic facet of federated learning. In this paper, we study the problem of FSSL in-depth and show that (1) heterogeneity exacerbates pseudo-label mismatches, further degrading model performance and convergence, and (2) local and global models’ predictive tendencies diverge as heterogeneity increases. Motivated by these findings, we propose a simple and effective method called Semi-supervised Aggregation for Globally-Enhanced Ensemble (SAGE), that can flexibly correct pseudo-labels based on confidence discrepancies. This strategy effectively mitigates performance degradation caused by incorrect pseudo-labels and enhances consensus between local and global models. Experimental results demonstrate that SAGE outperforms existing FSSL methods in both performance and convergence. Our code is available at this https URL
zh

[CV-31] Dense Policy: Bidirectional Autoregressive Learning of Actions

【速读】：该论文试图解决主流视觉运动策略中生成式模型在整体动作预测上的主导地位，以及当前自回归策略（predicting the next token or chunk）表现欠佳的问题。论文的关键解决方案是提出了一种双向扩展学习方法，称为密集策略（Dense Policy），通过采用轻量级的仅编码器架构，以粗到细的方式从初始单帧逐步展开为目标动作序列，并在对数时间复杂度内完成推理。这种方法旨在有效释放自回归策略在机器人操作任务中的潜力，超越现有的生成式整体策略。

链接: https://arxiv.org/abs/2503.13217
作者: Yue Su,Xinyu Zhan,Hongjie Fang,Han Xue,Hao-Shu Fang,Yong-Lu Li,Cewu Lu,Lixin Yang
机构: Shanghai Jiao Tong University (上海交通大学); Xidian University (西安电子科技大学); Shanghai Innovation Institute (上海创新研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mainstream visuomotor policies predominantly rely on generative models for holistic action prediction, while current autoregressive policies, predicting the next token or chunk, have shown suboptimal results. This motivates a search for more effective learning methods to unleash the potential of autoregressive policies for robotic manipulation. This paper introduces a bidirectionally expanded learning approach, termed Dense Policy, to establish a new paradigm for autoregressive policies in action prediction. It employs a lightweight encoder-only architecture to iteratively unfold the action sequence from an initial single frame into the target sequence in a coarse-to-fine manner with logarithmic-time inference. Extensive experiments validate that our dense policy has superior autoregressive learning capabilities and can surpass existing holistic generative policies. Our policy, example data, and training code will be publicly available upon publication. Project page: https: //selenthis http URL.
zh

[CV-32] A General Adaptive Dual-level Weighting Mechanism for Remote Sensing Pansharpening CVPR

【速读】：该论文旨在解决现有基于深度学习的遥感全色锐化方法在充分利用特征异质性和冗余性方面的不足，从而限制其性能的问题。论文的关键解决方案是提出了一种通用的自适应双层加权机制（ADWM），其核心在于通过相关性感知的协方差加权（CACW）建模特征的异质性和冗余性，并进一步引入两种关键机制：内部特征加权（IFW）和交叉特征加权（CFW）。IFW评估同一特征内通道之间的相关性以减少冗余并增强独特信息；CFW则基于层间相关性调整各层的贡献，优化最终输出。实验结果验证了ADWM相比近期最先进的方法具有优越性能。

链接: https://arxiv.org/abs/2503.13214
作者: Jie Huang,Haorui Chen,Jiaxuan Ren,Siran Peng,Liangjian Deng
机构: University of Electronic Science and Technology of China (电子科技大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper is accepted at the CVPR Conference on Computer Vision and Pattern Recognition 2025

点击查看摘要

Abstract:Currently, deep learning-based methods for remote sensing pansharpening have advanced rapidly. However, many existing methods struggle to fully leverage feature heterogeneity and redundancy, thereby limiting their effectiveness. We use the covariance matrix to model the feature heterogeneity and redundancy and propose Correlation-Aware Covariance Weighting (CACW) to adjust them. CACW captures these correlations through the covariance matrix, which is then processed by a nonlinear function to generate weights for adjustment. Building upon CACW, we introduce a general adaptive dual-level weighting mechanism (ADWM) to address these challenges from two key perspectives, enhancing a wide range of existing deep-learning methods. First, Intra-Feature Weighting (IFW) evaluates correlations among channels within each feature to reduce redundancy and enhance unique information. Second, Cross-Feature Weighting (CFW) adjusts contributions across layers based on inter-layer correlations, refining the final output. Extensive experiments demonstrate the superior performance of ADWM compared to recent state-of-the-art (SOTA) methods. Furthermore, we validate the effectiveness of our approach through generality experiments, redundancy visualization, comparison experiments, key variables and complexity analysis, and ablation studies. Our code is available at this https URL.
zh

[CV-33] MedLoRD: A Medical Low-Resource Diffusion Model for High-Resolution 3D CT Image Synthesis

【速读】：该论文旨在解决医疗成像领域中因患者隐私保护导致的真实数据稀缺以及计算资源受限环境下的高维医学影像合成难题。论文提出的关键解决方案是MedLoRD，这是一种专为计算资源有限环境设计的生成扩散模型。MedLoRD能够在仅具备24GB VRAM的GPU上生成高达512×512×256分辨率的高维医学影像体积数据，并通过多模态评估验证其生成图像的高保真度与临床意义，同时确保对分割掩膜条件的高度依从性，从而在资源受限环境下超越现有最先进的医学影像合成方法的能力。

链接: https://arxiv.org/abs/2503.13211
作者: Marvin Seyfarth,Salman Ul Hassan Dar,Isabelle Ayx,Matthias Alexander Fink,Stefan O. Schoenberg,Hans-Ulrich Kauczor,Sandy Engelhardt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advancements in AI for medical imaging offer significant potential. However, their applications are constrained by the limited availability of data and the reluctance of medical centers to share it due to patient privacy concerns. Generative models present a promising solution by creating synthetic data as a substitute for real patient data. However, medical images are typically high-dimensional, and current state-of-the-art methods are often impractical for computational resource-constrained healthcare environments. These models rely on data sub-sampling, raising doubts about their feasibility and real-world applicability. Furthermore, many of these models are evaluated on quantitative metrics that alone can be misleading in assessing the image quality and clinical meaningfulness of the generated images. To address this, we introduce MedLoRD, a generative diffusion model designed for computational resource-constrained environments. MedLoRD is capable of generating high-dimensional medical volumes with resolutions up to 512 \times 512 \times 256, utilizing GPUs with only 24GB VRAM, which are commonly found in standard desktop workstations. MedLoRD is evaluated across multiple modalities, including Coronary Computed Tomography Angiography and Lung Computed Tomography datasets. Extensive evaluations through radiological evaluation, relative regional volume analysis, adherence to conditional masks, and downstream tasks show that MedLoRD generates high-fidelity images closely adhering to segmentation mask conditions, surpassing the capabilities of current state-of-the-art generative models for medical image synthesis in computational resource-constrained environments.
zh

[CV-34] Clustering is back: Reaching state-of-the-art LiDAR instance segmentation without training

【速读】：该论文旨在解决激光雷达点云全景分割任务中大规模数据标注成本高、耗时长的问题。传统方法通常依赖端到端深度学习模型，并需要大量人工标注的实例信息，而本文提出的方法仅利用语义标签即可实现具有竞争力的全景分割性能，同时无需任何实例级别的训练或标注。其关键在于通过一种可直接替换现有实例头模块的方法，在标准基准数据集（如SemanticKITTI和nuScenes）上达到了与当前最先进的监督学习方法相当甚至更优的性能，且运行效率高，能够在单线程CPU上实时执行，同时保持完全的可解释性，无需额外的学习或参数调整过程。

链接: https://arxiv.org/abs/2503.13203
作者: Corentin Sautier,Gilles Puy,Alexandre Boulch,Renaud Marlet,Vincent Lepetit
机构: LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS (法国); Valeo.ai (法雷奥.ai)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Panoptic segmentation of LiDAR point clouds is fundamental to outdoor scene understanding, with autonomous driving being a primary application. While state-of-the-art approaches typically rely on end-to-end deep learning architectures and extensive manual annotations of instances, the significant cost and time investment required for labeling large-scale point cloud datasets remains a major bottleneck in this field. In this work, we demonstrate that competitive panoptic segmentation can be achieved using only semantic labels, with instances predicted without any training or annotations. Our method achieves performance comparable to current state-of-the-art supervised methods on standard benchmarks including SemanticKITTI and nuScenes, and outperforms every publicly available method on SemanticKITTI as a drop-in instance head replacement, while running in real-time on a single-threaded CPU and requiring no instance labels. Our method is fully explainable, and requires no learning or parameter tuning. Code is available at this https URL
zh

[CV-35] 3D Hierarchical Panoptic Segmentation in Real Orchard Environments Across Different Sensors IROS

【速读】：该论文旨在解决苹果园三维数据的分层全景分割问题，这对于农业生产中的作物产量估计具有重要意义。解决方案的关键在于提出了一种新颖的方法，能够同时实现语义分割、树干与果实的实例分割以及植物（单个树干及其果实）的实例分割，从而精确识别个体植物、果实和树干，并捕捉它们之间的关系，例如准确估算果园中每棵树所关联的果实数量。此外，为了高效评估该方法，论文还提供了一个专门为此任务设计的数据集，包含从地面激光扫描仪到安装在不同机器人平台上的RGB-D相机等多种传感器记录的真实数据。实验结果表明，该方法在农业领域的三维全景分割任务中超越了现有技术，同时提供了完整的分层全景分割能力。

链接: https://arxiv.org/abs/2503.13188
作者: Matteo Sodano,Federico Magistri,Elias Marks,Fares Hosn,Aibek Zurbayev,Rodrigo Marcuzzi,Meher V. R. Malladi,Jens Behley,Cyrill Stachniss
机构: Center for Robotics, University of Bonn, Germany (波恩大学机器人中心，德国); Department of Engineering Science, University of Oxford, UK (牛津大学工程科学系，英国); Lamarr Institute for Machine Learning and Artificial Intelligence, Germany (拉马尔机器学习与人工智能研究所，德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to IROS

点击查看摘要

Abstract:Crop yield estimation is a relevant problem in agriculture, because an accurate crop yield estimate can support farmers’ decisions on harvesting or precision intervention. Robots can help to automate this process. To do so, they need to be able to perceive the surrounding environment to identify target objects. In this paper, we introduce a novel approach to address the problem of hierarchical panoptic segmentation of apple orchards on 3D data from different sensors. Our approach is able to simultaneously provide semantic segmentation, instance segmentation of trunks and fruits, and instance segmentation of plants (a single trunk with its fruits). This allows us to identify relevant information such as individual plants, fruits, and trunks, and capture the relationship among them, such as precisely estimate the number of fruits associated to each tree in an orchard. Additionally, to efficiently evaluate our approach for hierarchical panoptic segmentation, we provide a dataset designed specifically for this task. Our dataset is recorded in Bonn in a real apple orchard with a variety of sensors, spanning from a terrestrial laser scanner to a RGB-D camera mounted on different robotic platforms. The experiments show that our approach surpasses state-of-the-art approaches in 3D panoptic segmentation in the agricultural domain, while also providing full hierarchical panoptic segmentation. Our dataset has been made publicly available at this https URL. We will provide the open-source implementation of our approach and public competiton for hierarchical panoptic segmentation on the hidden test sets upon paper acceptance.
zh

[CV-36] 3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT -4o

【速读】：本文旨在解决多模态大型语言模型（MLLMs）在三维视觉任务中的应用能力不足问题，特别是其在理解真实场景中物体三维位置方面的表现。现有研究主要集中在逻辑推理和二维视觉理解上，而三维视觉操作能力的研究尚处于初步阶段。为了解决这一问题，论文提出了一种名为3DAxisPrompt的新颖视觉提示方法，通过利用三维坐标轴和由Segment Anything Model (SAM)生成的掩膜，向MLLMs提供显式的几何先验知识，并将其强大的二维定位与推理能力扩展到真实的三维场景中。该方案的关键在于创新性地结合了几何信息与语言模型的能力，从而实现对三维空间中物体位置的有效感知。此外，研究还探讨了不同视觉提示格式的潜力及其局限性，并通过构建包含ScanRefer、ScanNet、FMB和nuScenes四个数据集的评估环境，进行了广泛的定量和定性实验验证，证明了所提方法的有效性。尽管如此，单一的提示工程方法并不能始终为所有三维任务带来最优结果，这表明需要进一步探索更灵活的提示策略以优化MLLMs在三维视觉任务中的表现。

链接: https://arxiv.org/abs/2503.13185
作者: Dingning Liu,Cheng Wang,Peng Gao,Renrui Zhang,Xinzhu Ma,Yuan Meng,Zhihui Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) exhibit impressive capabilities across a variety of tasks, especially when equipped with carefully designed visual prompts. However, existing studies primarily focus on logical reasoning and visual understanding, while the capability of MLLMs to operate effectively in 3D vision remains an ongoing area of exploration. In this paper, we introduce a novel visual prompting method, called 3DAxisPrompt, to elicit the 3D understanding capabilities of MLLMs in real-world scenes. More specifically, our method leverages the 3D coordinate axis and masks generated from the Segment Anything Model (SAM) to provide explicit geometric priors to MLLMs and then extend their impressive 2D grounding and reasoning ability to real-world 3D scenarios. Besides, we first provide a thorough investigation of the potential visual prompting formats and conclude our findings to reveal the potential and limits of 3D understanding capabilities in GPT-4o, as a representative of MLLMs. Finally, we build evaluation environments with four datasets, i.e., ScanRefer, ScanNet, FMB, and nuScene datasets, covering various 3D tasks. Based on this, we conduct extensive quantitative and qualitative experiments, which demonstrate the effectiveness of the proposed method. Overall, our study reveals that MLLMs, with the help of 3DAxisPrompt, can effectively perceive an object’s 3D position in real-world scenarios. Nevertheless, a single prompt engineering approach does not consistently achieve the best outcomes for all 3D tasks. This study highlights the feasibility of leveraging MLLMs for 3D vision grounding/reasoning with prompt engineering techniques.
zh

[CV-37] riad: Empowering LMM-based Anomaly Detection with Vision Expert-guided Visual Tokenizer and Manufacturing Process

【速读】：该论文旨在解决工业异常检测（Industrial Anomaly Detection, IAD）领域中现有大型多模态模型（Large Multimodal Models, LMMs）泛化能力不足的问题。论文指出，这一问题主要源于两个方面：一方面，通用型LMMs缺乏对视觉模态缺陷的认知，无法充分关注异常区域；另一方面，现有方法主要依赖于学习缺陷模式或与正常样本对比来识别缺陷，而未能深入理解缺陷产生的原因。鉴于缺陷生成与制造过程密切相关，论文提出了一种基于制造驱动的IAD范式，并设计了一个面向IAD的指令微调数据集（InstructIAD）以及一种结合制造过程的Chain-of-Thought组织方法（CoT-M），以利用制造信息提升检测性能。基于上述改进，论文提出了Triad，这是一种结合专家引导感兴趣区域标记器和制造过程的新型LMMs方法。实验结果表明，Triad不仅在性能上与当前LMMs竞争，还通过引入制造过程进一步提升了检测精度。

链接: https://arxiv.org/abs/2503.13184
作者: Yuanze Li,Shihao Yuan,Haolin Wang,Qizhang Li,Ming Liu,Chen Xu,Guangming Shi,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Pazhou Lab Huangpu (琶洲实验室黄埔分部); Harbin Institute of Technology, Pazhou Lab Huangpu (哈尔滨工业大学, 琶洲实验室黄埔分部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although recent methods have tried to introduce large multimodal models (LMMs) into industrial anomaly detection (IAD), their generalization in the IAD field is far inferior to that for general purposes. We summarize the main reasons for this gap into two aspects. On one hand, general-purpose LMMs lack cognition of defects in the visual modality, thereby failing to sufficiently focus on defect areas. Therefore, we propose to modify the AnyRes structure of the LLaVA model, providing the potential anomalous areas identified by existing IAD models to the LMMs. On the other hand, existing methods mainly focus on identifying defects by learning defect patterns or comparing with normal samples, yet they fall short of understanding the causes of these defects. Considering that the generation of defects is closely related to the manufacturing process, we propose a manufacturing-driven IAD paradigm. An instruction-tuning dataset for IAD (InstructIAD) and a data organization approach for Chain-of-Thought with manufacturing (CoT-M) are designed to leverage the manufacturing process for IAD. Based on the above two modifications, we present Triad, a novel LMM-based method incorporating an expert-guided region-of-interest tokenizer and manufacturing process for industrial anomaly detection. Extensive experiments show that our Triad not only demonstrates competitive performance against current LMMs but also achieves further improved accuracy when equipped with manufacturing processes. Source code, training data, and pre-trained models will be publicly available at this https URL.
zh

[CV-38] A super-resolution reconstruction method for lightweight building images based on an expanding feature modulation network

【速读】：该论文旨在解决现有轻量级图像超分辨率网络在建模长距离依赖关系方面的局限性，特别是在处理具有规则纹理和长距离依赖关系的建筑物图像时。论文提出了一种基于膨胀上下文特征调制网络（Dilated Contextual Feature Modulation Network, DCFMN）的轻量级方法。解决方案的关键在于引入了扩展可分离调制单元（expansion separable modulation unit）和局部特征增强模块（local feature enhancement module）。前者通过多尺度膨胀卷积高效聚合多尺度特征，并结合简单的注意力机制实现自适应性；后者则通过重新参数化确保推理过程无额外计算开销，同时编码局部特征并混合通道信息。这种方法实现了精确且高效的全局特征建模，同时保持较低的计算成本，显著提升了建筑物图像超分辨率模型的重建质量和轻量级效率。

链接: https://arxiv.org/abs/2503.13179
作者: Yi Zhang,Wenye Zhou,Ruonan Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study proposes a lightweight method for building image super-resolution using a Dilated Contextual Feature Modulation Network (DCFMN). The process includes obtaining high-resolution images, down-sampling them to low-resolution, enhancing the low-resolution images, constructing and training a lightweight network model, and generating super-resolution outputs. To address challenges such as regular textures and long-range dependencies in building images, the DCFMN integrates an expansion separable modulation unit and a local feature enhancement module. The former employs multiple expansion convolutions equivalent to a large kernel to efficiently aggregate multi-scale features while leveraging a simple attention mechanism for adaptivity. The latter encodes local features, mixes channel information, and ensures no additional computational burden during inference through reparameterization. This approach effectively resolves the limitations of existing lightweight super-resolution networks in modeling long-range dependencies, achieving accurate and efficient global feature modeling without increasing computational costs, and significantly improving both reconstruction quality and lightweight efficiency for building image super-resolution models.
zh

[CV-39] DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction

【速读】：该论文旨在解决从真实场景捕获中重建清洁且无干扰的3D场景的问题，特别是在高度动态和杂乱的环境中（如第一人称视频）这一挑战尤为显著。论文提出了一种名为DeGauss的简单而鲁棒的自监督框架，用于动态场景重建。其关键在于采用解耦的动态-静态高斯点云设计，通过前景高斯模型表示动态元素，背景高斯模型表示静态内容，并利用概率掩模协调它们的组合，从而实现独立但互补的优化过程。这种设计使得DeGauss能够在多种真实场景中广泛且稳健地应用，无需复杂的启发式规则或大量监督。

链接: https://arxiv.org/abs/2503.13176
作者: Rui Wang,Quentin Lohmeyer,Mirko Meboldt,Siyu Tang
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing clean, distractor-free 3D scenes from real-world captures remains a significant challenge, particularly in highly dynamic and cluttered settings such as egocentric videos. To tackle this problem, we introduce DeGauss, a simple and robust self-supervised framework for dynamic scene reconstruction based on a decoupled dynamic-static Gaussian Splatting design. DeGauss models dynamic elements with foreground Gaussians and static content with background Gaussians, using a probabilistic mask to coordinate their composition and enable independent yet complementary optimization. DeGauss generalizes robustly across a wide range of real-world scenarios, from casual image collections to long, dynamic egocentric videos, without relying on complex heuristics or extensive supervision. Experiments on benchmarks including NeRF-on-the-go, ADT, AEA, Hot3D, and EPIC-Fields demonstrate that DeGauss consistently outperforms existing methods, establishing a strong baseline for generalizable, distractor-free 3D reconstructionin highly dynamic, interaction-rich environments.
zh

[CV-40] From Zero to Detail: Deconstructing Ultra-High-Definition Image Restoration from Progressive Spectral Perspective CVPR2025

【速读】：该论文旨在解决超高清（UHD）图像恢复面临的高分辨率、复杂内容及精细细节所带来的挑战。论文通过逐步频谱视角深入分析恢复过程，将复杂的UHD图像恢复问题分解为三个渐进阶段：零频增强、低频恢复和高频精化。解决方案的关键在于提出的ERR框架，该框架包含三个协作子网络：零频增强器（ZFE）、低频恢复器（LFR）和高频精炼器（HFR）。其中，ZFE利用全局先验知识学习全局映射，LFR专注于重建粗粒度内容以恢复低频信息，而HFR采用频率窗口化的Kolmogorov-Arnold网络（FW-KAN）细化纹理与细节，从而实现高质量的图像恢复。广泛的消融研究验证了每个组件的有效性，证明该方法在多种任务中显著优于现有UHD图像恢复方法。

链接: https://arxiv.org/abs/2503.13165
作者: Chen Zhao,Zhizhou Chen,Yunzhe Xu,Enxuan Gu,Jian Li,Zili Yi,Qian Wang,Jian Yang,Ying Tai
机构: Nanjing University; Dalian University of Technology; Tencent Youtu (腾讯优图); China Mobile Institute (中国移动研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Ultra-high-definition (UHD) image restoration faces significant challenges due to its high resolution, complex content, and intricate details. To cope with these challenges, we analyze the restoration process in depth through a progressive spectral perspective, and deconstruct the complex UHD restoration problem into three progressive stages: zero-frequency enhancement, low-frequency restoration, and high-frequency refinement. Building on this insight, we propose a novel framework, ERR, which comprises three collaborative sub-networks: the zero-frequency enhancer (ZFE), the low-frequency restorer (LFR), and the high-frequency refiner (HFR). Specifically, the ZFE integrates global priors to learn global mapping, while the LFR restores low-frequency information, emphasizing reconstruction of coarse-grained content. Finally, the HFR employs our designed frequency-windowed kolmogorov-arnold networks (FW-KAN) to refine textures and details, producing high-quality image restoration. Our approach significantly outperforms previous UHD methods across various tasks, with extensive ablation studies validating the effectiveness of each component. The code is available at \hrefthis https URLhere.
zh

[CV-41] Beyond RGB: Adaptive Parallel Processing for RAW Object Detection

【速读】：该论文试图解决传统图像信号处理（ISP）在优化计算机视觉任务（如对象检测）时可能导致关键信息丢失的问题。解决方案的关键在于引入了一个名为RAW自适应模块（RAM）的新模块，它通过并行应用多个ISP功能取代传统的ISP，并针对RAW对象检测专门优化其参数。受人类视觉系统并行处理机制的启发，RAM允许更全面地捕捉图像特征，并通过一个专门的融合模块动态整合和优化这些表示以适应目标任务，从而充分利用RAW传感器数据的潜力并实现任务特定的预处理，最终在不同光照条件和动态范围内显著提升对象检测性能。

链接: https://arxiv.org/abs/2503.13163
作者: Shani Gamrian,Hila Barel,Feiran Li,Masakazu Yoshimura,Daisuke Iso
机构: Sony AI (索尼AI); Sony Group Corporation (索尼集团公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection models are typically applied to standard RGB images processed through Image Signal Processing (ISP) pipelines, which are designed to enhance sensor-captured RAW images for human vision. However, these ISP functions can lead to a loss of critical information that may be essential in optimizing for computer vision tasks, such as object detection. In this work, we introduce Raw Adaptation Module (RAM), a module designed to replace the traditional ISP, with parameters optimized specifically for RAW object detection. Inspired by the parallel processing mechanisms of the human visual system, RAM departs from existing learned ISP methods by applying multiple ISP functions in parallel rather than sequentially, allowing for a more comprehensive capture of image features. These processed representations are then fused in a specialized module, which dynamically integrates and optimizes the information for the target task. This novel approach not only leverages the full potential of RAW sensor data but also enables task-specific pre-processing, resulting in superior object detection performance. Our approach outperforms RGB-based methods and achieves state-of-the-art results across diverse RAW image datasets under varying lighting conditions and dynamic ranges.
zh

[CV-42] Language-guided Open-world Video Anomaly Detection

【速读】：该论文旨在解决视频异常检测（Video Anomaly Detection, VAD）在开放世界场景下定义固定不变的问题。传统方法假设异常定义是恒定的，无法适应需求变化或上下文差异，例如在流感爆发期间不戴口罩被视为异常，而在其他情况下则正常。为了解决这一局限性，论文提出了一种具有可变定义的开放世界VAD范式，并通过用户提供的自然语言指导进行推理时的检测。关键在于构建从视频与文本定义到异常分数的鲁棒映射，为此提出了LaGoVAD模型。该模型通过两种正则化策略动态调整异常定义：一是通过动态视频合成多样化异常相对持续时间；二是通过对比学习与负样本挖掘增强特征鲁棒性。此外，由于现有数据集通常缺乏语义描述，仅提供预设标签，论文还收集了PreVAD数据集，包含35,279个带有多层次类别标签和描述的标注视频，以支持模型训练和泛化能力。实验结果表明，该方法在零样本设置下的七个数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2503.13160
作者: Zihao Liu,Xiaoyu Wu,Jianqin Wu,Xuxu Wang,Linlin Yang
机构: Communication University of China (中国传媒大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video anomaly detection models aim to detect anomalies that deviate from what is expected. In open-world scenarios, the expected events may change as requirements change. For example, not wearing a mask is considered abnormal during a flu outbreak but normal otherwise. However, existing methods assume that the definition of anomalies is invariable, and thus are not applicable to the open world. To address this, we propose a novel open-world VAD paradigm with variable definitions, allowing guided detection through user-provided natural language at inference time. This paradigm necessitates establishing a robust mapping from video and textual definition to anomaly score. Therefore, we propose LaGoVAD (Language-guided Open-world VAD), a model that dynamically adapts anomaly definitions through two regularization strategies: diversifying the relative durations of anomalies via dynamic video synthesis, and enhancing feature robustness through contrastive learning with negative mining. Training such adaptable models requires diverse anomaly definitions, but existing datasets typically provide given labels without semantic descriptions. To bridge this gap, we collect PreVAD (Pre-training Video Anomaly Dataset), the largest and most diverse video anomaly dataset to date, featuring 35,279 annotated videos with multi-level category labels and descriptions that explicitly define anomalies. Zero-shot experiments on seven datasets demonstrate SOTA performance. Data and code will be released.
zh

[CV-43] DynSTG-Mamba: Dynamic Spatio-Temporal Graph Mamba with Cross-Graph Knowledge Distillation for Gait Disorders Recognition

【速读】：本文旨在解决步态障碍识别中现有方法面临的高内存需求和难以有效捕捉复杂时空依赖关系的问题，这些问题限制了其在临床应用中的效率。为应对这些挑战，论文提出了一种名为DynSTG-Mamba（动态时空图马巴）的新框架，结合了DF-STGNN和STG-Mamba以增强运动序列建模能力。DF-STGNN引入了动态时空滤波器，能够自适应调整骨骼关节间的空间连接以及不同运动阶段的时间交互，通过考虑骨骼步态数据的层次结构和动态特性，确保特征在动态图结构中的良好传播。同时，STG-Mamba作为Mamba的一种扩展形式，专门针对骨骼运动数据设计，保证状态的连续传播，有助于捕捉长期依赖关系并降低计算复杂度。此外，为了减少模型参数数量和计算成本，同时保持一致性，论文还提出了跨图关系知识蒸馏机制，该机制在教师模型（大架构）和学生模型（小架构）之间对齐关系信息，并使用共享内存，从而确保关节的相互作用和运动模式在运动序列中得到准确保留。最终，DynSTG-Mamba在KOA-NM、PD-WALK和ATAXIA数据集上的验证结果显示，其在准确率、F1分数和召回率方面优于最先进的方法，证明了该方法的高效性和鲁棒性，提供了一个轻量级但高度精确的自动化步态分析与运动障碍评估解决方案。

链接: https://arxiv.org/abs/2503.13156
作者: Zakariae Zrimek,Youssef Mourchid,Mohammed El Hassouni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gait disorder recognition plays a crucial role in the early diagnosis and monitoring of movement disorders. Existing approaches, including spatio-temporal graph convolutional networks (ST-GCNs), often face high memory demands and struggle to capture complex spatio-temporal dependencies, limiting their efficiency in clinical applications. To address these challenges, we introduce DynSTG-Mamba (Dynamic Spatio-Temporal Graph Mamba), a novel framework that combines DF-STGNN and STG-Mamba to enhance motion sequence modeling. The DF-STGNN incorporates a dynamic spatio-temporal filter that adaptively adjusts spatial connections between skeletal joints and temporal interactions across different movement phases. This approach ensures better feature propagation through dynamic graph structures by considering the hierarchical nature and dynamics of skeletal gait data. Meanwhile, STG-Mamba, an extension of Mamba adapted for skeletal motion data, ensures a continuous propagation of states, facilitating the capture of long-term dependencies while reducing computational complexity. To reduce the number of model parameters and computational costs while maintaining consistency, we propose Cross-Graph Relational Knowledge Distillation, a novel knowledge transfer mechanism that aligns relational information between teacher (large architecture) and student models (small architecture) while using shared memory. This ensures that the interactions and movement patterns of the joints are accurately preserved in the motion sequences. We validate our DynSTG-Mamba on KOA-NM, PD-WALK, and ATAXIA datasets, where it outperforms state-of-the-art approaches by achieving in terms of Accuracy, F1-score, and Recall. Our results highlight the efficiency and robustness of our approach, offering a lightweight yet highly accurate solution for automated gait analysis and movement disorder assessment.
zh

[CV-44] Iterative Predictor-Critic Code Decoding for Real-World Image Dehazing CVPR2025

【速读】：该论文致力于解决真实场景图像去雾问题，提出了一种新颖的迭代预测器-评论者码本解码框架（Iterative Predictor-Critic Code Decoding，简称IPC-Dehaze）。该方法利用预训练VQGAN中封装的高质量码本先验知识，并通过在每次迭代中利用前一次迭代获得的高质量码来指导下一次迭代中Code-Predictor的预测，从而提高码预测的准确性并确保稳定的去雾性能。论文的关键在于提出了Code-Critic机制，用于捕获码之间的相互关系，通过评估码相关性并重新采样具有最高掩码分数的一组码，帮助保留更准确的码并预测困难的码，从而克服了难以确定在每次迭代中应保留或替换哪些码的难题。实验结果证明了该方法相对于现有技术在真实场景去雾任务中的优越性。

链接: https://arxiv.org/abs/2503.13147
作者: Jiayi Fu,Siyu Liu,Zikun Liu,Chun-Le Guo,Hyunhee Park,Ruiqi Wu,Guoqing Wang,Chongyi Li
机构: VCIP, CS, Nankai University (南开大学); NKIARI, Shenzhen Futian (南开大学); Samsung R&D Institute China-Beijing (三星中国研究院); CIG, Samsung Electronics (三星电子); Donghai Laboratory, Zhoushan, Zhejiang (东海实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Acceptted by CVPR 2025

点击查看摘要

Abstract:We propose a novel Iterative Predictor-Critic Code Decoding framework for real-world image dehazing, abbreviated as IPC-Dehaze, which leverages the high-quality codebook prior encapsulated in a pre-trained VQGAN. Apart from previous codebook-based methods that rely on one-shot decoding, our method utilizes high-quality codes obtained in the previous iteration to guide the prediction of the Code-Predictor in the subsequent iteration, improving code prediction accuracy and ensuring stable dehazing performance. Our idea stems from the observations that 1) the degradation of hazy images varies with haze density and scene depth, and 2) clear regions play crucial cues in restoring dense haze regions. However, it is non-trivial to progressively refine the obtained codes in subsequent iterations, owing to the difficulty in determining which codes should be retained or replaced at each iteration. Another key insight of our study is to propose Code-Critic to capture interrelations among codes. The Code-Critic is used to evaluate code correlations and then resample a set of codes with the highest mask scores, i.e., a higher score indicates that the code is more likely to be rejected, which helps retain more accurate codes and predict difficult ones. Extensive experiments demonstrate the superiority of our method over state-of-the-art methods in real-world dehazing.
zh

[CV-45] Enhancing zero-shot learning in medical imaging: integrating clip with advanced techniques for improved chest x-ray analysis

【速读】：该论文旨在解决医学影像中因数据量庞大且标注困难导致现有深度学习模型性能受限的问题，特别是针对类别不平衡和未标注数据集的挑战。论文的关键创新在于将对比语言图像预训练（Contrastive Language-Image Pre-training, CLIP）与动量对比（Momentum Contrast, MoCo）相结合，提出了一种新的模型MoCoCLIP。这种方法通过利用无监督或弱监督学习技术，在无需大量标注数据的情况下提升胸部X光片（Chest X-rays, CXRs）中肺部病理检测的准确性。实验结果表明，MoCoCLIP在NIH ChestXray14和CheXpert数据集上的零样本学习表现均优于当前最先进的CheXZero模型，特别是在未见数据上的泛化能力显著增强。

链接: https://arxiv.org/abs/2503.13134
作者: Prakhar Bhardwaj,Sheethal Bhat,Andreas Maier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the large volume of medical imaging data, advanced AI methodologies are needed to assist radiologists in diagnosing thoracic diseases from chest X-rays (CXRs). Existing deep learning models often require large, labeled datasets, which are scarce in medical imaging due to the time-consuming and expert-driven annotation process. In this paper, we extend the existing approach to enhance zero-shot learning in medical imaging by integrating Contrastive Language-Image Pre-training (CLIP) with Momentum Contrast (MoCo), resulting in our proposed model, MoCoCLIP. Our method addresses challenges posed by class-imbalanced and unlabeled datasets, enabling improved detection of pulmonary pathologies. Experimental results on the NIH ChestXray14 dataset demonstrate that MoCoCLIP outperforms the state-of-the-art CheXZero model, achieving relative improvement of approximately 6.5%. Furthermore, on the CheXpert dataset, MoCoCLIP demonstrates superior zero-shot performance, achieving an average AUC of 0.750 compared to CheXZero with 0.746 AUC, highlighting its enhanced generalization capabilities on unseen data.
zh

[CV-46] Patient-specific radiomic feature selection with reconstructed healthy persona of knee MR images

【速读】：该论文旨在解决经典影像组学特征在临床任务中因维度较低而导致性能逊于端到端深度学习（Deep Learning, DL）模型的问题，同时保持其可解释性的优势。论文的关键在于提出一种方法，通过为每位患者从候选特征池中学习选择个性化的影像组学特征，显著提升了标准逻辑回归模型的性能。此外，论文还通过利用去噪扩散模型训练得到的健康受试者数据生成特定患者的健康“persona”（虚拟表型），扩充特征池，构建无病理性基线特征集，从而促进新特征发现和疾病分类性能的提升。这种方法在多种临床任务中实现了与最先进的深度学习方法相当甚至更优的性能，同时提供了更高的可解释性。

链接: https://arxiv.org/abs/2503.13131
作者: Yaxi Chen,Simin Ni,Aleksandra Ivanova,Shaheer U. Saeed,Rikin Hargunani,Jie Huang,Chaozong Liu,Yipeng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classical radiomic features have been designed to describe image appearance and intensity patterns. These features are directly interpretable and readily understood by radiologists. Compared with end-to-end deep learning (DL) models, lower dimensional parametric models that use such radiomic features offer enhanced interpretability but lower comparative performance in clinical tasks. In this study, we propose an approach where a standard logistic regression model performance is substantially improved by learning to select radiomic features for individual patients, from a pool of candidate features. This approach has potentials to maintain the interpretability of such approaches while offering comparable performance to DL. We also propose to expand the feature pool by generating a patient-specific healthy persona via mask-inpainting using a denoising diffusion model trained on healthy subjects. Such a pathology-free baseline feature set allows further opportunity in novel feature discovery and improved condition classification. We demonstrate our method on multiple clinical tasks of classifying general abnormalities, anterior cruciate ligament tears, and meniscus tears. Experimental results demonstrate that our approach achieved comparable or even superior performance than state-of-the-art DL approaches while offering added interpretability by using radiomic features extracted from images and supplemented by generating healthy personas. Example clinical cases are discussed in-depth to demonstrate the intepretability-enabled utilities such as human-explainable feature discovery and patient-specific location/view selection. These findings highlight the potentials of the combination of subject-specific feature selection with generative models in augmenting radiomic analysis for more interpretable decision-making. The codes are available at: this https URL
zh

[CV-47] ChainHOI: Joint-based Kinematic Chain Modeling for Human-Object Interaction Generation CVPR2025

【速读】：该论文试图解决文本驱动的人体-物体交互（Human-Object Interaction, HOI）生成问题。现有方法通常通过全身姿态隐式建模交互，而忽略了关节级别的显式建模。论文的关键解决方案在于提出了一种名为ChainHOI的新方法，它通过引入关节图（Joint Graph）显式捕捉关节与物体之间的潜在交互，并利用生成时空图卷积网络（Generative Spatiotemporal Graph Convolution Network）在关节级别显式建模交互；同时，通过基于运动学的交互模块（Kinematics-based Interaction Module），进一步在动能链（Kinetic Chain）级别显式建模交互，从而确保生成的运动更加真实且生物力学上一致。这一方法显著提升了HOI生成的逼真度和语义一致性。

链接: https://arxiv.org/abs/2503.13130
作者: Ling-An Zeng,Guohong Huang,Yi-Lin Wei,Shengbo Gu,Yu-Ming Tang,Jingke Meng,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学, China); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (教育部机器智能与先进计算重点实验室, China)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:We propose ChainHOI, a novel approach for text-driven human-object interaction (HOI) generation that explicitly models interactions at both the joint and kinetic chain levels. Unlike existing methods that implicitly model interactions using full-body poses as tokens, we argue that explicitly modeling joint-level interactions is more natural and effective for generating realistic HOIs, as it directly captures the geometric and semantic relationships between joints, rather than modeling interactions in the latent pose space. To this end, ChainHOI introduces a novel joint graph to capture potential interactions with objects, and a Generative Spatiotemporal Graph Convolution Network to explicitly model interactions at the joint level. Furthermore, we propose a Kinematics-based Interaction Module that explicitly models interactions at the kinetic chain level, ensuring more realistic and biomechanically coherent motions. Evaluations on two public datasets demonstrate that ChainHOI significantly outperforms previous methods, generating more realistic, and semantically consistent HOIs. Code is available \hrefthis https URLhere.
zh

[CV-48] Non-Destructive Detection of Sub-Micron Imperceptible Scratches On Laser Chips Based On Consistent Texture Entropy Recursive Optimization Semi-Supervised Network

【速读】：该论文旨在解决半导体激光芯片发光表面难以检测的亚微米级细微划痕（sub-micron level scratches）的问题。此类划痕虽然肉眼几乎不可见，但会对器件性能和寿命产生显著影响，从而阻碍生产效率和良品率提升。传统方法因缺乏标注数据集且难以有效检测这些细微缺陷而面临挑战。

解决方案的关键在于提出了一种名为TexRecNet的一致性纹理熵递归优化半监督网络。该网络采用递归优化架构，通过迭代改进不可见划痕边缘的检测精度，利用前一周期的输出指导后续输入以及位置编码。此外，引入图像纹理熵概念，在大量无标注数据的基础上扩展训练集，同时保持训练信号可靠性。最终，通过分析递归过程中网络输出序列的不一致性，提出了具有递归一致性约束的半监督训练策略，利用递归过程中的输出进行非破坏性信号增强，并持续优化损失函数以实现高效的端到端训练。实验结果表明，该方法在使用大量无监督数据的情况下，检测不可见划痕的准确率达到75.6%，召回率达到74.8%，分别比传统Unet提升了8.5%和33.6%，显著增强了激光芯片的质量控制能力。

链接: https://arxiv.org/abs/2503.13125
作者: Pan Liu
机构: School of Automation, Central South University (中南大学自动化学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Laser chips, the core components of semiconductor lasers, are extensively utilized in various industries, showing great potential for future application. Smoothness emitting surfaces are crucial in chip production, as even imperceptible scratches can significantly degrade performance and lifespan, thus impeding production efficiency and yield. Therefore, non-destructively detecting these imperceptible scratches on the emitting surfaces is essential for enhancing yield and reducing costs. These sub-micron level scratches, barely visible against the background, are extremely difficult to detect with conventional methods, compounded by a lack of labeled datasets. To address this challenge, this paper introduces TexRecNet, a consistent texture entropy recursive optimization semi-supervised network. The network, based on a recursive optimization architecture, iteratively improves the detection accuracy of imperceptible scratch edges, using outputs from previous cycles to inform subsequent inputs and guide the network’s positional encoding. It also introduces image texture entropy, utilizing a substantial amount of unlabeled data to expand the training set while maintaining training signal reliability. Ultimately, by analyzing the inconsistency of the network output sequences obtained during the recursive process, a semi-supervised training strategy with recursive consistency constraints is proposed, using outputs from the recursive process for non-destructive signal augmentation and consistently optimizes the loss function for efficient end-to-end training. Experimental results show that this method, utilizing a substantial amount of unsupervised data, achieves 75.6% accuracy and 74.8% recall in detecting imperceptible scratches, an 8.5% and 33.6% improvement over conventional Unet, enhancing quality control in laser chips.
zh

[CV-49] 3D Human Interaction Generation: A Survey

【速读】：该论文旨在系统性地综述3D人体交互生成领域的研究进展，并试图解决如何自然地生成人类运动以及实现人类与交互实体之间精确交互这一关键挑战。论文的关键在于全面介绍支撑该领域发展的基础技术（如3D模型表示方法、运动捕捉技术和生成式AI模型），同时梳理针对人体-场景交互、人体-物体交互及人体-人体交互三大子任务所提出的解决方案及其对应的公开数据集和评估指标。通过这项工作，作者希望为未来的研究提供方向，并激励更多创新性探索。

链接: https://arxiv.org/abs/2503.13120
作者: Siyuan Fan,Wenke Huang,Xiantao Cai,Bo Du
机构: School of Computer Science, Wuhan University (武汉大学), Wuhan, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D human interaction generation has emerged as a key research area, focusing on producing dynamic and contextually relevant interactions between humans and various interactive entities. Recent rapid advancements in 3D model representation methods, motion capture technologies, and generative models have laid a solid foundation for the growing interest in this domain. Existing research in this field can be broadly categorized into three areas: human-scene interaction, human-object interaction, and human-human interaction. Despite the rapid advancements in this area, challenges remain due to the need for naturalness in human motion generation and the accurate interaction between humans and interactive entities. In this survey, we present a comprehensive literature review of human interaction generation, which, to the best of our knowledge, is the first of its kind. We begin by introducing the foundational technologies, including model representations, motion capture methods, and generative models. Subsequently, we introduce the approaches proposed for the three sub-tasks, along with their corresponding datasets and evaluation metrics. Finally, we discuss potential future research directions in this area and conclude the survey. Through this survey, we aim to offer a comprehensive overview of the current advancements in the field, highlight key challenges, and inspire future research works.
zh

[CV-50] DTGBrepGen: A Novel B-rep Generative Model through Decoupling Topology and Geometry

【速读】：该论文旨在解决自动创建有效且高质量边界表示（Boundary Representation, B-rep）模型的挑战，这一挑战源于模型拓扑与几何之间复杂的相互依赖性。现有方法通常更关注几何表示而忽视了拓扑约束，导致难以保持结构的有效性和几何精度。论文的关键解决方案是提出了一种名为DTGBrepGen的新框架，该框架实现了拓扑与几何的解耦，并显式地同时处理这两个方面。具体而言，首先通过两阶段过程独立建模边-面和边-顶点邻接关系以生成有效的拓扑结构；随后利用基于Transformer的扩散模型进行顺序几何生成，逐步生成顶点坐标、边几何以及以B样条形式表示的面几何。实验结果表明，该方法在拓扑有效性和几何准确性方面显著优于现有方法。

链接: https://arxiv.org/abs/2503.13110
作者: Jing Li,Yihang Fu,Falai Chen
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Boundary representation (B-rep) of geometric models is a fundamental format in Computer-Aided Design (CAD). However, automatically generating valid and high-quality B-rep models remains challenging due to the complex interdependence between the topology and geometry of the models. Existing methods tend to prioritize geometric representation while giving insufficient attention to topological constraints, making it difficult to maintain structural validity and geometric accuracy. In this paper, we propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation that explicitly addresses both aspects. Our approach first generates valid topological structures through a two-stage process that independently models edge-face and edge-vertex adjacency relationships. Subsequently, we employ Transformer-based diffusion models for sequential geometry generation, progressively generating vertex coordinates, followed by edge geometries and face geometries which are represented as B-splines. Extensive experiments on diverse CAD datasets show that DTGBrepGen significantly outperforms existing methods in both topological validity and geometric accuracy, achieving higher validity rates and producing more diverse and realistic B-reps. Our code is publicly available at this https URL.
zh

[CV-51] Lifting the Veil on Visual Information Flow in MLLM s: Unlocking Pathways to Faster Inference

【速读】：该论文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在视觉信息处理机制不清晰以及高效推理方法不足的问题。论文通过揭示视觉信息在模型中的主导流动模式，发现浅层中图像标记与指令标记之间的强交互使得大部分视觉信息被注入指令标记以形成跨模态语义表示，而在深层中图像标记主要相互作用以优化视觉模态内的语义表示。基于这些洞察，论文提出了一种名为分层模态感知剪枝（Hierarchical Modality-Aware Pruning, HiMAP）的插拔式推理加速方法，该方法能够在特定层动态剪枝图像标记，将计算成本降低约65%且不牺牲性能。解决方案的关键在于依据视觉信息在不同层次的流动特性，设计了一种能够有效减少无用计算的动态剪枝策略。

链接: https://arxiv.org/abs/2503.13108
作者: Hao Yin,Guangzong Si,Zilei Wang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models (LLMs). However, how MLLMs process and utilize visual information remains unclear. In this paper, a shift in the dominant flow of visual information is uncovered: (1) in shallow layers, strong interactions are observed between image tokens and instruction tokens, where most visual information is injected into instruction tokens to form cross-modal semantic representations; (2) in deeper layers, image tokens primarily interact with each other, aggregating the remaining visual information to optimize semantic representations within visual modality. Based on these insights, we propose Hierarchical Modality-Aware Pruning (HiMAP), a plug-and-play inference acceleration method that dynamically prunes image tokens at specific layers, reducing computational costs by approximately 65% without sacrificing performance. Our findings offer a new understanding of visual information processing in MLLMs and provide a state-of-the-art solution for efficient inference.
zh

[CV-52] ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在生成内容时容易出现对象幻觉（object hallucinations）的问题。传统对比解码策略虽通过减少对语言先验的过度依赖来缓解此问题，但存在两个主要局限：一是盲目抑制语言先验可能损害生成内容的连贯性和准确性；二是处理对比输入会增加计算负担，显著降低推理速度。为应对这些挑战，论文提出了一种名为视觉增强融合（Visual Amplification Fusion, VAF）的插件式技术，其关键是通过加强模型中间层（模态融合主要发生层）对视觉信号的关注，更有效地捕捉视觉特征，从而减少模型对语言模态的偏差。实验结果表明，VAF能够显著减少多种MLLMs中的幻觉现象，同时保持推理速度不变，并维持生成内容的连贯性和准确性。

链接: https://arxiv.org/abs/2503.13107
作者: Hao Yin,Guangzong Si,Zilei Wang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contrastive decoding strategies are widely used to mitigate object hallucinations in multimodal large language models (MLLMs). By reducing over-reliance on language priors, these strategies ensure that generated content remains closely grounded in visual inputs, producing contextually accurate outputs. Since contrastive decoding requires no additional training or external tools, it offers both computational efficiency and versatility, making it highly attractive. However, these methods present two main limitations: (1) bluntly suppressing language priors can compromise coherence and accuracy of generated content, and (2) processing contrastive inputs adds computational load, significantly slowing inference speed. To address these challenges, we propose Visual Amplification Fusion (VAF), a plug-and-play technique that enhances attention to visual signals within the model’s middle layers, where modality fusion predominantly occurs. This approach enables more effective capture of visual features, reducing the model’s bias toward language modality. Experimental results demonstrate that VAF significantly reduces hallucinations across various MLLMs without affecting inference speed, while maintaining coherence and accuracy in generated outputs.
zh

[CV-53] Multi-Platform Teach-and-Repeat Navigation by Visual Place Recognition Based on Deep-Learned Local Features

【速读】：该论文旨在解决移动机器人导航中稳定视觉定位与建图在均匀和变化环境下的挑战。论文提出的解决方案的关键在于引入了一种基于外观的教-重复导航方法，无需标准地图即可实现简化的定位和反应式的机器人运动控制。具体而言，论文的核心贡献包括采用一种新的基于视觉位置识别技术、一种新颖的水平位移计算方法以及多平台系统设计以适应不同类型移动机器人的应用需求。此外，论文还提出了一个新的公开数据集用于外观导航方法的实验测试，并通过真实环境中的实验验证了所提出系统的性能，同时与其他最先进的方法进行了对比。结果显示，新系统在多个测试场景中优于现有方法，能够在室内外环境中运行并对昼夜场景变化表现出鲁棒性。

链接: https://arxiv.org/abs/2503.13090
作者: Václav Truhlařík,Tomáš Pivoňka,Michal Kasarda,Libor Přeučil
机构: Czech Institute of Informatics, Robotics and Cybernetics, Czech technical University in Prague (捷克技术大学布拉格分校捷克智能信息学、机器人学和控制论研究所); Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague (捷克技术大学布拉格分校电气工程学院控制论系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures

点击查看摘要

Abstract:Uniform and variable environments still remain a challenge for stable visual localization and mapping in mobile robot navigation. One of the possible approaches suitable for such environments is appearance-based teach-and-repeat navigation, relying on simplified localization and reactive robot motion control - all without a need for standard mapping. This work brings an innovative solution to such a system based on visual place recognition techniques. Here, the major contributions stand in the employment of a new visual place recognition technique, a novel horizontal shift computation approach, and a multi-platform system design for applications across various types of mobile robots. Secondly, a new public dataset for experimental testing of appearance-based navigation methods is introduced. Moreover, the work also provides real-world experimental testing and performance comparison of the introduced navigation system against other state-of-the-art methods. The results confirm that the new system outperforms existing methods in several testing scenarios, is capable of operation indoors and outdoors, and exhibits robustness to day and night scene variations.
zh

[CV-54] Gaussian On-the-Fly Splatting: A Progressive Framework for Robust Near Real-Time 3DGS Optimization

【速读】：该论文旨在解决现有3D Gaussian Splatting (3DGS) 方法依赖于离线训练且需要完整的Structure-from-Motion (SfM) 处理流程的问题，从而实现近实时的3DGS优化。论文的关键解决方案在于提出了一种名为On-the-Fly GS的渐进框架，该框架能够在图像采集过程中实时更新图像姿态与稀疏点，并立即整合优化后的高斯分布到3DGS场中。此外，通过引入基于重叠关系的局部优化策略以及自适应学习率调度机制，确保了新旧图像训练的稳定性；同时，采用高效的全局优化方案防止对新增图像的过拟合，从而在保证整体质量的同时显著减少了训练时间。

链接: https://arxiv.org/abs/2503.13086
作者: Yiwei Xu,Yifei Yu,Wentian Gan,Tengfei Wang,Zongqian Zhan,Hao Cheng,Xin Wang
机构: School of Geodesy and Geomatics, Wuhan University, China PR (武汉大学); Department of Earth Observation Science at ITC Faculty Geo-Information Science and Earth Observation, University of Twente, the Netherlands (荷兰屯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) achieves high-fidelity rendering with fast real-time performance, but existing methods rely on offline training after full Structure-from-Motion (SfM) processing. In contrast, this work introduces On-the-Fly GS, a progressive framework enabling near real-time 3DGS optimization during image capture. As each image arrives, its pose and sparse points are updated via on-the-fly SfM, and newly optimized Gaussians are immediately integrated into the 3DGS field. We propose a progressive local optimization strategy to prioritize new images and their neighbors by their corresponding overlapping relationship, allowing the new image and its overlapping images to get more training. To further stabilize training across old and new images, an adaptive learning rate schedule balances the iterations and the learning rate. Moreover, to maintain overall quality of the 3DGS field, an efficient global optimization scheme prevents overfitting to the newly added images. Experiments on multiple benchmark datasets show that our On-the-Fly GS reduces training time significantly, optimizing each new image in seconds with minimal rendering loss, offering the first practical step toward rapid, progressive 3DGS reconstruction.
zh

[CV-55] Free-form language-based robotic reasoning and grasping

【速读】：该论文旨在解决基于人类指令从杂乱料箱中进行机器人抓取的挑战性任务，重点在于理解自由形式语言的细微差别以及物体间的空间关系。论文探索了是否可以将大规模预训练的视觉-语言模型（Vision-Language Models, VLMs），如GPT-4o，在零样本设置下用于此任务，并分析其局限性。解决方案的关键在于提出了一种名为FreeGrasp的新方法，该方法利用预训练VLMs的世界知识来推理人类指令和物体的空间布局。具体而言，FreeGrasp将所有物体检测为关键点，并在图像中标注这些关键点以辅助GPT-4o的零样本空间推理，从而判断目标物体是否可以直接抓取，或者需要先抓取并移除其他物体。此外，由于缺乏专门为此任务设计的数据集，研究团队通过扩展MetaGraspNetV2数据集并加入人工标注的指令和真实的抓取序列，构建了一个合成数据集FreeGraspData，以支持模型的训练与验证。实验结果表明，该方法在抓取推理和执行方面达到了当前最先进的性能。

链接: https://arxiv.org/abs/2503.13082
作者: Runyu Jiao,Alice Fasoli,Francesco Giuliari,Matteo Bortolon,Sergio Povoli,Guofeng Mei,Yiming Wang,Fabio Poiesi
机构: Fondazione Bruno Kessler; University of Trento; Istituto Italiano di Tecnologia
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations? In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs’ world knowledge to reason about human instructions and object spatial arrangements. Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o’s zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-of-the-art performance in grasp reasoning and execution. Project website: this https URL.
zh

[CV-56] Vision-based automatic fruit counting with UAV

【速读】：该论文旨在解决利用无人机（Unmanned Aerial Vehicles, UAVs）进行智能农业中的自动水果计数问题。解决方案的关键在于开发了一种基于视觉算法的系统，该算法通过处理RGB相机和深度传感器的数据流，采用经典图像操作实现水果检测。此外，系统还支持飞行轨迹的规划与执行，优化飞行时间和覆盖距离。在仿真测试中，该方案获得了平均87.27/100分的成绩，并在ICUAS 2024会议组织的UAV竞赛中取得84.83/100分，排名第6，成功晋级决赛。

链接: https://arxiv.org/abs/2503.13080
作者: Hubert Szolc,Mateusz Wasala,Remigiusz Mietla,Kacper Iwicki,Tomasz Kryjak
机构: AGH University of Science and Technology (AGH University of Science and Technology)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted for the 29th Conference on Automation - Innovations and Future Perspectives Automation 2025, May 7 - 9, 2025, Warsaw, Poland

点击查看摘要

Abstract:The use of unmanned aerial vehicles (UAVs) for smart agriculture is becoming increasingly popular. This is evidenced by recent scientific works, as well as the various competitions organised on this topic. Therefore, in this work we present a system for automatic fruit counting using UAVs. To detect them, our solution uses a vision algorithm that processes streams from an RGB camera and a depth sensor using classical image operations. Our system also allows the planning and execution of flight trajectories, taking into account the minimisation of flight time and distance covered. We tested the proposed solution in simulation and obtained an average score of 87.27/100 points from a total of 500 missions. We also submitted it to the UAV Competition organised as part of the ICUAS 2024 conference, where we achieved an average score of 84.83/100 points, placing 6th in a field of 23 teams and advancing to the finals.
zh

[CV-57] Rethinking Image Evaluation in Super-Resolution

【速读】：该论文试图解决现有超分辨率（Super-Resolution, SR）图像评估中存在的不一致性问题，即尽管最新的SR技术在感知质量上不断改进，但在定量评价中仍可能失败。这种不一致导致了对现有SR评估指标的信任危机。论文指出，这种问题部分源于现有SR数据集中参考地面真值（Ground Truth, GT）的质量参差不齐。为了解决这些问题，论文提出了两个主要贡献：首先，通过系统分析三个真实世界SR数据集中的七种最先进的SR模型，证明低质量的GT会持续影响不同模型的表现，并且当控制GT质量时，模型表现会有显著差异；其次，提出了一种新的感知质量度量方法——相对质量指数（Relative Quality Index, RQI），用于衡量图像对之间的相对质量差异，从而缓解由不可靠GT引起的偏差评估。关键在于引入RQI这一创新性指标，以实现更公平且与人类主观意见高度一致的评估结果。

链接: https://arxiv.org/abs/2503.13074
作者: Shaolin Su,Josep M. Rocafort,Danna Xue,David Serrano-Lozano,Lei Sun,Javier Vazquez-Corral
机构: Computer Vision Center (计算机视觉中心); Universitat Autonoma de Barcelona (巴塞罗那自治大学); INSAIT, Sofia University (索非亚大学INSAIT研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent advancing image super-resolution (SR) techniques are continually improving the perceptual quality of their outputs, they can usually fail in quantitative evaluations. This inconsistency leads to a growing distrust in existing image metrics for SR evaluations. Though image evaluation depends on both the metric and the reference ground truth (GT), researchers typically do not inspect the role of GTs, as they are generally accepted as `perfect’ references. However, due to the data being collected in the early years and the ignorance of controlling other types of distortions, we point out that GTs in existing SR datasets can exhibit relatively poor quality, which leads to biased evaluations. Following this observation, in this paper, we are interested in the following questions: Are GT images in existing SR datasets 100% trustworthy for model evaluations? How does GT quality affect this evaluation? And how to make fair evaluations if there exist imperfect GTs? To answer these questions, this paper presents two main contributions. First, by systematically analyzing seven state-of-the-art SR models across three real-world SR datasets, we show that SR performances can be consistently affected across models by low-quality GTs, and models can perform quite differently when GT quality is controlled. Second, we propose a novel perceptual quality metric, Relative Quality Index (RQI), that measures the relative quality discrepancy of image pairs, thus issuing the biased evaluations caused by unreliable GTs. Our proposed model achieves significantly better consistency with human opinions. We expect our work to provide insights for the SR community on how future datasets, models, and metrics should be developed.
zh

[CV-58] DehazeMamba: SAR-guided Optical Remote Sensing Image Dehazing with Adaptive State Space Model

【速读】：该论文旨在解决光学遥感图像去雾难题，由于其空间尺度大且雾霾分布高度非均匀，传统单图像去雾方法难以有效应对。同时，论文指出当前基于合成孔径雷达（SAR）引导的去雾方法存在两个关键局限：一是SAR信息的引入常会降低无雾区域的质量；二是特征质量的不稳定进一步加剧了跨模态域偏移问题。为克服这些挑战，论文提出了DehazeMamba，这是一种基于渐进式雾霾解耦融合策略的新颖SAR引导去雾网络。该方案的关键创新包括：通过光学-SAR差异分析动态识别受雾霾影响区域的雾霾感知与解耦模块（Haze Perception and Decoupling Module, HPDM），以及通过基于特征质量评估的两阶段融合过程减轻域偏移的渐进式融合模块（Progressive Fusion Module, PFM）。此外，论文还构建了一个大规模基准数据集MRSHaze，包含8,000对时间同步且精确地理注册的高分辨率SAR-光学图像，以支持多样化的雾霾条件研究。实验结果表明，DehazeMamba在峰值信噪比（PSNR）上比现有最先进方法提高了0.73 dB，并在下游任务如语义分割中实现了显著改进。数据集可通过提供的链接获取。

链接: https://arxiv.org/abs/2503.13073
作者: Zhicheng Zhao,Jinquan Yan,Chenglong Li,Xiao Wang,Jin Tang
机构: Anhui University (安徽大学); Anhui Provincial Key Laboratory of Multimodal Cognitive Computation (安徽省多模态认知计算重点实验室); Information Materials and Intelligent Sensing Laboratory of Anhui Province (安徽省信息材料与智能感知实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Optical remote sensing image dehazing presents significant challenges due to its extensive spatial scale and highly non-uniform haze distribution, which traditional single-image dehazing methods struggle to address effectively. While Synthetic Aperture Radar (SAR) imagery offers inherently haze-free reference information for large-scale scenes, existing SAR-guided dehazing approaches face two critical limitations: the integration of SAR information often diminishes the quality of haze-free regions, and the instability of feature quality further exacerbates cross-modal domain shift. To overcome these challenges, we introduce DehazeMamba, a novel SAR-guided dehazing network built on a progressive haze decoupling fusion strategy. Our approach incorporates two key innovations: a Haze Perception and Decoupling Module (HPDM) that dynamically identifies haze-affected regions through optical-SAR difference analysis, and a Progressive Fusion Module (PFM) that mitigates domain shift through a two-stage fusion process based on feature quality assessment. To facilitate research in this domain, we present MRSHaze, a large-scale benchmark dataset comprising 8,000 pairs of temporally synchronized, precisely geo-registered SAR-optical images with high resolution and diverse haze conditions. Extensive experiments demonstrate that DehazeMamba significantly outperforms state-of-the-art methods, achieving a 0.73 dB improvement in PSNR and substantial enhancements in downstream tasks such as semantic segmentation. The dataset is available at this https URL.
zh

[CV-59] Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation

【速读】：本文旨在解决生成式人工智能内容（Artificial Intelligence-Generated Content, AIGC）中将生成图像与复杂文本提示及人类偏好对齐的核心挑战。随着奖励增强扩散蒸馏方法的兴起，研究发现当条件变得更加具体且奖励信号更强时，奖励本身成为生成过程中的主导力量，而扩散损失则更像是一种代价高昂的正则化形式。为此，论文提出了一种新的条件生成方法R0，通过正则化奖励最大化来替代传统的扩散蒸馏损失，将图像生成视为数据空间中的优化问题，以寻找具有高组成奖励的有效图像。关键在于创新的生成器参数化设计与适当的正则化技术相结合，训练出在少量步骤内表现最优的文字到图像生成模型。这挑战了传统的扩散后训练和条件生成观念，并证明了在复杂条件下奖励起着主导作用。

链接: https://arxiv.org/abs/2503.13070
作者: Yihong Luo,Tianyang Hu,Weijian Luo,Kenji Kawaguchi,Jing Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aligning generated images to complicated text prompts and human preferences is a central challenge in Artificial Intelligence-Generated Content (AIGC). With reward-enhanced diffusion distillation emerging as a promising approach that boosts controllability and fidelity of text-to-image models, we identify a fundamental paradigm shift: as conditions become more specific and reward signals stronger, the rewards themselves become the dominant force in generation. In contrast, the diffusion losses serve as an overly expensive form of regularization. To thoroughly validate our hypothesis, we introduce R0, a novel conditional generation approach via regularized reward maximization. Instead of relying on tricky diffusion distillation losses, R0 proposes a new perspective that treats image generations as an optimization problem in data space which aims to search for valid images that have high compositional rewards. By innovative designs of the generator parameterization and proper regularization techniques, we train state-of-the-art few-step text-to-image generative models with R0 at scales. Our results challenge the conventional wisdom of diffusion post-training and conditional generation by demonstrating that rewards play a dominant role in scenarios with complex conditions. We hope our findings can contribute to further research into human-centric and reward-centric generation paradigms across the broader field of AIGC. Code is available at this https URL.
zh

[CV-60] Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

【速读】：该论文试图解决的问题是如何设计一个具备通用能力的视听模型，以统一完成多样化任务（如时间定位、空间定位、时空推理和像素级理解），而非像现有模型那样针对单一任务进行专门化训练。然而，直接联合训练所有任务会因视听数据的异质性和任务间复杂关系而导致干扰。为解决此问题，论文的关键方案是从数据和模型两个方面实现任务间的显式协作。具体而言，通过精心整理现有数据集并构建包含显式推理过程的Audio-Visual Unified Instruction-tuning数据集(AV-UIE)，明确任务间的协作关系；同时设计了一种具有多LoRA头的交互感知LoRA结构，以促进学习阶段的具体协作。最终，该方法不仅在多个任务上超越了现有的统一视听模型，还在某些任务上超过了大多数专用模型。此外，可视化结果表明每个LoRA头都具备一定的视听理解能力。

链接: https://arxiv.org/abs/2503.13068
作者: Henghui Du,Guangyao Li,Chang Zhou,Chunjie Zhang,Alan Zhao,Di Hu
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Department of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系); AI Technology Center, Online Video Business Unit, Tencent PCG (腾讯PCG在线视频业务单元AI技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, numerous tasks have been proposed to encourage model to develop specified capability in understanding audio-visual scene, primarily categorized into temporal localization, spatial localization, spatio-temporal reasoning, and pixel-level understanding. Instead, human possesses a unified understanding ability for diversified tasks. Therefore, designing an audio-visual model with general capability to unify these tasks is of great value. However, simply joint training for all tasks can lead to interference due to the heterogeneity of audiovisual data and complex relationship among tasks. We argue that this problem can be solved through explicit cooperation among tasks. To achieve this goal, we propose a unified learning method which achieves explicit inter-task cooperation from both the perspectives of data and model thoroughly. Specifically, considering the labels of existing datasets are simple words, we carefully refine these datasets and construct an Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning process (AV-UIE), which clarifies the cooperative relationship among tasks. Subsequently, to facilitate concrete cooperation in learning stage, an interaction-aware LoRA structure with multiple LoRA heads is designed to learn different aspects of audiovisual data interaction. By unifying the explicit cooperation across the data and model aspect, our method not only surpasses existing unified audio-visual model on multiple tasks, but also outperforms most specialized models for certain tasks. Furthermore, we also visualize the process of explicit cooperation and surprisingly find that each LoRA head has certain audio-visual understanding ability. Code and dataset: this https URL
zh

[CV-61] Federated Learning with Domain Shift Eraser CVPR2025

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中因客户端数据来自不同域而导致的域偏移（domain shift）问题，这种域偏移会损害模型性能并阻碍一致表示空间的学习。为应对这一挑战，论文提出了一种名为联邦域偏移擦除器（FDSE）的新框架。其关键在于通过将神经网络中的每一层分解为域无关特征提取器（Domain-agnostic Feature Extractor, DFE）和域特定偏移擦除器（Domain-specific Skew Eraser, DSE），交替提取特征并校正域偏移，从而实现模型前向传递的迭代去偏过程。此外，通过引入正则化项确保DSE输出的局部统计特性接近全局一致，并利用相似性感知聚合个性化DSE模块以针对性地消除各客户端的域偏移，同时公平聚合DFE模块以最大化共识。这些方法共同提升了模型在准确性、效率及泛化能力方面的表现。

链接: https://arxiv.org/abs/2503.13063
作者: Zheng Wang,Zihui Wang,Zheng Wang,Xiaoliang Fan,Cheng Wang
机构: Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University (厦门大学), China; Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University (厦门大学), China; Shanghai Innovation Institution (上海创新机构); School of Informatic, Xiamen University (厦门大学), China; Peng Cheng Laboratory, Shenzhen (鹏城实验室，深圳), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Federated learning (FL) is emerging as a promising technique for collaborative learning without local data leaving their devices. However, clients’ data originating from diverse domains may degrade model performance due to domain shifts, preventing the model from learning consistent representation space. In this paper, we propose a novel FL framework, Federated Domain Shift Eraser (FDSE), to improve model performance by differently erasing each client’s domain skew and enhancing their consensus. First, we formulate the model forward passing as an iterative deskewing process that extracts and then deskews features alternatively. This is efficiently achieved by decomposing each original layer in the neural network into a Domain-agnostic Feature Extractor (DFE) and a Domain-specific Skew Eraser (DSE). Then, a regularization term is applied to promise the effectiveness of feature deskewing by pulling local statistics of DSE’s outputs close to the globally consistent ones. Finally, DFE modules are fairly aggregated and broadcast to all the clients to maximize their consensus, and DSE modules are personalized for each client via similarity-aware aggregation to erase their domain skew differently. Comprehensive experiments were conducted on three datasets to confirm the advantages of our method in terms of accuracy, efficiency, and generalizability.
zh

[CV-62] Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari

【速读】：该论文旨在解决印度中世纪 Modi 字符串手稿文档向 Devanagari 字符串直接转译的问题，当前面临缺乏专家以及大量文档未被转译的挑战。过去的研究多集中于单个字符识别，而未能提供完整的端到端解决方案。论文的关键创新在于提出了 MoDeTrans 数据集与 MoScNet 框架。MoScNet 是一种新颖的视觉-语言模型（Vision-Language Model, VLM），通过知识蒸馏（Knowledge Distillation）方法，利用教师模型指导学生模型以提升转译性能，最终学生模型在参数量仅为教师模型 1/163 的情况下实现了更优的表现。这一工作首次实现了从手写 Modi 字体到 Devanagari 字符串的直接转译，并在光学字符识别（OCR）任务中表现出竞争力。

链接: https://arxiv.org/abs/2503.13060
作者: Harshal Kausadikar,Tanvi Kale,Onkar Susladkar,Sparsh Mittal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under submission at a conference

点击查看摘要

Abstract:In medieval India, the Marathi language was written using the Modi script. The texts written in Modi script include extensive knowledge about medieval sciences, medicines, land records and authentic evidence about Indian history. Around 40 million documents are in poor condition and have not yet been transliterated. Furthermore, only a few experts in this domain can transliterate this script into English or Devanagari. Most of the past research predominantly focuses on individual character recognition. A system that can transliterate Modi script documents to Devanagari script is needed. We propose the MoDeTrans dataset, comprising 2,043 images of Modi script documents accompanied by their corresponding textual transliterations in Devanagari. We further introduce MoScNet (\textbfModi \textbfScript \textbfNetwork), a novel Vision-Language Model (VLM) framework for transliterating Modi script images into Devanagari text. MoScNet leverages Knowledge Distillation, where a student model learns from a teacher model to enhance transliteration performance. The final student model of MoScNet has better performance than the teacher model while having 163 \times lower parameters. Our work is the first to perform direct transliteration from the handwritten Modi script to the Devanagari script. MoScNet also shows competitive results on the optical character recognition (OCR) task.
zh

[CV-63] Do Vision Models Develop Human-Like Progressive Difficulty Understanding?

【速读】：该论文试图解决的问题是：评估视觉识别模型是否具备与人类相似的学习能力模式，即模型在面对不同难度的图像时，其响应是否表现出类似的层级结构（如果模型在简单任务上失败，则很可能在更复杂任务上也失败；反之亦然）。此外，论文还探索了一种新的模型评估方法以提高效率。

解决方案的关键在于：首先构建了一个包含100个类别、10个属性和3个难度等级的数据集，利用生成式模型（Generative Models）控制图像的难度，从而模拟人类学习中的层级结构。接着发现大多数模型在约80%-90%的情况下确实遵循这一模式。基于此观察，论文提出一种自适应测试方法，类似于GRE考试，通过动态调整后续测试样本的难度来反映模型的实际能力，从而在较少步骤内高效评估模型的整体性能。

链接: https://arxiv.org/abs/2503.13058
作者: Zeyi Huang,Utkarsh Ojha,Yuyang Ji,Donghyun Lee,Yong Jae Lee
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); UIUC (伊利诺伊大学香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When a human undertakes a test, their responses likely follow a pattern: if they answered an easy question (2 \times 3) incorrectly, they would likely answer a more difficult one (2 \times 3 \times 4) incorrectly; and if they answered a difficult question correctly, they would likely answer the easy one correctly. Anything else hints at memorization. Do current visual recognition models exhibit a similarly structured learning capacity? In this work, we consider the task of image classification and study if those models’ responses follow that pattern. Since real images aren’t labeled with difficulty, we first create a dataset of 100 categories, 10 attributes, and 3 difficulty levels using recent generative models: for each category (e.g., dog) and attribute (e.g., occlusion), we generate images of increasing difficulty (e.g., a dog without occlusion, a dog only partly visible). We find that most of the models do in fact behave similarly to the aforementioned pattern around 80-90% of the time. Using this property, we then explore a new way to evaluate those models. Instead of testing the model on every possible test image, we create an adaptive test akin to GRE, in which the model’s performance on the current round of images determines the test images in the next round. This allows the model to skip over questions too easy/hard for itself, and helps us get its overall performance in fewer steps.
zh

[CV-64] MaskSDM with Shapley values to improve flexibility robustness and explainability in species distribution modeling

【速读】：该论文旨在解决现有物种分布模型（SDMs）在预测物种分布时面临的三个关键问题：(1) 缺乏在推理阶段灵活选择相关预测变量的能力而无需重新训练；(2) 对缺失预测变量值的处理不够稳健，可能影响准确性；(3) 缺乏对每个预测变量贡献的可解释性。为了解决这些问题，论文提出了一种名为MaskSDM的新方法，这是一种基于深度学习的SDM，通过采用掩码训练策略实现了预测变量选择的灵活性。此方法允许模型使用任意子集的输入变量进行预测，并保持对数据缺失的鲁棒性，同时清晰揭示添加或移除某个预测变量对模型性能的影响。此外，MaskSDM利用Shapley值精确评估预测变量的贡献，超越了传统的近似方法。通过在sPlotOpen数据集上的实验验证，MaskSDM在性能上优于基于插补的方法，并且能够逼近针对特定变量子集训练的模型，展示了其在提升SDMs适用性和推广潜力方面的显著优势。

链接: https://arxiv.org/abs/2503.13057
作者: Robin Zbinden,Nina van Tiel,Gencer Sumbul,Chiara Vanalli,Benjamin Kellenberger,Devis Tuia
机构: Environmental Computational Science and Earth Observation Laboratory, École Polytechnique Fédérale de Lausanne (瑞士联邦理工学院环境计算科学与地球观测实验室); People and Nature Lab, University College London (伦敦大学学院自然与人类实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Species Distribution Models (SDMs) play a vital role in biodiversity research, conservation planning, and ecological niche modeling by predicting species distributions based on environmental conditions. The selection of predictors is crucial, strongly impacting both model accuracy and how well the predictions reflect ecological patterns. To ensure meaningful insights, input variables must be carefully chosen to match the study objectives and the ecological requirements of the target species. However, existing SDMs, including both traditional and deep learning-based approaches, often lack key capabilities for variable selection: (i) flexibility to choose relevant predictors at inference without retraining; (ii) robustness to handle missing predictor values without compromising accuracy; and (iii) explainability to interpret and accurately quantify each predictor’s contribution. To overcome these limitations, we introduce MaskSDM, a novel deep learning-based SDM that enables flexible predictor selection by employing a masked training strategy. This approach allows the model to make predictions with arbitrary subsets of input variables while remaining robust to missing data. It also provides a clearer understanding of how adding or removing a given predictor affects model performance and predictions. Additionally, MaskSDM leverages Shapley values for precise predictor contribution assessments, improving upon traditional approximations. We evaluate MaskSDM on the global sPlotOpen dataset, modeling the distributions of 12,738 plant species. Our results show that MaskSDM outperforms imputation-based methods and approximates models trained on specific subsets of variables. These findings underscore MaskSDM’s potential to increase the applicability and adoption of SDMs, laying the groundwork for developing foundation models in SDMs that can be readily applied to diverse ecological applications.
zh

[CV-65] Uncertainty-Aware Knowledge Distillation for Compact and Efficient 6DoF Pose Estimation

【速读】：本文旨在解决轻量级模型在实时6DoF（六自由度）物体姿态估计中的精度与紧凑性权衡问题。针对基于关键点的6DoF姿态估计算法，论文提出了一种新颖的不确定性感知端到端知识蒸馏（Knowledge Distillation, KD）框架。其关键在于利用教师模型预测的关键点所具有的不同不确定性水平，在蒸馏过程中优化学生模型的准确性同时保持其紧凑性。具体而言，通过调整知识传递以适应每个教师关键点预测相关的不确定性，实现了学生与教师预测之间的对齐；此外，该方法还利用这种基于不确定性的关键点对齐，在相应特征图的关键位置转移知识。实验结果表明，该方法在LINEMOD基准数据集上的表现优于现有技术，并且在SPEED+数据集上的验证进一步证明了其在多样化6DoF姿态估计场景下的鲁棒性。

链接: https://arxiv.org/abs/2503.13053
作者: Nassim Ali Ousalah,Anis Kacem,Enjie Ghorbel,Emmanuel Koumandakis,Djamila Aouada
机构: CVI2 (CVI2), SnT (SnT), University of Luxembourg (卢森堡大学); Cristal Laboratory (Cristal 实验室), ENSI (ENSI), Manouba University (马诺巴大学); Infinite Orbits (Infinite Orbits), Toulouse (图卢兹), France (法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compact and efficient 6DoF object pose estimation is crucial in applications such as robotics, augmented reality, and space autonomous navigation systems, where lightweight models are critical for real-time accurate performance. This paper introduces a novel uncertainty-aware end-to-end Knowledge Distillation (KD) framework focused on keypoint-based 6DoF pose estimation. Keypoints predicted by a large teacher model exhibit varying levels of uncertainty that can be exploited within the distillation process to enhance the accuracy of the student model while ensuring its compactness. To this end, we propose a distillation strategy that aligns the student and teacher predictions by adjusting the knowledge transfer based on the uncertainty associated with each teacher keypoint prediction. Additionally, the proposed KD leverages this uncertainty-aware alignment of keypoints to transfer the knowledge at key locations of their respective feature maps. Experiments on the widely-used LINEMOD benchmark demonstrate the effectiveness of our method, achieving superior 6DoF object pose estimation with lightweight models compared to state-of-the-art approaches. Further validation on the SPEED+ dataset for spacecraft pose estimation highlights the robustness of our approach under diverse 6DoF pose estimation scenarios.
zh

[CV-66] Permutation Learning with Only N Parameters: From SoftSort to Self-Organizing Gaussians

【速读】：该论文致力于解决排序与排列学习在优化和机器学习中的高维数据组织问题，特别是在处理大规模数据集时现有方法因参数需求过高而导致的计算开销大的挑战。传统方法如Gumbel-Sinkhorn需要(N*N)参数来确定完整的排列矩阵，而低秩矩阵分解虽然将内存需求降至(2MN(M<<N))，但对极大规模问题仍显不足；SoftSort虽实现了可微1D排序的连续松弛，却难以应对多维数据及复杂排列。
论文的关键创新在于提出了一种仅需(N)个参数的新方法，大幅降低了存储成本。其核心方案基于SoftSort，并通过迭代分离式学习过程重新排列待排序元素的(N)个索引，显著提升了排序质量和多维数据的优化性能，同时保持了高效率和可扩展性。这一改进尤其适用于大规模优化任务，如“自组织高斯模型”等，对高效且可扩展的排列学习提出了更高需求。

链接: https://arxiv.org/abs/2503.13051
作者: Kai Uwe Barthel,Florian Barthel,Peter Eisert
机构: Visual Computing Group (视觉计算组), HTW Berlin (柏林应用技术大学); Vision and Imaging Technologies (视觉与成像技术), Fraunhofer HHI (弗劳恩霍夫赫兹电波研究所), Berlin, Germany
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Sorting and permutation learning are key concepts in optimization and machine learning, especially when organizing high-dimensional data into meaningful spatial layouts. The Gumbel-Sinkhorn method, while effective, requires N*N parameters to determine a full permutation matrix, making it computationally expensive for large datasets. Low-rank matrix factorization approximations reduce memory requirements to 2MN (with M N), but they still struggle with very large problems. SoftSort, by providing a continuous relaxation of the argsort operator, allows differentiable 1D sorting, but it faces challenges with multidimensional data and complex permutations. In this paper, we present a novel method for learning permutations using only N parameters, which dramatically reduces storage costs. Our approach builds on SoftSort, but extends it by iteratively shuffling the N indices of the elements to be sorted through a separable learning process. This modification significantly improves sorting quality, especially for multidimensional data and complex optimization criteria, and outperforms pure SoftSort. Our method offers improved memory efficiency and scalability compared to existing approaches, while maintaining high-quality permutation learning. Its dramatically reduced memory requirements make it particularly well-suited for large-scale optimization tasks, such as “Self-Organizing Gaussians”, where efficient and scalable permutation learning is critical.
zh

[CV-67] InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving

【速读】：本文旨在解决从原始传感器数据直接生成自动驾驶规划结果的问题，特别是在复杂场景中实现适应性和鲁棒性的挑战。传统方法依赖全局场景感知，但实际驾驶中人类驾驶员仅关注直接影响驾驶决策的关键区域。为此，论文提出了一种名为InsightDrive的新颖端到端自动驾驶方法，其核心在于通过语言引导的场景表示来组织感知。关键解决方案包括：引入以实例为中心的场景标记器，将周围环境转换为与地图和对象感知相关的实例标记；利用视觉-语言模型生成强调影响本车运动的关键区域和障碍物的场景注意语言描述；通过视觉-语言模型对齐场景描述与视觉特征，引导视觉注意力以实现有效的场景表示；采用自注意力和交叉注意力机制建模本车代理与地图之间的拓扑关系；最终基于场景理解联合执行运动预测与路径规划。实验表明，该方法在nuScenes基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2503.13047
作者: Ruiqi Song,Xianda Guo,Hangbin Wu,Qinggong Wei,Long Chen
机构: College of Surveying and Geo-informatics, Tongji University (同济大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Computer Science, Wuhan University (武汉大学); The School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Waytous; IAIR, Xi’an Jiaotong University (西安交通大学 IAIR)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Directly generating planning results from raw sensors has become increasingly prevalent due to its adaptability and robustness in complex scenarios. Scene representation, as a key module in the pipeline, has traditionally relied on conventional perception, which focus on the global scene. However, in driving scenarios, human drivers typically focus only on regions that directly impact driving, which often coincide with those required for end-to-end autonomous driving. In this paper, a novel end-to-end autonomous driving method called InsightDrive is proposed, which organizes perception by language-guided scene representation. We introduce an instance-centric scene tokenizer that transforms the surrounding environment into map- and object-aware instance tokens. Scene attention language descriptions, which highlight key regions and obstacles affecting the ego vehicle’s movement, are generated by a vision-language model that leverages the cognitive reasoning capabilities of foundation models. We then align scene descriptions with visual features using the vision-language model, guiding visual attention through these descriptions to give effectively scene representation. Furthermore, we employ self-attention and cross-attention mechanisms to model the ego-agents and ego-map relationships to comprehensively build the topological relationships of the scene. Finally, based on scene understanding, we jointly perform motion prediction and planning. Extensive experiments on the widely used nuScenes benchmark demonstrate that the proposed InsightDrive achieves state-of-the-art performance in end-to-end autonomous driving. The code is available at this https URL
zh

[CV-68] All You Need to Know About Training Image Retrieval Models

【速读】：该论文试图解决图像检索任务中影响检索精度的关键因素问题。通过运行数万次训练实验，研究了嵌入模型架构、损失函数、数据采样器、挖掘函数、学习率以及批量大小等训练阶段因素对检索准确性的影响，并总结出适用于多个数据集的最佳实践。关键在于系统性地分析和验证这些因素对图像检索性能的具体作用，从而为优化图像检索管道提供指导。代码资源可从提供的链接获取。

链接: https://arxiv.org/abs/2503.13045
作者: Gabriele Berton,Kevin Musgrave,Carlo Masone
机构: Polytechnic of Turin (都灵理工大学); Setta.dev; Polytechnic of Turin (都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image retrieval is the task of finding images in a database that are most similar to a given query image. The performance of an image retrieval pipeline depends on many training-time factors, including the embedding model architecture, loss function, data sampler, mining function, learning rate(s), and batch size. In this work, we run tens of thousands of training runs to understand the effect each of these factors has on retrieval accuracy. We also discover best practices that hold across multiple datasets. The code is available at this https URL
zh

[CV-69] Beyond Role-Based Surgical Domain Modeling: Generalizable Re-Identification in the Operating Room

【速读】：该论文旨在解决手术团队中个体角色识别与跨程序跟踪的挑战，特别是在不同临床环境间模型泛化能力不足的问题。传统手术领域模型主要关注整体工作流程优化，而忽视了团队熟悉度和个人差异对手术结果的影响。为应对这些局限性，论文提出了一种以人员为中心的新颖建模方法，通过捕捉每位成员独特的运动模式和物理特征，实现对手术人员在多次手术中的长期跟踪分析。关键在于开发了一个可泛化的再识别框架，该框架通过编码三维点云序列来捕获每个个体独有的形状和关节运动模式。实验结果显示，该方法在真实临床数据上的准确率达到86.19%，并在跨环境迁移任务中保持75.27%的准确率，较现有方法提升了12%。此外，当用于增强无标记人员追踪时，其准确性提高了超过50%。这一方案不仅提升了模型的适应性和精度，还通过新的工作流可视化技术揭示了手术团队动态和空间利用模式的新见解，推动了对手术工作流及团队协作的分析方法改进。

链接: https://arxiv.org/abs/2503.13028
作者: Tony Danjun Wang,Lennart Bastian,Tobias Czempiel,Christian Heiliger,Nassir Navab
机构: Technical University of Munich (TUM)(慕尼黑工业大学); University College London (UCL)(伦敦大学学院); Ludwig-Maximilians-Universität München (LMU)(慕尼黑路德维希-马克西米利安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 14 figures, Submitted to Medical Image Analysis

点击查看摘要

Abstract:Surgical domain models improve workflow optimization through automated predictions of each staff member’s surgical role. However, mounting evidence indicates that team familiarity and individuality impact surgical outcomes. We present a novel staff-centric modeling approach that characterizes individual team members through their distinctive movement patterns and physical characteristics, enabling long-term tracking and analysis of surgical personnel across multiple procedures. To address the challenge of inter-clinic variability, we develop a generalizable re-identification framework that encodes sequences of 3D point clouds to capture shape and articulated motion patterns unique to each individual. Our method achieves 86.19% accuracy on realistic clinical data while maintaining 75.27% accuracy when transferring between different environments - a 12% improvement over existing methods. When used to augment markerless personnel tracking, our approach improves accuracy by over 50%. Through extensive validation across three datasets and the introduction of a novel workflow visualization technique, we demonstrate how our framework can reveal novel insights into surgical team dynamics and space utilization patterns, advancing methods to analyze surgical workflows and team coordination.
zh

[CV-70] HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model

【速读】：该论文旨在解决现有基于大型多模态模型（Large Multimodal Models, LMMs）的图像分割方法中存在的掩码表示不足和架构复杂的问题，这些问题限制了LMMs在图像分割任务中的潜力。论文的关键创新在于提出了Hierarchical Mask Tokenizer (HiMTok)，它通过最多32个标记来表示分割掩码，并在掩码解码过程中无需依赖原始图像。HiMTok实现了紧凑且从粗到细的掩码表示，与语言模型的下一token预测范式高度兼容，从而直接赋予LMMs分割能力。此外，论文设计了一种三阶段的渐进学习训练策略，并引入分层掩码损失以促进有效的粗到细学习，同时通过双向信息流实现边界框与掩码标记之间的转换，充分挖掘多任务训练的潜力。这些方案显著提升了多种分割任务的性能，并增强了视觉定位和整体视觉理解能力。

链接: https://arxiv.org/abs/2503.13026
作者: Tao Wang,Changxu Cheng,Lingfeng Wang,Senda Chen,Wuyue Zhao
机构: Uni-Ubi; Zhejiang University (浙江大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:The remarkable performance of large multimodal models (LMMs) has attracted significant interest from the image segmentation community. To align with the next-token-prediction paradigm, current LMM-driven segmentation methods either use object boundary points to represent masks or introduce special segmentation tokens, whose hidden states are decoded by a segmentation model requiring the original image as input. However, these approaches often suffer from inadequate mask representation and complex architectures, limiting the potential of LMMs. In this work, we propose the Hierarchical Mask Tokenizer (HiMTok), which represents segmentation masks with up to 32 tokens and eliminates the need for the original image during mask de-tokenization. HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the LLM next-token-prediction paradigm and facilitating the direct acquisition of segmentation capabilities. We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning. Additionally, we enable bidirectional information flow, allowing conversion between bounding boxes and mask tokens to fully leverage multi-task training potential. Extensive experiments demonstrate that our method achieves state-of-the-art performance across various segmentation tasks,while also enhancing visual grounding and maintaining overall visual understanding.
zh

[CV-71] PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data

【速读】：该论文旨在解决现有数据增强方法在处理具有多样化人体外观和复杂姿势的真实世界场景时效果不佳的问题。论文提出了一种名为PoseSyn的新颖数据合成框架，通过将丰富的野外2D姿态数据集转化为多样化的3D姿态图像对来提升模型泛化能力。PoseSyn的关键在于其包含两个核心组件：误差提取模块（Error Extraction Module, EEM），用于从2D姿态数据集中识别具有挑战性的姿态；以及运动合成模块（Motion Synthesis Module, MSM），围绕这些具有挑战性的姿态合成运动序列。通过利用与具有挑战性姿态和外观对齐的人体动画模型生成真实的3D训练数据，PoseSyn能够使各种3D姿态估计器在真实世界基准测试中的准确性提高多达14%，涵盖多种背景、遮挡、复杂姿态及多视角场景。大量实验进一步验证了PoseSyn是一种可扩展且有效的解决方案，能够在不依赖昂贵的3D标注的情况下改善模型的泛化性能，无论姿态估计器的模型大小或设计如何。

链接: https://arxiv.org/abs/2503.13025
作者: ChangHee Yang,Hyeonseop Song,Seokhun Choi,Seungwoo Lee,Jaechul Kim,Hoseok Do
机构: AI Lab, CTO Division, LG Electronics (LG电子 AI 实验室, 首席技术官事业部)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The first three authors contributed equally to this work

点击查看摘要

Abstract:Despite considerable efforts to enhance the generalization of 3D pose estimators without costly 3D annotations, existing data augmentation methods struggle in real world scenarios with diverse human appearances and complex poses. We propose PoseSyn, a novel data synthesis framework that transforms abundant in the wild 2D pose dataset into diverse 3D pose image pairs. PoseSyn comprises two key components: Error Extraction Module (EEM), which identifies challenging poses from the 2D pose datasets, and Motion Synthesis Module (MSM), which synthesizes motion sequences around the challenging poses. Then, by generating realistic 3D training data via a human animation model aligned with challenging poses and appearances PoseSyn boosts the accuracy of various 3D pose estimators by up to 14% across real world benchmarks including various backgrounds and occlusions, challenging poses, and multi view scenarios. Extensive experiments further confirm that PoseSyn is a scalable and effective approach for improving generalization without relying on expensive 3D annotations, regardless of the pose estimator’s model size or design.
zh

[CV-72] Real-Time Multi-Object Tracking using YOLOv8 and SORT on a SoC FPGA

【速读】：该论文旨在解决多目标跟踪（MOT）在低功耗实时嵌入式平台上的实现问题，特别关注如何高效利用硬件资源以支持复杂场景下的目标检测与跟踪。论文的关键在于提出了一种基于FPGA的嵌入式MOT系统实现方案，采用量化的YOLOv8检测器和SORT跟踪器。解决方案的核心在于通过修改FINN框架来利用外部存储器管理YOLOv8模型参数，并优化必要的计算操作，从而降低系统的内存占用和计算复杂度。实验结果显示，在COCO数据集上检测精度达到0.21 mAP，在MOT15数据集上跟踪精度达到38.9 MOTA，验证了该方法的有效性。

链接: https://arxiv.org/abs/2503.13023
作者: Michal Danilowicz,Tomasz Kryjak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for the 21st International Symposium on Applied Reconfigurable Computing ARC 2025, Sevilla, Spain, April 9-11, 2025

点击查看摘要

Abstract:Multi-object tracking (MOT) is one of the most important problems in computer vision and a key component of any vision-based perception system used in advanced autonomous mobile robotics. Therefore, its implementation on low-power and real-time embedded platforms is highly desirable. Modern MOT algorithms should be able to track objects of a given class (e.g. people or vehicles). In addition, the number of objects to be tracked is not known in advance, and they may appear and disappear at any time, as well as be obscured. For these reasons, the most popular and successful approaches have recently been based on the tracking paradigm. Therefore, the presence of a high quality object detector is essential, which in practice accounts for the vast majority of the computational and memory complexity of the whole MOT system. In this paper, we propose an FPGA (Field-Programmable Gate Array) implementation of an embedded MOT system based on a quantized YOLOv8 detector and the SORT (Simple Online Realtime Tracker) tracker. We use a modified version of the FINN framework to utilize external memory for model parameters and to support operations necessary required by YOLOv8. We discuss the evaluation of detection and tracking performance using the COCO and MOT15 datasets, where we achieve 0.21 mAP and 38.9 MOTA respectively. As the computational platform, we use an MPSoC system (Zynq UltraScale+ device from AMD/Xilinx) where the detector is deployed in reprogrammable logic and the tracking algorithm is implemented in the processor system.
zh

[CV-73] Efficient Motion-Aware Video MLLM CVPR2025

【速读】：该论文旨在解决现有视频多模态学习模型（Video MLLMs）中存在的两个主要问题：一是均匀帧采样导致的数据处理效率低下；二是图像级编码器限制了对运动信息的有效感知。为了解决这些问题，论文提出了一种名为EMA（Efficient Motion-Aware）的高效运动感知视频多模态学习模型。EMA的关键创新在于利用压缩视频结构作为输入，并设计了一个运动感知GOP（Group of Pictures）编码器，该编码器能够在压缩视频流的GOP单元内融合空间与运动信息，生成紧凑且信息丰富的视觉标记。此外，通过在原生慢-快输入架构中结合较少但更密集的RGB帧与更多但更稀疏的运动向量，EMA减少了冗余并增强了运动表示能力。同时，论文还引入了MotionBench基准来评估四种不同类型的运动理解能力，进一步验证了EMA在提升性能和降低推理成本方面的有效性及其在长视频理解任务中的可扩展性。

链接: https://arxiv.org/abs/2503.13016
作者: Zijia Zhao,Yuqi Huo,Tongtian Yue,Longteng Guo,Haoyu Lu,Bingning Wang,Weipeng Chen,Jing Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Most current video MLLMs rely on uniform frame sampling and image-level encoders, resulting in inefficient data processing and limited motion awareness. To address these challenges, we introduce EMA, an Efficient Motion-Aware video MLLM that utilizes compressed video structures as inputs. We propose a motion-aware GOP (Group of Pictures) encoder that fuses spatial and motion information within a GOP unit in the compressed video stream, generating compact, informative visual tokens. By integrating fewer but denser RGB frames with more but sparser motion vectors in this native slow-fast input architecture, our approach reduces redundancy and enhances motion representation. Additionally, we introduce MotionBench, a benchmark for evaluating motion understanding across four motion types: linear, curved, rotational, and contact-based. Experimental results show that EMA achieves state-of-the-art performance on both MotionBench and popular video question answering benchmarks, while reducing inference costs. Moreover, EMA demonstrates strong scalability, as evidenced by its competitive performance on long video understanding benchmarks.
zh

[CV-74] st-Time Domain Generalization via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation

【速读】：该论文致力于解决预训练模型在实际部署中因域偏移（domain shift）导致性能下降的问题，尤其是在医疗图像分割任务中的测试时适应（Test-time Adaptation, TTA）挑战。现有TTA方法通常难以在医疗图像分割中取得强性能，主要因为它们忽视了医疗图像中固有的关键先验知识。为了解决这一问题，论文的关键创新在于结合形态学信息，并提出基于多图匹配（multi-graph matching）的框架。具体而言，通过引入可学习的宇宙嵌入（learnable universe embeddings），在多源训练阶段整合形态学先验知识，同时设计新颖的无监督测试时适应范式。这种方法不仅保证了多匹配的循环一致性，还使模型能够更有效地捕获未见数据的不变先验，从而显著减轻域偏移的影响。实验结果表明，该方法在多源和单源领域泛化任务的两个医学图像分割基准数据集上均优于当前最先进的方法。

链接: https://arxiv.org/abs/2503.13012
作者: Xingguo Lv,Xingbo Dong,Liwen Wang,Jiewen Yang,Lei Zhao,Bin Pu,Zhe Jin,Xuejun Li
机构: Anhui Provincial International Joint Research Center for Advanced Technology in Medical Imaging (安徽省医学影像先进技术国际联合研究中心), Anhui University (安徽大学); Hunan University (湖南大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite domain generalization (DG) has significantly addressed the performance degradation of pre-trained models caused by domain shifts, it often falls short in real-world deployment. Test-time adaptation (TTA), which adjusts a learned model using unlabeled test data, presents a promising solution. However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. Specifically, we introduce learnable universe embeddings that integrate morphological priors during multi-source training, along with novel unsupervised test-time paradigms for domain adaptation. This approach guarantees cycle-consistency in multi-matching while enabling the model to more effectively capture the invariant priors of unseen data, significantly mitigating the effects of domain shifts. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches on two medical image segmentation benchmarks for both multi-source and single-source domain generalization tasks. The source code is available at this https URL.
zh

[CV-75] Knowledge Distillation: Enhancing Neural Network Compression with Integrated Gradients

【速读】：该论文旨在解决在资源受限设备上高效部署深度神经网络的问题，特别是在保持模型精度与互操作性的前提下实现网络压缩。论文提出了一种结合知识蒸馏（Knowledge Distillation, KD）与整合梯度（Integrated Gradients, IG）的机器学习框架。其中，IG作为一种归因方法，被用于优化卷积神经网络的压缩过程。解决方案的关键在于引入了一种新颖的数据增强策略：通过叠加从教师模型预计算的IG映射到训练图像上，指导紧凑型学生模型关注关键特征表示。这种方法利用教师模型的决策洞察，增强了学生模型复制复杂模式的能力，同时减少了参数量。实验结果表明，该方法显著提升了模型性能与推理效率，并通过全面的消融研究揭示了KD与IG之间的协同效应。

链接: https://arxiv.org/abs/2503.13008
作者: David E. Hernandez,Jose Ramon Chang,Torbjörn E. M. Nordling
机构: Nordling Lab (Nordling实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 3 figures, conference

点击查看摘要

Abstract:Efficient deployment of deep neural networks on resource-constrained devices demands advanced compression techniques that preserve accuracy and interoperability. This paper proposes a machine learning framework that augments Knowledge Distillation (KD) with Integrated Gradients (IG), an attribution method, to optimise the compression of convolutional neural networks. We introduce a novel data augmentation strategy where IG maps, precomputed from a teacher model, are overlaid onto training images to guide a compact student model toward critical feature representations. This approach leverages the teacher’s decision-making insights, enhancing the student’s ability to replicate complex patterns with reduced parameters. Experiments on CIFAR-10 demonstrate the efficacy of our method: a student model, compressed 4.1-fold from the MobileNet-V2 teacher, achieves 92.5% classification accuracy, surpassing the baseline student’s 91.4% and traditional KD approaches, while reducing inference latency from 140 ms to 13 ms–a tenfold speedup. We perform hyperparameter optimisation for efficient learning. Comprehensive ablation studies dissect the contributions of KD and IG, revealing synergistic effects that boost both performance and model explainability. Our method’s emphasis on feature-level guidance via IG distinguishes it from conventional KD, offering a data-driven solution for mining transferable knowledge in neural architectures. This work contributes to machine learning by providing a scalable, interpretable compression technique, ideal for edge computing applications where efficiency and transparency are paramount.
zh

[CV-76] FDM: Time-Variant Frequency-Based Point Cloud Diffusion with Mamba

【速读】：该论文旨在解决将状态空间模型（State Space Models, Mamba）的强大能力应用于三维点云生成的问题，目前这一领域的研究较为有限。论文的关键创新在于提出了一种包含双重潜在Mamba块（Dual Latent Mamba Block, DM-Block）和时变频率编码器（Time-Variant Frequency Encoder, TF-Encoder）的新型扩散框架。其中，DM-Block通过空间填充曲线重新排列点云为适合Mamba状态空间建模的序列，并在潜在空间中操作以减轻直接处理三维数据带来的计算开销；而TF-Encoder则利用扩散模型在后期恢复阶段细化细节的能力，通过优先优化U-Net架构中的关键点来增强最终生成结果的细节质量。实验结果显示，该方法在ShapeNet-v2数据集上实现了最先进的性能，同时显著降低了计算参数和推理时间。

链接: https://arxiv.org/abs/2503.13004
作者: Jiaxu Liu,Li Li,Hubert P. H. Shum,Toby P. Breckon
机构: Durham University (杜伦大学); King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models currently demonstrate impressive performance over various generative tasks. Recent work on image diffusion highlights the strong capabilities of Mamba (state space models) due to its efficient handling of long-range dependencies and sequential data modeling. Unfortunately, joint consideration of state space models with 3D point cloud generation remains limited. To harness the powerful capabilities of the Mamba model for 3D point cloud generation, we propose a novel diffusion framework containing dual latent Mamba block (DM-Block) and a time-variant frequency encoder (TF-Encoder). The DM-Block apply a space-filling curve to reorder points into sequences suitable for Mamba state-space modeling, while operating in a latent space to mitigate the computational overhead that arises from direct 3D data processing. Meanwhile, the TF-Encoder takes advantage of the ability of the diffusion model to refine fine details in later recovery stages by prioritizing key points within the U-Net architecture. This frequency-based mechanism ensures enhanced detail quality in the final stages of generation. Experimental results on the ShapeNet-v2 dataset demonstrate that our method achieves state-of-the-art performance (ShapeNet-v2: 0.14% on 1-NNA-Abs50 EMD and 57.90% on COV EMD) on certain metrics for specific categories while reducing computational parameters and inference time by up to 10 \times and 9 \times , respectively. Source code is available in Supplementary Materials and will be released upon accpetance.
zh

[CV-77] Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization

【速读】：该论文致力于解决视觉-语言模型（Vision-Language Models, VLMs）在个性化任务中因用户提供的正样本不足以及检索到的负样本质量低下而面临的挑战。论文的关键在于提出了一种名为Concept-as-Tree (CaT) 的框架，通过将概念表示为树结构，能够生成具有不同难度和多样性的正负样本，从而提升VLM的个性化能力。此外，结合精心设计的数据过滤策略，CaT框架确保了生成数据的质量，构建了一个强大的数据管道，有效缓解了正样本缺乏和负样本质量低的问题，并显著增强了MyVLM、Yo’LLaVA和MC-LLaVA等数据集上的个性化性能。

链接: https://arxiv.org/abs/2503.12999
作者: Ruichuan An,Kai Zeng,Ming Lu,Sihan Yang,Renrui Zhang,Huitong Ji,Qizhe Zhang,Yulin Luo,Hao Liang,Wentao Zhang
机构: Peking University; Intel Labs, China; CUHK MMLab
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for fine-tuning. To reveal the relationship between sample and model performance, we systematically investigate the impact of positive and negative samples (easy and hard) and their diversity on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity for VLM personalization. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the personalization capabilities of VLMs across the MyVLM, Yo’LLaVA, and MC-LLaVA datasets. To our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code is released at \hrefthis https URLthis https URL.
zh

[CV-78] SparseAlign: A Fully Sparse Framework for Cooperative Object Detection

【速读】：本文旨在解决现有协同感知方法在长距离检测任务中的计算复杂度高及扩展性差的问题。尽管先前的工作在密集鸟瞰图（Bird’s Eye View, BEV）特征图上的协同目标检测取得了成功，但这些方法通常计算需求大且难以适用于远距离检测场景。为此，本文提出了一种完全稀疏的框架——SparseAlign，其关键在于三个核心设计：增强型稀疏三维主干网络、基于查询的时间上下文学习模块以及专为稀疏特征优化的鲁棒检测头。实验结果表明，即使保持稀疏性，该框架在OPV2V和DairV2X数据集上仍优于当前最先进的方法，并且对通信带宽的需求更低。此外，在时间对齐的协同目标检测任务中，SparseAlign在OPV2Vt和DairV2Xt数据集上的表现也显著优于基线方法。

链接: https://arxiv.org/abs/2503.12982
作者: Yunshuang Yuan,Yan Xia,Daniel Cremers,Monika Sester
机构: Leibniz University Hannover (汉诺威莱布尼茨大学); Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (MCML)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cooperative perception can increase the view field and decrease the occlusion of an ego vehicle, hence improving the perception performance and safety of autonomous driving. Despite the success of previous works on cooperative object detection, they mostly operate on dense Bird’s Eye View (BEV) feature maps, which are computationally demanding and can hardly be extended to long-range detection problems. More efficient fully sparse frameworks are rarely explored. In this work, we design a fully sparse framework, SparseAlign, with three key features: an enhanced sparse 3D backbone, a query-based temporal context learning module, and a robust detection head specially tailored for sparse features. Extensive experimental results on both OPV2V and DairV2X datasets show that our framework, despite its sparsity, outperforms the state of the art with less communication bandwidth requirements. In addition, experiments on the OPV2Vt and DairV2Xt datasets for time-aligned cooperative object detection also show a significant performance gain compared to the baseline works.
zh

[CV-79] Analyzing Swimming Performance Using Drone Captured Aerial Videos

【速读】：该论文旨在解决传统游泳运动员追踪方法（如水面和水下摄像机）因需要多摄像头布置及水花干扰导致的局限性问题。解决方案的关键在于提出了一种利用移动无人机（UAV）结合高分辨率相机捕捉空中视频，并通过计算机视觉算法处理以提取游泳者位置和动作的技术。这种方法实现了单相机使用与全面覆盖的优势，从而有效提升了追踪精度，最大误差分别为0.3秒（对于划水持续时间）和0.35 m/s（对于速度）。

链接: https://arxiv.org/abs/2503.12981
作者: Thu Tran,Kenny Tsu Wei Choo,Shaohui Foong,Hitesh Bhardwaj,Shane Kyi Hla Win,Wei Jun Ang,Kenneth Goh,Rajesh Krishna Balan
机构: Singapore University of Technology and Design (新加坡科技设计大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 6 pages, published to ACM Dronet’24

点击查看摘要

Abstract:Monitoring swimmer performance is crucial for improving training and enhancing athletic techniques. Traditional methods for tracking swimmers, such as above-water and underwater cameras, face limitations due to the need for multiple cameras and obstructions from water splashes. This paper presents a novel approach for tracking swimmers using a moving UAV. The proposed system employs a UAV equipped with a high-resolution camera to capture aerial footage of the swimmers. The footage is then processed using computer vision algorithms to extract the swimmers’ positions and movements. This approach offers several advantages, including single camera use and comprehensive coverage. The system’s accuracy is evaluated with both training and in competition videos. The results demonstrate the system’s ability to accurately track swimmers’ movements, limb angles, stroke duration and velocity with the maximum error of 0.3 seconds and 0.35~m/s for stroke duration and velocity, respectively.
zh

[CV-80] Exploring 3D Activity Reasoning and Planning Planning : From Implicit Human Intentions to Route-Aware Planning

【速读】：该论文旨在解决两个关键问题：一是现有方法对隐式用户意图的推理不足，过度依赖显式指令；二是忽视了机器人多步骤移动中的步间路径规划。为解决这些问题，论文提出了一种新的3D任务——基于场景分割中细粒度3D物体形状和位置的隐式指令活动推理与规划。解决方案的关键在于构建了一个名为ReasonPlan3D的大规模基准数据集，涵盖多样化3D场景及丰富的隐式指令与详细标注，并设计了一种引入跨步骤上下文一致性的渐进式计划生成框架，同时结合动态更新的场景图以捕捉关键物体及其空间关系。通过这些方法，实现了从隐式人类指令中有效推理活动、生成精确的分步任务计划以及无缝整合多步骤移动中的路径规划。

链接: https://arxiv.org/abs/2503.12974
作者: Xueying Jiang,Wenhao Li,Xiaoqin Zhang,Ling Shao,Shijian Lu
机构: College of Computing and Data Science, Nanyang Technological University (南洋理工大学), Singapore; College of Computer Science and Technology, Zhejiang University of Technology (浙江工业大学), China; UCAS-Terminus AI Lab, University of Chinese Academy of Sciences (中国科学院大学), China
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:3D activity reasoning and planning has attracted increasing attention in human-robot interaction and embodied AI thanks to the recent advance in multimodal learning. However, most existing works share two constraints: 1) heavy reliance on explicit instructions with little reasoning on implicit user intention; 2) negligence of inter-step route planning on robot moves. To bridge the gaps, we propose 3D activity reasoning and planning, a novel 3D task that reasons the intended activities from implicit instructions and decomposes them into steps with inter-step routes and planning under the guidance of fine-grained 3D object shapes and locations from scene segmentation. We tackle the new 3D task from two perspectives. First, we construct ReasonPlan3D, a large-scale benchmark that covers diverse 3D scenes with rich implicit instructions and detailed annotations for multi-step task planning, inter-step route planning, and fine-grained segmentation. Second, we design a novel framework that introduces progressive plan generation with contextual consistency across multiple steps, as well as a scene graph that is updated dynamically for capturing critical objects and their spatial relations. Extensive experiments demonstrate the effectiveness of our benchmark and framework in reasoning activities from implicit human instructions, producing accurate stepwise task plans, and seamlessly integrating route planning for multi-step moves. The dataset and code will be released.
zh

[CV-81] Prospects for Mitigating Spectral Variability in Tropical Species Classification Using Self-Supervised Learning

【速读】：该论文旨在解决热带物种识别中因光谱变异性导致结果不一致的问题。解决方案的关键在于利用自监督学习（Self-Supervised Learning, SSL）方法，通过采用先进的Barlow-Twins方法对重复的光谱数据进行处理，提取出对非生物变异性具有鲁棒性且适用于物种识别的稳定特征。实验表明，基于这些特征的分类性能在跨日期的光谱变异性条件下，比传统的反射率产品高出10个百分点的准确性。

链接: https://arxiv.org/abs/2503.12973
作者: Colin Prieur,Nassim Ait Ali Braham,Paul Tresson,Grégoire Vincent,Jocelyn Chanussot
机构: AMAP, Univ Montpellier, CIRAD, CNRS, INRAE, IRD (法国蒙彼利埃大学, 农业研究中心, 国家科学研究中心, 国家农业食品与环境研究院, 国际研究发展 institute), Montpellier, France; Data Science in Earth Observation, Technical University of Munich (慕尼黑工业大学地球观测数据科学), Munich, Germany; Remote Sensing Technology Institute, German Aerospace Center (德国航空航天中心遥感技术研究所), Germany; Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK (格勒诺布尔阿尔卑斯大学, 法国国家信息与自动化研究所, 国家科学研究中心, 格勒诺布尔工程师学院, 计算机科学实验室), Grenoble, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, published as proceeding of the “2024 14th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS)”

点击查看摘要

Abstract:Airborne hyperspectral imaging is a promising method for identifying tropical species, but spectral variability between acquisitions hinders consistent results. This paper proposes using Self-Supervised Learning (SSL) to encode spectral features that are robust to abiotic variability and relevant for species identification. By employing the state-of-the-art Barlow-Twins approach on repeated spectral acquisitions, we demonstrate the ability to develop stable features. For the classification of 40 tropical species, experiments show that these features can outperform typical reflectance products in terms of robustness to spectral variability by 10 points of accuracy across dates.
zh

[CV-82] Aligning Vision to Language: Text-Free Multimodal Knowledge Graph Construction for Enhanced LLM s Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多模态推理中面临的不完整知识和幻觉现象（hallucination artifacts）的问题，这些问题部分由于文本知识图谱（Knowledge Graphs, KGs）的单一模态限制而未被完全缓解。尽管多模态知识图谱（Multimodal Knowledge Graphs, MMKGs）有望提升跨模态理解能力，但其实际构建受到人工文本标注语义狭窄性和视觉-语义实体链接固有噪声的阻碍。为此，论文提出了一种名为Vision-align-to-Language集成知识图谱（Vision-align-to-Language integrated Knowledge Graph, VaLiK）的新方法，通过跨模态信息补充增强LLMs的推理能力。方案的关键在于利用预训练的视觉-语言模型（Vision-Language Models, VLMs）将图像特征与文本对齐，并生成包含特定图像信息的描述，同时引入跨模态相似性验证机制以量化语义一致性并过滤对齐过程中引入的噪声。即使没有手动标注的图像标题，这些精炼的描述也足以构建MMKG。相比传统MMKG构建范式，该方法不仅显著提高了存储效率，还保持了实体到图像的直接链接能力。实验结果表明，结合VaLiK的LLMs在多模态推理任务中优于现有最先进模型。

链接: https://arxiv.org/abs/2503.12972
作者: Junming Liu,Siyuan Meng,Yanting Gao,Song Mao,Pinlong Cai,Guohang Yan,Yirong Chen,Zilin Bian,Botian Shi,Ding Wang
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tongji University (同济大学); East China Normal University (华东师范大学); Stanford University (斯坦福大学); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts, challenges that textual Knowledge Graphs (KGs) only partially mitigate due to their modality isolation. While Multimodal Knowledge Graphs (MMKGs) promise enhanced cross-modal understanding, their practical construction is impeded by semantic narrowness of manual text annotations and inherent noise in visual-semantic entity linkages. In this paper, we propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhances LLMs reasoning through cross-modal information supplementation. Specifically, we cascade pre-trained Vision-Language Models (VLMs) to align image features with text, transforming them into descriptions that encapsulate image-specific information. Furthermore, we developed a cross-modal similarity verification mechanism to quantify semantic consistency, effectively filtering out noise introduced during feature alignment. Even without manually annotated image captions, the refined descriptions alone suffice to construct the MMKG. Compared to conventional MMKGs construction paradigms, our approach achieves substantial storage efficiency gains while maintaining direct entity-to-image linkage capability. Experimental results on multimodal reasoning tasks demonstrate that LLMs augmented with VaLiK outperform previous state-of-the-art models. Our code is published at this https URL.
zh

[CV-83] Action tube generation by person query matching for spatio-temporal action detection

【速读】：本文旨在解决时空动作检测（Spatio-Temporal Action Detection, STAD）中生成动作管（action tube）的问题，提出了一种直接从原始视频中生成动作管的方法，无需依赖基于交并比（IoU）的连接或片段分割等后处理步骤。解决方案的关键在于引入查询匹配模块（Query Matching Module, QMM），通过度量学习将同一人物的查询在不同帧间拉近，同时与不同人物的查询区分开来。此方法利用QMM匹配得到的查询序列预测动作类别，支持来自长于单个片段视频的可变长度输入，并在JHMDB、UCF101-24和AVA数据集上的实验结果表明，该方法对于大范围的人体位置变化具有良好的性能，同时具备更高的计算效率和更低的资源需求。

链接: https://arxiv.org/abs/2503.12969
作者: Kazuki Omi,Jion Oshima,Toru Tamaki
机构: Nagoya Institute of Technology (名古屋工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: extended version of VISAPP2025

点击查看摘要

Abstract:This paper proposes a method for spatio-temporal action detection (STAD) that directly generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting. Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames. We introduce the Query Matching Module (QMM), which uses metric learning to bring queries for the same person closer together across frames compared to queries for different people. Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip. Experimental results on JHMDB, UCF101-24, and AVA datasets demonstrate that our method performs well for large position changes of people while offering superior computational efficiency and lower resource requirements.
zh

[CV-84] OptiPMB: Enhancing 3D Multi-Object Tracking with Optimized Poisson Multi-Bernoulli Filtering

【速读】：该论文旨在解决自动驾驶场景下三维多目标跟踪（3D Multi-Object Tracking, MOT）的准确性问题，特别是在复杂环境中实现鲁棒感知、导航和规划的需求。传统基于模型的方法因数据关联和轨迹管理方案的启发式设计而受限，而随机有限集（Random Finite Set, RFS）方法虽在理论上更严谨，但实际应用中仍有改进空间。为此，论文提出了一种新颖的基于RFS的3D MOT方法OptiPMB，其关键在于采用优化的泊松多伯努利（Poisson Multi-Bernoulli, PMB）滤波器，并结合以下创新设计：1）提出基于测量驱动的混合自适应出生模型以改善轨迹初始化；2）引入自适应检测概率参数以有效维持被遮挡目标的轨迹；3）优化密度剪枝与轨迹提取模块以进一步提升整体跟踪性能。这些改进显著提升了跟踪精度，在nuScenes和KITTI数据集上的实验结果表明OptiPMB超越了现有先进方法，为基于RFS的模型驱动三维MOT建立了新的基准。

链接: https://arxiv.org/abs/2503.12968
作者: Guanhua Ding,Yuxuan Xia,Runwei Guan,Qinchen Wu,Tao Huang,Weiping Ding,Jinping Sun,Guoqiang Mao
机构: School of Electronic Information Engineering, Beihang University (北京航空航天大学); Department of Automation, Shanghai Jiaotong University (上海交通大学); Department of Computer Science and Engineering, The Hong Kong University of Science and Technology; College of Science and Engineering, James Cook University (詹姆斯库克大学); School of Artificial Intelligence and Computer Science, Nantong University (南通大学); Faculty of Data Science, City University of Macau (澳门城市大学); Research Laboratory of Smart Driving and Intelligent Transportation Systems, Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate 3D multi-object tracking (MOT) is crucial for autonomous driving, as it enables robust perception, navigation, and planning in complex environments. While deep learning-based solutions have demonstrated impressive 3D MOT performance, model-based approaches remain appealing for their simplicity, interpretability, and data efficiency. Conventional model-based trackers typically rely on random vector-based Bayesian filters within the tracking-by-detection (TBD) framework but face limitations due to heuristic data association and track management schemes. In contrast, random finite set (RFS)-based Bayesian filtering handles object birth, survival, and death in a theoretically sound manner, facilitating interpretability and parameter tuning. In this paper, we present OptiPMB, a novel RFS-based 3D MOT method that employs an optimized Poisson multi-Bernoulli (PMB) filter while incorporating several key innovative designs within the TBD framework. Specifically, we propose a measurement-driven hybrid adaptive birth model for improved track initialization, employ adaptive detection probability parameters to effectively maintain tracks for occluded objects, and optimize density pruning and track extraction modules to further enhance overall tracking performance. Extensive evaluations on nuScenes and KITTI datasets show that OptiPMB achieves superior tracking accuracy compared with state-of-the-art methods, thereby establishing a new benchmark for model-based 3D MOT and offering valuable insights for future research on RFS-based trackers in autonomous driving.
zh

[CV-85] raining Video Foundation Models with NVIDIA NeMo

【速读】：该论文旨在解决大规模高质量 Video Foundation Models (VFMs) 训练的挑战，特别是生成高质量视频的难题。论文的关键解决方案在于提出了一种可扩展的开源 VFM 训练管道，基于 NVIDIA NeMo 实现，包含加速的视频数据集整理、多模态数据加载、以及并行化的视频扩散模型训练与推理。此外，论文还提供了全面的性能分析，总结了高效训练和推理的最佳实践。

链接: https://arxiv.org/abs/2503.12964
作者: Zeeshan Patel,Ethan He,Parth Mannan,Xiaowei Ren,Ryan Wolf,Niket Agarwal,Jacob Huffman,Zhuoyao Wang,Carl Wang,Jack Chang,Yan Bai,Tommy Huang,Linnan Wang,Sahil Jain,Shanmugam Ramasamy,Joseph Jennings,Ekaterina Sirazitdinova,Oleg Sudakov,Mingyuan Ma,Bobby Chen,Forrest Lin,Hao Wang,Vasanth Rao Naik Sabavat,Sriharsha Niverty,Rong Ou,Pallab Bhattacharya,David Page,Nima Tajbakhsh,Ashwath Aithal
机构: NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.
zh

[CV-86] Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

【速读】：该论文旨在解决音频驱动单图像说话头像生成中的两个主要挑战：(1) 基于关键点的方法难以捕捉精细面部细节且姿态多样性受限；(2) 基于图像的方法虽然能够生成高质量细节丰富的头像，但存在身份失真和计算成本高的问题。为了解决这些问题，论文提出了一种名为KDTalker的新框架，其关键是将无监督隐式3D关键点与时空扩散模型相结合。通过利用无监督隐式3D关键点，KDTalker能够自适应面部信息密度，使扩散过程能够灵活地建模多样的头部姿态并捕捉精细的面部细节。此外，自定义的时空注意力机制确保了精确的唇同步，同时提高了时间一致性、生成质量并增强了计算效率。

链接: https://arxiv.org/abs/2503.12963
作者: Chaolong Yang,Kai Yao,Yuyao Yan,Chenru Jiang,Weiguang Zhao,Jie Sun,Guangliang Cheng,Yifei Zhang,Bin Dong,Kaizhu Huang
机构: University of Liverpool (利物浦大学); Ant Group (蚂蚁集团); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Duke Kunshan University (杜克昆山大学); Ricoh (理光)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-driven single-image talking portrait generation plays a crucial role in virtual reality, digital human creation, and filmmaking. Existing approaches are generally categorized into keypoint-based and image-based methods. Keypoint-based methods effectively preserve character identity but struggle to capture fine facial details due to the fixed points limitation of the 3D Morphable Model. Moreover, traditional generative networks face challenges in establishing causality between audio and keypoints on limited datasets, resulting in low pose diversity. In contrast, image-based approaches produce high-quality portraits with diverse details using the diffusion network but incur identity distortion and expensive computational costs. In this work, we propose KDTalker, the first framework to combine unsupervised implicit 3D keypoint with a spatiotemporal diffusion model. Leveraging unsupervised implicit 3D keypoints, KDTalker adapts facial information densities, allowing the diffusion process to model diverse head poses and capture fine facial details flexibly. The custom-designed spatiotemporal attention mechanism ensures accurate lip synchronization, producing temporally consistent, high-quality animations while enhancing computational efficiency. Experimental results demonstrate that KDTalker achieves state-of-the-art performance regarding lip synchronization accuracy, head pose diversity, and execution this http URL codes are available at this https URL.
zh

[CV-87] HIS-GPT : Towards 3D Human-In-Scene Multimodal Understanding

【速读】：该论文试图解决的问题是如何评估和提升具身代理（embodied agents）在理解人类行为与场景交互方面的能力，具体表现为提出了一种新的任务：Human-In-Scene Question Answering (HIS-QA)，要求代理理解人类状态与行为、推理周围环境，并回答与场景中人类相关的问题。为支持这一任务，论文构建了一个名为HIS-Bench的多模态基准测试集，用于系统性评估从基本感知到常识推理和规划的广泛能力范围。然而，现有视觉语言模型在HIS-Bench上的表现揭示了其处理HIS-QA任务的显著局限性。为此，论文提出了HIS-GPT，这是一种针对HIS理解的第一个基础模型，通过将3D场景上下文和人类运动动力学整合到大型语言模型中，并引入专门机制捕捉人类与场景之间的交互，从而克服上述局限。实验结果表明，HIS-GPT在HIS-QA任务上达到了新的技术水平。解决方案的关键在于开发HIS-GPT，它不仅结合了丰富的场景和动作信息，还设计了特定的交互捕捉机制以增强模型性能。

链接: https://arxiv.org/abs/2503.12955
作者: Jiahe Zhao,Ruibing Hou,Zejie Tian,Hong Chang,Shiguang Shan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HIS-Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HIS-GPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research on human behavior analysis in 3D scenes, advancing embodied AI and world models.
zh

[CV-88] Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction

【速读】：该论文致力于解决文本-视频预测（Text-Video Prediction, TVP）任务中，现有方法因缺乏连续性而导致生成视频质量不佳的问题。以往方法通常基于从文本到图像任务预训练的模型进行迁移，而这些模型难以直接适应视频生成任务。此外，常用的低秩适应（Low-Rank Adaptation, LoRA）技术在文本到视频（Text-to-Video, T2V）模型微调中表现欠佳。为此，论文提出了一种基于适配策略的关键解决方案——帧级条件适配（Frame-wise Conditioning Adaptation, FCA）。该方案通过设计一个子模块，将输入文本转换为帧级文本嵌入，并将其作为额外条件辅助生成过程。同时，FCA 方法还利用初始帧作为附加条件来微调 T2V 模型。论文进一步探讨了如何更有效地将这些嵌入注入 T2V 模型，并通过广泛的消融研究验证了所提方法在定量和定性分析上的性能优势，最终实现了 TVP 任务的新技术水平。

链接: https://arxiv.org/abs/2503.12953
作者: Zheyuan Liu,Junyan Wang,Zicheng Duan,Cristian Rodriguez-Opazo,Anton van den Hengel
机构: Australian Institute for Machine Learning, The University of Adelaide (澳大利亚机器学习研究所, 阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 15 figures

点击查看摘要

Abstract:Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames given a series of initial video frames and text describing the required motion. In practice TVP methods focus on a particular category of videos depicting manipulations of objects carried out by human beings or robot arms. Previous methods adapt models pre-trained on text-to-image tasks, and thus tend to generate video that lacks the required continuity. A natural progression would be to leverage more recent pre-trained text-to-video (T2V) models. This approach is rendered more challenging by the fact that the most common fine-tuning technique, low-rank adaptation (LoRA), yields undesirable results. In this work, we propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA). Within the module, we devise a sub-module that produces frame-wise text embeddings from the input text, which acts as an additional text condition to aid generation. We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition. We compare and discuss the more effective strategy for injecting such embeddings into the T2V model. We conduct extensive ablation studies on our design choices with quantitative and qualitative performance analysis. Our approach establishes a new state-of-the-art for the task of TVP. The project page is at this https URL .
zh

[CV-89] DivCon-NeRF: Generating Augmented Rays with Diversity and Consistency for Few-shot View Synthesis

【速读】：该论文旨在解决NeRF在少样本场景下的性能瓶颈问题，即当训练数据稀疏时，传统NeRF方法因需要大量多视角图像而难以适用。为缓解这一问题，论文提出了一种名为DivCon-NeRF的方法，其关键在于通过引入表面球面增强（surface-sphere augmentation）和内球面增强（inner-sphere augmentation）来显著提升视点多样性和一致性。表面球面增强保持了相机与预测表面点之间的距离不变，使得模型能够基于高概率表面点的顺序比较并过滤掉不一致的光线，而无需精确深度信息；内球面增强则随机化角度和距离以生成多样化视点，进一步提高多样性。这些创新有效地减少了浮点伪影（floaters）和视觉失真，使DivCon-NeRF在Blender、LLFF及DTU数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2503.12947
作者: Ingyun Lee,Jae Won Jang,Seunghyeon Seo,Nojun Kwak
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Neural Radiance Field (NeRF) has shown remarkable performance in novel view synthesis but requires many multiview images, making it impractical for few-shot scenarios. Ray augmentation was proposed to prevent overfitting for sparse training data by generating additional rays. However, existing methods, which generate augmented rays only near the original rays, produce severe floaters and appearance distortion due to limited viewpoints and inconsistent rays obstructed by nearby obstacles and complex surfaces. To address these problems, we propose DivCon-NeRF, which significantly enhances both diversity and consistency. It employs surface-sphere augmentation, which preserves the distance between the original camera and the predicted surface point. This allows the model to compare the order of high-probability surface points and filter out inconsistent rays easily without requiring the exact depth. By introducing inner-sphere augmentation, DivCon-NeRF randomizes angles and distances for diverse viewpoints, further increasing diversity. Consequently, our method significantly reduces floaters and visual distortions, achieving state-of-the-art performance on the Blender, LLFF, and DTU datasets. Our code will be publicly available.
zh

[CV-90] GIFT: Generated Indoor video frames for Texture-less point tracking

【速读】：该论文旨在解决现有点跟踪方法在处理纹理较少或弱纹理区域的视频帧时难以准确跟踪的问题。论文的关键解决方案是引入了一种评估三维物体纹理强度的新指标，并基于此将ShapeNet中的三维模型分为三个纹理强度级别，构建了一个名为GIFT的具有挑战性的合成基准数据集，包含1800个带有丰富标注的室内视频序列。与现有数据集不同，GIFT精确地标记了分类后的目标物体上的真实点，确保每个视频对应特定的纹理强度等级。通过在GIFT上的全面评估，论文分析了纹理对点跟踪的影响，从而系统性地评估了当前方法在不同纹理强度下的性能。

链接: https://arxiv.org/abs/2503.12944
作者: Jianzheng Huang,Xianyu Mo,Ziling Liu,Jinyu Yang,Feng Zheng
机构: Southern University of Science and Technology (南方科技大学); tapall.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point tracking is becoming a powerful solver for motion estimation and video editing. Compared to classical feature matching, point tracking methods have the key advantage of robustly tracking points under complex camera motion trajectories and over extended periods. However, despite certain improvements in methodologies, current point tracking methods still struggle to track any position in video frames, especially in areas that are texture-less or weakly textured. In this work, we first introduce metrics for evaluating the texture intensity of a 3D object. Using these metrics, we classify the 3D models in ShapeNet into three levels of texture intensity and create GIFT, a challenging synthetic benchmark comprising 1800 indoor video sequences with rich annotations. Unlike existing datasets that assign ground truth points arbitrarily, GIFT precisely anchors ground truth on classified target objects, ensuring that each video corresponds to a specific texture intensity level. Furthermore, we comprehensively evaluate current methods on GIFT to assess their performance across different texture intensity levels and analyze the impact of texture on point tracking.
zh

[CV-91] L2HCount:Generalizing Crowd Counting from Low to High Crowd Density via Density Simulation

【速读】：该论文旨在解决在高密度场景下由于小头部尺寸和严重遮挡导致标注困难的问题，同时探索是否可以利用低密度场景的数据训练模型并将其推广到高密度场景。论文的关键在于提出了一种从低密度场景学习高密度场景模式的低至高密度泛化框架（L2HCount）。具体而言，通过引入高密度模拟模块和真实标签生成模块来构建带有相应真实标签的假想高密度图像；随后提出头部特征增强模块以提取清晰特征；最后设计双密度记忆编码模块，使用两个群体记忆分别从低密度和模拟的高密度场景中学习特定场景模式。这些创新点共同确保了模型能够在不同密度场景间实现有效的泛化能力。

链接: https://arxiv.org/abs/2503.12935
作者: Guoliang Xu,Jianqin Yin,Ren Zhang,Yonghao Dang,Feng Zhou,Bo Yu
机构: School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications (北京邮电大学智能工程与自动化学院); School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Since COVID-19, crowd-counting tasks have gained wide applications. While supervised methods are reliable, annotation is more challenging in high-density scenes due to small head sizes and severe occlusion, whereas it’s simpler in low-density scenes. Interestingly, can we train the model in low-density scenes and generalize it to high-density scenes? Therefore, we propose a low- to high-density generalization framework (L2HCount) that learns the pattern related to high-density scenes from low-density ones, enabling it to generalize well to high-density scenes. Specifically, we first introduce a High-Density Simulation Module and a Ground-Truth Generation Module to construct fake high-density images along with their corresponding ground-truth crowd annotations respectively by image-shifting technique, effectively simulating high-density crowd patterns. However, the simulated images have two issues: image blurring and loss of low-density image characteristics. Therefore, we second propose a Head Feature Enhancement Module to extract clear features in the simulated high-density scene. Third, we propose a Dual-Density Memory Encoding Module that uses two crowd memories to learn scene-specific patterns from low- and simulated high-density scenes, respectively. Extensive experiments on four challenging datasets have shown the promising performance of L2HCount.
zh

[CV-92] AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction

【速读】：该论文致力于解决新型视角合成（Novel View Synthesis, NVS）中生成视图与输入视图之间一致性不足的问题，尤其是在相机姿态存在显著差异的情况下，传统方法容易导致低质量的三维几何结构和纹理。论文的关键洞察是现有方法未能充分区分目标视图的重要性，而作者通过经验观察发现，靠近输入视图的目标视图具有更高的保真度。基于此，论文提出了一种名为AR-1-to-3的新一代视图预测范式，该范式首先利用扩散模型生成接近输入视图的视图，并将其作为上下文信息逐步合成更远的目标视图。为了将生成的视图子序列编码为局部和全局条件以指导后续视图预测，论文进一步设计了堆叠局部特征编码策略（Stacked-LE）和基于LSTM的全局特征编码策略（LSTM-GE）。大量实验验证了所提方法在提高生成视图与输入视图一致性方面的显著优势，实现了高保真度的三维资产生成。

链接: https://arxiv.org/abs/2503.12929
作者: Xuying Zhang,Yupeng Zhou,Kai Wang,Yikai Wang,Zhen Li,Xiuli Shao,Daquan Zhou,Qibin Hou,Ming-Ming Cheng
机构: Nankai University (南开大学); Tsinghua University (清华大学); ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis (NVS) is a cornerstone for image-to-3d creation. However, existing works still struggle to maintain consistency between the generated views and the input views, especially when there is a significant camera pose difference, leading to poor-quality 3D geometries and textures. We attribute this issue to their treatment of all target views with equal priority according to our empirical observation that the target views closer to the input views exhibit higher fidelity. With this inspiration, we propose AR-1-to-3, a novel next-view prediction paradigm based on diffusion models that first generates views close to the input views, which are then utilized as contextual information to progressively synthesize farther views. To encode the generated view subsequences as local and global conditions for the next-view prediction, we accordingly develop a stacked local feature encoding strategy (Stacked-LE) and an LSTM-based global feature encoding strategy (LSTM-GE). Extensive experiments demonstrate that our method significantly improves the consistency between the generated views and the input views, producing high-fidelity 3D assets.
zh

[CV-93] MMLNB: Multi-Modal Learning for Neuroblastoma Subtyping Classification Assisted with Textual Description Generation

【速读】：该论文旨在解决神经母细胞瘤（Neuroblastoma, NB）病理亚型分类中因组织学变异性大而导致的传统诊断方法主观性强、耗时且不一致的问题。论文提出的关键解决方案是开发一种多模态学习（Multi-Modal Learning, MML）模型——MMLNB，通过整合病理图像与生成的文本描述，提升分类准确性与可解释性。其核心在于两阶段策略：首先微调视觉-语言模型（Vision-Language Model, VLM）以增强与病理相关的文本生成能力；其次利用双分支架构分别提取视觉和文本特征，并通过渐进鲁棒多模态融合（Progressive Robust Multi-Modal Fusion, PRMF）模块实现特征融合。实验结果表明，MMLNB模型在分类精度上优于单模态模型，消融研究进一步验证了多模态融合、微调及PRMF机制的重要性。这一研究构建了一个可扩展的人工智能驱动的数字病理框架，显著提升了神经母细胞瘤亚型分类的可靠性和可解释性。

链接: https://arxiv.org/abs/2503.12927
作者: Huangwei Chen,Zhu Zhu,Zhenyu Yan,Yifei Chen,Mingyang Ding,Chenlei Li,Feiwei Qin
机构: Hangzhou Dianzi University (杭州电子科技大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 7 figures

点击查看摘要

Abstract:Neuroblastoma (NB), a leading cause of childhood cancer mortality, exhibits significant histopathological variability, necessitating precise subtyping for accurate prognosis and treatment. Traditional diagnostic methods rely on subjective evaluations that are time-consuming and inconsistent. To address these challenges, we introduce MMLNB, a multi-modal learning (MML) model that integrates pathological images with generated textual descriptions to improve classification accuracy and interpretability. The approach follows a two-stage process. First, we fine-tune a Vision-Language Model (VLM) to enhance pathology-aware text generation. Second, the fine-tuned VLM generates textual descriptions, using a dual-branch architecture to independently extract visual and textual features. These features are fused via Progressive Robust Multi-Modal Fusion (PRMF) Block for stable training. Experimental results show that the MMLNB model is more accurate than the single modal model. Ablation studies demonstrate the importance of multi-modal fusion, fine-tuning, and the PRMF mechanism. This research creates a scalable AI-driven framework for digital pathology, enhancing reliability and interpretability in NB subtyping classification. Our source code is available at this https URL.
zh

[CV-94] Efficient Multimodal 3D Object Detector via Instance-Level Contrastive Distillation

【速读】：该论文旨在解决多模态3D目标检测中因几何感知激光雷达点云与语义丰富的RGB图像模态异质性（如收敛不平衡和模态错位）带来的挑战，同时应对检测导向特征尺寸较大限制长距离依赖捕获的问题。论文的关键解决方案包括提出实例级对比蒸馏（Instance-level Contrastive Distillation, ICD）框架和交叉线性注意力融合模块（Cross Linear Attention Fusion Module, CLFM）。其中，ICD通过对象感知的对比蒸馏对齐实例级图像特征与激光雷达表示，确保细粒度的跨模态一致性；CLFM则提供一种高效可扩展的融合策略，增强大规模多模态BEV特征中的跨模态全局交互。实验结果表明，该方法在KITTI和nuScenes数据集上不仅超越了现有最先进的方法，还实现了更优的效率。

链接: https://arxiv.org/abs/2503.12914
作者: Zhuoqun Su,Huimin Lu,Shuaifeng Jiao,Junhao Xiao,Yaonan Wang,Xieyuanli Chen
机构: College of Intelligence Science and Technology (智能科学与技术学院), National University of Defense Technology (国防科技大学); National Key Laboratory of Equipment State Sensing and Smart Support (装备状态感知与智能支持国家重点实验室), National University of Defense Technology (国防科技大学); College of Electrical and Information Engineering (电气与信息工程学院), Hunan University (湖南大学, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal 3D object detectors leverage the strengths of both geometry-aware LiDAR point clouds and semantically rich RGB images to enhance detection performance. However, the inherent heterogeneity between these modalities, including unbalanced convergence and modal misalignment, poses significant challenges. Meanwhile, the large size of the detection-oriented feature also constrains existing fusion strategies to capture long-range dependencies for the 3D detection tasks. In this work, we introduce a fast yet effective multimodal 3D object detector, incorporating our proposed Instance-level Contrastive Distillation (ICD) framework and Cross Linear Attention Fusion Module (CLFM). ICD aligns instance-level image features with LiDAR representations through object-aware contrastive distillation, ensuring fine-grained cross-modal consistency. Meanwhile, CLFM presents an efficient and scalable fusion strategy that enhances cross-modal global interactions within sizable multimodal BEV features. Extensive experiments on the KITTI and nuScenes 3D object detection benchmarks demonstrate the effectiveness of our methods. Notably, our 3D object detector outperforms state-of-the-art (SOTA) methods while achieving superior efficiency. The implementation of our method has been released as open-source at: this https URL.
zh

[CV-95] Pose as a Modality: A Psychology-Inspired Network for Personality Recognition with a New Multimodal Dataset AAAI2025

【速读】：该论文试图解决在利用多模态数据预测五大人格特质（Big Five personality traits）时，现有计算模型性能不足的问题。尤其关注心理研究中已知的身体姿态（pose）与人格特质之间存在强相关性，但以往计算模型大多忽视了姿态数据的整合。为填补这一空白，论文提出了一种包含全身姿态数据的新颖多模态数据集，并设计了一种基于心理学启发的网络（Psychology-Inspired Network, PINet）。PINet 的关键在于其三个核心模块：多模态特征感知（Multimodal Feature Awareness, MFA）、多模态特征交互（Multimodal Feature Interaction, MFI）以及基于心理学的模态相关性损失（Psychology-Informed Modality Correlation Loss, PIMC Loss）。其中，MFA 模块通过视觉特征提取捕捉与人格相关的全面视觉信息；MFI 模块高效融合多模态特征；而 PIMC Loss 根据心理学理论指导模型针对不同人格维度强调不同的模态。实验结果表明，PINet 在多个基准模型之上实现了更优性能，且三个模块对整体表现贡献均衡。引入姿态数据显著提升了模型性能，姿态模态的重要性在五种模态中排名中等。这些发现不仅弥补了缺乏全身姿态数据的人格相关数据集的不足，还为提升人格预测模型的准确性提供了新方法，强调了将心理学洞见融入人工智能框架的重要性。

链接: https://arxiv.org/abs/2503.12912
作者: Bin Tang,Keqi Pan,Miao Zheng,Ning Zhou,Jialu Sui,Dandan Zhu,Cheng-Long Deng,Shu-Guang Kuai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 6 figures, AAAI 2025 Oral

点击查看摘要

Abstract:In recent years, predicting Big Five personality traits from multimodal data has received significant attention in artificial intelligence (AI). However, existing computational models often fail to achieve satisfactory performance. Psychological research has shown a strong correlation between pose and personality traits, yet previous research has largely ignored pose data in computational models. To address this gap, we develop a novel multimodal dataset that incorporates full-body pose data. The dataset includes video recordings of 287 participants completing a virtual interview with 36 questions, along with self-reported Big Five personality scores as labels. To effectively utilize this multimodal data, we introduce the Psychology-Inspired Network (PINet), which consists of three key modules: Multimodal Feature Awareness (MFA), Multimodal Feature Interaction (MFI), and Psychology-Informed Modality Correlation Loss (PIMC Loss). The MFA module leverages the Vision Mamba Block to capture comprehensive visual features related to personality, while the MFI module efficiently fuses the multimodal features. The PIMC Loss, grounded in psychological theory, guides the model to emphasize different modalities for different personality dimensions. Experimental results show that the PINet outperforms several state-of-the-art baseline models. Furthermore, the three modules of PINet contribute almost equally to the model’s overall performance. Incorporating pose data significantly enhances the model’s performance, with the pose modality ranking mid-level in importance among the five modalities. These findings address the existing gap in personality-related datasets that lack full-body pose data and provide a new approach for improving the accuracy of personality prediction models, highlighting the importance of integrating psychological insights into AI frameworks.
zh

[CV-96] MFP-CLIP: Exploring the Efficacy of Multi-Form Prompts for Zero-Shot Industrial Anomaly Detection

【速读】：该论文旨在解决零样本异常检测（Zero-Shot Anomaly Detection, ZSAD）中由于小而复杂的缺陷边界导致的表征不足问题。现有方法通常依赖单一的手工设计提示词，难以适应多样化的物体和异常类型。为了解决这些问题，论文提出了一种基于多形式提示词的新型CLIP框架——MFP-CLIP。其关键是通过图像到文本提示（Image-to-Text Prompting, I2TP）、自提示（Self Prompting, SP）以及多补丁特征聚合（Multi-Patch Feature Aggregation, MPFA）模块增强对多尺度复杂异常的感知能力，并引入掩码提示（Mask Prompting, MP）模块以精确定位缺陷区域。实验结果表明，MFP-CLIP在MVTecAD和VisA两个工业异常检测基准数据集上表现出色。

链接: https://arxiv.org/abs/2503.12910
作者: Jingyi Yuan,Pengyu Jie,Junyin Zhang,Ziao Li,Chenqiang Gao
机构: the School of Intelligent Systems Engineering, Sun Yat-Sen University (中山大学智能工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, zero-shot anomaly detection (ZSAD) has emerged as a pivotal paradigm for identifying defects in unseen categories without requiring target samples in training phase. However, existing ZSAD methods struggle with the boundary of small and complex defects due to insufficient representations. Most of them use the single manually designed prompts, failing to work for diverse objects and anomalies. In this paper, we propose MFP-CLIP, a novel prompt-based CLIP framework which explores the efficacy of multi-form prompts for zero-shot industrial anomaly detection. We employ an image to text prompting(I2TP) mechanism to better represent the object in the image. MFP-CLIP enhances perception to multi-scale and complex anomalies by self prompting(SP) and a multi-patch feature aggregation(MPFA) module. To precisely localize defects, we introduce the mask prompting(MP) module to guide model to focus on potential anomaly regions. Extensive experiments are conducted on two wildly used industrial anomaly detection benchmarks, MVTecAD and VisA, demonstrating MFP-CLIP’s superiority in ZSAD.
zh

[CV-97] UCF-Crime-DVS: A Novel Event-Based Dataset for Video Anomaly Detection with Spiking Neural Networks AAAI2025

【速读】：该论文旨在解决视频异常检测领域中如何有效利用动态视觉传感器（Dynamic Vision Sensors, DVS）事件数据中的丰富动态信息以提升模型异常识别能力的问题。论文的关键在于构建了一个名为UCF-Crime-DVS的第一套基于DVS的视频异常检测基准数据集，并设计了一种多尺度尖峰融合网络（Multi-Scale Spiking Fusion Network, MSF），该网络基于尖峰神经网络（Spiking Neural Networks, SNNs）。通过这一方案，论文探索了DVS事件数据在弱监督视频异常检测任务中的应用潜力，并验证了所提出框架的有效性及优越性能，为基于SNN的弱监督视频异常检测建立了新的基准。

链接: https://arxiv.org/abs/2503.12905
作者: Yuanbin Qian,Shuhan Ye,Chong Wang,Xiaojie Cai,Jiangbo Qian,Jiafei Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Video anomaly detection plays a significant role in intelligent surveillance systems. To enhance model’s anomaly recognition ability, previous works have typically involved RGB, optical flow, and text features. Recently, dynamic vision sensors (DVS) have emerged as a promising technology, which capture visual information as discrete events with a very high dynamic range and temporal resolution. It reduces data redundancy and enhances the capture capacity of moving objects compared to conventional camera. To introduce this rich dynamic information into the surveillance field, we created the first DVS video anomaly detection benchmark, namely UCF-Crime-DVS. To fully utilize this new data modality, a multi-scale spiking fusion network (MSF) is designed based on spiking neural networks (SNNs). This work explores the potential application of dynamic information from event data in video anomaly detection. Our experiments demonstrate the effectiveness of our framework on UCF-Crime-DVS and its superior performance compared to other models, establishing a new baseline for SNN-based weakly supervised video anomaly detection.
zh

[CV-98] UncTrack: Reliable Visual Object Tracking with Uncertainty-Aware Prototype Memory Network

【速读】：该论文试图解决现有基于 Transformer 的跟踪器在目标状态预测中忽视定位不确定性的问题，这限制了其在挑战性场景中的可靠性。解决方案的关键在于提出了一种新颖的不确定性感知 Transformer 跟踪器（UncTrack），通过预测目标定位的不确定性并在推理过程中整合这一信息，从而实现更准确的状态估计。具体而言，UncTrack 利用 Transformer 编码器处理模板图像与搜索图像之间的特征交互，并通过不确定性感知定位解码器（ULD）粗略预测边界框位置及其不确定性。随后，将不确定性信息送入原型记忆网络（PMN）以挖掘历史数据，判断当前预测的可靠性。同时，高置信度样本被反馈更新记忆库，增强模型对外观变化的鲁棒性。实验结果表明，该方法显著优于其他最先进的方法。

链接: https://arxiv.org/abs/2503.12888
作者: Siyuan Yao,Yang Guo,Yanyang Yan,Wenqi Ren,Xiaochun Cao
机构: School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications (北京邮电大学计算机学院（国家示范性软件学院）); State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所处理器重点实验室); School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University (中山大学深圳校区网络空间科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages,11 figures,references added

点击查看摘要

Abstract:Transformer-based trackers have achieved promising success and become the dominant tracking paradigm due to their accuracy and efficiency. Despite the substantial progress, most of the existing approaches tackle object tracking as a deterministic coordinate regression problem, while the target localization uncertainty has been greatly overlooked, which hampers trackers’ ability to maintain reliable target state prediction in challenging scenarios. To address this issue, we propose UncTrack, a novel uncertainty-aware transformer tracker that predicts the target localization uncertainty and incorporates this uncertainty information for accurate target state inference. Specifically, UncTrack utilizes a transformer encoder to perform feature interaction between template and search images. The output features are passed into an uncertainty-aware localization decoder (ULD) to coarsely predict the corner-based localization and the corresponding localization uncertainty. Then the localization uncertainty is sent into a prototype memory network (PMN) to excavate valuable historical information to identify whether the target state prediction is reliable or not. To enhance the template representation, the samples with high confidence are fed back into the prototype memory bank for memory updating, making the tracker more robust to challenging appearance variations. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods. Our code is available at this https URL.
zh

[CV-99] RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars

【速读】：本文旨在解决实时高质量人脸 avatar 重建的问题，目标是实现足够快的速度以支持即时重建。论文的关键创新在于提出了一种名为 RGBAvatar 的方法，其核心解决方案是通过多层感知机（MLP）将跟踪的 3D 形变模型（3DMM）参数映射为简化的高斯 blendshape 权重，从而构建紧凑的 blendshape 基础集。这种方法避免了依赖传统 3DMM 的固定基权重，能够更有效地捕捉特定个体的关键面部细节，同时提升重建质量和效率。此外，论文还引入了一种新颖的颜色初始化估计方法和批处理并行高斯光栅化过程，并结合局部-全局采样策略，实现了与离线设置相当的高质量重建，且训练吞吐量高达每秒约 630 张图像，从而显著加速了实时重建流程。

链接: https://arxiv.org/abs/2503.12886
作者: Linzhou Li,Yumeng Li,Yanlin Weng,Youyi Zheng,Kun Zhou
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Reduced Gaussian Blendshapes Avatar (RGBAvatar), a method for reconstructing photorealistic, animatable head avatars at speeds sufficient for on-the-fly reconstruction. Unlike prior approaches that utilize linear bases from 3D morphable models (3DMM) to model Gaussian blendshapes, our method maps tracked 3DMM parameters into reduced blendshape weights with an MLP, leading to a compact set of blendshape bases. The learned compact base composition effectively captures essential facial details for specific individuals, and does not rely on the fixed base composition weights of 3DMM, leading to enhanced reconstruction quality and higher efficiency. To further expedite the reconstruction process, we develop a novel color initialization estimation method and a batch-parallel Gaussian rasterization process, achieving state-of-the-art quality with training throughput of about 630 images per second. Moreover, we propose a local-global sampling strategy that enables direct on-the-fly reconstruction, immediately reconstructing the model as video streams in real time while achieving quality comparable to offline settings. Our source code is available at this https URL.
zh

[CV-100] DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

【速读】：该论文旨在解决现有图像条件生成方法（如基于深度或Canny边缘图的方法）在精确控制多实例（或多区域）内容方面存在的挑战，特别是属性泄漏等问题，这些限制了用户的可控性。论文提出的解决方案是DreamRenderer，这是一种无需训练的方法，基于FLUX模型构建。其关键是两项创新：1）桥接图像标记用于硬文本属性绑定，通过复制的图像标记作为桥接标记，确保T5文本嵌入在联合注意力机制下正确绑定每个实例的视觉属性；2）仅在关键层应用硬图像属性绑定，在非关键层则采用软绑定。这种方法在保证精准控制的同时，保持了图像质量。实验结果显示，DreamRenderer相比FLUX提升了17.7%的图像成功率，并将布局到图像模型（如GLIGEN和3DIS）的性能提高了高达26.8%。

链接: https://arxiv.org/abs/2503.12885
作者: Dewei Zhou,Mingwei Li,Zongxin Yang,Yi Yang
机构: RELER (未知); CCAI (未知); Zhejiang University (浙江大学); DBMI (未知); HMS (未知); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Image-conditioned generation methods, such as depth- and canny-conditioned approaches, have demonstrated remarkable abilities for precise image synthesis. However, existing models still struggle to accurately control the content of multiple instances (or regions). Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. DreamRenderer enables users to control the content of each instance via bounding boxes or masks, while ensuring overall visual harmony. We propose two key innovations: 1) Bridge Image Tokens for Hard Text Attribute Binding, which uses replicated image tokens as bridge tokens to ensure that T5 text embeddings, pre-trained solely on text data, bind the correct visual attributes for each instance during Joint Attention; 2) Hard Image Attribute Binding applied only to vital layers. Through our analysis of FLUX, we identify the critical layers responsible for instance attribute rendering and apply Hard Image Attribute Binding only in these layers, using soft binding in the others. This approach ensures precise control while preserving image quality. Evaluations on the COCO-POS and COCO-MIG benchmarks demonstrate that DreamRenderer improves the Image Success Ratio by 17.7% over FLUX and enhances the performance of layout-to-image models like GLIGEN and 3DIS by up to 26.8%. Project Page: this https URL.
zh

[CV-101] An interpretable approach to automating the assessment of biofouling in video footage

【速读】：该论文旨在解决因生物污垢（Biofouling）导致的海洋入侵物种传播及疾病扩散风险问题，提出了一种高效且有效的自动化评估方法。国际船舶为管理生物污垢需提供相关实践证据，而验证这些措施的有效性依赖于大量水下影像数据的采集与分析，传统方式耗时费力。论文的关键解决方案在于采用可解释的组件特征（Component Features, ComFe）方法，并基于DINOv2视觉Transformer（Vision Transformer, ViT）基础模型，相较于传统的不可解释卷积神经网络（Convolutional Neural Network, CNN）方法，ComFe不仅性能更优，且参数量显著减少，同时通过明确图像哪些区域对分类结果有贡献以及训练数据中哪些样本导致该结论，实现了更高的透明度。所有代码、数据及模型权重均已公开发布。

链接: https://arxiv.org/abs/2503.12875
作者: Evelyn J. Mannix,Bartholomew A. Woodham
机构: Centre of Excellence for Biosecurity Risk Analysis, The University of Melbourne (墨尔本大学), Melbourne, Australia.; Melbourne Centre for Data Science, The University of Melbourne (墨尔本大学), Melbourne, Australia.; Biosecurity Animal Division, Department of Agriculture, Fisheries and Forestry (农业渔业和林业部), Canberra, Australia.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Biofouling \unicodex2013 communities of organisms that grow on hard surfaces immersed in water \unicodex2013 provides a pathway for the spread of invasive marine species and diseases. To address this risk, international vessels are increasingly being obligated to provide evidence of their biofouling management practices. Verification that these activities are effective requires underwater inspections, using divers or underwater remotely operated vehicles (ROVs), and the collection and analysis of large amounts of imagery and footage. Automated assessment using computer vision techniques can significantly streamline this process, and this work shows how this challenge can be addressed efficiently and effectively using the interpretable Component Features (ComFe) approach with a DINOv2 Vision Transformer (ViT) foundation model. ComFe is able to obtain improved performance in comparison to previous non-interpretable Convolutional Neural Network (CNN) methods, with significantly fewer weights and greater transparency \unicodex2013 through identifying which regions of the image contribute to the classification, and which images in the training data lead to that conclusion. All code, data and model weights are publicly released.
zh

[CV-102] Evolution-based Region Adversarial Prompt Learning for Robustness Enhancement in Vision-Language Models

【速读】：该论文旨在解决大型预训练视觉-语言模型（Vision-Language Models, VLMs），如CLIP，在对抗性攻击下的鲁棒性不足问题。尽管先前的研究通过对抗性训练探索了鲁棒文本提示，但这些方法主要依赖单一梯度方向扰动（例如PGD）来生成对抗样本（Adversarial Examples, AEs），导致生成的对抗样本缺乏多样性，从而限制了模型的对抗鲁棒性提升。
论文的关键解决方案是提出了一种基于演化机制的区域对抗提示调优方法（Evolution-based Region Adversarial Prompt Tuning, ER-APT）。该方法结合梯度方法与遗传演化，通过选择、变异和交叉等操作生成更多样化且更具挑战性的对抗样本。此外，还引入了一种动态损失加权方法，以优化提示学习的效率，平衡模型在准确性和鲁棒性之间的表现。实验结果表明，ER-APT方法在多个基准数据集上的性能优于现有最先进的对抗提示调优方法。

链接: https://arxiv.org/abs/2503.12874
作者: Xiaojun Jia,Sensen Gao,Simeng Qin,Ke Ma,Xinfeng Li,Yihao Huang,Wei Dong,Yang Liu,Xiaochun Cao
机构: Nanyang Technological University (南洋理工大学); Northeastern University (东北大学，中国沈阳); Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Chinese Academy of Sciences (中国科学院大学); School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University (中山大学深圳网络科学技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large pre-trained vision-language models (VLMs), such as CLIP, demonstrate impressive generalization but remain highly vulnerable to adversarial examples (AEs). Previous work has explored robust text prompts through adversarial training, achieving some improvement in both robustness and generalization. However, they primarily rely on singlegradient direction perturbations (e.g., PGD) to generate AEs, which lack diversity, resulting in limited improvement in adversarial robustness. To address these limitations, we propose an evolution-based region adversarial prompt tuning method called ER-APT, which combines gradient methods with genetic evolution to generate more diverse and challenging AEs. In each training iteration, we first generate AEs using traditional gradient-based methods. Subsequently, a genetic evolution mechanism incorporating selection, mutation, and crossover is applied to optimize the AEs, ensuring a broader and more aggressive perturbation this http URL final evolved AEs are used for prompt tuning, achieving region-based adversarial optimization instead of conventional single-point adversarial prompt tuning. We also propose a dynamic loss weighting method to adjust prompt learning efficiency for accuracy and robustness. Experimental evaluations on various benchmark datasets demonstrate the superiority of our proposed method, outperforming stateof-the-art APT methods. The code is released at this https URL.
zh

[CV-103] UniReg: Foundation Model for Controllable Medical Image Registration

【速读】：该论文旨在解决学习型医学图像配准方法在临床场景中的泛化性不足问题，具体表现为针对不同的配准任务（如跨/ intra-受试者配准或特定器官对齐）需要开发多个孤立网络，增加了研发成本。为克服这一局限，论文提出了 \textbf{UniReg}，这是一种用于医学图像配准的首个交互式基础模型。其关键创新在于通过统一框架实现多样化配准场景，即在一个统一的配准模型中通过条件变形场估计来完成任务。这种方法通过显式编码解剖结构先验知识、配准类型约束（跨/ intra-受试者）以及实例特定特征，生成针对特定场景最优的变形场，从而结合了任务特定学习方法的精度优势与传统优化方法的泛化能力。

链接: https://arxiv.org/abs/2503.12868
作者: Zi Li,Jianpeng Zhang,Tai Ma,Tony C. W. Mok,Yan-Jie Zhou,Zeli Chen,Xianghua Ye,Le Lu,Dakai Jin
机构: DAMO Academy, Alibaba Group (达摩院，阿里巴巴集团); College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); The First Affiliated Hospital of College of Medicine, Zhejiang University (浙江大学医学院附属第一医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning-based medical image registration has achieved performance parity with conventional methods while demonstrating a substantial advantage in computational efficiency. However, learning-based registration approaches lack generalizability across diverse clinical scenarios, requiring the laborious development of multiple isolated networks for specific registration tasks, e.g., inter-/intra-subject registration or organ-specific alignment. % To overcome this limitation, we propose \textbfUniReg, the first interactive foundation model for medical image registration, which combines the precision advantages of task-specific learning methods with the generalization of traditional optimization methods. Our key innovation is a unified framework for diverse registration scenarios, achieved through a conditional deformation field estimation within a unified registration model. This is realized through a dynamic learning paradigm that explicitly encodes: (1) anatomical structure priors, (2) registration type constraints (inter/intra-subject), and (3) instance-specific features, enabling the generation of scenario-optimal deformation fields. % Through comprehensive experiments encompassing 90 anatomical structures at different body regions, our UniReg model demonstrates comparable performance with contemporary state-of-the-art methodologies while achieving ~50% reduction in required training iterations relative to the conventional learning-based paradigm. This optimization contributes to a significant reduction in computational resources, such as training time. Code and model will be available.
zh

[CV-104] SCAP: Transductive Test-Time Adaptation via Supportive Clique-based Attribute Prompting CVPR2025

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在应对域偏移（domain shifts）时的表现下降问题，特别是当测试数据分布发生变化时。现有测试时适配（Test-Time Adaptation, TTA）方法主要关注单个样本的适应，而忽视了批次内样本之间的跨样本相关性。尽管已有基于ViT的TTA方法引入了批次级适配，但其对文本模态的整合不足，导致性能仍不理想。为此，论文提出了一种新颖的归纳式TTA框架——支持性簇属性提示（Supportive Clique-based Attribute Prompting, SCAP）。SCAP的关键在于通过无监督方式构建支持性簇，并为每个簇生成细粒度的属性提示，同时结合视觉与文本信息以增强适配能力。此外，SCAP还设计了一个保留模块，动态更新属性提示及其关联属性以适应不断到来的新数据。实验结果表明，SCAP在多个基准数据集上显著超越现有最优方法，大幅提升了VLM在域偏移下的泛化能力。

链接: https://arxiv.org/abs/2503.12866
作者: Chenyu Zhang,Kunlun Xu,Zichen Liu,Yuxin Peng,Jiahuan Zhou
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所，北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Vision-language models (VLMs) encounter considerable challenges when adapting to domain shifts stemming from changes in data distribution. Test-time adaptation (TTA) has emerged as a promising approach to enhance VLM performance under such conditions. In practice, test data often arrives in batches, leading to increasing interest in the transductive TTA setting. However, existing TTA methods primarily focus on individual test samples, overlooking crucial cross-sample correlations within a batch. While recent ViT-based TTA methods have introduced batch-level adaptation, they remain suboptimal for VLMs due to inadequate integration of the text modality. To address these limitations, we propose a novel transductive TTA framework, Supportive Clique-based Attribute Prompting (SCAP), which effectively combines visual and textual information to enhance adaptation by generating fine-grained attribute prompts across test batches. SCAP first forms supportive cliques of test samples in an unsupervised manner based on visual similarity and learns an attribute prompt for each clique, capturing shared attributes critical for adaptation. For each test sample, SCAP aggregates attribute prompts from its associated cliques, providing enriched contextual information. To ensure adaptability over time, we incorporate a retention module that dynamically updates attribute prompts and their associated attributes as new data arrives. Comprehensive experiments across multiple benchmarks demonstrate that SCAP outperforms existing state-of-the-art methods, significantly advancing VLM generalization under domain shifts. Our code is available at this https URL.
zh

[CV-105] CAT-3DGS Pro: A New Benchmark for Efficient 3DGS Compression

【速读】：该论文旨在解决3D Gaussian Splatting (3DGS) 表示在传输和存储应用中的率失真优化压缩难题，同时克服现有方法（如CAT-3DGS）训练时间过长和解码效率较低的局限。论文的关键解决方案包括：(1) 引入PCA-guided向量-矩阵超先验以替代三平面超先验，从而减少冗余参数；(2) 提出交替优化策略（Alternate-Rate-Distortion Optimization, A-RDO），以实现更平衡的率失真权衡和更快的编码速度；(3) 改进CAT-3DGS的采样率优化方法，进一步提升率失真性能。这些改进使BungeeNeRF数据集上的训练时间提升了3倍，Amsterdam场景的解码速度提高了5倍，并实现了46.6%的BD-rate降低。

链接: https://arxiv.org/abs/2503.12862
作者: Yu-Ting Zhan,He-bi Yang,Cheng-Yuan Ho,Jui-Chiu Chiang,Wen-Hsiao Peng
机构: National Yang Ming Chiao Tung University (国立阳明交通大学), Taiwan; National Chung Cheng University (中正大学), Taiwan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has shown immense potential for novel view synthesis. However, achieving rate-distortion-optimized compression of 3DGS representations for transmission and/or storage applications remains a challenge. CAT-3DGS introduces a context-adaptive triplane hyperprior for end-to-end optimized compression, delivering state-of-the-art coding performance. Despite this, it requires prolonged training and decoding time. To address these limitations, we propose CAT-3DGS Pro, an enhanced version of CAT-3DGS that improves both compression performance and computational efficiency. First, we introduce a PCA-guided vector-matrix hyperprior, which replaces the triplane-based hyperprior to reduce redundant parameters. To achieve a more balanced rate-distortion trade-off and faster encoding, we propose an alternate optimization strategy (A-RDO). Additionally, we refine the sampling rate optimization method in CAT-3DGS, leading to significant improvements in rate-distortion performance. These enhancements result in a 46.6% BD-rate reduction and 3x speedup in training time on BungeeNeRF, while achieving 5x acceleration in decoding speed for the Amsterdam scene compared to CAT-3DGS.
zh

[CV-106] Adaptive Transformer Attention and Multi-Scale Fusion for Spine 3D Segmentation

【速读】：本文旨在解决脊柱医学图像三维语义分割任务中的准确性与鲁棒性问题。为应对脊柱图像复杂的解剖结构，论文提出的关键解决方案包括引入多尺度融合机制以增强特征提取能力，并通过不同尺度信息的利用提高目标区域的识别精度；同时，采用自适应注意力机制使模型能够动态调整对关键区域的关注，从而优化边界分割效果。实验结果表明，相比传统方法（如3D CNN、3D U-Net及3D U-Net + Transformer），所提模型在mIoU、mDice和mAcc等指标上表现更优，且可视化分析验证了其能更好地恢复真实的解剖结构。

链接: https://arxiv.org/abs/2503.12853
作者: Yanlin Xiang,Qingyuan He,Ting Xu,Ran Hao,Jiacheng Hu,Hanchao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study proposes a 3D semantic segmentation method for the spine based on the improved SwinUNETR to improve segmentation accuracy and robustness. Aiming at the complex anatomical structure of spinal images, this paper introduces a multi-scale fusion mechanism to enhance the feature extraction capability by using information of different scales, thereby improving the recognition accuracy of the model for the target area. In addition, the introduction of the adaptive attention mechanism enables the model to dynamically adjust the attention to the key area, thereby optimizing the boundary segmentation effect. The experimental results show that compared with 3D CNN, 3D U-Net, and 3D U-Net + Transformer, the model of this study has achieved significant improvements in mIoU, mDice, and mAcc indicators, and has better segmentation performance. The ablation experiment further verifies the effectiveness of the proposed improved method, proving that multi-scale fusion and adaptive attention mechanism have a positive effect on the segmentation task. Through the visualization analysis of the inference results, the model can better restore the real anatomical structure of the spinal image. Future research can further optimize the Transformer structure and expand the data scale to improve the generalization ability of the model. This study provides an efficient solution for the task of medical image segmentation, which is of great significance to intelligent medical image analysis.
zh

[CV-107] ACT360: An Efficient 360-Degree Action Detection and Summarization Framework for Mission-Critical Training and Debriefing

【速读】：本文旨在解决高风险任务环境中（如灾害响应、军事演习和工业安全）传统训练后分析方法的局限性。这些传统方法依赖人工审查二维视频，效率低下且缺乏全面的情境意识。为克服这些限制，论文提出了一种名为ACT360的系统，其核心解决方案是结合360度视频与机器学习技术，实现自动化动作检测和结构化复盘。关键创新点包括：(1) 提出360YOWO模型，通过空间注意力机制和等矩投影感知卷积（EAC）减轻全景视频失真；(2) 应用量化和模型剪枝技术，将模型大小减少74%，同时保持较高的准确性（mAP仅下降1.5%），显著提升推理速度；(3) 开发基于Web的360AIE界面，利用大语言模型（LLMs）实现自动动作检测、检索和文本摘要，极大提高事后分析效率。这些技术共同构成了一个通用的复盘框架，适用于需要轻量级动作检测和结构化分析的任何训练环境。

链接: https://arxiv.org/abs/2503.12852
作者: Aditi Tiwari,Klara Nahrstedt
机构: University of Illinois Urbana-Champaign (UIUC)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:Effective training and debriefing are critical in high-stakes, mission-critical environments such as disaster response, military simulations, and industrial safety, where precision and minimizing errors are paramount. The traditional post-training analysis relies on manually reviewing 2D videos, a time-consuming process that lacks comprehensive situational awareness. To address these limitations, we introduce ACT360, a system that leverages 360-degree videos and machine learning for automated action detection and structured debriefing. ACT360 integrates 360YOWO, an enhanced You Only Watch Once (YOWO) model with spatial attention and equirectangular-aware convolution (EAC) to mitigate panoramic video distortions. To enable deployment in resource-constrained environments, we apply quantization and model pruning, reducing the model size by 74% while maintaining robust accuracy (mAP drop of only 1.5%, from 0.865 to 0.850) and improving inference speed. We validate our approach on a publicly available dataset of 55 labeled 360-degree videos covering seven key operational actions, recorded across various real-world training sessions and environmental conditions. Additionally, ACT360 integrates 360AIE (Action Insight Explorer), a web-based interface for automatic action detection, retrieval, and textual summarization using large language models (LLMs), significantly enhancing post-incident analysis efficiency. ACT360 serves as a generalized framework for mission-critical debriefing, incorporating EAC, spatial attention, summarization, and model optimization. These innovations apply to any training environment requiring lightweight action detection and structured post-exercise analysis.
zh

[CV-108] Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment CVPR2025

【速读】：该论文旨在解决基于视听线索的可听物体精确定位问题，传统方法在处理模糊的视听对应关系（如视觉相似但声学特征不同的物体以及物体发声状态频繁变化的情况）时表现不佳，可能导致过分割或欠分割现象。为克服这些局限性，论文提出了一种新颖框架，其关键是包含两个核心组件：音频引导模态对齐（Audio-Guided Modality Alignment, AMA）模块和不确定性估计（Uncertainty Estimation, UE）模块。AMA模块通过将视听交互限制在多个组内，并基于对音频线索的响应程度整合组特征以形成紧凑表示，从而有效引导模型关注与音频相关区域；同时利用对比学习区分发声区域与静默区域。UE模块则结合空间与时间信息识别因发声状态频繁变化导致的高不确定性区域，并通过降低这些区域的置信度来减少预测误差。实验结果表明，该方法在复杂场景下显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.12847
作者: Chen Liu,Peike Li,Liying Yang,Dadong Wang,Lincheng Li,Xin Yu
机构: The University of Queensland (昆士兰大学); NetEase Fuxi AI Lab (网易伏羲实验室); Matrix Verse AI; CSIRO Data61; Macau University of Science and Technology (澳门科技大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation. Most previous methods emphasize spatial or temporal multi-modal modeling, yet overlook challenges from ambiguous audio-visual correspondences such as nearby visually similar but acoustically different objects and frequent shifts in objects’ sounding status. Consequently, they may struggle to reliably correlate audio and visual cues, leading to over- or under-segmentation. To address these limitations, we propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module. Instead of indiscriminately correlating audio-visual cues through a global attention mechanism, AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues, effectively directing the model’s attention to audio-relevant areas. Leveraging contrastive learning, AMA further distinguishes sounding regions from silent areas by treating features with strong audio responses as positive samples and weaker responses as negatives. Additionally, UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state, reducing prediction errors by lowering confidence in these areas. Experimental results demonstrate that our approach achieves superior accuracy compared to existing state-of-the-art methods, particularly in challenging scenarios where traditional approaches struggle to maintain reliable segmentation.
zh

[CV-109] GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

【速读】：该论文旨在解决视觉障碍和低视力（BLV）人群在移动性方面面临的重大挑战，特别是因空间理解不足导致的跌倒风险。解决方案的关键在于通过引入GuideDog数据集，缓解现有数据集规模有限的问题。传统的BLV相关标注需要专业知识且耗时费力，而GuideDog采用了一种以协作式人机框架为基础的新方法，将标注负担从生成转移到验证，同时遵循既定的无障碍标准，显著提升了标注效率并保持了高质量。此外，论文还开发了GuideDogQA子集，用于评估细粒度的视觉感知能力，进一步推动基于多模态大语言模型（MLLM）的辅助技术发展，并为机器人和增强现实中的第一人称场景理解提供支持。

链接: https://arxiv.org/abs/2503.12844
作者: Junhyeok Kim,Jaewoo Park,Junhee Park,Sangeyl Lee,Jiwan Chung,Jisung Kim,Ji Hoon Joung,Youngjae Yu
机构: Yonsei University (延世大学); SK Telecom (SK电信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV), with 7% of visually impaired individuals experiencing falls at least once a month. While recent advances in Multimodal Large Language Models (MLLMs) offer promising opportunities for BLV assistance, their development has been hindered by limited datasets. This limitation stems from the fact that BLV-aware annotation requires specialized domain knowledge and intensive labor. To address this gap, we introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs (including 2K human-annotated pairs) that capture diverse real-world scenes from a pedestrian’s viewpoint. Our approach shifts the annotation burden from generation to verification through a collaborative human-AI framework grounded in established accessibility standards, significantly improving efficiency while maintaining high-quality annotations. We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities, specifically object recognition and relative depth perception. Our experimental results highlight the importance of accurate spatial understanding for effective BLV guidance. GuideDog and GuideDogQA will advance research in MLLM-based assistive technologies for BLV individuals while contributing to broader applications in understanding egocentric scenes for robotics and augmented reality. The code and dataset will be publicly available.
zh

[CV-110] owards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data

【速读】：该论文旨在解决现有自监督学习方法在处理地学栅格（影像）数据时存在的可扩展性不足问题，特别是在面对多通道和多模态数据时表现出的模型架构僵化及计算效率低下。为应对这些挑战，论文提出了Low-rank Efficient Spatial-Spectral Vision Transformer (LESS ViT)，其核心创新点包括：i) 利用Kronecker积近似高维空间-光谱注意力的LESS注意力模块；ii) 通过连续位置-通道嵌入层保持每个patch的空间和光谱连续性及其物理特性；iii) 借助感知域掩码限制注意力范围至邻近patch以利用局部空间依赖性。论文还构建了一个名为GFM-Bench的新基准用于评估LESS ViT，并通过预训练展示了该方法在减少计算量和参数数量的同时超越了当前最先进的多模态地学基础模型。这些创新显著提升了模型的灵活性与扩展性，为未来涉及多种模态和通道的地学数据分析任务提供了新方向。

链接: https://arxiv.org/abs/2503.12843
作者: Haozhe Si,Yuxuan Wan,Minh Do,Deepak Vasisht,Han Zhao,Hendrik F. Hamann
机构: University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); IBM Research (IBM 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geospatial raster (imagery) data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer (LESS ViT) with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker’s product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both spatial and spectral continuity and physical characteristics of each patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct a benchmark, GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method surpasses current state-of-the-art multi-modal geospatial foundation models, achieving superior performance with less computation and fewer parameters. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.
zh

[CV-111] Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics CVPR2025

【速读】：该论文致力于解决声学信号引导下的物体分割问题，主要关注两个挑战：(1) 音频信号重叠引起的特征混淆，以及 (2) 同一物体产生多样化声音导致的音频-视觉匹配困难。为了解决这些问题，论文提出了一种名为动态推导与消除 (Dynamic Derivation and Elimination, DDESeg) 的新颖音频-视觉分割框架。其关键在于通过增强单个声源的独特语义信息来重构混合音频信号的语义内容以缓解特征混淆，并引入判别特征学习模块提升生成音频表示的语义区分度以降低匹配难度；同时，设计动态消除模块过滤不匹配的元素，确保目标区域与相关音频语义之间的精准交互与对齐。实验结果表明，该框架在音频-视觉分割 (AVS) 数据集上取得了优异性能。

链接: https://arxiv.org/abs/2503.12840
作者: Chen Liu,Liying Yang,Peike Li,Dadong Wang,Lincheng Li,Xin Yu
机构: The University of Queensland (昆士兰大学); NetEase Fuxi AI Lab (网易伏羲实验室); Matrix Verse AI; CSIRO Data61; Macau University of Science and Technology (澳门科技大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Sound-guided object segmentation has drawn considerable attention for its potential to enhance multimodal perception. Previous methods primarily focus on developing advanced architectures to facilitate effective audio-visual interactions, without fully addressing the inherent challenges posed by audio natures, \emph\ie, (1) feature confusion due to the overlapping nature of audio signals, and (2) audio-visual matching difficulty from the varied sounds produced by the same object. To address these challenges, we propose Dynamic Derivation and Elimination (DDESeg): a novel audio-visual segmentation framework. Specifically, to mitigate feature confusion, DDESeg reconstructs the semantic content of the mixed audio signal by enriching the distinct semantic information of each individual source, deriving representations that preserve the unique characteristics of each sound. To reduce the matching difficulty, we introduce a discriminative feature learning module, which enhances the semantic distinctiveness of generated audio representations. Considering that not all derived audio representations directly correspond to visual features (e.g., off-screen sounds), we propose a dynamic elimination module to filter out non-matching elements. This module facilitates targeted interaction between sounding regions and relevant audio semantics. By scoring the interacted features, we identify and filter out irrelevant audio information, ensuring accurate audio-visual alignment. Comprehensive experiments demonstrate that our framework achieves superior performance in AVS datasets.
zh

[CV-112] DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Mode

【速读】：该论文旨在解决现有基于扩散模型的文本驱动图像生成方法在多层结构探索方面的不足，特别是多层之间的不一致交互问题（如遮挡关系、空间布局和阴影等）。为了解决这一问题，论文提出了DreamLayer框架，其关键在于通过显式建模前景透明层与背景层之间的关系，实现连贯的多层图像生成。DreamLayer包含三个核心组件：Context-Aware Cross-Attention (CACA) 用于全局与局部信息交换，Layer-Shared Self-Attention (LSSA) 用于建立稳健的层间连接，以及 Information Retained Harmonization (IRH) 用于在潜在空间层面优化融合细节。此外，为了促进多层生成的研究，作者构建了一个高质量且多样化的多层数据集，并通过广泛的实验和用户研究验证了DreamLayer在生成连贯且对齐良好的多层图像方面的优越性及其在潜在空间编辑和图像分解中的广泛应用潜力。

链接: https://arxiv.org/abs/2503.12838
作者: Junjia Huang,Pengxiang Yan,Jinhang Cai,Jiyang Liu,Zhao Wang,Yitong Wang,Xinglong Wu,Guanbin Li
机构: Sun Yat-sen University (中山大学); ByteDance Intelligent Creation (字节跳动智能创作); Peng Cheng Laboratory (鹏城实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under submission

点击查看摘要

Abstract:Text-driven image generation using diffusion models has recently gained significant attention. To enable more flexible image manipulation and editing, recent research has expanded from single image generation to transparent layer generation and multi-layer compositions. However, existing approaches often fail to provide a thorough exploration of multi-layer structures, leading to inconsistent inter-layer interactions, such as occlusion relationships, spatial layout, and shadowing. In this paper, we introduce DreamLayer, a novel framework that enables coherent text-driven generation of multiple image layers, by explicitly modeling the relationship between transparent foreground and background layers. DreamLayer incorporates three key components, i.e., Context-Aware Cross-Attention (CACA) for global-local information exchange, Layer-Shared Self-Attention (LSSA) for establishing robust inter-layer connections, and Information Retained Harmonization (IRH) for refining fusion details at the latent level. By leveraging a coherent full-image context, DreamLayer builds inter-layer connections through attention mechanisms and applies a harmonization step to achieve seamless layer fusion. To facilitate research in multi-layer generation, we construct a high-quality, diverse multi-layer dataset including 400k samples. Extensive experiments and user studies demonstrate that DreamLayer generates more coherent and well-aligned layers, with broad applicability, including latent-space image editing and image-to-layer decomposition.
zh

[CV-113] CompMarkGS: Robust Watermarking for Compression 3D Gaussian Splatting

【速读】：该论文旨在解决3D Gaussian Splatting (3DGS) 模型在经过基于量化压缩后水印易丢失的问题。解决方案的关键在于提出了一种新的水印嵌入方法，通过在训练过程中引入量化失真层来模拟压缩过程，确保水印在量化压缩后的鲁棒性，同时保持高质量的渲染效果。此外，该方法还设计了一种可学习的水印嵌入特征，将水印与锚特征结合，并提出频率感知的锚点生长机制以增强高频区域的图像质量。实验结果验证了该方法在高压缩比下能够有效保留水印且保持优异的图像质量，为3DGS模型提供了安全保护的可行方案。

链接: https://arxiv.org/abs/2503.12836
作者: Sumin In,Youngdong Jang,Utae Jeong,MinHyuk Jang,Hyeongcheol Park,Eunbyung Park,Sangpil Kim
机构: Korea University (高丽大学); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 17 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) enables rapid differentiable rendering for 3D reconstruction and novel view synthesis, leading to its widespread commercial use. Consequently, copyright protection via watermarking has become critical. However, because 3DGS relies on millions of Gaussians, which require gigabytes of storage, efficient transfer and storage require compression. Existing 3DGS watermarking methods are vulnerable to quantization-based compression, often resulting in the loss of the embedded watermark. To address this challenge, we propose a novel watermarking method that ensures watermark robustness after model compression while maintaining high rendering quality. In detail, we incorporate a quantization distortion layer that simulates compression during training, preserving the watermark under quantization-based compression. Also, we propose a learnable watermark embedding feature that embeds the watermark into the anchor feature, ensuring structural consistency and seamless integration into the 3D scene. Furthermore, we present a frequency-aware anchor growing mechanism to enhance image quality in high-frequency regions by effectively identifying Guassians within these regions. Experimental results confirm that our method preserves the watermark and maintains superior image quality under high compression, validating it as a promising approach for a secure 3DGS model.
zh

[CV-114] PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior

【速读】：该论文致力于解决条件化三维形状生成中的两个核心挑战：最小化信息损失并最大化用户输入意图。现有方法主要关注两种孤立的条件信号（用户草图与文本描述），但它们均无法提供灵活的形状控制能力。论文提出的解决方案名为PASTA，其关键是通过视觉-语言模型的文本嵌入来增强草图的语义表示，将文本信息作为先验知识，明确物体的部分组成以弥补模糊草图中缺失的视觉线索。此外，PASTA引入了ISG-Net，采用两种图卷积网络——IndivGCN用于处理细粒度细节，PartGCN用于聚合这些细节并优化物体结构。实验结果表明，PASTA在部分级编辑任务中表现优异，并在草图到三维形状生成方面达到了当前最佳性能。

链接: https://arxiv.org/abs/2503.12834
作者: Seunggwan Lee,Hwanhee Jung,Byoungsoo Koh,Qixing Huang,Sangho Yoon,Sangpil Kim
机构: Korea University (韩国大学); KOCCA (韩国文化产业振兴院); The University of Texas at Austin (德克萨斯大学奥斯汀分校); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 18 figures

点击查看摘要

Abstract:A fundamental challenge in conditional 3D shape generation is to minimize the information loss and maximize the intention of user input. Existing approaches have predominantly focused on two types of isolated conditional signals, i.e., user sketches and text descriptions, each of which does not offer flexible control of the generated shape. In this paper, we introduce PASTA, the flexible approach that seamlessly integrates a user sketch and a text description for 3D shape generation. The key idea is to use text embeddings from a vision-language model to enrich the semantic representation of sketches. Specifically, these text-derived priors specify the part components of the object, compensating for missing visual cues from ambiguous sketches. In addition, we introduce ISG-Net which employs two types of graph convolutional networks: IndivGCN, which processes fine-grained details, and PartGCN, which aggregates these details into parts and refines the structure of objects. Extensive experiments demonstrate that PASTA outperforms existing methods in part-level editing and achieves state-of-the-art results in sketch-to-3D shape generation.
zh

[CV-115] GSBAKK: top-K Geometric Score-based Black-box Attack ICLR2025

【速读】：本文旨在解决现有基于评分的对抗攻击方法在设计顶级（top-1）对抗样本时面临的挑战，尤其是在小扰动限制下攻击成功率和查询效率不高的问题，同时填补了多标签学习分类器对抗鲁棒性研究的空白。论文的关键创新在于提出了一种无代理的基于评分的攻击方法——几何评分黑盒攻击（GSBAK^K），用于在未定向和定向攻击场景下生成顶级（top-K）对抗样本，目标是改变目标分类器的顶级K个预测结果。其解决方案的核心在于引入基于梯度的方法以找到良好的初始边界点，并通过新颖的梯度估计技术迭代优化决策边界上的样本位置，从而高效利用决策边界的几何特性。此外，GSBAK^K 还适用于针对顶级K多标签学习分类器的攻击任务。实验结果验证了该方法在ImageNet和PASCAL VOC数据集上的有效性。

链接: https://arxiv.org/abs/2503.12827
作者: Md Farhamdur Reza,Richeng Jin,Tianfu Wu,Huaiyu Dai
机构: NC State University (北卡罗来纳州立大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This article has been accepted for publication at ICLR 2025

点击查看摘要

Abstract:Existing score-based adversarial attacks mainly focus on crafting top -1 adversarial examples against classifiers with single-label classification. Their attack success rate and query efficiency are often less than satisfactory, particularly under small perturbation requirements; moreover, the vulnerability of classifiers with multi-label learning is yet to be studied. In this paper, we propose a comprehensive surrogate free score-based attack, named \b geometric \b score-based \b black-box \b attack (GSBAK ^K ), to craft adversarial examples in an aggressive top - K setting for both untargeted and targeted attacks, where the goal is to change the top - K predictions of the target classifier. We introduce novel gradient-based methods to find a good initial boundary point to attack. Our iterative method employs novel gradient estimation techniques, particularly effective in top - K setting, on the decision boundary to effectively exploit the geometry of the decision boundary. Additionally, GSBAK ^K can be used to attack against classifiers with top - K multi-label learning. Extensive experimental results on ImageNet and PASCAL VOC datasets validate the effectiveness of GSBAK ^K in crafting top - K adversarial examples.
zh

[CV-116] From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration CVPR2025

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在训练数据中存在的长尾（Long-Tail, LT）问题，即数据分布的严重不平衡。论文指出，现有工作主要集中在传统视觉-语言模型（如CLIP或ViT）以及特定任务（如识别和分类），而对更通用的任务（如视觉问答和视觉推理）中的LVLM（例如LLaVA）探索不足。论文通过深入分析发现，长尾问题的核心原因在于头部概念的过度表示和尾部概念的不足表示。为此，论文提出了一种自适应数据精炼框架（Adaptive Data Refinement Framework, ADR），包含两个关键阶段：数据重平衡（Data Rebalancing, DR）和数据合成（Data Synthesis, DS）。在DR阶段，通过实体分布自适应地调整冗余数据；在DS阶段，利用去噪扩散概率模型（Denoising Diffusion Probabilistic Models, DDPMs）和稀缺图像补充数据中不足的部分。实验表明，该方法在不增加训练数据量的情况下，相对提升了LLaVA 1.5平均性能4.36%，有效缓解了训练数据的长尾问题。

链接: https://arxiv.org/abs/2503.12821
作者: Mingyang Song,Xiaoye Qu,Jiawei Zhou,Yu Cheng
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Stony Brook University (石溪大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved significant progress in combining visual comprehension with language generation. Despite this success, the training data of LVLMs still suffers from Long-Tail (LT) problems, where the data distribution is highly imbalanced. Previous works have mainly focused on traditional VLM architectures, i.e., CLIP or ViT, and specific tasks such as recognition and classification. Nevertheless, the exploration of LVLM (e.g. LLaVA) and more general tasks (e.g. Visual Question Answering and Visual Reasoning) remains under-explored. In this paper, we first conduct an in-depth analysis of the LT issues in LVLMs and identify two core causes: the overrepresentation of head concepts and the underrepresentation of tail concepts. Based on the above observation, we propose an \textbfA daptive \textbfD ata \textbfR efinement Framework ( \textbfADR ), which consists of two stages: \textbfD ata \textbfR ebalancing ( \textbfDR ) and \textbfD ata \textbfS ynthesis ( \textbfDS ). In the DR stage, we adaptively rebalance the redundant data based on entity distributions, while in the DS stage, we leverage Denoising Diffusion Probabilistic Models (DDPMs) and scarce images to supplement underrepresented portions. Through comprehensive evaluations across eleven benchmarks, our proposed ADR effectively mitigates the long-tail problem in the training data, improving the average performance of LLaVA 1.5 relatively by 4.36%, without increasing the training data volume.
zh

[CV-117] Hydra-MDP: Advancing End-to-End Driving via Expert-Guided Hydra-Distillation

【速读】：该论文旨在解决传统基于NAVSIM生成教师模型在捕捉不安全驾驶行为（如交通信号灯遵守、车道保持及扩展舒适性）方面的不足。解决方案的关键在于提出了一种名为Hydra-MDP++的新框架，它采用教师-学生知识蒸馏方法，并结合多头解码器从人类演示和规则驱动专家中学习。该框架使用轻量级ResNet-34网络，引入扩展评估指标，同时直接处理原始图像而不依赖特权感知信号，通过集成这些组件与V2-99图像编码器实现了91.0%的NAVSCORE驱动评分，从而有效应对多样化驾驶场景并保持计算效率。

链接: https://arxiv.org/abs/2503.12820
作者: Kailin Li,Zhenxin Li,Shiyi Lan,Yuan Xie,Zhizhong Zhang,Jiayi Liu,Zuxuan Wu,Zhiding Yu,Jose M.Alvarez
机构: East China Normal University (华东师范大学); Fudan University (复旦大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hydra-MDP++ introduces a novel teacher-student knowledge distillation framework with a multi-head decoder that learns from human demonstrations and rule-based experts. Using a lightweight ResNet-34 network without complex components, the framework incorporates expanded evaluation metrics, including traffic light compliance (TL), lane-keeping ability (LK), and extended comfort (EC) to address unsafe behaviors not captured by traditional NAVSIM-derived teachers. Like other end-to-end autonomous driving approaches, \hydra processes raw images directly without relying on privileged perception signals. Hydra-MDP++ achieves state-of-the-art performance by integrating these components with a 91.0% drive score on NAVSIM through scaling to a V2-99 image encoder, demonstrating its effectiveness in handling diverse driving scenarios while maintaining computational efficiency.
zh

[CV-118] AV-Surf: Surface-Enhanced Geometry-Aware Novel-View Acoustic Synthesis

【速读】：该论文旨在解决在复杂真实环境中准确建模声波传播以实现新颖视点声学合成（Novel View Acoustic Synthesis, NVAS）的问题。传统方法主要依赖视觉感知估计空间声学，而忽略了从三维表示中结合表面法线和结构细节在声学建模中的潜力。论文的关键在于提出了一种表面增强的几何感知方法，通过引入几何先验（如图像、深度图、表面法线和点云）以及基于双交叉注意力机制的变换器模型，将几何约束融入频率查询以理解发射器周围的环境。此外，设计了一个基于ConvNeXt的频谱特征处理网络（Spectral Refinement Network, SRN），用于合成真实的双耳音频。实验结果表明，该方法在RWAVS和SoundSpace数据集上的表现优于现有方法，验证了解决方案的有效性。

链接: https://arxiv.org/abs/2503.12806
作者: Hadam Baek,Hannie Shin,Jiyoung Seo,Chanwoo Kim,Saerom Kim,Hyeongbok Kim,Sangpil Kim
机构: Korea University; Testworks
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Accurately modeling sound propagation with complex real-world environments is essential for Novel View Acoustic Synthesis (NVAS). While previous studies have leveraged visual perception to estimate spatial acoustics, the combined use of surface normal and structural details from 3D representations in acoustic modeling has been underexplored. Given their direct impact on sound wave reflections and propagation, surface normals should be jointly modeled with structural details to achieve accurate spatial acoustics. In this paper, we propose a surface-enhanced geometry-aware approach for NVAS to improve spatial acoustic modeling. To achieve this, we exploit geometric priors, such as image, depth map, surface normals, and point clouds obtained using a 3D Gaussian Splatting (3DGS) based framework. We introduce a dual cross-attention-based transformer integrating geometrical constraints into frequency query to understand the surroundings of the emitter. Additionally, we design a ConvNeXt-based spectral features processing network called Spectral Refinement Network (SRN) to synthesize realistic binaural audio. Experimental results on the RWAVS and SoundSpace datasets highlight the necessity of our approach, as it surpasses existing methods in novel view acoustic synthesis.
zh

[CV-119] Pairwise Similarity Regularization for Semi-supervised Graph Medical Image Segmentation

【速读】：该论文旨在解决半监督医学图像分割中由于标注数据与未标注数据之间的分布偏移（distributional shift）导致标注信息利用率下降的问题。为缓解此问题，论文提出了一种基于成对相似性正则化（Pairwise Similarity Regularization, PaSR）的图网络特征对齐方法。该方法的关键在于通过保持目标域与源域特征图之间成对结构相似性的稳定性来对齐不同域内图像的图结构，从而减少医学图像中的分布偏移问题；同时，进一步利用图聚类信息对齐以优化教师网络中的伪标签准确性，从而提升模型的整体半监督效率。实验验证表明，该方法在多个医学图像分割基准数据集上超越了现有先进方法，并在ACDC数据集上实现了超过10.66%的平均性能提升。

链接: https://arxiv.org/abs/2503.12800
作者: Jialu Zhou,Dianxi Shi,Shaowu Yang,Chunping Qiu,Luoxi Jing,Mengzhu Wang
机构: National University of Defense Technology (国防科技大学); Academy of Military Sciences (军事科学院); Intelligent Game and Decision Lab (智能游戏与决策实验室); Peking University (北京大学); Hebei University of Technology (河北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With fully leveraging the value of unlabeled data, semi-supervised medical image segmentation algorithms significantly reduces the limitation of limited labeled data, achieving a significant improvement in accuracy. However, the distributional shift between labeled and unlabeled data weakens the utilization of information from the labeled data. To alleviate the problem, we propose a graph network feature alignment method based on pairwise similarity regularization (PaSR) for semi-supervised medical image segmentation. PaSR aligns the graph structure of images in different domains by maintaining consistency in the pairwise structural similarity of feature graphs between the target domain and the source domain, reducing distribution shift issues in medical images. Meanwhile, further improving the accuracy of pseudo-labels in the teacher network by aligning graph clustering information to enhance the semi-supervised efficiency of the model. The experimental part was verified on three medical image segmentation benchmark datasets, with results showing improvements over advanced methods in various metrics. On the ACDC dataset, it achieved an average improvement of more than 10.66%.
zh

[CV-120] Grounded Chain-of-Thought for Multimodal Large Language Models

【速读】：该论文旨在解决现有多模态大型语言模型（Multimodal Large Language Models, MLLMs）容易产生视觉幻觉的问题，这严重阻碍了它们在实际场景中的可信应用。论文从视觉-空间推理的角度研究此问题，并提出了一种新的学习任务——基于定位的链式思维（Grounded Chain-of-Thought, GCoT），其关键在于帮助MLLMs逐步识别并定位相关的视觉线索，从而以定位坐标作为直观依据预测正确答案。为支持这一任务，论文精心设计并构建了一个包含24,022个GCoT示例和5,033张图像的数据集MM-GCoT。此外，还引入了一套全面的一致性评估系统，包括答案准确性、定位准确性和答案-定位一致性等指标。通过在12个先进MLLMs上的实验，发现大多数模型在一致性评估中表现不佳，表明存在明显的视觉幻觉现象，且这种幻觉与模型参数规模及通用多模态性能无直接关系。最后，论文展示了所提出的数据集能够显著提升现有MLLMs的GCoT能力，并减少不一致回答的发生，同时其能力还可泛化至现有的多模态任务如开放式问答和推理等。

链接: https://arxiv.org/abs/2503.12799
作者: Qiong Wu,Xiangcong Yang,Yiyi Zhou,Chenxin Fang,Baiyang Song,Xiaoshuai Sun,Rongrong Ji
机构: Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Despite great progress, existing multimodal large language models (MLLMs) are prone to visual hallucination, greatly impeding their trustworthy applications. In this paper, we study this problem from the perspective of visual-spatial reasoning, and propose a new learning task for MLLMs, termed Grounded Chain-of-Thought (GCoT). Different from recent visual CoT studies, which focus more on visual knowledge reasoning, GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images. Besides, a comprehensive consistency evaluation system is also introduced, including the metrics of answer accuracy, grounding accuracy and answer-grounding consistency. We further design and conduct a bunch of experiments on 12 advanced MLLMs, and reveal some notable findings: i. most MLLMs performs poorly on the consistency evaluation, indicating obvious visual hallucination; ii. visual hallucination is not directly related to the parameter size and general multimodal performance, i.e., a larger and stronger MLLM is not less affected by this issue. Lastly, we also demonstrate that the proposed dataset can help existing MLLMs to well cultivate their GCoT capability and reduce the inconsistent answering significantly. Moreover, their GCoT can be also generalized to exiting multimodal tasks, such as open-world QA and REC.
zh

[CV-121] Improving Generalization of Universal Adversarial Perturbation via Dynamic Maximin Optimization AAAI2025

【速读】：该论文旨在解决现有方法在生成具有跨样本普遍性和跨模型可迁移性的鲁棒性（universal）对抗扰动（UAPs）方面存在的不足。具体而言，传统方法通过静态模型参数快照优化UAPs，未能充分利用深度神经网络（DNNs）的潜力以生成更有效的扰动。论文的关键创新在于提出了一种动态最大化最小化（DM-UAP）策略，该策略利用迭代的最大-最小-最小优化框架以及课程学习算法，通过动态调整模型-数据对来全面探索模型参数与数据的空间。这种方法能够显著提升UAPs的跨样本普遍性和跨模型可迁移性，仅使用500个样本即可超越当前最先进的方法，平均误导率提升12.108%。

链接: https://arxiv.org/abs/2503.12793
作者: Yechao Zhang,Yingzhe Xu,Junyu Shi,Leo Yu Zhang,Shengshan Hu,Minghui Li,Yanjun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in AAAI 2025

点击查看摘要

Abstract:Deep neural networks (DNNs) are susceptible to universal adversarial perturbations (UAPs). These perturbations are meticulously designed to fool the target model universally across all sample classes. Unlike instance-specific adversarial examples (AEs), generating UAPs is more complex because they must be generalized across a wide range of data samples and models. Our research reveals that existing universal attack methods, which optimize UAPs using DNNs with static model parameter snapshots, do not fully leverage the potential of DNNs to generate more effective UAPs. Rather than optimizing UAPs against static DNN models with a fixed training set, we suggest using dynamic model-data pairs to generate UAPs. In particular, we introduce a dynamic maximin optimization strategy, aiming to optimize the UAP across a variety of optimal model-data pairs. We term this approach DM-UAP. DM-UAP utilizes an iterative max-min-min optimization framework that refines the model-data pairs, coupled with a curriculum UAP learning algorithm to examine the combined space of model parameters and data thoroughly. Comprehensive experiments on the ImageNet dataset demonstrate that the proposed DM-UAP markedly enhances both cross-sample universality and cross-model transferability of UAPs. Using only 500 samples for UAP generation, DM-UAP outperforms the state-of-the-art approach with an average increase in fooling ratio of 12.108%.
zh

[CV-122] Privacy-Preserving Biometric Verification with Handwritten Random Digit String

【速读】：该论文旨在解决手写验证技术中因包含个人敏感信息（如签名）而导致的隐私泄露问题。为应对这一挑战，论文提出利用随机数字字符串（Random Digit String, RDS）实现隐私保护的手写验证方法，使用户通过书写任意数字序列进行身份认证，从而有效保障隐私。方案的关键在于设计了一种名为模式注意验证网络（Pattern Attentive VErification Network, PAVENet）的模型，并引入判别模式挖掘（Discriminative Pattern Mining, DPM）模块。DPM 模块能够自适应增强一致且具有区分度的手写模式的识别能力，进而优化手写风格的表征，克服 RDS 手写内容无约束和多变带来的建模挑战。通过全面评估，证明了所提方法在隐私保护生物特征验证中的可行性和优越性，并揭示了一种不同于以往研究的伪造现象及其在对抗恶意冒充攻击中的积极作用。

链接: https://arxiv.org/abs/2503.12786
作者: Peirong Zhang,Yuliang Liu,Songxuan Lai,Hongliang Li,Lianwen Jin
机构: School of Electronic and Information Engineering, South China University of Technology (华南理工大学电子与信息工程学院), Guangzhou, China; School of Artificial Intelligence and Automation, Huazhong University of Science and Technology (华中科技大学人工智能与自动化学院), Wuhan, China; EI Product Department, Huawei Cloud Computing Technologies Co., Ltd. (华为云计算技术有限公司EI产品部), Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Handwriting verification has stood as a steadfast identity authentication method for decades. However, this technique risks potential privacy breaches due to the inclusion of personal information in handwritten biometrics such as signatures. To address this concern, we propose using the Random Digit String (RDS) for privacy-preserving handwriting verification. This approach allows users to authenticate themselves by writing an arbitrary digit sequence, effectively ensuring privacy protection. To evaluate the effectiveness of RDS, we construct a new HRDS4BV dataset composed of online naturally handwritten RDS. Unlike conventional handwriting, RDS encompasses unconstrained and variable content, posing significant challenges for modeling consistent personal writing style. To surmount this, we propose the Pattern Attentive VErification Network (PAVENet), along with a Discriminative Pattern Mining (DPM) module. DPM adaptively enhances the recognition of consistent and discriminative writing patterns, thus refining handwriting style representation. Through comprehensive evaluations, we scrutinize the applicability of online RDS verification and showcase a pronounced outperformance of our model over existing methods. Furthermore, we discover a noteworthy forgery phenomenon that deviates from prior findings and discuss its positive impact in countering malicious impostor attacks. Substantially, our work underscores the feasibility of privacy-preserving biometric verification and propels the prospects of its broader acceptance and application.
zh

[CV-123] Mixed-granularity Implicit Representation for Continuous Hyperspectral Compressive Reconstruction

【速读】：本文旨在解决高光谱图像（Hyperspectral Images, HSIs）在传统光谱仪下因长时间采集导致的应用瓶颈问题。通过引入编码孔径快照光谱成像（Coded Aperture Snapshot Spectral Imaging, CASSI）系统，利用压缩技术加速采集过程，但其重建过程中受到固定空间-光谱分辨率限制的挑战。论文的关键解决方案在于提出了一种基于隐式神经表示的连续高光谱图像重建方法，即混合粒度隐式表征（Mixed Granularity Implicit Representation, MGIR）框架。该框架包含分层光谱-空间隐式编码器以实现高效多尺度隐式特征提取，并辅以混合粒度局部特征聚合器自适应整合跨尺度局部特征，结合解码器融合坐标信息以实现精确重建。通过利用隐式神经表示，MGIR框架能够在任意所需的空-谱分辨率下进行重建，显著提升了CASSI系统的灵活性与适应性。实验结果表明，所提模型可在任意分辨率下生成重建图像，并在不同空-谱压缩比下达到最先进的性能水平。

链接: https://arxiv.org/abs/2503.12783
作者: Jianan Li,Huan Chen,Wangcai Zhao,Rui Chen,Tingfa Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by TNNLS

点击查看摘要

Abstract:Hyperspectral Images (HSIs) are crucial across numerous fields but are hindered by the long acquisition times associated with traditional spectrometers. The Coded Aperture Snapshot Spectral Imaging (CASSI) system mitigates this issue through a compression technique that accelerates the acquisition process. However, reconstructing HSIs from compressed data presents challenges due to fixed spatial and spectral resolution constraints. This study introduces a novel method using implicit neural representation for continuous hyperspectral image reconstruction. We propose the Mixed Granularity Implicit Representation (MGIR) framework, which includes a Hierarchical Spectral-Spatial Implicit Encoder for efficient multi-scale implicit feature extraction. This is complemented by a Mixed-Granularity Local Feature Aggregator that adaptively integrates local features across scales, combined with a decoder that merges coordinate information for precise reconstruction. By leveraging implicit neural representations, the MGIR framework enables reconstruction at any desired spatial-spectral resolution, significantly enhancing the flexibility and adaptability of the CASSI system. Extensive experimental evaluations confirm that our model produces reconstructed images at arbitrary resolutions and matches state-of-the-art methods across varying spectral-spatial compression ratios. The code will be released at this https URL.
zh

[CV-124] SAM2 for Image and Video Segmentation: A Comprehensive Survey

【速读】：该论文旨在解决现有图像与视频分割模型在跨域适应性和泛化能力方面的局限性问题。论文的关键在于系统分析并优化了SAM2（Segment Anything Model 的改进版本）在图像和视频分割任务中的性能，特别是在复杂场景下的表现，并深入探讨其在特定领域的适应性及不足之处。通过全面评估SAM2在医疗成像等专业领域的能力及其跨域应用的挑战，论文提出了技术改进建议和发展方向，为实际场景中SAM2的优化与应用提供了有价值的参考。

链接: https://arxiv.org/abs/2503.12781
作者: Zhang Jiaxing,Tang Hao
机构: School of Software engineering, Sichuan University (四川大学软件工程学院); School of Computer Science, Peking University (北京大学计算机科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures, 7 Tables

点击查看摘要

Abstract:Despite significant advances in deep learning for image and video segmentation, existing models continue to face challenges in cross-domain adaptability and generalization. Image and video segmentation are fundamental tasks in computer vision with wide-ranging applications in healthcare, agriculture, industrial inspection, and autonomous driving. With the advent of large-scale foundation models, SAM2 - an improved version of SAM (Segment Anything Model)has been optimized for segmentation tasks, demonstrating enhanced performance in complex scenarios. However, SAM2’s adaptability and limitations in specific domains require further investigation. This paper systematically analyzes the application of SAM2 in image and video segmentation and evaluates its performance in various fields. We begin by introducing the foundational concepts of image segmentation, categorizing foundation models, and exploring the technical characteristics of SAM and SAM2. Subsequently, we delve into SAM2’s applications in static image and video segmentation, emphasizing its performance in specialized areas such as medical imaging and the challenges of cross-domain adaptability. As part of our research, we reviewed over 200 related papers to provide a comprehensive analysis of the topic. Finally, the paper highlights the strengths and weaknesses of SAM2 in segmentation tasks, identifies the technical challenges it faces, and proposes future development directions. This review provides valuable insights and practical recommendations for optimizing and applying SAM2 in real-world scenarios.
zh

[CV-125] LangDA: Building Context-Awareness via Language for Domain Adaptive Semantic Segmentation

【速读】：该论文旨在解决无监督领域自适应语义分割（DASS）中的两个关键挑战：一是仅基于视觉的方法易受源域偏置的噪声伪标签影响；二是基于语言的方法难以充分捕捉对象间的复杂空间关系。为应对这些问题，论文提出LangDA方法，其关键是通过视觉-语言模型（VLM）生成的场景描述学习对象之间的上下文关系，并将整个图像特征与这些上下文感知的文本表示对齐，从而学习泛化的表征。这一方法在三个DASS基准测试中取得了最新的州-of-the-art性能，分别超越现有方法2.6%、1.4%和3.9%。

链接: https://arxiv.org/abs/2503.12780
作者: Chang Liu,Bavesh Balaji,Saad Hossain,C Thomas,Kwei-Herng Lai,Raviteja Vemulapalli,Alexander Wong,Sirisha Rambhatla
机构: University of Waterloo (滑铁卢大学); Apple (苹果公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label-rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision-only approaches using masking or multi-resolution crops, and (2) language-based approaches that use generic class-wise prompts informed by target domain (e.g. “a snowy photo of a class”). However, the former is susceptible to noisy pseudo-labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects – key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM-generated scene descriptions (e.g. “a pedestrian is on the sidewalk, and the street is lined with buildings.”). Second, LangDA aligns the entire image features with text representation of this context-aware scene caption and learns generalized representations via text. With this, LangDA sets the new state-of-the-art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.
zh

[CV-126] ransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image ICRA2025

【速读】：该论文旨在解决透明物体操作中的挑战，主要由于其反射和折射特性引入的复杂性导致准确估计透明物体3D形状困难。为应对这些挑战，论文提出了一种基于单视图RGB-D的深度补全框架TransDiff，利用去噪扩散概率模型(DDPM)实现桌面环境下的无材料依赖的对象抓取。关键在于通过从RGB图像提取的特征（如语义分割、边缘图和法线图）来引导深度图生成过程，并采用迭代去噪策略将随机深度分布转化为深度图，同时结合改进的训练方法更好地对齐噪声深度与RGB图像特征，以逐步优化深度估计。此外，还提出了加速去噪过程的改进推理方法。实验验证表明，该方法在合成数据和真实场景中均显著优于基线方法，且具有可接受的推理时间。

链接: https://arxiv.org/abs/2503.12779
作者: Haoxiao Wang,Kaichen Zhou,Binrui Gu,Zhiyuan Feng,Weijie Wang,Peilin Sun,Yicheng Xiao,Jianhua Zhang,Hao Dong
机构: CFCS, School of CS, Peking University (北京大学) and National Key Laboratory for Multimedia Information Processing (多媒体信息处理国家重点实验室); Tsinghua University (清华大学); Zhejiang University (浙江大学); Southeast University (东南大学); Tianjin University of Technology (天津理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2025

点击查看摘要

Abstract:Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we propose a single-view RGB-D-based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material-agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real-world benchmarks with acceptable inference time. The demo of our method can be found on this https URL
zh

[CV-127] Adaptive Deep Learning for Multiclass Breast Cancer Classification via Misprediction Risk Analysis

【速读】：该论文旨在解决乳腺癌多分类诊断中的误分类问题，特别是在基于HE染色组织病理学图像的多类别分类任务中，现有方法容易产生频繁误预测的挑战。论文的关键创新在于提出了一种新颖的自适应学习方法，其核心包括两个方面：首先，引入了一个误分类风险分析框架，通过一个仅需少量标注样本的可解释性风险模型量化并排序图像被错误标记的可能性；其次，设计了一种自适应学习策略，根据特定数据集的特性微调分类器以最小化误分类风险，使分类器能够有效适应目标工作负载。实验结果表明，该风险分析框架比现有方法更准确地识别误分类，并且自适应学习方法显著提升了最先进的深度神经网络分类器的性能。

链接: https://arxiv.org/abs/2503.12778
作者: Gul Sheeraz,Qun Chen,Liu Feiyu,Zhou Fengjin MD
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Breast cancer remains one of the leading causes of cancer-related deaths worldwide. Early detection is crucial for improving patient outcomes, yet the diagnostic process is often complex and prone to inconsistencies among pathologists. Computer-aided diagnostic approaches have significantly enhanced breast cancer detection, particularly in binary classification (benign vs. malignant). However, these methods face challenges in multiclass classification, leading to frequent mispredictions. In this work, we propose a novel adaptive learning approach for multiclass breast cancer classification using HE-stained histopathology images. First, we introduce a misprediction risk analysis framework that quantifies and ranks the likelihood of an image being mislabeled by a classifier. This framework leverages an interpretable risk model that requires only a small number of labeled samples for training. Next, we present an adaptive learning strategy that fine-tunes classifiers based on the specific characteristics of a given dataset. This approach minimizes misprediction risk, allowing the classifier to adapt effectively to the target workload. We evaluate our proposed solutions on real benchmark datasets, demonstrating that our risk analysis framework more accurately identifies mispredictions compared to existing methods. Furthermore, our adaptive learning approach significantly improves the performance of state-of-the-art deep neural network classifiers.
zh

[CV-128] NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models

【速读】：该论文旨在解决多模态大语言模型（Multi-modal Large Language Models, MLLMs）在驾驶场景理解中的局限性，特别是现有模型难以有效处理包含多视角信息的复杂驾驶场景的问题。论文的关键创新在于提出了NuPlanQA-Eval这一多视角、多模态的驾驶场景理解评估基准，并构建了一个包含100万真实世界视觉问答对的大规模数据集NuPlanQA-1M。此外，论文通过将鸟瞰图（Bird’s-Eye-View, BEV）特征融入MLLMs，提出了一种新的模型BEV-LLM。实验结果表明，与现有MLLMs相比，BEV-LLM在驾驶场景特定的感知和以自我为中心的空间推理任务中表现出显著优势，在九个子任务中优于其他模型六个任务。这表明BEV特征的整合能够有效提升多视角MLLMs的性能，同时指出了需要进一步优化的关键领域以实现更有效的驾驶场景适应。

链接: https://arxiv.org/abs/2503.12772
作者: Sung-Yeon Park,Can Cui,Yunsheng Ma,Ahmadreza Moradipari,Rohit Gupta,Kyungtae Han,Ziran Wang
机构: Purdue University (普渡大学); Toyota InfoTech Labs (丰田信息科技实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird’s-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multi-view MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. To facilitate further research, we publicly release NuPlanQA at this https URL.
zh

[CV-129] ViSpeak: Visual Instruction Feedback in Streaming Videos

【速读】：该论文旨在解决流式视频理解（streaming video understanding）面临的挑战，这些挑战源于其时间敏感性、多模态特性和交互性。现有大型多模态模型（Large Multi-modal Models, LMMs）主要侧重于离线视频理解，而流式场景需要模型能够实时感知视觉内容并从中提取指令。论文提出了一种名为“视觉指令反馈”（Visual Instruction Feedback）的新任务，目标是使模型不仅理解视觉信息，还能学习从视觉数据中提取操作指令，从而提升用户与代理（agent）之间的交互质量。例如，在用户挥手示意时，系统需识别手势并触发相应的欢迎对话。

解决方案的关键在于定义了七个与视觉模态高度相关的子任务，并构建了ViSpeak-Instruct训练集和ViSpeak-Bench评估集来支持研究。此外，论文提出了ViSpeak模型，这是一种具备GPT-4o级性能的SOTA流式视频理解LMM。通过在ViSpeak-Instruct数据集上的微调，ViSpeak获得了基本的视觉指令反馈能力，为后续研究提供了坚实的基线。

链接: https://arxiv.org/abs/2503.12769
作者: Shenghao Fu,Qize Yang,Yuan-Ming Li,Yi-Xing Peng,Kun-Yu Lin,Xihan Wei,Jian-Fang Hu,Xiaohua Xie,Wei-Shi Zheng
机构: School of Computer Science and Engineering, Sun Yat-sen University, China (中山大学计算机科学与工程学院);
Tongyi Lab, Alibaba Group (通义实验室, 阿里巴巴集团);
Peng Cheng Laboratory, China (鹏程实验室, 中国);
Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China (教育部机器智能与先进计算重点实验室, 中国);
Guangdong Province Key Laboratory of Information Security Technology, China (广东省信息安全技术重点实验室, 中国);
Pazhou Laboratory (Huangpu), China (琶洲实验室（黄埔），中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research.
zh

[CV-130] Dynamic-Dark SLAM: RGB-Thermal Cooperative Robot Vision Strategy for Multi-Person Tracking in Both Well-Lit and Low-Light Scenes

【速读】：该论文旨在解决多目标跟踪（Multi-Person Tracking, MPT）在热成像领域的数据稀缺性和个体识别困难的问题。解决方案的关键在于提出了一种结合RGB和热成像相机的协作式MPT系统，并利用伪标注（bounding boxes + person IDs）训练RGB和热成像追踪器。此外，研究发现使用二元亮度分类器的追踪器切换方法相较于追踪器融合方法更适合用于信息整合。这一研究为“动态-黑暗SLAM”奠定了重要基础，以实现明暗环境中的个体、遮挡物及可通行区域的有效识别、理解和重建。

链接: https://arxiv.org/abs/2503.12768
作者: Tatsuro Sakai,Kanji Tanaka,Jonathan Tay Yu Liang,Muhammad Adil Luqman,Daiki Iwata
机构: University of Fukui (福井大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, technical report

点击查看摘要

Abstract:In robot vision, thermal cameras have significant potential for recognizing humans even in complete darkness. However, their application to multi-person tracking (MPT) has lagged due to data scarcity and difficulties in individual identification. In this study, we propose a cooperative MPT system that utilizes co-located RGB and thermal cameras, using pseudo-annotations (bounding boxes + person IDs) to train RGB and T trackers. Evaluation experiments demonstrate that the T tracker achieves remarkable performance in both bright and dark scenes. Furthermore, results suggest that a tracker-switching approach using a binary brightness classifier is more suitable than a tracker-fusion approach for information integration. This study marks a crucial first step toward ``Dynamic-Dark SLAM," enabling effective recognition, understanding, and reconstruction of individuals, occluding objects, and traversable areas in dynamic environments, both bright and dark.
zh

[CV-131] Decouple to Reconstruct: High Quality UHD Restoration via Active Feature Disentanglement and Reversible Fusion

【速读】：该论文致力于解决超高清（UHD）图像修复过程中因极高分辨率导致的计算瓶颈和信息丢失问题。现有基于变分自编码器（VAE）的方法通过将图像修复过程从像素空间转移到潜在空间来提升效率，但降解成分与背景元素固有耦合，压缩过程中的信息丢失及补偿过程中的信息增益难以控制，这导致修复后的图像常出现细节损失及降解去除不完全的问题。为解决此问题，论文提出了一种可控的微分解缠VAE模型，其关键在于利用分层对比解缠学习（Hierarchical Contrastive Disentanglement Learning）和正交门控投影模块（Orthogonal Gated Projection Module），引导VAE主动舍弃易恢复的背景信息，同时将更难恢复的降解信息编码到潜在空间中。此外，设计了复杂的可逆多尺度融合网络处理背景特征以确保一致性，并采用潜在空间修复网络转换降解的潜在特征，从而实现更精确的修复结果。大量实验结果表明，该方法有效缓解了VAE模型中的信息丢失问题，同时保持了计算效率，在六个UHD图像修复任务中仅使用100万参数即达到了最先进的性能。

链接: https://arxiv.org/abs/2503.12764
作者: Yidi Liu,Dong Li,Yuxin Ma,Jie Huang,Wenlong Zhang,Xueyang Fu,Zheng-jun Zha
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-high-definition (UHD) image restoration often faces computational bottlenecks and information loss due to its extremely high resolution. Existing studies based on Variational Autoencoders (VAE) improve efficiency by transferring the image restoration process from pixel space to latent space. However, degraded components are inherently coupled with background elements in degraded images, both information loss during compression and information gain during compensation remain uncontrollable. These lead to restored images often exhibiting image detail loss and incomplete degradation removal. To address this issue, we propose a Controlled Differential Disentangled VAE, which utilizes Hierarchical Contrastive Disentanglement Learning and an Orthogonal Gated Projection Module to guide the VAE to actively discard easily recoverable background information while encoding more difficult-to-recover degraded information into the latent space. Additionally, we design a Complex Invertible Multiscale Fusion Network to handle background features, ensuring their consistency, and utilize a latent space restoration network to transform the degraded latent features, leading to more accurate restoration results. Extensive experimental results demonstrate that our method effectively alleviates the information loss problem in VAE models while ensuring computational efficiency, significantly improving the quality of UHD image restoration, and achieves state-of-the-art results in six UHD restoration tasks with only 1M parameters.
zh

[CV-132] A Survey on Human Interaction Motion Generation

【速读】：该论文旨在解决数字系统中人类交互行为（包括人与人、人与物以及人与环境的互动）的建模与生成问题。尽管深度生成模型和新数据集的进步推动了这一领域的进展，但精确捕捉复杂的人类动态及其对外部实体的交互仍面临显著挑战。论文的关键在于首次全面综述了人类交互运动生成的相关文献，通过梳理基础概念、现有解决方案与数据集（涵盖人-人、人-物和人-场景三大交互任务），并分析评估指标，最终探讨了开放的研究方向与未来机遇。

链接: https://arxiv.org/abs/2503.12763
作者: Kewei Sui,Anindita Ghosh,Inwoo Hwang,Jian Wang,Chuan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The repository listing relevant papers is accessible at: this https URL

点击查看摘要

Abstract:Humans inhabit a world defined by interactions – with other humans, objects, and environments. These interactive movements not only convey our relationships with our surroundings but also demonstrate how we perceive and communicate with the real world. Therefore, replicating these interaction behaviors in digital systems has emerged as an important topic for applications in robotics, virtual reality, and animation. While recent advances in deep generative models and new datasets have accelerated progress in this field, significant challenges remain in modeling the intricate human dynamics and their interactions with entities in the external world. In this survey, we present, for the first time, a comprehensive overview of the literature in human interaction motion generation. We begin by establishing foundational concepts essential for understanding the research background. We then systematically review existing solutions and datasets across three primary interaction tasks – human-human, human-object, and human-scene interactions – followed by evaluation metrics. Finally, we discuss open research directions and future opportunities.
zh

[CV-133] VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis

【速读】：该论文旨在解决通过非血管造影输入自动生成血管造影图像的问题，以减少对比剂引起的额外辐射暴露带来的健康风险。论文的关键在于提出了一种名为VasTSD的3D血管树状态空间扩散模型，其创新之处在于采用动态构建血管树拓扑的状态空间序列化方法，并结合基于扩散的生成模型，确保在3D体积中生成解剖学连续的血管结构。此外，使用预训练的视觉嵌入器构建血管状态空间表示，实现了跨多种成像模态的一致性建模。实验结果验证了VasTSD相较于现有方法在多模态和解剖区域合成血管造影图像时具有更高的血管连续性。

链接: https://arxiv.org/abs/2503.12758
作者: Zhifeng Wang,Renjiao Yi,Xin Wen,Chenyang Zhu,Kai Xu
机构: National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Angiography imaging is a medical imaging technique that enhances the visibility of blood vessels within the body by using contrast agents. Angiographic images can effectively assist in the diagnosis of vascular diseases. However, contrast agents may bring extra radiation exposure which is harmful to patients with health risks. To mitigate these concerns, in this paper, we aim to automatically generate angiography from non-angiographic inputs, by leveraging and enhancing the inherent physical properties of vascular structures. Previous methods relying on 2D slice-based angiography synthesis struggle with maintaining continuity in 3D vascular structures and exhibit limited effectiveness across different imaging modalities. We propose VasTSD, a 3D vascular tree-state space diffusion model to synthesize angiography from 3D non-angiographic volumes, with a novel state space serialization approach that dynamically constructs vascular tree topologies, integrating these with a diffusion-based generative model to ensure the generation of anatomically continuous vasculature in 3D volumes. A pre-trained vision embedder is employed to construct vascular state space representations, enabling consistent modeling of vascular structures across multiple modalities. Extensive experiments on various angiographic datasets demonstrate the superiority of VasTSD over prior works, achieving enhanced continuity of blood vessels in synthesized angiographic synthesis for multiple modalities and anatomical regions.
zh

[CV-134] R3-Avatar: Record and Retrieve Temporal Codebook for Reconstructing Photorealistic Human Avatars

【速读】：该论文旨在解决现有基于视频的三维人体 avatar 重建方法无法同时实现高质量渲染与可动画化的问题。具体而言，现有的方法要么仅关注渲染而缺乏动画支持，要么通过学习姿态-外观映射来实现动画化，但在训练姿态有限或服装复杂的情况下会出现质量退化。论文的关键在于提出了一种“记录-检索-重构”（Record-Retrieve-Reconstruct, R3）策略，其中通过时间代码本（temporal codebook）记录外观的时间变化，确保新视角下的高保真渲染，同时通过匹配最相似的训练姿态检索相关时间戳以增强新姿态下的外观表现。这种方法有效克服了在有限训练姿态和复杂服装场景下视觉质量退化的问题。

链接: https://arxiv.org/abs/2503.12751
作者: Yifan Zhan,Wangze Xu,Qingtian Zhu,Muyao Niu,Mingze Ma,Yifei Liu,Zhihang Zhong,Xiao Sun,Yinqiang Zheng
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The University of Tokyo (东京大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present R3-Avatar, incorporating a temporal codebook, to overcome the inability of human avatars to be both animatable and of high-fidelity rendering quality. Existing video-based reconstruction of 3D human avatars either focuses solely on rendering, lacking animation support, or learns a pose-appearance mapping for animating, which degrades under limited training poses or complex clothing. In this paper, we adopt a “record-retrieve-reconstruct” strategy that ensures high-quality rendering from novel views while mitigating degradation in novel poses. Specifically, disambiguating timestamps record temporal appearance variations in a codebook, ensuring high-fidelity novel-view rendering, while novel poses retrieve corresponding timestamps by matching the most similar training poses for augmented appearance. Our R3-Avatar outperforms cutting-edge video-based human avatar reconstruction, particularly in overcoming visual quality degradation in extreme scenarios with limited training human poses and complex clothing.
zh

[CV-135] ProtoDepth: Unsupervised Continual Depth Completion with Prototypes CVPR2025

【速读】：该论文旨在解决连续学习（continual learning）场景下无监督深度完成（unsupervised depth completion）任务中的灾难性遗忘（catastrophic forgetting）问题。深度完成模型在面对新的非平稳数据分布时，会丢失先前学到的信息。为了解决这一问题，论文提出了一种基于原型的方法（ProtoDepth），其关键是通过学习适应新领域的原型集（prototype sets），调整冻结的预训练模型的潜在特征，而无需修改原始权重。此外，为了应对测试阶段领域身份未知的情况，论文进一步引入领域描述符（domain descriptors）以帮助模型选择合适的原型集进行推理。通过这种方法，ProtoDepth 在室内和室外场景中分别将遗忘减少了 52.2% 和 53.2%，达到了当前最优性能（state of the art）。

链接: https://arxiv.org/abs/2503.12745
作者: Patrick Rim,Hyoungseob Park,S. Gangopadhyay,Ziyao Zeng,Younjoon Chung,Alex Wong
机构: Yale Vision Lab (耶鲁视觉实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:We present ProtoDepth, a novel prototype-based approach for continual learning of unsupervised depth completion, the multimodal 3D reconstruction task of predicting dense depth maps from RGB images and sparse point clouds. The unsupervised learning paradigm is well-suited for continual learning, as ground truth is not needed. However, when training on new non-stationary distributions, depth completion models will catastrophically forget previously learned information. We address forgetting by learning prototype sets that adapt the latent features of a frozen pretrained model to new domains. Since the original weights are not modified, ProtoDepth does not forget when test-time domain identity is known. To extend ProtoDepth to the challenging setting where the test-time domain identity is withheld, we propose to learn domain descriptors that enable the model to select the appropriate prototype set for inference. We evaluate ProtoDepth on benchmark dataset sequences, where we reduce forgetting compared to baselines by 52.2% for indoor and 53.2% for outdoor to achieve the state of the art.
zh

[CV-136] Stereo Event-based 6-DOF Pose Tracking for Uncooperative Spacecraft

【速读】：本文旨在解决非合作航天器姿态跟踪这一空间探索和在轨服务中的关键难题，该问题是现有技术尚未完全攻克的开放性挑战。论文提出了一种基于线特征的姿态跟踪方法，利用双目事件相机克服传统相机因运动模糊和极端光照等引起的局限性。方案的关键在于首先通过双目事件流的时间-空间一致性估计非合作航天器的线框模型，然后建立事件与航天器投影线之间的有效对应关系，并将姿态跟踪问题转化为基于6自由度运动参数的连续优化过程，通过最小化事件与线段的距离实现。此外，构建了一个包含仿真和真实事件的双目事件驱动非合作航天器运动数据集，并通过实验验证了所提方法在效果和精度上的优势。

链接: https://arxiv.org/abs/2503.12732
作者: Zibin Liu,Banglei Guan,Yang Shang,Yifei Bian,Pengju Sun,Qifeng Yu
机构: College of Aerospace Science and Engineering, National University of Defense Technology, Changsha 410073, China (国防科技大学航空航天科学与工程学院，长沙 410073, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Geoscience and Remote Sensing

点击查看摘要

Abstract:Pose tracking of uncooperative spacecraft is an essential technology for space exploration and on-orbit servicing, which remains an open problem. Event cameras possess numerous advantages, such as high dynamic range, high temporal resolution, and low power consumption. These attributes hold the promise of overcoming challenges encountered by conventional cameras, including motion blur and extreme illumination, among others. To address the standard on-orbit observation missions, we propose a line-based pose tracking method for uncooperative spacecraft utilizing a stereo event camera. To begin with, we estimate the wireframe model of uncooperative spacecraft, leveraging the spatio-temporal consistency of stereo event streams for line-based reconstruction. Then, we develop an effective strategy to establish correspondences between events and projected lines of uncooperative spacecraft. Using these correspondences, we formulate the pose tracking as a continuous optimization process over 6-DOF motion parameters, achieved by minimizing event-line distances. Moreover, we construct a stereo event-based uncooperative spacecraft motion dataset, encompassing both simulated and real events. The proposed method is quantitatively evaluated through experiments conducted on our self-collected dataset, demonstrating an improvement in terms of effectiveness and accuracy over competing methods. The code will be open-sourced at this https URL.
zh

[CV-137] Navigating Heat Exposure: Simulation of Route Planning Based on Visual Language Model Agents

【速读】：该论文旨在解决现有方法（如基于代理建模 ABM 和经验测量）无法充分考虑个体在热应激下的生理变化及环境感知机制的问题，导致缺乏以人为本且适应高温的人行路径建议。论文的关键解决方案在于提出了一种名为 PPPM 的新框架，该框架结合街景图像与城市网络拓扑结构来模拟适应高温的人行路径选择行为。通过在 Gemini-2.0 模型上进行结构化提示工程，创建了八种不同的热敏感人格以建模高温暴露期间的移动行为，并通过问卷调查进行了实证验证。实验结果表明，该框架能够有效捕捉不同人格间的差异，并与观察到的路径偏好高度一致，同时强调了驱动决策因素的差异性。此外，该方法具有很高的成本效益，每次路径模拟仅需 0.006 美元且耗时 47.81 秒。此人工智能生成内容 AIGC 方法通过高分辨率模拟热响应移动模式，为气候适应性城市规划提供了可操作的见解。

链接: https://arxiv.org/abs/2503.12731
作者: Haoran Ma,Kaihan Zhang,Jiannan Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Heat exposure significantly influences pedestrian routing behaviors. Existing methods such as agent-based modeling (ABM) and empirical measurements fail to account for individual physiological variations and environmental perception mechanisms under thermal stress. This results in a lack of human-centred, heat-adaptive routing suggestions. To address these limitations, we propose a novel Vision Language Model (VLM)-driven Persona-Perception-Planning-Memory (PPPM) framework that integrating street view imagery and urban network topology to simulate heat-adaptive pedestrian routing. Through structured prompt engineering on Gemini-2.0 model, eight distinct heat-sensitive personas were created to model mobility behaviors during heat exposure, with empirical validation through questionnaire survey. Results demonstrate that simulation outputs effectively capture inter-persona variations, achieving high significant congruence with observed route preferences and highlighting differences in the factors driving agents decisions. Our framework is highly cost-effective, with simulations costing 0.006USD and taking 47.81s per route. This Artificial Intelligence-Generated Content (AIGC) methodology advances urban climate adaptation research by enabling high-resolution simulation of thermal-responsive mobility patterns, providing actionable insights for climate-resilient urban planning.
zh

[CV-138] GenStereo: Towards Open-World Generation of Stereo Images and Unsupervised Matching

【速读】：该论文旨在解决高质量立体图像生成的挑战，特别是现有方法难以同时兼顾视觉质量和几何精度的问题。论文提出了一种基于扩散模型的方法GenStereo，其关键创新在于：(1) 利用视差感知坐标嵌入和扭曲输入图像来引导扩散过程，从而实现比以往方法更精确的立体对齐；(2) 引入自适应融合机制，智能结合扩散生成的图像与扭曲图像，提升真实感和视差一致性。通过在11个多样化立体数据集上的训练，GenStereo展示了强大的泛化能力，并在立体图像生成及无监督立体匹配任务中达到了当前最优性能。这种方法无需复杂的硬件设置即可生成高质量立体图像，具有重要的实际应用价值和无监督学习潜力。

链接: https://arxiv.org/abs/2503.12720
作者: Feng Qiao,Zhexiao Xiong,Eric Xing,Nathan Jacobs
机构: Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page is available at this https URL

点击查看摘要

Abstract:Stereo images are fundamental to numerous applications, including extended reality (XR) devices, autonomous driving, and robotics. Unfortunately, acquiring high-quality stereo images remains challenging due to the precise calibration requirements of dual-camera setups and the complexity of obtaining accurate, dense disparity maps. Existing stereo image generation methods typically focus on either visual quality for viewing or geometric accuracy for matching, but not both. We introduce GenStereo, a diffusion-based approach, to bridge this gap. The method includes two primary innovations (1) conditioning the diffusion process on a disparity-aware coordinate embedding and a warped input image, allowing for more precise stereo alignment than previous methods, and (2) an adaptive fusion mechanism that intelligently combines the diffusion-generated image with a warped image, improving both realism and disparity consistency. Through extensive training on 11 diverse stereo datasets, GenStereo demonstrates strong generalization ability. GenStereo achieves state-of-the-art performance in both stereo image generation and unsupervised stereo matching tasks. Our framework eliminates the need for complex hardware setups while enabling high-quality stereo image generation, making it valuable for both real-world applications and unsupervised learning scenarios. Project page is available at this https URL
zh

[CV-139] SatDepth: A Novel Dataset for Satellite Image Matching

【速读】：该论文旨在解决现有基于深度学习的图像匹配方法在卫星图像应用中的局限性问题，这些方法主要针对地面图像设计，而未充分探索卫星图像的特性。论文的关键创新在于提出了“SatDepth”数据集，该数据集为训练专用于卫星图像的密集真值对应关系提供了支持，并通过一种新颖的图像旋转增强策略来应对卫星图像由于不同视角和多次重访导致的变异性。这一解决方案的核心在于旋转增强程序，它能够在存在大角度差异的情况下有效发现像素间的对应关系，从而显著提升模型精度（最高可达40%）。

链接: https://arxiv.org/abs/2503.12706
作者: Rahul Deshmukh,Avinash Kak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in deep-learning based methods for image matching have demonstrated their superiority over traditional algorithms, enabling correspondence estimation in challenging scenes with significant differences in viewing angles, illumination and weather conditions. However, the existing datasets, learning frameworks, and evaluation metrics for the deep-learning based methods are limited to ground-based images recorded with pinhole cameras and have not been explored for satellite images. In this paper, we present ``SatDepth’', a novel dataset that provides dense ground-truth correspondences for training image matching frameworks meant specifically for satellite images. Satellites capture images from various viewing angles and tracks through multiple revisits over a region. To manage this variability, we propose a dataset balancing strategy through a novel image rotation augmentation procedure. This procedure allows for the discovery of corresponding pixels even in the presence of large rotational differences between the images. We benchmark four existing image matching frameworks using our dataset and carry out an ablation study that confirms that the models trained with our dataset with rotation augmentation outperform (up to 40% increase in precision) the models trained with other datasets, especially when there exist large rotational differences between the images.
zh

[CV-140] AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration

【速读】：该论文试图解决从单张自然场景图像中对相机（无论其具体模型为何）进行标定的问题。传统方法通常针对特定相机模型或依赖于可见的外部线索（如重力方向）来进行标定，而本文提出的方法无需这些先验知识。关键在于将相机标定过程转化为像素对应的射线回归问题，并证明这种中间表示允许对多种相机模型（包括针孔模型、Brown-Conrady模型和Kannala-Brandt模型等）进行无模型依赖的闭式内参恢复。此外，该方法还能处理被裁剪或拉伸后的图像。实验表明，AnyCalib 方法在性能上优于其他替代方案，即使训练数据量显著少于基于 3D 基础模型的方法。

链接: https://arxiv.org/abs/2503.12701
作者: Javier Tirado-Garín,Javier Civera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present AnyCalib, a method for calibrating the intrinsic parameters of a camera from a single in-the-wild image, that is agnostic to the camera model. Current methods are predominantly tailored to specific camera models and/or require extrinsic cues, such as the direction of gravity, to be visible in the image. In contrast, we argue that the perspective and distortion cues inherent in images are sufficient for model-agnostic camera calibration. To demonstrate this, we frame the calibration process as the regression of the rays corresponding to each pixel. We show, for the first time, that this intermediate representation allows for a closed-form recovery of the intrinsics for a wide range of camera models, including but not limited to: pinhole, Brown-Conrady and Kannala-Brandt. Our approach also applies to edited – cropped and stretched – images. Experimentally, we demonstrate that AnyCalib consistently outperforms alternative methods, including 3D foundation models, despite being trained on orders of magnitude less data. Code is available at this https URL.
zh

[CV-141] MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

【速读】：该论文旨在解决视频身份定制领域中两个关键问题：1）长时间视频中的身份退化；2）训练过程中动态表现的减弱。这些问题主要源于现有方法依赖于传统的基于静态图像的自重建训练。为了解决这些问题，论文提出了一个名为\textbf{MagicID}的新框架，其关键是通过构建具有明确身份和动态奖励的成对偏好视频数据来替代传统的自重建方法，从而直接促进符合用户偏好的身份一致且动态丰富的视频生成。此外，引入了一种混合采样策略，在优先保持身份一致性的基础上，利用基于参考图像的静态视频，并结合基于前沿(Frontier-based)的采样方法增强生成视频的动态质量。通过优化模型以适应定制化偏好对之间的奖励差异，MagicID实现了身份一致性与自然动态表现的双重提升，超越了现有方法在多项指标上的表现。

链接: https://arxiv.org/abs/2503.12689
作者: Hengjia Li,Lifan Jiang,Xi Xiao,Tianyang Wang,Hongwei Yi,Boxi Wu,Deng Cai
机构: Zhejiang University (浙江大学); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Hedra AI (Hedra AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users’ reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce \textbfMagicID , a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences. Specifically, we propose constructing pairwise preference video data with explicit identity and dynamic rewards for preference learning, instead of sticking to the traditional self-reconstruction. To address the constraints of customized preference data, we introduce a hybrid sampling strategy. This approach first prioritizes identity preservation by leveraging static videos derived from reference images, then enhances dynamic motion quality in the generated videos using a Frontier-based sampling method. By utilizing these hybrid preference pairs, we optimize the model to align with the reward differences between pairs of customized preferences. Extensive experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.
zh

[CV-142] Dynamic Angle Selection in X-Ray CT: A Reinforcement Learning Approach to Optimal Stopping

【速读】：该论文旨在解决工业X射线计算机断层扫描（CT）中快速在线检测的需求，特别是在稀疏角度CT领域，通过减少所需投影数量来加速处理并节约资源。然而，现有方法通常在重建质量和扫描时间之间寻求平衡，并依赖固定的扫描时长，未能充分考虑根据实际需求动态调整投影角度数量的重要性。论文的关键创新在于将最优停止（Optimal Stopping）的概念引入顺序最优实验设计（Optimal Experimental Design, OED）框架中，提出了一种基于Actor-Critic框架计算策略梯度的新方法，实现了投影角度选择与扫描终止的自适应策略开发。此外，研究还探讨了基于学习的方法在模拟与真实应用场景之间的差距，证明了使用合成数据训练的模型在真实数据上的可靠性能。这种方法增强了CT操作的灵活性，并扩展了稀疏角度CT在工业应用中的适用性。

链接: https://arxiv.org/abs/2503.12688
作者: Tianyuan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In industrial X-ray Computed Tomography (CT), the need for rapid in-line inspection is critical. Sparse-angle tomography plays a significant role in this by reducing the required number of projections, thereby accelerating processing and conserving resources. Most existing methods aim to balance reconstruction quality and scanning time, typically relying on fixed scan durations. Adaptive adjustment of the number of angles is essential; for instance, more angles may be required for objects with complex geometries or noisier projections. The concept of optimal stopping, which dynamically adjusts this balance according to varying industrial needs, remains underutilized. Building on our previous work, we integrate optimal stopping into sequential Optimal Experimental Design (OED). We propose a novel method for computing the policy gradient within the Actor-Critic framework, enabling the development of adaptive policies for informative angle selection and scan termination. Additionally, we investigated the gap between simulation and real-world applications in the context of the developed learning-based method. Our trained model, developed using synthetic data, demonstrates reliable performance when applied to real-world data. This approach enhances the flexibility of CT operations and expands the applicability of sparse-angle tomography in industrial settings.
zh

[CV-143] Domain Generalization for Improved Human Activity Recognition in Office Space Videos Using Adaptive Pre-processing

【速读】：该论文旨在解决跨域视频活动识别的挑战，特别是在训练和测试数据来自不同域的情况下，传统方法难以有效适应未知域的问题。论文的核心目标是通过提出一种领域泛化（Domain Generalization）的方法，提升模型在未见过的域上的表现。关键在于引入三种适用于任何视频编码器的预处理技术，这些技术能够增强模型对环境变化的鲁棒性。此外，论文结合多视图Transformer (MViT) 等先进视频分类模型，验证了所提方法的有效性，显著提升了准确性（Accuracy）、精确率（Precision）、召回率（Recall）以及F1分数，并证明了其在实际场景中的适应能力。这种方法为构建更可靠的异构数据域视频活动识别系统奠定了基础。

链接: https://arxiv.org/abs/2503.12678
作者: Partho Ghosh,Raisa Bentay Hossain,Mohammad Zunaed,Taufiq Hasan
机构: mHealth lab, Department of Biomedical Engineering, Bangladesh University of Engineering and Technology (BUET)(孟加拉国工程技术大学生物医学工程系移动健康实验室); Center for Bioengineering Innovation and Design (CBID), Department of Biomedical Engineering, Johns Hopkins University (约翰霍普金斯大学生物医学工程系生物工程创新与设计中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic video activity recognition is crucial across numerous domains like surveillance, healthcare, and robotics. However, recognizing human activities from video data becomes challenging when training and test data stem from diverse domains. Domain generalization, adapting to unforeseen domains, is thus essential. This paper focuses on office activity recognition amidst environmental variability. We propose three pre-processing techniques applicable to any video encoder, enhancing robustness against environmental variations. Our study showcases the efficacy of MViT, a leading state-of-the-art video classification model, and other video encoders combined with our techniques, outperforming state-of-the-art domain adaptation methods. Our approach significantly boosts accuracy, precision, recall and F1 score on unseen domains, emphasizing its adaptability in real-world scenarios with diverse video data sources. This method lays a foundation for more reliable video activity recognition systems across heterogeneous data domains.
zh

[CV-144] UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

【速读】：该论文旨在解决现有文本到图像（Text-to-Image, T2I）扩散模型在处理多样化任务时需依赖多组参数和特定架构的问题。为实现单一模型支持多种图像生成任务的目标，论文提出UniVG，这是一种通用扩散模型，通过将多模态输入统一为条件来支持包括T2I生成、图像修复、基于指令的编辑、身份保留生成、布局引导生成、深度估计及指代表征分割等多种下游应用。关键在于通过数据混合和多任务训练的系统性研究，证明了T2I生成与其他任务（如基于指令的编辑）可以共存且无性能折衷，同时辅助任务（如深度估计和指代表征分割）进一步提升了图像编辑效果，最终实现了在某些特定任务基准上超越专门模型的表现。

链接: https://arxiv.org/abs/2503.12652
作者: Tsu-Jui Fu,Yusu Qian,Chen Chen,Wenze Hu,Zhe Gan,Yinfei Yang
机构: Apple
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) diffusion models have shown impressive results in generating visually compelling images following user prompts. Building on this, various methods further fine-tune the pre-trained T2I model for specific tasks. However, this requires separate model architectures, training designs, and multiple parameter sets to handle different tasks. In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. UniVG treats multi-modal inputs as unified conditions to enable various downstream applications, ranging from T2I generation, inpainting, instruction-based editing, identity-preserving generation, and layout-guided generation, to depth estimation and referring segmentation. Through comprehensive empirical studies on data mixing and multi-task training, we provide detailed insights into the training processes and decisions that inform our final designs. For example, we show that T2I generation and other tasks, such as instruction-based editing, can coexist without performance trade-offs, while auxiliary tasks like depth estimation and referring segmentation enhance image editing. Notably, our model can even outperform some task-specific models on their respective benchmarks, marking a significant step towards a unified image generation model.
zh

[CV-145] MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

【速读】：本文旨在解决动态情感识别问题，通过构建多模态情感维度假设模型（Dimensional Modeling of Affect）实现对情感效价（Valence）和唤醒度（Arousal）的建模。论文提出了一种名为MAVEN（Multi-modal Attention for Valence-Arousal Emotion Network）的新架构，其关键在于引入双向跨模态注意力机制（bi-directional cross-modal attention），该机制包含六条独特的注意力路径，以实现视觉、听觉和文本三种模态之间的全面交互。此外，MAVEN通过模态特定编码器提取同步视频帧、音频片段和字幕的丰富特征表示，并采用跨模态增强策略，即通过其他模态加权注意力优化每个模态的表示，随后利用模态特定编码器进行自注意力精炼。与直接预测效价和唤醒度值不同，MAVEN采用极坐标形式预测情感，使其结果更符合情绪圆环模型的心理学理论。实验评估表明，该方法在Aff-Wild2数据集上的性能优于现有技术（SOTA），使用一致性相关系数（Concordance Correlation Coefficient, CCC）衡量。

链接: https://arxiv.org/abs/2503.12623
作者: Vrushank Ahire,Kunal Shah,Mudasir Nazir Khan,Nikhil Pakhale,Lownish Rai Sookha,M. A. Ganaie,Abhinav Dhall
机构: Department of Computer Science and Engineering, Indian Institute of Technology Ropar (印度理工学院罗帕尔), Punjab, India; Faculty of Information Technology, Monash University (蒙纳士大学), Melbourne, Australia
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:This paper introduces MAVEN (Multi-modal Attention for Valence-Arousal Emotion Network), a novel architecture for dynamic emotion recognition through dimensional modeling of affect. The model uniquely integrates visual, audio, and textual modalities via a bi-directional cross-modal attention mechanism with six distinct attention pathways, enabling comprehensive interactions between all modality pairs. Our proposed approach employs modality-specific encoders to extract rich feature representations from synchronized video frames, audio segments, and transcripts. The architecture’s novelty lies in its cross-modal enhancement strategy, where each modality representation is refined through weighted attention from other modalities, followed by self-attention refinement through modality-specific encoders. Rather than directly predicting valence-arousal values, MAVEN predicts emotions in a polar coordinate form, aligning with psychological models of the emotion circumplex. Experimental evaluation on the Aff-Wild2 dataset demonstrates the effectiveness of our approach, with performance measured using Concordance Correlation Coefficient (CCC). The multi-stage architecture demonstrates superior ability to capture the complex, nuanced nature of emotional expressions in conversational videos, advancing the state-of-the-art (SOTA) in continuous emotion recognition in-the-wild. Code can be found at: this https URL.
zh

[CV-146] Scaling Semantic Categories: Investigating the Impact on Vision Transformer Labeling Performance CVPR

【速读】：该论文旨在探索语义类别扩展对视觉变换器（Vision Transformers, ViTs）图像分类性能的影响。研究假设随着真实标签和人工引入的语义等效类别的数量增加，ViTs 的标注准确性会提升，直至达到理论上的最大值或极限。为验证此假设，论文选择了多种图像数据集，并通过一个自定义的 Python 函数处理这些数据集以评估模型精度，同时调整以适应不同数据集之间的格式差异。通过指数方式引入冗余类别，实验追踪了模型精度的变化趋势，直到其趋于平稳、下降或波动不一致为止。论文的关键在于通过系统性地扩展语义类别，揭示了在视觉变换器中类别标注策略的优势与局限性及其潜在优化方向。

链接: https://arxiv.org/abs/2503.12617
作者: Anthony Lamelas,Harrison Muchnic
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 7 figures, submitted to CVPR (feedback pending)

点击查看摘要

Abstract:This study explores the impact of scaling semantic categories on the image classification performance of vision transformers (ViTs). In this specific case, the CLIP server provided by Jina AI is used for experimentation. The research hypothesizes that as the number of ground truth and artificially introduced semantically equivalent categories increases, the labeling accuracy of ViTs improves until a theoretical maximum or limit is reached. A wide variety of image datasets were chosen to test this hypothesis. These datasets were processed through a custom function in Python designed to evaluate the model’s accuracy, with adjustments being made to account for format differences between datasets. By exponentially introducing new redundant categories, the experiment assessed accuracy trends until they plateaued, decreased, or fluctuated inconsistently. The findings show that while semantic scaling initially increases model performance, the benefits diminish or reverse after surpassing a critical threshold, providing insight into the limitations and possible optimization of category labeling strategies for ViTs.
zh

[CV-147] LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization

【速读】：该论文旨在解决两个主要问题：一是如何以端到端（Plug-and-Play, PnP）零样本的方式利用文本到图像的潜在扩散模型（Latent Diffusion Models, LDMs）来解决成像逆问题；二是现有基于文本到图像的PnP方法计算成本高昂的问题。为了解决这些问题，论文提出了一种新颖的PnP推理范式，特别设计用于将生成模型嵌入到随机逆解算器中，并重点关注潜在一致性模型（Latent Consistency Models, LCMs），这些模型通过蒸馏LDMs成为快速生成器。基于此框架，论文提出了LATINO（LAtent consisTency INverse sOlver），这是首个利用LCMs编码先验知识的零样本PnP框架，用于解决逆问题。关键在于其条件机制避免了自动微分，仅需8次神经函数评估即可达到最先进的质量。此外，论文还将其嵌入经验贝叶斯框架中，通过边缘最大似然估计自动校准文本提示，从而显著提高了估计精度。实验表明，这种自校准提示极大地提升了重建质量和计算效率，使LATINO在图像重建质量和计算效率方面定义了新的SOTA。

链接: https://arxiv.org/abs/2503.12615
作者: Alessio Spagnoletti,Jean Prost,Andrés Almansa,Nicolas Papadakis,Marcelo Pereyra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 27 pages, 20 figures

点击查看摘要

Abstract:Text-to-image latent diffusion models (LDMs) have recently emerged as powerful generative models with great potential for solving inverse problems in imaging. However, leveraging such models in a Plug Play (PnP), zero-shot manner remains challenging because it requires identifying a suitable text prompt for the unknown image of interest. Also, existing text-to-image PnP approaches are highly computationally expensive. We herein address these challenges by proposing a novel PnP inference paradigm specifically designed for embedding generative models within stochastic inverse solvers, with special attention to Latent Consistency Models (LCMs), which distill LDMs into fast generators. We leverage our framework to propose LAtent consisTency INverse sOlver (LATINO), the first zero-shot PnP framework to solve inverse problems with priors encoded by LCMs. Our conditioning mechanism avoids automatic differentiation and reaches SOTA quality in as little as 8 neural function evaluations. As a result, LATINO delivers remarkably accurate solutions and is significantly more memory and computationally efficient than previous approaches. We then embed LATINO within an empirical Bayesian framework that automatically calibrates the text prompt from the observed measurements by marginal maximum likelihood estimation. Extensive experiments show that prompt self-calibration greatly improves estimation, allowing LATINO with PRompt Optimization to define new SOTAs in image reconstruction quality and computational efficiency.
zh

[CV-148] VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility

【速读】：该论文致力于解决在严重遮挡环境中（severely occluded environments）的抓取任务中因可见性约束（visibility constraints）导致的挑战。传统方法难以直接抓取目标物体时，需要规划最优视角（Next-Best-View, NBV）以改善抓取成功率。论文的关键在于提出了一种名为VISO-Grasp的新型视觉语言感知系统，通过利用Foundation Models (FMs) 进行空间推理和主动视点规划，构建并动态更新基于实例的空间关系表征，从而显著提升复杂遮挡条件下的抓取成功率。此外，引入的多视角不确定性驱动抓取融合机制进一步优化了抓取置信度与方向不确定性，确保了抓取操作的鲁棒性和稳定性。这一综合框架首次实现了将FMs集成到目标感知的主动视点规划以及六自由度（6-DoF）抓取任务中，解决了严重遮挡甚至完全不可见环境中的抓取难题。

链接: https://arxiv.org/abs/2503.12609
作者: Yitian Shi,Di Wen,Guanqi Chen,Edgar Welte,Sheng Liu,Kunyu Peng,Rainer Stiefelhagen,Rania Rayyes
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:We propose VISO-Grasp, a novel vision-language-informed system designed to systematically address visibility constraints for grasping in severely occluded environments. By leveraging Foundation Models (FMs) for spatial reasoning and active view planning, our framework constructs and updates an instance-centric representation of spatial relationships, enhancing grasp success under challenging occlusions. Furthermore, this representation facilitates active Next-Best-View (NBV) planning and optimizes sequential grasping strategies when direct grasping is infeasible. Additionally, we introduce a multi-view uncertainty-driven grasp fusion mechanism that refines grasp confidence and directional uncertainty in real-time, ensuring robust and stable grasp execution. Extensive real-world experiments demonstrate that VISO-Grasp achieves a success rate of 87.5% in target-oriented grasping with the fewest grasp attempts outperforming baselines. To the best of our knowledge, VISO-Grasp is the first unified framework integrating FMs into target-aware active view planning and 6-DoF grasping in environments with severe occlusions and entire invisibility constraints.
zh

[CV-149] Multimodal Chain-of-Thought Reasoning : A Comprehensive Survey

【速读】：该论文试图解决多模态链式思维（Multimodal Chain-of-Thought, MCoT）推理领域的系统性综述缺失问题，旨在填补现有研究在全面总结方法、挑战及未来方向方面的空白。论文的关键解决方案在于提出首个系统的MCoT推理综述，明确相关基础概念与定义，构建全面的分类体系，并从多种视角深入分析当前方法在不同应用场景中的特性。此外，论文还探讨了现存挑战并指明未来研究方向，以推动多模态通用人工智能（AGI）的发展。

链接: https://arxiv.org/abs/2503.12605
作者: Yaoting Wang,Shengqiong Wu,Yuecheng Zhang,William Wang,Ziwei Liu,Jiebo Luo,Hao Fei
机构: NUS(新加坡国立大学); CUHK(香港中文大学); UCSB(加州大学圣塔芭芭拉分校); NTU(南洋理工大学); UR(罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: survey resource at this https URL 12 figures, 4 tables, 44 pages

点击查看摘要

Abstract:By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.
zh

[CV-150] Point Cloud Based Scene Segmentation: A Survey

【速读】：该论文旨在解决自动驾驶领域中点云语义分割方法的研究问题，以提供更精确和详细的周围环境信息。与仅预测目标边界框的3D目标检测任务不同，点云语义分割能够为每个点分配标签，从而生成更为丰富和密集的环境描述，这对导航和车道变更等自动驾驶任务至关重要。论文的关键解决方案在于系统地分类现有最先进的方法为基于投影的方法、基于3D的方法以及混合方法，并深入讨论了用于该任务的重要数据集及其局限性，同时强调了合成数据的重要性以弥补真实数据不足的问题。此外，论文还展示了不同方法的性能结果，并从分割精度和效率方面进行了对比分析。

链接: https://arxiv.org/abs/2503.12595
作者: Dan Halperin,Niklas Eisl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous driving is a safety-critical application, and it is therefore a top priority that the accompanying assistance systems are able to provide precise information about the surrounding environment of the vehicle. Tasks such as 3D Object Detection deliver an insufficiently detailed understanding of the surrounding scene because they only predict a bounding box for foreground objects. In contrast, 3D Semantic Segmentation provides richer and denser information about the environment by assigning a label to each individual point, which is of paramount importance for autonomous driving tasks, such as navigation or lane changes. To inspire future research, in this review paper, we provide a comprehensive overview of the current state-of-the-art methods in the field of Point Cloud Semantic Segmentation for autonomous driving. We categorize the approaches into projection-based, 3D-based and hybrid methods. Moreover, we discuss the most important and commonly used datasets for this task and also emphasize the importance of synthetic data to support research when real-world data is limited. We further present the results of the different methods and compare them with respect to their segmentation accuracy and efficiency.
zh

[CV-151] Personalize Anything for Free with Diffusion Transformer

【速读】：该论文旨在解决个性化图像生成中的身份保持（identity preservation）、适用性（applicability）以及与扩散变换器（diffusion transformers, DiTs）的兼容性问题。近年来，无训练（training-free）方法虽然在计算效率上优于基于训练的方法，但在上述方面仍面临挑战。论文的关键创新在于揭示了DiTs的潜在能力：通过简单地将去噪标记（denoising tokens）替换为参考主体的标记，即可实现零样本主体重建（zero-shot subject reconstruction）。基于这一观察，作者提出了“Personalize Anything”，一种无训练的框架，通过以下两个关键方案实现DiT中的个性化图像生成：1）时间步自适应标记替换（timestep-adaptive token replacement），通过早期阶段的身份一致性约束和晚期阶段的正则化增强灵活性；2）补丁扰动策略（patch perturbation strategies）以提升结构多样性。该方法支持布局引导生成、多主体个性化以及掩码控制编辑等多样化应用场景，并在身份保持和多功能性方面展现了最先进的性能。

链接: https://arxiv.org/abs/2503.12590
作者: Haoran Feng,Zehuan Huang,Lin Li,Hairong Lv,Lu Sheng
机构: Tsinghua University (清华大学); Beihang University (北京航空航天大学); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Personalized image generation aims to produce images of user-specified concepts while enabling flexible editing. Recent training-free approaches, while exhibit higher computational efficiency than training-based methods, struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs). In this paper, we uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction. This simple yet effective feature injection technique unlocks diverse scenarios, from personalization to image editing. Building upon this observation, we propose \textbfPersonalize Anything, a training-free framework that achieves personalized image generation in DiT through: 1) timestep-adaptive token replacement that enforces subject consistency via early-stage injection and enhances flexibility through late-stage regularization, and 2) patch perturbation strategies to boost structural diversity. Our method seamlessly supports layout-guided generation, multi-subject personalization, and mask-controlled editing. Evaluations demonstrate state-of-the-art performance in identity preservation and versatility. Our work establishes new insights into DiTs while delivering a practical paradigm for efficient personalization.
zh

[CV-152] Progressive Limb-Aware Virtual Try-On ACM-MM2022

【速读】：该论文针对现有基于图像的虚拟试穿方法中存在的问题，尝试解决两个主要挑战：一是未利用服装属性来优化传输服装的几何形状和纹理，导致服装外观不完整且模糊；二是通常会遮蔽输入图像中的肢体纹理以构建无服装的人体表示，这在长袖到短袖等服装转换时尤其会导致人体肢体区域（如裸露的手臂皮肤）预测不准确。为了解决这些问题，论文提出了一种名为PL-VTON的渐进式虚拟试穿框架。其关键在于设计了一个多属性服装变形模块（Multi-attribute Clothing Warping, MCW），采用基于多个属性的两阶段对齐策略，逐步估计像素级服装位移；引入了人体解析估计器（Human Parsing Estimator, HPE）以实现语义分割，为人体结构提供约束从而减轻服装与肢体区域之间的纹理渗色现象；同时提出了肢体感知纹理融合模块（Limb-aware Texture Fusion, LTF），通过显式的肢体感知特征引导融合服装与人体纹理，以估计高质量的肢体细节。实验结果表明，所提方法在定性和定量上均优于现有最先进的虚拟试穿方法。

链接: https://arxiv.org/abs/2503.12588
作者: Xiaoyu Han,Shengping Zhang,Qinglin Liu,Zonglin Li,Chenyang Wang
机构: Harbin Institute of Technology Weihai (哈尔滨工业大学威海校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2022. The code is available at this https URL

点击查看摘要

Abstract:Existing image-based virtual try-on methods directly transfer specific clothing to a human image without utilizing clothing attributes to refine the transferred clothing geometry and textures, which causes incomplete and blurred clothing appearances. In addition, these methods usually mask the limb textures of the input for the clothing-agnostic person representation, which results in inaccurate predictions for human limb regions (i.e., the exposed arm skin), especially when transforming between long-sleeved and short-sleeved garments. To address these problems, we present a progressive virtual try-on framework, named PL-VTON, which performs pixel-level clothing warping based on multiple attributes of clothing and embeds explicit limb-aware features to generate photo-realistic try-on results. Specifically, we design a Multi-attribute Clothing Warping (MCW) module that adopts a two-stage alignment strategy based on multiple attributes to progressively estimate pixel-level clothing displacements. A Human Parsing Estimator (HPE) is then introduced to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and limb regions. Finally, we propose a Limb-aware Texture Fusion (LTF) module to estimate high-quality details in limb regions by fusing textures of the clothing and the human body with the guidance of explicit limb-aware features. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art virtual try-on methods both qualitatively and quantitatively. The code is available at this https URL.
zh

[CV-153] BalancedDPO: Adaptive Multi-Metric Alignment

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）扩散模型在满足多样化用户偏好方面的持续挑战。当前方法通常针对单一指标进行优化或依赖于狭隘筛选的数据集，导致模型容易过拟合且在关键视觉质量指标上的泛化能力有限。论文提出了一种名为BalancedDPO的新方法，它是Direct Preference Optimization (DPO) 的扩展，通过同时与多个指标（包括人类偏好、CLIP分数和美学质量）对齐来克服这些局限性。该方案的关键创新在于，在偏好分布空间中聚合来自多种指标的一致性标签，而非采用现有的奖励混合方法，从而实现稳健且可扩展的多指标对齐，同时保持标准DPO管道的简单性。

链接: https://arxiv.org/abs/2503.12575
作者: Dipesh Tamboli,Souradip Chakraborty,Aditya Malusare,Biplab Banerjee,Amrit Singh Bedi,Vaneet Aggarwal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have made remarkable advancements, yet aligning them with diverse preferences remains a persistent challenge. Current methods often optimize single metrics or depend on narrowly curated datasets, leading to overfitting and limited generalization across key visual quality metrics. We present BalancedDPO, a novel extension of Direct Preference Optimization (DPO) that addresses these limitations by simultaneously aligning T2I diffusion models with multiple metrics, including human preference, CLIP score, and aesthetic quality. Our key novelty lies in aggregating consensus labels from diverse metrics in the preference distribution space as compared to existing reward mixing approaches, enabling robust and scalable multi-metric alignment while maintaining the simplicity of the standard DPO pipeline that we refer to as BalancedDPO. Our evaluations on the Pick-a-Pic, PartiPrompt and HPD datasets show that BalancedDPO achieves state-of-the-art results, outperforming existing approaches across all major metrics. BalancedDPO improves the average win rates by 15%, 7.1%, and 10.3% on Pick-a-pic, PartiPrompt and HPD, respectively, from the DiffusionDPO.
zh

[CV-154] Deblur Gaussian Splatting SLAM

【速读】：本文旨在解决从运动模糊（motion-blurred）输入中恢复清晰三维重建（sharp reconstructions）的问题。解决方案的关键在于提出了一种名为Deblur-SLAM的鲁棒RGB SLAM管道，它结合了帧间（frame-to-frame）和帧到模型（frame-to-model）方法的优势，以建模亚帧（sub-frame）相机轨迹，从而在运动模糊场景中实现高保真重建。此外，通过模拟运动模糊图像的物理成像过程，并最小化观测到的模糊图像与由锐利虚拟亚帧图像平均得到的渲染模糊图像之间的误差，进一步提升了重建精度。同时，利用单目深度估计器与高斯变形的在线优化相结合，确保了精确映射和增强的图像去模糊效果。这些组件的集成显著提高了系统的整体性能。实验结果表明，Deblur-SLAM在合成及真实世界运动模糊数据上实现了最先进的清晰地图估计和亚帧轨迹恢复能力。

链接: https://arxiv.org/abs/2503.12572
作者: Francesco Girlanda,Denys Rozumnyi,Marc Pollefeys,Martin R. Oswald
机构: ETH Zürich (苏黎世联邦理工学院); Microsoft (微软); University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Deblur-SLAM, a robust RGB SLAM pipeline designed to recover sharp reconstructions from motion-blurred inputs. The proposed method bridges the strengths of both frame-to-frame and frame-to-model approaches to model sub-frame camera trajectories that lead to high-fidelity reconstructions in motion-blurred settings. Moreover, our pipeline incorporates techniques such as online loop closure and global bundle adjustment to achieve a dense and precise global trajectory. We model the physical image formation process of motion-blurred images and minimize the error between the observed blurry images and rendered blurry images obtained by averaging sharp virtual sub-frame images. Additionally, by utilizing a monocular depth estimator alongside the online deformation of Gaussians, we ensure precise mapping and enhanced image deblurring. The proposed SLAM pipeline integrates all these components to improve the results. We achieve state-of-the-art results for sharp map estimation and sub-frame trajectory recovery both on synthetic and real-world blurry input data.
zh

[CV-155] GAN-Based Single-Stage Defense for Traffic Sign Classification Under Adversarial Patch Attack

【速读】：该论文旨在解决自动驾驶车辆（AVs）感知模块易受对抗性攻击（adversarial attacks），特别是对抗性贴纸攻击（Adversarial Patch Attack, APA）的问题，这种攻击可能导致交通标志分类错误，进而引发严重安全事故。为应对这一挑战，论文提出了一种基于生成对抗网络（Generative Adversarial Network, GAN）的单阶段防御策略，用于增强交通标志分类系统的安全性。该方案的关键在于其无需事先了解攻击贴纸的设计细节即可有效防御不同类型的交通标志对抗性攻击，并且能够显著提升分类器在对抗性环境下的准确率，同时保持模型无关性，适用于任意交通标志分类模型。实验结果表明，与未采用任何防御机制的分类器相比，所提方法将APA条件下的分类准确率提升了高达80.8%，并对所有考虑的交通标志整体分类准确率提高了58%。

链接: https://arxiv.org/abs/2503.12567
作者: Abyad Enan,Mashrur Chowdhury
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE Transactions on Intelligent Transportation Systems (T-ITS) for possible publication

点击查看摘要

Abstract:Computer Vision plays a critical role in ensuring the safe navigation of autonomous vehicles (AVs). An AV perception module is responsible for capturing and interpreting the surrounding environment to facilitate safe navigation. This module enables AVs to recognize traffic signs, traffic lights, and various road users. However, the perception module is vulnerable to adversarial attacks, which can compromise their accuracy and reliability. One such attack is the adversarial patch attack (APA), a physical attack in which an adversary strategically places a specially crafted sticker on an object to deceive object classifiers. In APA, an adversarial patch is positioned on a target object, leading the classifier to misidentify it. Such an APA can cause AVs to misclassify traffic signs, leading to catastrophic incidents. To enhance the security of an AV perception system against APAs, this study develops a Generative Adversarial Network (GAN)-based single-stage defense strategy for traffic sign classification. This approach is tailored to defend against APAs on different classes of traffic signs without prior knowledge of a patch’s design. This study found this approach to be effective against patches of varying sizes. Our experimental analysis demonstrates that the defense strategy presented in this paper improves the classifier’s accuracy under APA conditions by up to 80.8% and enhances overall classification accuracy for all the traffic signs considered in this study by 58%, compared to a classifier without any defense mechanism. Our defense strategy is model-agnostic, making it applicable to any traffic sign classifier, regardless of the underlying classification model.
zh

[CV-156] History-Aware Transformation of ReID Features for Multiple Object Tracking

【速读】：该论文旨在解决多目标跟踪（MOT）任务中现有方法过于依赖简单外观相似性计算进行目标匹配的问题，指出这种做法忽略了MOT任务的独特特性。论文认为，通过寻求更适合每段视频序列样本分布的特征表示空间，可以提升跟踪性能。为此，作者提出了一种基于历史感知变换的解决方案，关键在于将历史轨迹特征作为条件，并利用定制化的Fisher线性判别分析（Fisher Linear Discriminant, FLD）找到一个空间投影矩阵，以最大化不同轨迹之间的区分度。实验结果表明，这一无需额外训练的投影方法显著提升了仅基于特征的跟踪器的性能，在多种基准测试中表现出竞争力甚至超越现有最先进的方法，同时展现出令人印象深刻的零样本迁移能力。

链接: https://arxiv.org/abs/2503.12562
作者: Ruopeng Gao,Yuyao Wang,Chunxu Liu,Limin Wang
机构: Nanjing University (南京大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report. Without bells and whistles, achieving 80.8 HOTA on SportsMOT

点击查看摘要

Abstract:The aim of multiple object tracking (MOT) is to detect all objects in a video and bind them into multiple trajectories. Generally, this process is carried out in two steps: detecting objects and associating them across frames based on various cues and metrics. Many studies and applications adopt object appearance, also known as re-identification (ReID) features, for target matching through straightforward similarity calculation. However, we argue that this practice is overly naive and thus overlooks the unique characteristics of MOT tasks. Unlike regular re-identification tasks that strive to distinguish all potential targets in a general representation, multi-object tracking typically immerses itself in differentiating similar targets within the same video sequence. Therefore, we believe that seeking a more suitable feature representation space based on the different sample distributions of each sequence will enhance tracking performance. In this paper, we propose using history-aware transformations on ReID features to achieve more discriminative appearance representations. Specifically, we treat historical trajectory features as conditions and employ a tailored Fisher Linear Discriminant (FLD) to find a spatial projection matrix that maximizes the differentiation between different trajectories. Our extensive experiments reveal that this training-free projection can significantly boost feature-only trackers to achieve competitive, even superior tracking performance compared to state-of-the-art methods while also demonstrating impressive zero-shot transfer capabilities. This demonstrates the effectiveness of our proposal and further encourages future investigation into the importance and customization of ReID models in multiple object tracking. The code will be released at this https URL.
zh

[CV-157] Niagara: Normal-Integrated Geometric Affine Field for Scene Reconstruction from a Single View

【速读】：该论文致力于解决单目3D场景重建中捕捉精细几何细节和确保结构一致性的问题，特别是在高保真户外场景建模中的挑战。论文提出Niagara框架，首次实现了从单一输入图像忠实地重建复杂的户外场景。其关键解决方案在于整合了单目深度与法向估计作为输入，显著提升了捕捉细节的能力，并缓解了几何细节丢失和变形等常见问题。此外，引入了几何仿射场（Geometric Affine Field, GAF）和基于3D自注意力的几何约束机制，结合显式几何的结构性质与隐式特征场的适应性，平衡了高效渲染与高保真重建的需求。最后，设计了一种专门的编码器-解码器架构，其中基于深度的3D高斯解码器用于预测3D高斯参数，以支持新颖视角合成。这些创新使Niagara在单目及双目设置下均超越了先前最先进的方法，特别是在提升户外场景的几何精度和视觉保真度方面表现优异。

链接: https://arxiv.org/abs/2503.12553
作者: Xianzu Wu,Zhenxin Ai,Harry Yang,Ser-Nam Lim,Jun Liu,Huan Wang
机构: Westlake University (西湖大学); Jiangxi University of Science and Technology (江西理工大学); The Hong Kong University of Science and Technology (香港科技大学); University of Central Florida (中佛罗里达大学); Lancaster University (兰卡斯特大学); Everlyn AI (未知)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in single-view 3D scene reconstruction have highlighted the challenges in capturing fine geometric details and ensuring structural consistency, particularly in high-fidelity outdoor scene modeling. This paper presents Niagara, a new single-view 3D scene reconstruction framework that can faithfully reconstruct challenging outdoor scenes from a single input image for the first time. Our approach integrates monocular depth and normal estimation as input, which substantially improves its ability to capture fine details, mitigating common issues like geometric detail loss and deformation. Additionally, we introduce a geometric affine field (GAF) and 3D self-attention as geometry-constraint, which combines the structural properties of explicit geometry with the adaptability of implicit feature fields, striking a balance between efficient rendering and high-fidelity reconstruction. Our framework finally proposes a specialized encoder-decoder architecture, where a depth-based 3D Gaussian decoder is proposed to predict 3D Gaussian parameters, which can be used for novel view synthesis. Extensive results and analyses suggest that our Niagara surpasses prior SoTA approaches such as Flash3D in both single-view and dual-view settings, significantly enhancing the geometric accuracy and visual fidelity, especially in outdoor scenes. Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.12553 [cs.GR] (or arXiv:2503.12553v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2503.12553 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xianzu Wu [view email] [v1] Sun, 16 Mar 2025 15:50:18 UTC (3,397 KB)
zh

[CV-158] MTGS: Multi-Traversal Gaussian Splatting

【速读】：该论文旨在解决多遍历数据在场景重建中的固有挑战，这些问题包括外观变化和动态物体的存在，导致单遍历数据重建质量不佳。为了解决这些问题，论文提出了一种名为多遍历高斯点云（Multi-Traversal Gaussian Splatting, MTGS）的新方法。MTGS的关键在于通过共享静态几何建模同时分别处理动态元素和外观变化来实现高质量的驾驶场景重建。具体而言，该方法采用了一个包含共享静态节点和遍历特定动态节点的多遍历动态场景图，并辅以带有可学习球谐系数残差的颜色校正节点，从而实现了高保真度的新视角合成，并提供了灵活导航任意视点的能力。实验结果表明，与单遍历基线相比，MTGS在LPIPS指标上提升了23.5%，在几何准确性上提升了46.3%。

链接: https://arxiv.org/abs/2503.12552
作者: Tianyu Li,Yihang Qiu,Zhenhua Wu,Carl Lindström,Peng Su,Matthias Nießner,Hongyang Li
机构: Shanghai Innovation Institute (上海创新研究院); OpenDriveLab and MMLab, The University of Hong Kong (香港大学 OpenDriveLab 和 MMLab); Technical University of Munich (慕尼黑工业大学); Chalmers University of Technology (查尔姆斯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Multi-traversal data, commonly collected through daily commutes or by self-driving fleets, provides multiple viewpoints for scene reconstruction within a road block. This data offers significant potential for high-quality novel view synthesis, which is crucial for applications such as autonomous vehicle simulators. However, inherent challenges in multi-traversal data often result in suboptimal reconstruction quality, including variations in appearance and the presence of dynamic objects. To address these issues, we propose Multi-Traversal Gaussian Splatting (MTGS), a novel approach that reconstructs high-quality driving scenes from arbitrarily collected multi-traversal data by modeling a shared static geometry while separately handling dynamic elements and appearance variations. Our method employs a multi-traversal dynamic scene graph with a shared static node and traversal-specific dynamic nodes, complemented by color correction nodes with learnable spherical harmonics coefficient residuals. This approach enables high-fidelity novel view synthesis and provides flexibility to navigate any viewpoint. We conduct extensive experiments on a large-scale driving dataset, nuPlan, with multi-traversal data. Our results demonstrate that MTGS improves LPIPS by 23.5% and geometry accuracy by 46.3% compared to single-traversal baselines. The code and data would be available to the public.
zh

[CV-159] Grasping Partially Occluded Objects Using Autoencoder-Based Point Cloud Inpainting ECML KDD2022

【速读】：该论文旨在解决工业生产系统中机器人抓取已知或未知物体时因目标物体部分遮挡导致的抓取点计算信息不足的问题。论文提出了一种算法，通过缺失信息重构来应对实际应用中常见的遮挡情况，例如视野中的支撑结构、传感器误差或工件间的相互遮挡等。解决方案的关键在于开发了一种填补（inpainting）方法，该方法能够有效恢复被遮挡物体的信息，从而提高基于鲁棒物体匹配的抓取点计算在真实工业场景中的适用性。实验结果表明，该方案显著减少了因遮挡而被丢弃的工件数量。

链接: https://arxiv.org/abs/2503.12549
作者: Alexander Koebler,Ralf Gross,Florian Buettner,Ingo Thon
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ECML PKDD 2022

点击查看摘要

Abstract:Flexible industrial production systems will play a central role in the future of manufacturing due to higher product individualization and customization. A key component in such systems is the robotic grasping of known or unknown objects in random positions. Real-world applications often come with challenges that might not be considered in grasping solutions tested in simulation or lab settings. Partial occlusion of the target object is the most prominent. Examples of occlusion can be supporting structures in the camera’s field of view, sensor imprecision, or parts occluding each other due to the production process. In all these cases, the resulting lack of information leads to shortcomings in calculating grasping points. In this paper, we present an algorithm to reconstruct the missing information. Our inpainting solution facilitates the real-world utilization of robust object matching approaches for grasping point calculation. We demonstrate the benefit of our solution by enabling an existing grasping system embedded in a real-world industrial application to handle occlusions in the input. With our solution, we drastically decrease the number of objects discarded by the process.
zh

[CV-160] PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在训练过程中因依赖互联网大规模数据而引发的隐私和安全问题。具体而言，这些模型可能无意间学习到个人敏感信息，而现有的机器去学习（Machine Unlearning, MU）方法在评估其对MLLMs的有效性时仍存在不完整性和定义不清的问题，阻碍了构建更安全和可信赖系统的策略发展。论文的关键解决方案是提出一个名为PEBench的新基准，包含个人实体及其对应的通用事件场景的数据集，用于全面评估MLLMs的机器去学习性能。通过PEBench，研究者期望提供一个标准化且强大的框架，以推动安全与隐私保护的多模态模型研究。此外，论文还评估了6种不同的MU方法，揭示了它们的优势与局限，并强调了MLLMs领域中机器去学习的关键挑战与机遇。

链接: https://arxiv.org/abs/2503.12545
作者: Zhaopan Xu,Pengfei Zhou,Weidong Tang,Jiaxin Ai,Wangbo Zhao,Xiaojiang Peng,Kai Wang,Yang You,Wenqi Shao,Hongxun Yao,Kaipeng Zhang
机构: HIT(哈尔滨工业大学); Shanghai AI Laboratory(上海人工智能实验室); Shanghai Innovation Institude(上海科技创新研究院); NUS(新加坡国立大学); SZTU(深圳技术大学); XDU(西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, Multimodal Large Language Models (MLLMs) have demonstrated remarkable advancements in tasks such as visual question answering, visual understanding, and reasoning. However, this impressive progress relies on vast amounts of data collected from the internet, raising significant concerns about privacy and security. To address these issues, machine unlearning (MU) has emerged as a promising solution, enabling the removal of specific knowledge from an already trained model without requiring retraining from scratch. Although MU for MLLMs has gained attention, current evaluations of its efficacy remain incomplete, and the underlying problem is often poorly defined, which hinders the development of strategies for creating more secure and trustworthy systems. To bridge this gap, we introduce a benchmark, named PEBench, which includes a dataset of personal entities and corresponding general event scenes, designed to comprehensively assess the performance of MU for MLLMs. Through PEBench, we aim to provide a standardized and robust framework to advance research in secure and privacy-preserving multimodal models. We benchmarked 6 MU methods, revealing their strengths and limitations, and shedding light on key challenges and opportunities for MU in MLLMs.
zh

[CV-161] ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

【速读】：该论文试图解决的问题是：探究多模态大型语言模型（Multimodal Large Language Models, MLLMs）是否能够像人类一样理解四维时空世界，并具备类似的自中心视角下的时空推理能力。论文通过构建一个包含超过5,000个问答对的新基准数据集Ego-ST Bench以及提出一种结合反向思维的视频推理模型ST-R1，系统性评估了空间、时间及整合的时空推理能力，以赋予MLLMs类似人类的时空推理能力。

解决方案的关键在于：首先，设计了一个涵盖四个类别的Ego-ST Bench基准数据集；其次，提出了ST-R1视频模型，该模型在强化学习过程中引入反向思维，显著提升了性能；最后，采用长链思维（long-CoT）监督微调与分组相对策略优化（Group Relative Policy Optimization, GRPO）相结合的方法，在有限的高质量数据下实现了显著改进。这些方法共同推动了基于视频的时空推理研究的发展。

链接: https://arxiv.org/abs/2503.12542
作者: Peiran Wu,Yunze Liu,Chonghan Liu,Miao Liu,Junxiao Shen
机构: University of Bristol (布里斯托尔大学); X-Intelligence Labs (智谱智能实验室); Meta (元宇宙平台公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans excel at spatio-temporal reasoning, effortlessly interpreting dynamic visual events from an egocentric viewpoint. However, whether multimodal large language models (MLLMs) can similarly comprehend the 4D world remains uncertain. This paper explores multimodal spatio-temporal reasoning from an egocentric perspective, aiming to equip MLLMs with human-like reasoning capabilities. To support this objective, we introduce Ego-ST Bench, a novel benchmark containing over 5,000 question-answer pairs across four categories, systematically evaluating spatial, temporal, and integrated spatio-temporal reasoning. Additionally, we propose the ST-R1 Video model, a video-based reasoning model that incorporates reverse thinking into its reinforcement learning process, significantly enhancing performance. We combine long-chain-of-thought (long-CoT) supervised fine-tuning with Group Relative Policy Optimization (GRPO) reinforcement learning, achieving notable improvements with limited high-quality data. Ego-ST Bench and ST-R1 provide valuable insights and resources for advancing video-based spatio-temporal reasoning research.
zh

[CV-162] BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis

【速读】：该论文旨在解决现有3D语义分割技术中对挑战性区域探索不足的问题，尽管当前最先进的方法主要通过通用指标（如mIoU、mAcc和oAcc）提升整体性能，但这些方法忽视了对分割复杂区域的深入分析。论文重新审视了3D语义分割，将其细化为四个全面的错误类别，并为每个类别设计了相应的评估指标。解决方案的关键在于提出了一种创新的BFANet网络，它集成了语义边界特征的详细分析：首先，设计了边界-语义模块以解耦点云特征为语义和边界特征，并融合查询队列利用注意力机制增强语义特征；其次，引入了一种更简洁且加速的边界伪标签计算算法，其速度比现有方法快3.9倍，同时兼容数据增强并支持高效的训练计算。实验结果验证了BFANet模型的优越性及所设计指标的重要性。

链接: https://arxiv.org/abs/2503.12539
作者: Weiguang Zhao,Rui Zhang,Qiufeng Wang,Guangliang Cheng,Kaizhu Huang
机构: University of Liverpool (利物浦大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Duke Kunshan University (杜克昆山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D semantic segmentation plays a fundamental and crucial role to understand 3D scenes. While contemporary state-of-the-art techniques predominantly concentrate on elevating the overall performance of 3D semantic segmentation based on general metrics (e.g. mIoU, mAcc, and oAcc), they unfortunately leave the exploration of challenging regions for segmentation mostly neglected. In this paper, we revisit 3D semantic segmentation through a more granular lens, shedding light on subtle complexities that are typically overshadowed by broader performance metrics. Concretely, we have delineated 3D semantic segmentation errors into four comprehensive categories as well as corresponding evaluation metrics tailored to each. Building upon this categorical framework, we introduce an innovative 3D semantic segmentation network called BFANet that incorporates detailed analysis of semantic boundary features. First, we design the boundary-semantic module to decouple point cloud features into semantic and boundary features, and fuse their query queue to enhance semantic features with attention. Second, we introduce a more concise and accelerated boundary pseudo-label calculation algorithm, which is 3.9 times faster than the state-of-the-art, offering compatibility with data augmentation and enabling efficient computation in training. Extensive experiments on benchmark data indicate the superiority of our BFANet model, confirming the significance of emphasizing the four uniquely designed metrics. Code is available at this https URL.
zh

[CV-163] Debiasing Diffusion Model: Enhancing Fairness through Latent Representation Learning in Stable Diffusion Model

【速读】：该论文试图解决扩散模型在生成图像时因数据驱动特性而继承的偏差问题，这些偏差导致群体表示失衡，加剧社会不平等。传统方法通常依赖预定义的敏感属性或分类器来缓解偏差，但存在无法充分捕捉群体间复杂连续变化的局限性。论文的关键解决方案是提出Debiasing Diffusion Model (DDM)，通过引入指示器在训练过程中学习潜在表征，促进平衡的群体表示，从而实现公平性，且无需依赖预定义的敏感属性作为条件。这种端到端的学习方式不仅在已有技术覆盖的场景中表现出有效性，还进一步提升了公平性。

链接: https://arxiv.org/abs/2503.12536
作者: Lin-Chun Huang,Ching Chieh Tsao,Fang-Yi Su,Jung-Hsien Chiang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Image generative models, particularly diffusion-based models, have surged in popularity due to their remarkable ability to synthesize highly realistic images. However, since these models are data-driven, they inherit biases from the training datasets, frequently leading to disproportionate group representations that exacerbate societal inequities. Traditionally, efforts to debiase these models have relied on predefined sensitive attributes, classifiers trained on such attributes, or large language models to steer outputs toward fairness. However, these approaches face notable drawbacks: predefined attributes do not adequately capture complex and continuous variations among groups. To address these issues, we introduce the Debiasing Diffusion Model (DDM), which leverages an indicator to learn latent representations during training, promoting fairness through balanced representations without requiring predefined sensitive attributes. This approach not only demonstrates its effectiveness in scenarios previously addressed by conventional techniques but also enhances fairness without relying on predefined sensitive attributes as conditions. In this paper, we discuss the limitations of prior bias mitigation techniques in diffusion-based models, elaborate on the architecture of the DDM, and validate the effectiveness of our approach through experiments.
zh

[CV-164] SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs CVPR2025

【速读】：该论文致力于解决基于3D高斯点溅射（3D Gaussian Splatting）的室内开放世界自由视图合成方法在稀疏输入条件下性能较差的问题。主要挑战源于高斯点分布稀疏以及视角监督不足。为缓解这些问题，论文提出了一种名为SPC-GS的方法，其关键在于引入场景布局引导的高斯初始化（Scene-layout-based Gaussian Initialization, SGI）和语义提示一致性（Semantic-Prompt Consistency, SPC）正则化。SGI通过利用视频生成模型产生的视角变化图像以及视角约束的高斯点密集化，提供密集的场景布局导向的高斯分布；SPC则借助SAM2开发的基于语义提示的一致性约束减轻视角监督不足的问题，通过优化视觉重叠区域的二维和三维一致性约束，利用训练视角中的可用语义信息作为指导提示。实验结果表明，SPC-GS在Replica和ScanNet基准数据集上显著提升了重建质量和开放世界语义分割性能，分别实现了3.06 dB的PSNR增益和7.3%的mIoU提升。

链接: https://arxiv.org/abs/2503.12535
作者: Guibiao Liao,Qing Li,Zhenyu Bao,Guoping Qiu,Kanglin Liu
机构: School of Electronic and Computer Engineering, Peking University (北京大学电子与计算机工程学院); Pengcheng Laboratory (鹏城实验室); University of Nottingham (诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025. The project page is available at this https URL

点击查看摘要

Abstract:3D Gaussian Splatting-based indoor open-world free-view synthesis approaches have shown significant performance with dense input images. However, they exhibit poor performance when confronted with sparse inputs, primarily due to the sparse distribution of Gaussian points and insufficient view supervision. To relieve these challenges, we propose SPC-GS, leveraging Scene-layout-based Gaussian Initialization (SGI) and Semantic-Prompt Consistency (SPC) Regularization for open-world free view synthesis with sparse inputs. Specifically, SGI provides a dense, scene-layout-based Gaussian distribution by utilizing view-changed images generated from the video generation model and view-constraint Gaussian points densification. Additionally, SPC mitigates limited view supervision by employing semantic-prompt-based consistency constraints developed by SAM2. This approach leverages available semantics from training views, serving as instructive prompts, to optimize visually overlapping regions in novel views with 2D and 3D consistency constraints. Extensive experiments demonstrate the superior performance of SPC-GS across Replica and ScanNet benchmarks. Notably, our SPC-GS achieves a 3.06 dB gain in PSNR for reconstruction quality and a 7.3% improvement in mIoU for open-world semantic segmentation.
zh

[CV-165] STEVE: AStep Verification Pipeline for Computer-use Agent Training

【速读】：该论文试图解决通过数据驱动方法训练自主操作图形用户界面（Graphical User Interfaces, GUI）的AI代理的问题。传统行为克隆方法需要大量高质量轨迹数据，这在实际应用中难以满足。为了解决这一问题，论文提出了一种名为STEVE的关键解决方案：一个用于计算机使用代理训练的步骤验证流水线。STEVE首先构建了一个大规模指令集，并利用一些次优代理收集轨迹数据；然后，使用GPT-4o基于动作执行前后的屏幕信息验证轨迹中每个步骤的正确性，赋予每个步骤二元标签；最后，采用卡尼曼和特沃斯基优化方法从二元步骤标签中优化代理模型。实验表明，STEVE不仅显著提升了性能，还实现了高效且低成本地训练7B视觉语言模型作为计算机使用代理，在具有挑战性的WinAgentArena桌面环境中表现出色。

链接: https://arxiv.org/abs/2503.12532
作者: Fanbin Lu,Zhisheng Zhong,Ziqin Wei,Shu Liu,Chi-Wing Fu,Jiaya Jia
机构: CUHK (香港中文大学); SmartMore (SmartMore); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing AI agents to autonomously manipulate graphical user interfaces is a long challenging task. Recent advances in data scaling law inspire us to train computer-use agents with a scaled instruction set, yet using behavior cloning to train agents still requires immense high-quality trajectories. To meet the scalability need, we designed STEVE, a step verification pipeline for computer-use agent training. First, we establish a large instruction set for computer-use agents and collect trajectory data with some suboptimal agents. GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution, assigning each step with a binary label. Last, we adopt the Kahneman and Tversky Optimization to optimize the agent from the binary stepwise labels. Extensive experiments manifest that our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory. Also, STEVE enables us to train a 7B vision-language model as a computer-use agent, achieving leading performance in the challenging live desktop environment WinAgentArena with great efficiency at a reduced cost. Code and data: this https URL.
zh

[CV-166] owards Suturing World Models: Learning Predictive Models for Robotic Surgical Tasks

【速读】：该论文旨在通过监督学习方法，利用标注的腹腔镜手术视频数据，开发专门的基于扩散（diffusion-based）的生成模型，以捕捉精细机器人外科缝合子动作的空间时间动态。解决方案的关键在于采用两种最先进的视频扩散模型（LTX-Video 和 HunyuanVideo），结合低秩适应（Low-Rank Adaptation, LoRA）和全模型微调技术进行优化，生成高保真的手术动作序列，并将手术操作细分为包括理想与非理想执行的针定位、瞄准、驱动及撤回等子类别。这些工作为基于数据的手术仿真世界模型奠定了基础，可模拟外科缝合的生物力学交互与过程动力学，同时实现高时间保真度，为改进训练模拟器、技能评估工具以及自主外科系统提供支持。此外，模型能够区分理想与非理想的技术执行，为构建手术培训与评估系统奠定基础。

链接: https://arxiv.org/abs/2503.12531
作者: Mehmet Kerem Turkcan,Mattia Ballo,Filippo Filicori,Zoran Kostic
机构: Columbia University (哥伦比亚大学); Northwell Health (北岸大学医院协会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce specialized diffusion-based generative models that capture the spatiotemporal dynamics of fine-grained robotic surgical sub-stitch actions through supervised learning on annotated laparoscopic surgery footage. The proposed models form a foundation for data-driven world models capable of simulating the biomechanical interactions and procedural dynamics of surgical suturing with high temporal fidelity. Annotating a dataset of \sim2K clips extracted from simulation videos, we categorize surgical actions into fine-grained sub-stitch classes including ideal and non-ideal executions of needle positioning, targeting, driving, and withdrawal. We fine-tune two state-of-the-art video diffusion models, LTX-Video and HunyuanVideo, to generate high-fidelity surgical action sequences at \ge 768x512 resolution and \ge 49 frames. For training our models, we explore both Low-Rank Adaptation (LoRA) and full-model fine-tuning approaches. Our experimental results demonstrate that these world models can effectively capture the dynamics of suturing, potentially enabling improved training simulators, surgical skill assessment tools, and autonomous surgical systems. The models also display the capability to differentiate between ideal and non-ideal technique execution, providing a foundation for building surgical training and evaluation systems. We release our models for testing and as a foundation for future research. Project Page: this https URL
zh

[CV-167] A Plug-and-Play Learning-based IMU Bias Factor for Robust Visual-Inertial Odometry

【速读】：本文旨在解决低成本惯性测量单元（IMU）偏差对视觉-惯性里程计（VIO）性能的影响问题，特别是在视觉跟踪出现误差时，传统方法优化得到的偏差结果可能偏离真实值，从而影响系统稳定性和定位精度。为应对这一挑战，论文提出了一种名为Inertial Prior Network (IPNet)的新型即插即用框架，其关键在于直接利用原始IMU数据准确估计IMU偏差，摆脱了对历史估计值的传统递归预测依赖，有效防止误差传播。此外，还引入了一种迭代方法来计算用于网络训练的偏差均值，解决了许多视觉-惯性数据集中缺乏偏差标签的问题。通过在三个数据集上的评估，实验表明该方法显著提升了定位精度和鲁棒性，平均ATE-RMSE指标改善了46%。

链接: https://arxiv.org/abs/2503.12527
作者: Yang Yi,Kunqing Wang,Jinpu Zhang,Zhen Tan,Xiangke Wang,Hui Shen,Dewen Hu
机构: College of Intelligence Science and Technology, National University of Defense Technology, China (智能科学与技术学院, 国防科技大学, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The bias of low-cost Inertial Measurement Units (IMU) is a critical factor affecting the performance of Visual-Inertial Odometry (VIO). In particular, when visual tracking encounters errors, the optimized bias results may deviate significantly from the true values, adversely impacting the system’s stability and localization precision. In this paper, we propose a novel plug-and-play framework featuring the Inertial Prior Network (IPNet), which is designed to accurately estimate IMU bias. Recognizing the substantial impact of initial bias errors in low-cost inertial devices on system performance, our network directly leverages raw IMU data to estimate the mean bias, eliminating the dependency on historical estimates in traditional recursive predictions and effectively preventing error propagation. Furthermore, we introduce an iterative approach to calculate the mean value of the bias for network training, addressing the lack of bias labels in many visual-inertial datasets. The framework is evaluated on two public datasets and one self-collected dataset. Extensive experiments demonstrate that our method significantly enhances both localization precision and robustness, with the ATE-RMSE metric improving on average by 46%. The source code and video will be available at \textcolorredthis https URL.
zh

[CV-168] EditID: Training-Free Editable ID Customization for Text-to-Image Generation

【速读】：该论文旨在解决现有定制化身份（ID）文本到图像生成模型在编辑能力方面的不足，这些模型通常更关注ID一致性而忽视了可编辑性，难以通过提示词灵活调整面部方向、人物属性等特征。为了解决这一问题，论文提出EditID，一种基于DiT架构的无训练方法，其关键在于将定制化ID的文本到图像模型解构为图像生成分支和人物特征分支，并进一步将人物特征分支分解为特征提取、特征融合和特征整合三个模块。通过引入映射特征与偏移特征的组合，并控制ID特征整合的强度，EditID实现了网络深度中局部特征的语义压缩，构建了一个可编辑的特征空间，从而在保持ID一致性的前提下成功生成高质量且可编辑的图像，在IBench评估框架中展现了卓越性能。EditID首次在DiT架构上提出了定制化ID的可编辑性，满足了长提示词和高质量图像生成的需求。

链接: https://arxiv.org/abs/2503.12526
作者: Guandong Li,Zhaobin Chu
机构: iFlyTek(科大讯飞)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose EditID, a training-free approach based on the DiT architecture, which achieves highly editable customized IDs for text to image generation. Existing text-to-image models for customized IDs typically focus more on ID consistency while neglecting editability. It is challenging to alter facial orientation, character attributes, and other features through prompts. EditID addresses this by deconstructing the text-to-image model for customized IDs into an image generation branch and a character feature branch. The character feature branch is further decoupled into three modules: feature extraction, feature fusion, and feature integration. By introducing a combination of mapping features and shift features, along with controlling the intensity of ID feature integration, EditID achieves semantic compression of local features across network depths, forming an editable feature space. This enables the successful generation of high-quality images with editable IDs while maintaining ID consistency, achieving excellent results in the IBench evaluation, which is an editability evaluation framework for the field of customized ID text-to-image generation that quantitatively demonstrates the superior performance of EditID. EditID is the first text-to-image solution to propose customizable ID editability on the DiT architecture, meeting the demands of long prompts and high quality image generation.
zh

[CV-169] Multi Activity Sequence Alignment via Implicit Clustering

【速读】：该论文旨在解决现有自监督时间序列对齐方法在处理多活动任务时需针对每种活动单独训练模型以及仅限于对齐相同活动序列的局限性。为克服这些限制，论文提出了一种基于隐式聚类的序列对齐新框架。其关键在于在对序列中的帧进行对齐的同时执行隐式的片段级聚类，并结合提出的双增强技术，从而提升网络学习到的表征的泛化性和区分性。实验结果表明，所提方法优于当前最先进的方法，并展示了该框架在多活动和不同模态下的泛化能力。

链接: https://arxiv.org/abs/2503.12519
作者: Taein Kwon,Zador Pataki,Mahdi Rad,Marc Pollefeys
机构: ETH Zürich; Microsoft MR & AI Lab, Zürich
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures

点击查看摘要

Abstract:Self-supervised temporal sequence alignment can provide rich and effective representations for a wide range of applications. However, existing methods for achieving optimal performance are mostly limited to aligning sequences of the same activity only and require separate models to be trained for each activity. We propose a novel framework that overcomes these limitations using sequence alignment via implicit clustering. Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences. This coupled with our proposed dual augmentation technique enhances the network’s ability to learn generalizable and discriminative representations. Our experiments show that our proposed method outperforms state-of-the-art results and highlight the generalization capability of our framework with multi activity and different modalities on three diverse datasets, H2O, PennAction, and IKEA ASM. We will release our code upon acceptance.
zh

[CV-170] AI-Powered Automated Model Construction for Patient-Specific CFD Simulations of Aortic Flows

【速读】：该论文旨在解决基于医学图像构建患者特异性血管模型过程中存在的劳动密集、易出错及耗时过长的问题，以提升其在临床应用中的可行性。论文提出的解决方案核心在于开发了一个深度学习框架，该框架通过集成一个用于精确体素级血管分割的模块与一个执行解剖一致性且无监督表面细化的模块，实现了从医学图像到可用于模拟的血管模型的自动化创建。这一框架的关键创新点在于将体素分割与表面变形统一到一个连贯的工作流中，从而克服现有方法的主要局限性，显著提高了几何精度与计算效率。

链接: https://arxiv.org/abs/2503.12515
作者: Pan Du,Delin An,Chaoli Wang,Jian-Xun Wang
机构: Department of Aerospace and Mechanical Engineering, University of Notre Dame (圣母大学), Notre Dame, IN; Department of Computer Science and Engineering, University of Notre Dame (圣母大学), Notre Dame, IN; Sibley School of Mechanical and Aerospace Engineering, Cornell University (康奈尔大学), Ithaca, NY
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注: 42 pages, 8 figures

点击查看摘要

Abstract:Image-based modeling is essential for understanding cardiovascular hemodynamics and advancing the diagnosis and treatment of cardiovascular diseases. Constructing patient-specific vascular models remains labor-intensive, error-prone, and time-consuming, limiting their clinical applications. This study introduces a deep-learning framework that automates the creation of simulation-ready vascular models from medical images. The framework integrates a segmentation module for accurate voxel-based vessel delineation with a surface deformation module that performs anatomically consistent and unsupervised surface refinements guided by medical image data. By unifying voxel segmentation and surface deformation into a single cohesive pipeline, the framework addresses key limitations of existing methods, enhancing geometric accuracy and computational efficiency. Evaluated on publicly available datasets, the proposed approach demonstrates state-of-the-art performance in segmentation and mesh quality while significantly reducing manual effort and processing time. This work advances the scalability and reliability of image-based computational modeling, facilitating broader applications in clinical and research settings.
zh

[CV-171] Segment Any-Quality Images with Generative Latent Space Enhancement CVPR2025

【速读】：该论文试图解决Segment Anything Models (SAMs) 在严重退化和低质量图像上性能显著下降的问题，限制了其在真实场景中的有效性。为了解决这一问题，论文提出了一种名为GleSAM的方法，其关键是利用生成式潜空间增强（Generative Latent Space Enhancement）来提升SAM在低质量图像上的鲁棒性，从而实现不同图像质量下的通用性。具体而言，GleSAM通过将潜在扩散（latent diffusion）的概念适配到基于SAM的分割框架中，在SAM的潜在空间中执行生成扩散过程以重建高质量表示，从而改进分割效果。此外，还引入了两种技术以提高预训练扩散模型与分割框架之间的兼容性。该方法仅需少量额外可学习参数即可应用于预训练的SAM及SAM2，并支持高效优化。同时，构建了一个包含更多退化类型和程度的LQSeg数据集用于训练和评估模型。大量实验表明，GleSAM在复杂退化情况下显著提升了分割鲁棒性，同时保持了对清晰图像的泛化能力，并在未见的退化情形下也表现出色。

链接: https://arxiv.org/abs/2503.12507
作者: Guangqian Guo,Yoong Guo,Xuehui Yu,Wenbo Li,Yaoxing Wang,Shan Gao
机构: Northwestern Polytechnical University (西北工业大学); Huawei; Tencent; Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Despite their success, Segment Anything Models (SAMs) experience significant performance drops on severely degraded, low-quality images, limiting their effectiveness in real-world scenarios. To address this, we propose GleSAM, which utilizes Generative Latent space Enhancement to boost robustness on low-quality images, thus enabling generalization across various image qualities. Specifically, we adapt the concept of latent diffusion to SAM-based segmentation frameworks and perform the generative diffusion process in the latent space of SAM to reconstruct high-quality representation, thereby improving segmentation. Additionally, we introduce two techniques to improve compatibility between the pre-trained diffusion model and the segmentation framework. Our method can be applied to pre-trained SAM and SAM2 with only minimal additional learnable parameters, allowing for efficient optimization. We also construct the LQSeg dataset with a greater diversity of degradation types and levels for training and evaluating the model. Extensive experiments demonstrate that GleSAM significantly improves segmentation robustness on complex degradations while maintaining generalization to clear images. Furthermore, GleSAM also performs well on unseen degradations, underscoring the versatility of our approach and dataset.
zh

[CV-172] MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification

【速读】：该论文旨在解决现有过程级奖励模型（Process-Level Reward Models, PRMs）评估基准局限于文本任务且侧重于错误检测的问题，未能涵盖推理搜索等其他重要场景。为填补这一空白，论文提出MPBench，这是一个综合性的多任务、多模态基准，用于系统性评估PRMs在多样化场景下的有效性。解决方案的关键在于设计了三种评价范式：(1) 步骤正确性评估，检验每个中间推理步骤的准确性；(2) 答案聚合，整合多个解决方案并选出最优解；(3) 推理过程搜索，在推理阶段引导寻找最优步骤。这些范式使得MPBench能够全面评估PRMs，并为多模态PRMs的发展提供指导。

链接: https://arxiv.org/abs/2503.12505
作者: Zhaopan Xu,Pengfei Zhou,Jiaxin Ai,Wangbo Zhao,Kai Wang,Xiaojiang Peng,Wenqi Shao,Hongxun Yao,Kaipeng Zhang
机构: HIT(哈尔滨工业大学); Shanghai AI Laboratory(上海人工智能实验室); NUS(National University of Singapore); Shanghai Innovation Institude(上海期智研究院); SZTU(SZTAKI); pjlab.org.cn(鹏城实验室)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, where the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to provide step-wise rewards that facilitate reinforcement learning and data production during training and guide LLMs toward correct steps during inference, thereby improving reasoning accuracy. However, existing benchmarks of PRMs are text-based and focus on error detection, neglecting other scenarios like reasoning search. To address this gap, we introduce MPBench, a comprehensive, multi-task, multimodal benchmark designed to systematically assess the effectiveness of PRMs in diverse scenarios. MPBench employs three evaluation paradigms, each targeting a specific role of PRMs in the reasoning process: (1) Step Correctness, which assesses the correctness of each intermediate reasoning step; (2) Answer Aggregation, which aggregates multiple solutions and selects the best one; and (3) Reasoning Process Search, which guides the search for optimal reasoning steps during inference. Through these paradigms, MPBench makes comprehensive evaluations and provides insights into the development of multimodal PRMs.
zh

[CV-173] Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?

【速读】：该论文旨在解决长视频任务中因“采样困境”（Sampling Dilemma）导致的高效处理难题：低密度采样可能遗漏关键信息，而高密度采样则引入冗余。为应对这一挑战，论文提出了LSDBench基准，通过构建高必要采样密度（NSD）问题来评估大型视觉-语言模型（LVLMs）在长视频任务中的表现，其中NSD表示准确回答特定问题所需的最小采样密度。同时，论文提出了一种新颖的基于推理的分层采样（Reasoning-Driven Hierarchical Sampling, RHS）框架，结合全局相关线索定位与局部密集采样以实现精确推断。此外，还开发了轻量级语义引导帧选择器，优先选择信息丰富的帧，使RHS能够在显著减少采样帧数的情况下达到或超越现有方法的性能。综上所述，LSDBench和RHS框架共同解决了高NSD长视频任务的独特挑战，为LVLMs的评估与优化设定了新的标准。

链接: https://arxiv.org/abs/2503.12496
作者: Tianyuan Qu,Longxiang Tang,Bohao Peng,Senqiao Yang,Bei Yu,Jiaya Jia
机构: CUHK (香港中文大学); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rise of Large Vision-Language Models (LVLMs) has significantly advanced video understanding. However, efficiently processing long videos remains a challenge due to the ``Sampling Dilemma’': low-density sampling risks missing critical information, while high-density sampling introduces redundancy. To address this issue, we introduce LSDBench, the first benchmark designed to evaluate LVLMs on long-video tasks by constructing high Necessary Sampling Density (NSD) questions, where NSD represents the minimum sampling density required to accurately answer a given question. LSDBench focuses on dense, short-duration actions to rigorously assess the sampling strategies employed by LVLMs. To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. Additionally, we develop a lightweight Semantic-Guided Frame Selector to prioritize informative frames, enabling RHS to achieve comparable or superior performance with significantly fewer sampled frames. Together, our LSDBench and RHS framework address the unique challenges of high-NSD long-video tasks, setting a new standard for evaluating and improving LVLMs in this domain.
zh

[CV-174] BS-Mamba for Black-Soil Area Detection On the Qinghai-Tibetan Plateau

【速读】：该论文旨在解决青藏高原（QTP）极重度退化草地区域（即黑土滩）的精准评估问题，以指导有效的生态修复工作。这些区域由于过度放牧、气候变化和鼠害等因素导致植被覆盖度下降和土壤质量恶化，对生态环境构成重大挑战。论文的关键解决方案在于提出了一个名为BS-Mamba的新神经网络模型，该模型专为利用无人机遥感影像检测黑土滩而设计，并且在两个独立测试数据集上的表现优于当前最先进的模型，从而实现了更高的识别精度。这一研究通过提供一种高效的方法来评估QTP黑土滩的分布范围，为草原恢复做出了贡献。

链接: https://arxiv.org/abs/2503.12495
作者: Xuan Ma,Zewen Lv,Chengcai Ma,Tao Zhang,Yuelan Xin,Kun Zhan
机构: School of Information Science and Engineering, Lanzhou University (兰州大学信息科学与工程学院); State Key Laboratory of HIGAE, College of Ecology, Lanzhou University (兰州大学高等农业生态国家重点实验室生态学院); School of Physics and Electronic Information Engineering, Qinghai Normal University (青海师范大学物理与电子信息工程学院); Key Laboratory of AI and Information Processing, Hechi University (河池学院人工智能与信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal of Applied Remote Sensing, 2025

点击查看摘要

Abstract:Extremely degraded grassland on the Qinghai-Tibetan Plateau (QTP) presents a significant environmental challenge due to overgrazing, climate change, and rodent activity, which degrade vegetation cover and soil quality. These extremely degraded grassland on QTP, commonly referred to as black-soil area, require accurate assessment to guide effective restoration efforts. In this paper, we present a newly created QTP black-soil dataset, annotated under expert guidance. We introduce a novel neural network model, BS-Mamba, specifically designed for the black-soil area detection using UAV remote sensing imagery. The BS-Mamba model demonstrates higher accuracy in identifying black-soil area across two independent test datasets than the state-of-the-art models. This research contributes to grassland restoration by providing an efficient method for assessing the extent of black-soil area on the QTP.
zh

[CV-175] Learning Contour-Guided 3D Face Reconstruction with Occlusions

【速读】：该论文致力于解决现有基于深度学习的3D人脸重建方法在处理遮挡场景时效果不佳以及难以捕捉复杂几何细节的问题。为应对这些挑战，论文受生成对抗网络（GANs）和凹凸贴图（bump mapping）原理的启发，提出了一种新的方法。关键在于引入了一种中间级别的形状精修（mid-level shape refinement）来增强基础结构的鲁棒性，并通过示例展示了该方法能够有效生成被遮挡面部区域的合理细节，从而实现更全面且真实的3D人脸重建结果。实验结果进一步证明了所提方法相较于传统手动去除遮挡的方法具有更高的适应性和优越性。

链接: https://arxiv.org/abs/2503.12494
作者: Dapeng Zhao
机构: Zhejiang Lab (之江实验室), Hangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, deep learning-based 3D face reconstruction methods have demonstrated promising advancements in terms of quality and efficiency. Nevertheless, these techniques face challenges in effectively handling occluded scenes and fail to capture intricate geometric facial details. Inspired by the principles of GANs and bump mapping, we have successfully addressed these issues. Our approach aims to deliver comprehensive 3D facial reconstructions, even in the presence of this http URL maintaining the overall shape’s robustness, we introduce a mid-level shape refinement to the fundamental structure. Furthermore, we illustrate how our method adeptly extends to generate plausible details for obscured facial regions. We offer numerous examples that showcase the effectiveness of our framework in producing realistic results, where traditional methods often struggle. To substantiate the superior adaptability of our approach, we have conducted extensive experiments in the context of general 3D face reconstruction tasks, serving as concrete evidence of its regulatory prowess compared to manual occlusion removal methods.
zh

[CV-176] Geometry-Aware Face Reconstruction Under Occluded Scenes

【速读】：该论文致力于解决现有基于深度学习的3D人脸重建方法在处理遮挡场景时效果不佳以及难以捕捉复杂几何细节的问题。论文的关键在于受GANs（生成对抗网络）和凹凸贴图原理的启发，提出了一种新的方法：通过引入中层形状优化来增强基础结构的鲁棒性，并巧妙扩展以生成逼真的遮挡区域细节。实验结果表明，该方法在生成真实感强的结果方面显著优于传统方法，并且在通用3D人脸重建任务中的表现也证明了其相对于手工遮挡去除方法的优越适应性。

链接: https://arxiv.org/abs/2503.12492
作者: Dapeng Zhao
机构: Zhejiang Lab (之江实验室), Hangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-177] GeoRSMLLM : A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing

【速读】：本文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在遥感（Remote Sensing, RS）任务中的局限性，特别是那些涉及复杂指令或多条件处理的任务，以及需要像素级操作的任务（如分割和变化检测）。当前模型在指代表达理解（Referring Expression Comprehension, REC）方面表现出色，但在更复杂的场景中表现欠佳。为了解决这些问题，论文提出了一个全面的层次化任务分类方法，并引入了遥感视觉-语言任务集（Remote Sensing Vision-Language Task Set, RSVLTS），包含开放词汇任务（Open-Vocabulary Tasks, OVT）、指代表达任务（Referring Expression Tasks, RET）、描述对象任务（Described Object Tasks, DOT）以及视觉问答（Visual Question Answering, VQA）。关键解决方案在于提出了一种基于点集的统一数据表示方法，结合条件解析器和基于循环指代的自增广策略，这些技术被整合到GeoRSMLLM模型中，从而实现对RSVLTS中广泛任务的高效处理，为地理科学与遥感领域的视觉-语言任务提供更通用的解决方案。

链接: https://arxiv.org/abs/2503.12490
作者: Zilun Zhang,Haozhan Shen,Tiancheng Zhao,Bin Chen,Zian Guan,Yuhao Wang,Xu Jia,Yuxiang Cai,Yongheng Shang,Jianwei Yin
机构: College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); Binjiang Research Institute of Zhejiang University (浙江大学滨江研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The application of Vision-Language Models (VLMs) in remote sensing (RS) has demonstrated significant potential in traditional tasks such as scene classification, object detection, and image captioning. However, current models, which excel in Referring Expression Comprehension (REC), struggle with tasks involving complex instructions (e.g., exists multiple conditions) or pixel-level operations like segmentation and change detection. In this white paper, we provide a comprehensive hierarchical summary of vision-language tasks in RS, categorized by the varying levels of cognitive capability required. We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT) with increased difficulty, and Visual Question Answering (VQA) aloneside. Moreover, we propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition parser and a self-augmentation strategy based on cyclic referring. These features are integrated into the GeoRSMLLM model, and this enhanced model is designed to handle a broad range of tasks of RSVLTS, paving the way for a more generalized solution for vision-language tasks in geoscience and remote sensing.
zh

[CV-178] Cross-Modal Consistency Learning for Sign Language Recognition

【速读】：该论文旨在解决孤立手语识别（Isolated Sign Language Recognition, ISLR）中预训练方法在利用紧凑姿态数据时面临的挑战，即虽然这些方法能够消除背景干扰，但相比原始RGB视频，其语义线索不足的问题。同时，直接从RGB视频进行表征学习也存在困难，因为存在与手语无关的视觉特征。为解决这一困境，论文提出了一种跨模态一致性学习框架（Cross-modal Consistency Learning for Sign Language Recognition, CCL-SLR）。该框架的关键在于基于自监督预训练，利用RGB和姿态模态之间的跨模态一致性：首先通过单模态和跨模态对比学习对实例进行区分，逐渐对齐RGB和姿态模态的特征空间以提取一致的手势表示；其次引入运动保持掩蔽（Motion-Preserving Masking, MPM）和语义正样本挖掘（Semantic Positive Mining, SPM）技术，分别从数据增强和样本相似性的角度提升跨模态一致性。实验结果表明，CCL-SLR在四个ISLR基准数据集上取得了显著性能提升。

链接: https://arxiv.org/abs/2503.12485
作者: Kepeng Wu,Zecheng Li,Weichao Zhao,Hezhen Hu,Wengang Zhou,Houqiang Li
机构: MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-training has been proven to be effective in boosting the performance of Isolated Sign Language Recognition (ISLR). Existing pre-training methods solely focus on the compact pose data, which eliminate background perturbation but inevitably suffer from insufficient semantic cues compared to raw RGB videos. Nevertheless, direct representation learning only from RGB videos remains challenging due to the presence of sign-independent visual features. To address this dilemma, we propose a Cross-modal Consistency Learning framework (CCL-SLR), which leverages the cross-modal consistency from both RGB and pose modalities based on self-supervised pre-training. First, CCL-SLR employs contrastive learning for instance discrimination within and across modalities. Through the single-modal and cross-modal contrastive learning, CCL-SLR gradually aligns the feature spaces of RGB and pose modalities, thereby extracting consistent sign representations. Second, we further introduce Motion-Preserving Masking (MPM) and Semantic Positive Mining (SPM) techniques to improve cross-modal consistency from the perspective of data augmentation and sample similarity, respectively. Extensive experiments on four ISLR benchmarks show that CCL-SLR achieves impressive performance, demonstrating its effectiveness. The code will be released to the public.
zh

[CV-179] Diffusion-based Synthetic Data Generation for Visible-Infrared Person Re-Identification AAAI2025

【速读】：该论文旨在解决可见光-红外（Visible-Infrared, VI）行人再识别（Re-ID）任务中因真实数据收集与标注成本高、耗时长且需遵守隐私保护法规所面临的严重挑战。论文提出的关键解决方案是设计了一种名为Diffusion-based VI-ReID数据扩展（DiVE）的新框架。该框架通过解耦身份信息与模态信息，自动生成大量包含身份保持的RGB-红外配对图像，从而提升VI-ReID模型性能。解决方案的核心在于利用身份表示从共享同一ID的样本集中提取，并通过在特定模态数据上微调Stable Diffusion (SD) 模型来学习图像模态特性，将文本驱动的图像合成技术扩展至身份保持的多模态图像合成，显著降低了数据采集和标注的成本，同时提升了模型性能。实验表明，基于DiVE生成的数据训练的VI-ReID模型在LLCM数据集上实现了约9%的mAP提升。

链接: https://arxiv.org/abs/2503.12472
作者: Wenbo Dai,Lijing Lu,Zhihang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:The performance of models is intricately linked to the abundance of training data. In Visible-Infrared person Re-IDentification (VI-ReID) tasks, collecting and annotating large-scale images of each individual under various cameras and modalities is tedious, time-expensive, costly and must comply with data protection laws, posing a severe challenge in meeting dataset requirements. Current research investigates the generation of synthetic data as an efficient and privacy-ensuring alternative to collecting real data in the field. However, a specific data synthesis technique tailored for VI-ReID models has yet to be explored. In this paper, we present a novel data generation framework, dubbed Diffusion-based VI-ReID data Expansion (DiVE), that automatically obtain massive RGB-IR paired images with identity preserving by decoupling identity and modality to improve the performance of VI-ReID models. Specifically, identity representation is acquired from a set of samples sharing the same ID, whereas the modality of images is learned by fine-tuning the Stable Diffusion (SD) on modality-specific data. DiVE extend the text-driven image synthesis to identity-preserving RGB-IR multimodal image synthesis. This approach significantly reduces data collection and annotation costs by directly incorporating synthetic data into ReID model training. Experiments have demonstrated that VI-ReID models trained on synthetic data produced by DiVE consistently exhibit notable enhancements. In particular, the state-of-the-art method, CAJ, trained with synthetic images, achieves an improvement of about 9% in mAP over the baseline on the LLCM dataset. Code: this https URL
zh

[CV-180] DPF-Net: Physical Imaging Model Embedded Data-Driven Underwater Image Enhancement

【速读】：该论文致力于解决水下图像因光吸收和散射复杂交互作用而导致的显著退化问题。解决方案的关键在于提出了一种名为数据驱动与物理参数融合网络（DPF-Net）的两阶段水下图像增强网络。该方法结合了物理成像模型的可靠性与数据驱动方法的通用性和效率。具体而言，首先通过合成数据集训练物理参数估计模块以确保物理参数的可信度，而非仅依赖于原始图像与参考图像之间拟合关系的学习，这是以往研究的常见做法。随后，将此模块与增强网络联合训练，并在嵌入空间内整合估计的物理参数到数据驱动模型中。此外，为了保持复原过程中退化的一致性，提出了基于物理的退化一致性损失，并引入创新的弱参考损失项来减轻模型对单个参考图像质量的依赖。实验结果表明，DPF-Net 在多个测试集中表现优异，达到了当前最佳性能。

链接: https://arxiv.org/abs/2503.12470
作者: Han Mei,Kunqian Li,Shuaixin Liu,Chengzhi Ma,Qianli Jiang
机构: College of Engineering, Ocean University of China (海洋大学工程学院); Institute for Advanced Ocean Study, Ocean University of China (先进海洋研究研究所, 海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Due to the complex interplay of light absorption and scattering in the underwater environment, underwater images experience significant degradation. This research presents a two-stage underwater image enhancement network called the Data-Driven and Physical Parameters Fusion Network (DPF-Net), which harnesses the robustness of physical imaging models alongside the generality and efficiency of data-driven methods. We first train a physical parameter estimate module using synthetic datasets to guarantee the trustworthiness of the physical parameters, rather than solely learning the fitting relationship between raw and reference images by the application of the imaging equation, as is common in prior studies. This module is subsequently trained in conjunction with an enhancement network, where the estimated physical parameters are integrated into a data-driven model within the embedding space. To maintain the uniformity of the restoration process amid underwater imaging degradation, we propose a physics-based degradation consistency loss. Additionally, we suggest an innovative weak reference loss term utilizing the entire dataset, which alleviates our model’s reliance on the quality of individual reference images. Our proposed DPF-Net demonstrates superior performance compared to other benchmark methods across multiple test sets, achieving state-of-the-art results. The source code and pre-trained models are available on the project home page: this https URL.
zh

[CV-181] Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition ICLR2025

【速读】：该论文试图解决扩散策略（Diffusion Policy, DP）在多模态场景下的准确性与泛化能力不足的问题。当前DP通常基于单一视觉模态（如RGB或点云），限制了其性能表现。尽管训练一个能够处理异构多模态数据的通用DP可以提升性能，但这需要高昂的计算和数据成本。为应对这些挑战，论文提出了一种创新的策略组合方法：通过利用多个基于个体视觉模态的预训练DP，并结合它们的分布分数，构建出一种更具表达能力的模态可组合扩散策略（Modality-Composable Diffusion Policy, MCDP），且无需额外训练。这种方法的关键在于有效整合不同模态的信息以形成更强大的跨模态表示，从而提高适应性和性能。

链接: https://arxiv.org/abs/2503.12466
作者: Jiahang Cao,Qiang Zhang,Hanzhong Guo,Jiaxu Wang,Hao Cheng,Renjing Xu
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); Beijing Innovation Center of Humanoid Robotics(人形机器人创新中心); The University of Hong Kong(香港大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025 Generative Models for Robot Learning Workshop

点击查看摘要

Abstract:Diffusion Policy (DP) has attracted significant attention as an effective method for policy representation due to its capacity to model multi-distribution dynamics. However, current DPs are often based on a single visual modality (e.g., RGB or point cloud), limiting their accuracy and generalization potential. Although training a generalized DP capable of handling heterogeneous multimodal data would enhance performance, it entails substantial computational and data-related costs. To address these challenges, we propose a novel policy composition method: by leveraging multiple pre-trained DPs based on individual visual modalities, we can combine their distributional scores to form a more expressive Modality-Composable Diffusion Policy (MCDP), without the need for additional training. Through extensive empirical experiments on the RoboTwin dataset, we demonstrate the potential of MCDP to improve both adaptability and performance. This exploration aims to provide valuable insights into the flexible composition of existing DPs, facilitating the development of generalizable cross-modality, cross-domain, and even cross-embodiment policies. Our code is open-sourced at this https URL.
zh

[CV-182] Learning Privacy from Visual Entities

【速读】：该论文旨在解决利用图像特征预测其隐私属性（私有或公开）这一具有挑战性的任务，主要难点在于主观解释和内容多样性。论文提出的关键解决方案是采用一种更简单的模型架构，即结合迁移学习（Transfer Learning）与卷积神经网络（Convolutional Neural Networks, CNNs），通过优化仅732个参数来关联场景类型与隐私属性，从而实现与基于图神经网络（Graph Neural Networks）的方法相当的性能表现。相比之下，端到端训练的图方法可能掩盖单个组件对分类性能的具体贡献。此外，研究还表明，使用CNN提取的高维特征向量对于每个视觉实体并非必要，反而会增加模型复杂性；同时，图组件对整体性能的影响较小，真正驱动性能提升的是对CNN进行微调以优化隐私相关节点的图像特征。

链接: https://arxiv.org/abs/2503.12464
作者: Alessio Xompero(1),Andrea Cavallaro(2 and 3) ((1) Queen Mary University of London, (2) Idiap Research Institute, (3) École Polytechnique Fédérale de Lausanne)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages (13 for the main article, 8 for bibliography, acks, appendixes), 9 figures, 12 tables. Article accepted and to appear in the Proceedings on Privacy Enhancing Technologies, 2025 (3): this https URL . To be presented at the Privacy Enhancing Technologies Symposium 2025. Artifact (source code) under review: this https URL

点击查看摘要

Abstract:Subjective interpretation and content diversity make predicting whether an image is private or public a challenging task. Graph neural networks combined with convolutional neural networks (CNNs), which consist of 14,000 to 500 millions parameters, generate features for visual entities (e.g., scene and object types) and identify the entities that contribute to the decision. In this paper, we show that using a simpler combination of transfer learning and a CNN to relate privacy with scene types optimises only 732 parameters while achieving comparable performance to that of graph-based methods. On the contrary, end-to-end training of graph-based methods can mask the contribution of individual components to the classification performance. Furthermore, we show that a high-dimensional feature vector, extracted with CNNs for each visual entity, is unnecessary and complexifies the model. The graph component has also negligible impact on performance, which is driven by fine-tuning the CNN to optimise image features for privacy nodes.
zh

[CV-183] MambaIC: State Space Models for High-Performance Learned Image Compression CVPR2025

【速读】：该论文旨在解决现有图像压缩方法在实时信息传输场景中面临的计算效率低下以及冗余建模不足的问题。论文的关键在于利用状态空间模型（State Space Models, SSMs）的优势，通过改进上下文建模和引入基于窗口的局部注意力机制，从多个角度提升图像压缩的效率与性能。具体而言，论文提出了一种名为MambaIC的增强型图像压缩方法，其核心在于通过自适应优化隐藏状态的表示来细化上下文建模，并在通道-空间熵建模中引入窗口化局部注意力以减少压缩过程中的潜在空间冗余，从而显著提高压缩效率。实验结果验证了该方法在高分辨率图像压缩中的有效性和高效性。

链接: https://arxiv.org/abs/2503.12461
作者: Fanhu Zeng,Hao Tang,Yihua Shao,Siyu Chen,Ling Shao,Yan Wang
机构: Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院); School of Computer Science, Peking University (北京大学计算机学院); University of Science and Technology Beijing (北京科技大学); UCAS-Terminus AI Lab, University of Chinese Academy of Sciences (中国科学院大学Terminus AI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:A high-performance image compression algorithm is crucial for real-time information transmission across numerous fields. Despite rapid progress in image compression, computational inefficiency and poor redundancy modeling still pose significant bottlenecks, limiting practical applications. Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods and improve image compression from multiple perspectives. In this paper, we integrate the advantages of SSMs for better efficiency-performance trade-off and propose an enhanced image compression approach through refined context modeling, which we term MambaIC. Specifically, we explore context modeling to adaptively refine the representation of hidden states. Additionally, we introduce window-based local attention into channel-spatial entropy modeling to reduce potential spatial redundancy during compression, thereby increasing efficiency. Comprehensive qualitative and quantitative results validate the effectiveness and efficiency of our approach, particularly for high-resolution image compression. Code is released at this https URL.
zh

[CV-184] Exploring Contextual Attribute Density in Referring Expression Counting CVPR25

【速读】：本文旨在解决指代表达计数（Referring Expression Counting, REC）任务中细粒度属性理解的挑战，特别是现有方法难以准确将属性信息与视觉模式对齐的问题。论文指出，当前REC方法的局限性源于对“上下文属性密度”（Contextual Attribute Density, CAD）的探索不足。为了解决这一问题，论文的关键创新在于提出了一种U形CAD估计器，通过GroundingDINO提取的多尺度视觉特征与指代表达之间的交互来建模CAD，并引入额外的密度监督以有效编码CAD。随后，利用CAD优化的查询并通过新颖的注意力机制解码，从而显著提升了计数和定位性能，实现了30%的计数误差减少和10%的定位精度提升。因此，论文的核心贡献在于揭示了CAD在REC任务中的重要性，并提供了一种有效的建模方法。

链接: https://arxiv.org/abs/2503.12460
作者: Zhicheng Wang,Zhiyu Pan,Zhan Peng,Jian Cheng,Liwen Xiao,Wei Jiang,Zhiguo Cao
机构: School of AIA, Huazhong University of Science and Technology (华中科技大学自动化学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR25

点击查看摘要

Abstract:Referring expression counting (REC) algorithms are for more flexible and interactive counting ability across varied fine-grained text expressions. However, the requirement for fine-grained attribute understanding poses challenges for prior arts, as they struggle to accurately align attribute information with correct visual patterns. Given the proven importance of ‘‘visual density’’, it is presumed that the limitations of current REC approaches stem from an under-exploration of ‘‘contextual attribute density’’ (CAD). In the scope of REC, we define CAD as the measure of the information intensity of one certain fine-grained attribute in visual regions. To model the CAD, we propose a U-shape CAD estimator in which referring expression and multi-scale visual features from GroundingDINO can interact with each other. With additional density supervision, we can effectively encode CAD, which is subsequently decoded via a novel attention procedure with CAD-refined queries. Integrating all these contributions, our framework significantly outperforms state-of-the-art REC methods, achieves 30% error reduction in counting metrics and a 10% improvement in localization accuracy. The surprising results shed light on the significance of contextual attribute density for REC. Code will be at this http URL.
zh

[CV-185] Shape Bias and Robustness Evaluation via Cue Decomposition for Image Classification and Segmentation

【速读】：该论文旨在解决现有方法在评估深度神经网络（DNNs）对于形状（shape）和纹理（texture）偏好的局限性，特别是这些方法通常仅适用于图像分类任务且依赖于风格迁移（style-transfer-based）。为了解决这些问题，论文提出了一种新的评估流程，其关键是引入了一种基于线索分解（cue-decomposition）的方法：首先通过两种无需AI的数据预处理方法分别提取形状和纹理线索；其次开发了一种新颖的线索分解形状偏好评估指标，利用上述线索分解数据。此外，为了应用目的，还提出了相应的线索分解鲁棒性指标，用于估计DNNs对图像损坏的鲁棒性。实验结果表明，新提出的线索分解鲁棒性指标在评估DNN鲁棒性方面表现优于现有方法，并首次揭示了语义分割DNNs在Cityscapes和ADE20k数据集上的偏见。

链接: https://arxiv.org/abs/2503.12453
作者: Edgar Heinert,Thomas Gottwald,Annika Mütze,Matthias Rottmann
机构: University of Wuppertal (伍珀塔尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Previous works studied how deep neural networks (DNNs) perceive image content in terms of their biases towards different image cues, such as texture and shape. Previous methods to measure shape and texture biases are typically style-transfer-based and limited to DNNs for image classification. In this work, we provide a new evaluation procedure consisting of 1) a cue-decomposition method that comprises two AI-free data pre-processing methods extracting shape and texture cues, respectively, and 2) a novel cue-decomposition shape bias evaluation metric that leverages the cue-decomposition data. For application purposes we introduce a corresponding cue-decomposition robustness metric that allows for the estimation of the robustness of a DNN w.r.t. image corruptions. In our numerical experiments, our findings for biases in image classification DNNs align with those of previous evaluation metrics. However, our cue-decomposition robustness metric shows superior results in terms of estimating the robustness of DNNs. Furthermore, our results for DNNs on the semantic segmentation datasets Cityscapes and ADE20k for the first time shed light into the biases of semantic segmentation DNNs.
zh

[CV-186] ISLR101: an Iranian Word-Level Sign Language Recognition Dataset

【速读】：该论文旨在解决手语识别领域中因手语资源匮乏导致的研究与开发挑战。手语包含复杂多通道信息（如手势形状与动作），而现有数据不足以充分支持相关研究。为填补这一空白，论文引入了ISLR101，首个公开可用的伊朗手语数据集，用于孤立手语识别任务。该数据集包含由10名不同手语者（包括聋人、手语翻译员及第二语言学习者）录制的4,614段视频，涵盖101个不同手语词汇，并提供基于OpenPose提取的骨架姿态信息。解决方案的关键在于构建了一个包含视觉外观特征和骨架特征的双模态基准模型框架，并在ISLR101数据集上进行全面训练与评估，分别实现了97.01%和94.02%的测试集准确率，同时公开了训练、验证和测试划分以促进公平比较。

链接: https://arxiv.org/abs/2503.12451
作者: Hossein Ranjbar,Alireza Taheri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sign language recognition involves modeling complex multichannel information, such as hand shapes and movements while relying on sufficient sign language-specific data. However, sign languages are often under-resourced, posing a significant challenge for research and development in this field. To address this gap, we introduce ISLR101, the first publicly available Iranian Sign Language dataset for isolated sign language recognition. This comprehensive dataset includes 4,614 videos covering 101 distinct signs, recorded by 10 different signers (3 deaf individuals, 2 sign language interpreters, and 5 L2 learners) against varied backgrounds, with a resolution of 800x600 pixels and a frame rate of 25 frames per second. It also includes skeleton pose information extracted using OpenPose. We establish both a visual appearance-based and a skeleton-based framework as baseline models, thoroughly training and evaluating them on ISLR101. These models achieve 97.01% and 94.02% accuracy on the test set, respectively. Additionally, we publish the train, validation, and test splits to facilitate fair comparisons.
zh

[CV-187] LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching

【速读】：该论文试图解决Masked Autoregressive (MAR) 模型在图像生成任务中因依赖双向自注意力机制而导致的KV缓存冲突问题，这种冲突引发了意外的计算瓶颈，削弱了其预期的计算效率提升。解决方案的关键在于引入了两种冗余利用机制：Token Redundancy 和 Condition Redundancy。Token Redundancy 指出相邻解码步骤中大量token具有非常相似的表示，可以在之前的步骤中缓存并在后续步骤中重用；Condition Redundancy 表明无分类器引导条件下条件输出与非条件输出之间的差异在相邻步骤中表现出非常相似的值。基于这两种冗余，论文提出了LazyMAR方法，通过设计两种缓存机制分别处理这些冗余，从而实现高效加速。LazyMAR无需额外训练且可直接集成到所有MAR模型中。实验结果表明，该方法实现了2.83倍的加速，且生成质量几乎无损。

链接: https://arxiv.org/abs/2503.12450
作者: Feihong Yan,Qingyan Wei,Jiayi Tang,Jiajun Li,Yulin Wang,Xuming Hu,Huiqi Li,Linfeng Zhang
机构: Beijing Institute of Technology (北京理工大学); Central South University (中南大学); China University of Mining and Technology (中国矿业大学); University of Electronic Science and Technology of China (电子科技大学); Tsinghua University (清华大学); Hong Kong University of Science and Technology(Guangzhou) (香港科技大学（广州）); Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Masked Autoregressive (MAR) models have emerged as a promising approach in image generation, expected to surpass traditional autoregressive models in computational efficiency by leveraging the capability of parallel decoding. However, their dependence on bidirectional self-attention inherently conflicts with conventional KV caching mechanisms, creating unexpected computational bottlenecks that undermine their expected efficiency. To address this problem, this paper studies the caching mechanism for MAR by leveraging two types of redundancy: Token Redundancy indicates that a large portion of tokens have very similar representations in the adjacent decoding steps, which allows us to first cache them in previous steps and then reuse them in the later steps. Condition Redundancy indicates that the difference between conditional and unconditional output in classifier-free guidance exhibits very similar values in adjacent steps. Based on these two redundancies, we propose LazyMAR, which introduces two caching mechanisms to handle them one by one. LazyMAR is training-free and plug-and-play for all MAR models. Experimental results demonstrate that our method achieves 2.83 times acceleration with almost no drop in generation quality. Our codes will be released in this https URL.
zh

[CV-188] Causality Model for Semantic Understanding on Videos

【速读】：该论文旨在解决视频理解领域中因数据不平衡导致深度神经网络（DNNs）难以有效学习潜在因果机制的问题，特别是在分布偏移（如长尾分布不平衡和扰动不平衡）情况下性能显著下降的挑战。论文的关键解决方案在于引入因果建模方法，通过揭示观察到的相关性背后的真正因果模式，提升模型的鲁棒性和泛化能力，从而推动语义视频理解任务（如视频关系检测和视频问答）的发展。

链接: https://arxiv.org/abs/2503.12447
作者: Li Yicong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: PhD Thesis

点击查看摘要

Abstract:After a decade of prosperity, the development of video understanding has reached a critical juncture, where the sole reliance on massive data and complex architectures is no longer a one-size-fits-all solution to all situations. The presence of ubiquitous data imbalance hampers DNNs from effectively learning the underlying causal mechanisms, leading to significant performance drops when encountering distribution shifts, such as long-tail imbalances and perturbed imbalances. This realization has prompted researchers to seek alternative methodologies to capture causal patterns in video data. To tackle these challenges and increase the robustness of DNNs, causal modeling emerged as a principle to discover the true causal patterns behind the observed correlations. This thesis focuses on the domain of semantic video understanding and explores the potential of causal modeling to advance two fundamental tasks: Video Relation Detection (VidVRD) and Video Question Answering (VideoQA).
zh

[CV-189] BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries

【速读】：该论文试图解决在无编码器的多模态大型语言模型（Multimodal Large Language Models, MLLMs）中，由于缺乏视觉编码器而导致需要大量训练数据来有效捕获视觉知识的问题。论文的关键解决方案在于提出了一种名为BREEN的新架构，它通过引入可学习的查询（learnable query）和图像专家（image expert），实现了显著的数据高效性。具体而言，可学习的查询被放置在图像标记和文本标记之间，并通过预训练的CLIP模型输出进行监督以蒸馏视觉知识，从而桥接视觉与文本模态之间的鸿沟；而图像专家独立处理图像标记和可学习查询，提升了效率并减少了对语言模型文本能力的干扰。BREEN仅使用1300万条文本-图像对进行训练，仅为现有方法所需数据量的大约百分之一，即可达到与先前最先进的无编码器模型相当的性能。

链接: https://arxiv.org/abs/2503.12446
作者: Tianle Li,Yongming Rao,Winston Hu,Yu Cheng
机构: The Chinese University of Hong Kong; Tencent Hunyuan Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Encoder-free multimodal large language models(MLLMs) eliminate the need for a well-trained vision encoder by directly processing image tokens before the language model. While this approach reduces computational overhead and model complexity, it often requires large amounts of training data to effectively capture the visual knowledge typically encoded by vision models like CLIP. The absence of a vision encoder implies that the model is likely to rely on substantial data to learn the necessary visual-semantic alignments. In this work, we present BREEN, a data-efficient encoder-free multimodal architecture that mitigates this issue. BREEN leverages a learnable query and image experts to achieve comparable performance with significantly less training data. The learnable query, positioned between image and text tokens, is supervised by the output of a pretrained CLIP model to distill visual knowledge, bridging the gap between visual and textual modalities. Additionally, the image expert processes image tokens and learnable queries independently, improving efficiency and reducing interference with the LLM’s textual capabilities. BREEN achieves comparable performance to prior encoder-free state-of-the-art models like Mono-InternVL, using only 13 million text-image pairs in training about one percent of the data required by existing methods. Our work highlights a promising direction for data-efficient encoder-free multimodal learning, offering an alternative to traditional encoder-based approaches.
zh

[CV-190] Consistent-Point: Consistent Pseudo-Points for Semi-Supervised Crowd Counting and Localization

【速读】：该论文旨在解决半监督场景下人群计数与定位任务中的伪点（pseudo-points）一致性不足的问题。现有方法依赖大量繁琐的人工标注，在半监督条件下性能受限。论文提出了一种新颖的基于点定位的半监督人群计数与定位方法——Consistent-Point，其关键在于解决伪点的两种不一致性：一是通过聚合相邻辅助提议点的位置来增强位置一致性；二是提出实例级不确定性校准以提升类别一致性。通过生成更一致的伪点，该方法为训练过程提供了更稳定的监督信号，从而在多个常用数据集上实现了先进的定位性能以及显著的计数结果。

链接: https://arxiv.org/abs/2503.12441
作者: Yuda Zou,Zelong Liu,Yuliang Gu,Bo Du,Yongchao Xu
机构: School of Computer Science, Wuhan University, Wuhan, China (计算机科学学院, 武汉大学, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Crowd counting and localization are important in applications such as public security and traffic management. Existing methods have achieved impressive results thanks to extensive laborious annotations. This paper propose a novel point-localization-based semi-supervised crowd counting and localization method termed Consistent-Point. We identify and address two inconsistencies of pseudo-points, which have not been adequately explored. To enhance their position consistency, we aggregate the positions of neighboring auxiliary proposal-points. Additionally, an instance-wise uncertainty calibration is proposed to improve the class consistency of pseudo-points. By generating more consistent pseudo-points, Consistent-Point provides more stable supervision to the training process, yielding improved results. Extensive experiments across five widely used datasets and three different labeled ratio settings demonstrate that our method achieves state-of-the-art performance in crowd localization while also attaining impressive crowd counting results. The code will be available.
zh

[CV-191] EgoEvGesture: Gesture Recognition Based on Egocentric Event Camera

【速读】：该论文致力于解决基于传统RGB数据的手势识别在动态场景下因运动模糊和光照变化导致的性能下降问题。此外，针对事件相机特有的异步事件流处理难题以及从第一人称视角下手势与头部运动数据混叠带来的复杂性，论文提出了一种专门设计用于事件数据处理的新颖网络架构。其关键在于：(1) 引入轻量级CNN结合非对称深度可分离卷积以减少参数同时保留时空特征；(2) 设计可插拔的状态空间模型作为上下文块，解耦头部运动噪声与手势动力学；(3) 提出无参的Bins-Temporal Shift Module (BSTM)，通过沿bins和时间维度特征平移实现稀疏事件高效融合。实验验证表明，该方法在异构测试中仅使用7M参数即可达到62.7%的准确率，并展现出强大的跨数据集泛化能力。

链接: https://arxiv.org/abs/2503.12419
作者: Luming Wang,Hao Shi,Xiaoting Yin,Kailun Yang,Kaiwei Wang
机构: State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University (浙江大学); National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV); Optics (physics.optics)
备注: The dataset and models are made publicly available at this https URL

点击查看摘要

Abstract:Egocentric gesture recognition is a pivotal technology for enhancing natural human-computer interaction, yet traditional RGB-based solutions suffer from motion blur and illumination variations in dynamic scenarios. While event cameras show distinct advantages in handling high dynamic range with ultra-low power consumption, existing RGB-based architectures face inherent limitations in processing asynchronous event streams due to their synchronous frame-based nature. Moreover, from an egocentric perspective, event cameras record data that include events generated by both head movements and hand gestures, thereby increasing the complexity of gesture recognition. To address this, we propose a novel network architecture specifically designed for event data processing, incorporating (1) a lightweight CNN with asymmetric depthwise convolutions to reduce parameters while preserving spatiotemporal features, (2) a plug-and-play state-space model as context block that decouples head movement noise from gesture dynamics, and (3) a parameter-free Bins-Temporal Shift Module (BSTM) that shifts features along bins and temporal dimensions to fuse sparse events efficiently. We further build the EgoEvGesture dataset, the first large-scale dataset for egocentric gesture recognition using event cameras. Experimental results demonstrate that our method achieves 62.7% accuracy in heterogeneous testing with only 7M parameters, 3.1% higher than state-of-the-art approaches. Notable misclassifications in freestyle motions stem from high inter-personal variability and unseen test patterns differing from training data. Moreover, our approach achieved a remarkable accuracy of 96.97% on DVS128 Gesture, demonstrating strong cross-dataset generalization capability. The dataset and models are made publicly available at this https URL.
zh

[CV-192] A Causality-Inspired Model for Intima-Media Thickening Assessment in Ultrasound Videos

【速读】：该论文旨在解决在颈动脉超声筛查过程中因图像风格变化引起的虚假相关性问题，这些虚假相关性会干扰与内膜中层增厚相关的解剖学内容线索（如管腔结构），从而影响诊断准确性。为应对这一挑战，论文提出了一种基于因果推理的新方法，用于逐帧评估颈动脉内膜中层厚度。该方法的关键在于通过引入三个模块来实现：首先，设计了一个虚假相关性消除（Spurious Correlation Elimination, SCE）模块，以通过对风格扰动保持预测不变性的方式去除非因果的风格效应；其次，提出了因果等效性巩固（Causal Equivalence Consolidation, CEC）模块，在内容随机化过程中通过对抗优化增强因果内容关联；最后，构建了因果过渡增强（Causal Transition Augmentation, CTA）模块，通过整合辅助路径与文本提示，并借助对比学习确保因果流的平滑过渡。实验结果表明，该方法在自建的颈动脉超声视频数据集上的准确率达到86.93%，验证了其优越性能。

链接: https://arxiv.org/abs/2503.12418
作者: Shuo Gao,Jingyang Zhang,Jun Xue,Meng Yang,Yang Chen,Guangquan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, conference

点击查看摘要

Abstract:Carotid atherosclerosis represents a significant health risk, with its early diagnosis primarily dependent on ultrasound-based assessments of carotid intima-media thickening. However, during carotid ultrasound screening, significant view variations cause style shifts, impairing content cues related to thickening, such as lumen anatomy, which introduces spurious correlations that hinder assessment. Therefore, we propose a novel causal-inspired method for assessing carotid intima-media thickening in frame-wise ultrasound videos, which focuses on two aspects: eliminating spurious correlations caused by style and enhancing causal content correlations. Specifically, we introduce a novel Spurious Correlation Elimination (SCE) module to remove non-causal style effects by enforcing prediction invariance with style perturbations. Simultaneously, we propose a Causal Equivalence Consolidation (CEC) module to strengthen causal content correlation through adversarial optimization during content randomization. Simultaneously, we design a Causal Transition Augmentation (CTA) module to ensure smooth causal flow by integrating an auxiliary pathway with text prompts and connecting it through contrastive learning. The experimental results on our in-house carotid ultrasound video dataset achieved an accuracy of 86.93%, demonstrating the superior performance of the proposed method. Code is available at \hrefthis https URLthis https URL.
zh

[CV-193] SAM2-ELNet: Label Enhancement and Automatic Annotation for Remote Sensing Segmentation

【速读】：该论文旨在解决现有遥感图像分割任务中监督模型性能高度依赖于标注数据质量的问题。具体而言，当前人工标注的标签数据存在时间成本高、主观干扰大、边界扭曲及细节丢失等挑战。为应对这些问题，论文提出了一种边缘增强标注网络（Edge-enhanced Labeling Network, SAM2-ELNet），其关键在于结合了标注模块与边缘注意力机制，有效缓解了标签细节丢失、碎片化及边界不准确等问题。此外，由于手工标注遥感数据稀缺，传统神经网络特征提取能力受限，论文采用预训练自监督大模型Segment Anything Model 2 (SAM2) 的Hiera主干作为编码器，并通过微调在小样本下游任务中实现高质量高效的特征提取。实验表明，使用增强标签训练的模型表现更优且最终损失更低，更接近真实数据分布。此外，研究还探索了将该模型扩展为高效自动标注框架的潜力，以促进大规模遥感图像解析与智能识别。

链接: https://arxiv.org/abs/2503.12404
作者: Jianhao Yang,Wenshuo Yu,Yuanchao Lv,Jiance Sun,Bokang Sun,Mingyang Liu
机构: College of Instrumentation and Electrical Engineering, Jilin University (仪器与电气工程学院, 吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing image segmentation is crucial for environmental monitoring, disaster assessment, and resource management, directly affecting the accuracy and efficiency of surface information extraction. The performance of existing supervised models in remote sensing image segmentation tasks highly depends on the quality of label data. However, current label data mainly relies on manual annotation, which comes with high time costs and is subject to subjective interference, resulting in distortion of label boundaries and often a loss of detail. To solve the above problems, our work proposes an Edge-enhanced Labeling Network, called SAM2-ELNet, which incorporates a labeling module and an edge attention mechanism. This model effectively addresses issues such as label detail loss, fragmentation, and inaccurate boundaries. Due to the scarcity of manually annotated remote sensing data, the feature extraction capabilities of traditional neural networks are limited. Our method uses the Hiera backbone of the pre-trained self-supervised large model segment anything model 2 (SAM2) as the encoder, achieves high-quality and efficient feature extraction even with small samples by fine-tuning on downstream tasks. This study compared the training effects of original and enhanced labels on the manually annotated Deep-SAR Oil Spill (SOS) dataset. Results showed that the model trained with enhanced labels performed better and had a lower final loss, indicating closer alignment with the real data distribution. Our work also explores the potential of extending the model into an efficient automatic annotation framework through generalization experiments, facilitating large-scale remote sensing image interpretation and intelligent recognition.
zh

[CV-194] MExD: An Expert-Infused Diffusion Model for Whole-Slide Image Classification CVPR2025

【速读】：该论文旨在解决 Whole Slide Image (WSI) 分类面临的两大挑战：一是由于图像尺寸庞大和非信息区域众多引入的噪声；二是数据不平衡导致特征聚合困难的问题。为应对这些挑战，论文提出了一种名为 MExD（Expert-Infused Diffusion Model）的创新方法，其关键在于结合了 Mixture-of-Experts (MoE) 机制与扩散模型的优势。MExD 通过一种基于 MoE 的新型聚合器平衡 Patch 特征分布，该聚合器能够有选择性地强调相关特征，有效过滤噪声、缓解数据不平衡并提取关键特征。随后，这些特征通过基于扩散的生成过程整合，直接输出 WSI 的类别分布。这种方法超越了传统的判别式方法，首次在 WSI 分类中采用生成式策略，捕捉精细细节以实现鲁棒且精确的结果。实验验证表明，MExD 在 Camelyon16、TCGA-NSCLC 和 BRACS 三个常用基准数据集上均达到了最先进的性能水平。

链接: https://arxiv.org/abs/2503.12401
作者: Jianwei Zhao,Xin Li,Fan Yang,Qiang Zhai,Ao Luo,Yang Zhao,Hong Cheng,Huazhu Fu
机构: UESTC (电子科技大学); AIQ; SICAU (四川农业大学); SWJTU (西南交通大学); IHPC (高性能计算研究所), A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:Whole Slide Image (WSI) classification poses unique challenges due to the vast image size and numerous non-informative regions, which introduce noise and cause data imbalance during feature aggregation. To address these issues, we propose MExD, an Expert-Infused Diffusion Model that combines the strengths of a Mixture-of-Experts (MoE) mechanism with a diffusion model for enhanced classification. MExD balances patch feature distribution through a novel MoE-based aggregator that selectively emphasizes relevant information, effectively filtering noise, addressing data imbalance, and extracting essential features. These features are then integrated via a diffusion-based generative process to directly yield the class distribution for the WSI. Moving beyond conventional discriminative approaches, MExD represents the first generative strategy in WSI classification, capturing fine-grained details for robust and precise results. Our MExD is validated on three widely-used benchmarks-Camelyon16, TCGA-NSCLC, and BRACS consistently achieving state-of-the-art performance in both binary and multi-class tasks.
zh

[CV-195] Pathology Image Restoration via Mixture of Prompts

【速读】：该论文旨在解决数字病理学中从单焦平面扫描恢复高质量病理图像的问题。传统扫描方法通过多焦面扫描与融合实现全焦清晰成像，但速度较慢且难以处理复杂组织离焦现象；而现有基于单一焦面的图像恢复方法因病理图像中复杂的离焦模式及领域特定的语义复杂性而表现不足。论文的关键解决方案是提出了一种两级修复框架，结合Transformer模型保持图像保真度和扩散模型提升感知质量的能力。此外，论文特别设计了一种新颖的提示混合策略，在初始提示建模显微成像离焦的基础上，引入描述病理基础模型高层次图像语义和精细组织结构边界的两个提示，证明了这种提示组合能够有效从单焦平面扫描中恢复高质量病理图像，展示了其在临床应用中的巨大潜力。代码将公开发布于指定链接。

链接: https://arxiv.org/abs/2503.12399
作者: Jiangdong Cai,Yan Chen,Zhenrong Shen,Haotian Jiang,Honglin Xiong,Kai Xuan,Lichi Zhang,Qian Wang
机构: School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University (上海科技大学生物医学工程学院&先进医疗材料与器件国家重点实验室); School of Biomedical Engineering, Shanghai Jiao Tong University (上海交通大学生物医学工程学院); Shanghai Clinical Research and Trial Center (上海临床研究和试验中心); Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In digital pathology, acquiring all-in-focus images is essential to high-quality imaging and high-efficient clinical workflow. Traditional scanners achieve this by scanning at multiple focal planes of varying depths and then merging them, which is relatively slow and often struggles with complex tissue defocus. Recent prevailing image restoration technique provides a means to restore high-quality pathology images from scans of single focal planes. However, existing image restoration methods are inadequate, due to intricate defocus patterns in pathology images and their domain-specific semantic complexities. In this work, we devise a two-stage restoration solution cascading a transformer and a diffusion model, to benefit from their powers in preserving image fidelity and perceptual quality, respectively. We particularly propose a novel mixture of prompts for the two-stage solution. Given initial prompt that models defocus in microscopic imaging, we design two prompts that describe the high-level image semantics from pathology foundation model and the fine-grained tissue structures via edge extraction. We demonstrate that, by feeding the prompt mixture to our method, we can restore high-quality pathology images from single-focal-plane scans, implying high potentials of the mixture of prompts to clinical usage. Code will be publicly available at this https URL.
zh

[CV-196] Car-1000: A New Large Scale Fine-Grained Visual Categorization Dataset CVPR2024

【速读】：该论文旨在解决现有细粒度视觉分类（Fine-grained Visual Categorization, FGVC）数据集在汽车模型识别任务上的局限性。具体而言，斯坦福汽车数据集（Stanford-Car）仅包含196个类别且局限于2013年之前生产的车型，无法反映近年来因汽车工业快速发展而导致的汽车外观复杂性和多样性增加的趋势。为应对这些挑战，论文提出了Car-1000这一大规模数据集，其涵盖了来自165家不同制造商的1000种独特车型，从而更好地捕捉当前汽车领域的变化。解决方案的关键在于构建一个更全面、更新的数据集，以满足现代汽车行业的应用需求，并通过在Car-1000上复现多种先进的FGVC方法来确立新的研究基准。

链接: https://arxiv.org/abs/2503.12385
作者: Yutao Hu,Sen Li,Jincheng Yan,Wenqi Shao,Xiaoyan Luo
机构: Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University)(东南大学重点实验室), Ministry of Education; School of Astronautics, Beihang University (北京航空航天大学航天学院), Beijing, China; Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to The Eleventh Workshop on Fine-Grained Visual Categorization in CVPR 2024

点击查看摘要

Abstract:Fine-grained visual categorization (FGVC) is a challenging but significant task in computer vision, which aims to recognize different sub-categories of birds, cars, airplanes, etc. Among them, recognizing models of different cars has significant application value in autonomous driving, traffic surveillance and scene understanding, which has received considerable attention in the past few years. However, Stanford-Car, the most widely used fine-grained dataset for car recognition, only has 196 different categories and only includes vehicle models produced earlier than 2013. Due to the rapid advancements in the automotive industry during recent years, the appearances of various car models have become increasingly intricate and sophisticated. Consequently, the previous Stanford-Car dataset fails to capture this evolving landscape and cannot satisfy the requirements of automotive industry. To address these challenges, in our paper, we introduce Car-1000, a large-scale dataset designed specifically for fine-grained visual categorization of diverse car models. Car-1000 encompasses vehicles from 165 different automakers, spanning a wide range of 1000 distinct car models. Additionally, we have reproduced several state-of-the-art FGVC methods on the Car-1000 dataset, establishing a new benchmark for research in this field. We hope that our work will offer a fresh perspective for future FGVC researchers. Our dataset is available at this https URL.
zh

[CV-197] VRsketch2Gaussian: 3D VR Sketch Guided 3D Object Generation with Gaussian Splatting

【速读】：该论文旨在解决虚拟现实（VR）草图引导的多模态三维（3D）物体生成问题。论文提出了VRSketch2Gaussian框架，通过引入基于3D高斯点 splatting 表征的方法，实现从VR草图到高质量3D模型的生成。论文的关键创新包括：1）草图与CLIP特征对齐策略，通过两阶段对齐方法弥合稀疏VR草图嵌入与丰富的CLIP嵌入之间的域差距；2）细粒度多模态条件控制，利用显式的VR草图进行几何约束，并结合文本描述实现外观控制，为此提出了一种可泛化的VR草图编码器；3）高效的高保真3D原生生成，采用3D原生生成方法实现快速且纹理丰富的3D对象合成。这些方案解决了现有方法在VR草图生成任务中的多模态对齐与高效生成难题。

链接: https://arxiv.org/abs/2503.12383
作者: Songen Gu,Haoxuan Song,Binjie Liu,Qian Yu,Sanyi Zhang,Haiyong Jiang,Jin Huang,Feng Tian
机构: Institute of Software, CAS (中国科学院软件研究所); UCAS (中国科学院大学); Communication University of China (中国传媒大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose VRSketch2Gaussian, a first VR sketch-guided, multi-modal, native 3D object generation framework that incorporates a 3D Gaussian Splatting representation. As part of our work, we introduce VRSS, the first large-scale paired dataset containing VR sketches, text, images, and 3DGS, bridging the gap in multi-modal VR sketch-based generation. Our approach features the following key innovations: 1) Sketch-CLIP feature alignment. We propose a two-stage alignment strategy that bridges the domain gap between sparse VR sketch embeddings and rich CLIP embeddings, facilitating both VR sketch-based retrieval and generation tasks. 2) Fine-Grained multi-modal conditioning. We disentangle the 3D generation process by using explicit VR sketches for geometric conditioning and text descriptions for appearance control. To facilitate this, we propose a generalizable VR sketch encoder that effectively aligns different modalities. 3) Efficient and high-fidelity 3D native generation. Our method leverages a 3D-native generation approach that enables fast and texture-rich 3D object synthesis. Experiments conducted on our VRSS dataset demonstrate that our method achieves high-quality, multi-modal VR sketch-based 3D generation. We believe our VRSS dataset and VRsketch2Gaussian method will be beneficial for the 3D generation community.
zh

[CV-198] RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds

【速读】：该论文旨在解决基于学习的神经模型在LiDAR点云压缩（LPCC）任务中实现实时压缩的问题，这是众多工业应用不可或缺的标准，但至今仍极具挑战性。论文提出的解决方案关键是RENO，这是一种针对3D LiDAR点云的首个实时神经编解码器，具有轻量级模型的特点。RENO通过跳过八叉树构造，直接基于多尺度稀疏张量表示进行构建，并设计了稀疏占用码（sparse occupancy codes），利用跨尺度相关性以一次性方式推导体素的占用状态，从而极大地节省处理时间。这一关键创新使得RENO能够在桌面平台上实现14位深度下的实时编码速度（10帧/秒），同时在相似质量下分别比G-PCCv23和Draco节省12.25%和48.34%的比特率，且模型大小仅为1MB，使其适用于实际应用。

链接: https://arxiv.org/abs/2503.12382
作者: Kang You,Tong Chen,Dandan Ding,M. Salman Asif,Zhan Ma
机构: Nanjing University (南京大学); Hangzhou Normal University (杭州师范大学); University of California Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Despite the substantial advancements demonstrated by learning-based neural models in the LiDAR Point Cloud Compression (LPCC) task, realizing real-time compression - an indispensable criterion for numerous industrial applications - remains a formidable challenge. This paper proposes RENO, the first real-time neural codec for 3D LiDAR point clouds, achieving superior performance with a lightweight model. RENO skips the octree construction and directly builds upon the multiscale sparse tensor representation. Instead of the multi-stage inferring, RENO devises sparse occupancy codes, which exploit cross-scale correlation and derive voxels’ occupancy in a one-shot manner, greatly saving processing time. Experimental results demonstrate that the proposed RENO achieves real-time coding speed, 10 fps at 14-bit depth on a desktop platform (e.g., one RTX 3090 GPU) for both encoding and decoding processes, while providing 12.25% and 48.34% bit-rate savings compared to G-PCCv23 and Draco, respectively, at a similar quality. RENO model size is merely 1MB, making it attractive for practical applications. The source code is available at this https URL.
zh

[CV-199] Deepfake Detection with Optimized Hybrid Model: EAR Biometric Descriptor via Improved RCNN

【速读】：该论文旨在解决深度伪造（Deepfake）内容的持续识别与防范问题，特别是在人工智能技术快速发展背景下，区分深度伪造与人工篡改图像的挑战日益增加。论文的关键在于提出了一种基于耳朵生物特征描述符的鲁棒检测方法，通过分析细微的耳朵运动和形状变化生成耳朵描述符，并结合增强型区域卷积神经网络（Region-Based Convolutional Neural Network, RCNN）构建了一种新颖的混合深度伪造检测模型。解决方案的核心在于利用深度置信网络（Deep Belief Network, DBN）与双向门控循环单元（Bidirectional Gated Recurrent Unit, Bi-GRU）的组合模型处理基于耳朵描述符的检测任务，并通过改进的分数级融合策略确定最终输出结果，同时采用自升级水母优化方法（SU-JFO）对检测模型权重进行最优调节。实验验证表明，该方法在多种性能指标（如准确性、特异性和精确度）上优于传统模型。

链接: https://arxiv.org/abs/2503.12381
作者: Ruchika Sharma,Rudresh Dwivedi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Submiited to journal

点击查看摘要

Abstract:Deepfake is a widely used technology employed in recent years to create pernicious content such as fake news, movies, and rumors by altering and substituting facial information from various sources. Given the ongoing evolution of deepfakes investigation of continuous identification and prevention is crucial. Due to recent technological advancements in AI (Artificial Intelligence) distinguishing deepfakes and artificially altered images has become challenging. This approach introduces the robust detection of subtle ear movements and shape changes to generate ear descriptors. Further, we also propose a novel optimized hybrid deepfake detection model that considers the ear biometric descriptors via enhanced RCNN (Region-Based Convolutional Neural Network). Initially, the input video is converted into frames and preprocessed through resizing, normalization, grayscale conversion, and filtering processes followed by face detection using the Viola-Jones technique. Next, a hybrid model comprising DBN (Deep Belief Network) and Bi-GRU (Bidirectional Gated Recurrent Unit) is utilized for deepfake detection based on ear descriptors. The output from the detection phase is determined through improved score-level fusion. To enhance the performance, the weights of both detection models are optimally tuned using the SU-JFO (Self-Upgraded Jellyfish Optimization method). Experimentation is conducted based on four scenarios: compression, noise, rotation, pose, and illumination on three different datasets. The performance results affirm that our proposed method outperforms traditional models such as CNN (Convolution Neural Network), SqueezeNet, LeNet, LinkNet, LSTM (Long Short-Term Memory), DFP (Deepfake Predictor) [1], and ResNext+CNN+LSTM [2] in terms of various performance metrics viz. accuracy, specificity, and precision.
zh

[CV-200] L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model

【速读】：该论文旨在解决语义场景完成（Semantic Scene Completion, SSC）在自动驾驶感知系统中的高计算需求与内存占用问题，同时保持高精度。当前研究多依赖于计算密集型和内存消耗大的三维操作，这在训练和测试阶段对平台提出了严峻挑战。为应对这一问题，论文提出了一种轻量级的以相机为中心的SSC框架L2COcc，其关键在于引入了高效的体素Transformer（Efficient Voxel Transformer, EVT）以及跨模态知识蒸馏模块，包括特征相似性蒸馏（Feature Similarity Distillation, FSD）、可转换透视图体素蒸馏（TPV Distillation, TPVD）和预测对齐蒸馏（Prediction Alignment Distillation, PAD）。这些创新方法显著降低了计算负担，同时确保了高精度。实验结果表明，该方法在SemanticKITTI和SSCBench-KITTI-360基准数据集上的准确性超越了现有基于视觉的SSC方法，并且在模型轻量化方面表现出色，相比现有方法减少了超过23%的内存消耗和推理时间。

链接: https://arxiv.org/abs/2503.12369
作者: Ruoyu Wang,Yukai Ma,Yi Yao,Sheng Tao,Haoang Li,Zongzhi Zhu,Yong Liu,Xingxing Zuo
机构: The authors are with the Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China (作者隶属于浙江大学 cyber-系统与控制研究所，中国杭州);
The authors are with the School of Computation, Information and Technology, Technical University of Munich, Munich, Germany (作者隶属于慕尼黑工业大学计算、信息和技术学院，德国慕尼黑);
The author is with Zhejiang Guoli Security Technology Co., Ltd., Ningbo, China (作者隶属于浙江国利信息安全技术有限公司，中国宁波);
Yong Liu is with the State Key Laboratory of Industrial Control Technology (Yong Liu隶属于工业控制技术国家重点实验室).
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic Scene Completion (SSC) constitutes a pivotal element in autonomous driving perception systems, tasked with inferring the 3D semantic occupancy of a scene from sensory data. To improve accuracy, prior research has implemented various computationally demanding and memory-intensive 3D operations, imposing significant computational requirements on the platform during training and testing. This paper proposes L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs. With our proposed efficient voxel transformer (EVT) and cross-modal knowledge modules, including feature similarity distillation (FSD), TPV distillation (TPVD) and prediction alignment distillation (PAD), our method substantially reduce computational burden while maintaining high accuracy. The experimental evaluations demonstrate that our proposed method surpasses the current state-of-the-art vision-based SSC methods regarding accuracy on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks, respectively. Additionally, our method is more lightweight, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method. Code is available at our project page:this https URL.
zh

[CV-201] HyperKAN: Hypergraph Representation Learning with Kolmogorov-Arnold Networks ICASSP2025

【速读】：本文旨在解决传统超图神经网络（Hypergraph Neural Networks, HNNs）在处理高阶关系建模时面临的不平衡信息聚合问题。具体而言，由于过度依赖超图拓扑结构，高阶顶点容易聚集冗余特征，而低阶顶点则难以充分捕获结构特征。为克服这些挑战，论文提出了一种名为HyperKAN的新框架。HyperKAN的关键创新在于利用Kolmogorov-Arnold网络（KANs）捕捉复杂的非线性关系，并通过调整基于相似性的结构特征来生成优化后的顶点表示，从而有效缓解了不平衡信息聚合的问题。实验结果表明，HyperKAN在Senate数据集上的性能比最先进的HNN方法提高了近9%。

链接: https://arxiv.org/abs/2503.12365
作者: Xiangfei Fang,Boying Wang,Chengying Huan,Shaonan Ma,Heng Zhang,Chen Zhao
机构: Institute of Software, Chinese Academy of Sciences (软件研究所，中国科学院); University of Chinese Academy of Sciences (中国科学院大学); North University of China (华北理工大学); Nanjing University (南京大学); Qiyuan Lab (奇源实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
备注: Accepted by ICASSP2025

点击查看摘要

Abstract:Hypergraph representation learning has garnered increasing attention across various domains due to its capability to model high-order relationships. Traditional methods often rely on hypergraph neural networks (HNNs) employing message passing mechanisms to aggregate vertex and hyperedge features. However, these methods are constrained by their dependence on hypergraph topology, leading to the challenge of imbalanced information aggregation, where high-degree vertices tend to aggregate redundant features, while low-degree vertices often struggle to capture sufficient structural features. To overcome the above challenges, we introduce HyperKAN, a novel framework for hypergraph representation learning that transcends the limitations of message-passing techniques. HyperKAN begins by encoding features for each vertex and then leverages Kolmogorov-Arnold Networks (KANs) to capture complex nonlinear relationships. By adjusting structural features based on similarity, our approach generates refined vertex representations that effectively addresses the challenge of imbalanced information aggregation. Experiments conducted on the real-world datasets demonstrate that HyperKAN significantly outperforms state of-the-art HNN methods, achieving nearly a 9% performance improvement on the Senate dataset.
zh

[CV-202] Localized Concept Erasure for Text-to-Image Diffusion Models Using Training-Free Gated Low-Rank Adaptation CVPR2025

【速读】：该论文旨在解决文本到图像扩散模型生成有害内容的问题，特别是在擦除目标概念（target concept）的同时保持其他概念和整体图像保真度的挑战。传统方法在局部擦除目标概念时，往往牺牲其他区域的保真度，导致生成性能下降。为此，论文提出了一种名为“局部概念擦除”(localized concept erasure) 的框架，专注于仅擦除包含目标概念的特定区域，而非整个图像。

解决方案的关键在于引入一种无需训练的方法——Gated Low-rank adaptation for Concept Erasure (GLoCE)。GLoCE 通过向扩散模型注入一个轻量级模块实现这一目标，该模块由低秩矩阵和简单的门控机制组成，且仅依赖于少量生成步骤来确定门的状态。通过直接作用于图像嵌入并向门控机制中注入关于目标概念的信息，GLoCE 可以在目标概念与其他概念共存的情况下，有选择性地移除目标概念所在的区域，同时保留其他部分的完整性。实验结果表明，GLoCE 不仅显著提高了擦除局部目标概念后的图像保真度，还在效率(efficacy)、特异性(specificity) 和鲁棒性(robustness) 上大幅优于现有方法，并可扩展至大规模概念擦除任务。

链接: https://arxiv.org/abs/2503.12356
作者: Byung Hyun Lee,Sungjin Lim,Se Young Chun
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Fine-tuning based concept erasing has demonstrated promising results in preventing generation of harmful contents from text-to-image diffusion models by removing target concepts while preserving remaining concepts. To maintain the generation capability of diffusion models after concept erasure, it is necessary to remove only the image region containing the target concept when it locally appears in an image, leaving other regions intact. However, prior arts often compromise fidelity of the other image regions in order to erase the localized target concept appearing in a specific area, thereby reducing the overall performance of image generation. To address these limitations, we first introduce a framework called localized concept erasure, which allows for the deletion of only the specific area containing the target concept in the image while preserving the other regions. As a solution for the localized concept erasure, we propose a training-free approach, dubbed Gated Low-rank adaptation for Concept Erasure (GLoCE), that injects a lightweight module into the diffusion model. GLoCE consists of low-rank matrices and a simple gate, determined only by several generation steps for concepts without training. By directly applying GLoCE to image embeddings and designing the gate to activate only for target concepts, GLoCE can selectively remove only the region of the target concepts, even when target and remaining concepts coexist within an image. Extensive experiments demonstrated GLoCE not only improves the image fidelity to text prompts after erasing the localized target concepts, but also outperforms prior arts in efficacy, specificity, and robustness by large margin and can be extended to mass concept erasure.
zh

[CV-203] Atlas: Multi-Scale Attention Improves Long Context Image Modeling

【速读】：该论文旨在解决高效建模大规模图像这一长久以来的机器学习挑战。为实现此目标，论文引入了多尺度注意力机制（Multi-Scale Attention, MSA）。MSA 的关键在于两个核心思想：(i) 多尺度表示和 (ii) 双向跨尺度通信。具体而言，MSA 构建了 (O(\log N)) 个尺度以逐步从粗特征表示图像，并利用交叉注意力在不同尺度间传播信息。基于 MSA，论文进一步提出了新型神经网络架构 Atlas。实验表明，在高分辨率 ImageNet-100 的长上下文图像建模任务中，Atlas 显著提升了计算与性能之间的权衡。例如，在 1024 像素分辨率下，Atlas-B 达到 91.04% 的准确率，接近 ConvNext-B（91.92%），但速度提高了 4.3 倍。此外，Atlas 在多个对比模型中展现出显著优势，包括运行速度和精度的提升。

链接: https://arxiv.org/abs/2503.12355
作者: Kumar Krishna Agrawal,Long Lian,Longchao Liu,Natalia Harguindeguy,Boyi Li,Alexander Bick,Maggie Chung,Trevor Darrell,Adam Yala
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficiently modeling massive images is a long-standing challenge in machine learning. To this end, we introduce Multi-Scale Attention (MSA). MSA relies on two key ideas, (i) multi-scale representations (ii) bi-directional cross-scale communication. MSA creates O(log N) scales to represent the image across progressively coarser features and leverages cross-attention to propagate information across scales. We then introduce Atlas, a novel neural network architecture based on MSA. We demonstrate that Atlas significantly improves the compute-performance tradeoff of long-context image modeling in a high-resolution variant of ImageNet 100. At 1024px resolution, Atlas-B achieves 91.04% accuracy, comparable to ConvNext-B (91.92%) while being 4.3x faster. Atlas is 2.95x faster and 7.38% better than FasterViT, 2.25x faster and 4.96% better than LongViT. In comparisons against MambaVision-S, we find Atlas-S achieves 5%, 16% and 32% higher accuracy at 1024px, 2048px and 4096px respectively, while obtaining similar runtimes. Code for reproducing our experiments and pretrained models is available at this https URL.
zh

[CV-204] ResLPR: A LiDAR Data Restoration Network and Benchmark for Robust Place Recognition Against Weather Corruptions

【速读】：该论文试图解决在恶劣天气条件下，基于激光雷达（LiDAR）的位置识别（Place Recognition, LPR）性能下降的问题。现有最先进的LPR方法在良好天气下表现优异，但在实际驾驶场景中常见的由恶劣天气（如雪、雾、雨）引起的环境干扰下仍表现不佳。为了解决这一问题，论文提出了一种名为ResLPRNet的新颖激光雷达数据恢复网络，其关键是利用基于小波变换（wavelet transform）的网络来恢复被污染的激光雷达扫描数据，从而显著提升LPR在恶劣天气下的性能。此外，ResLPRNet具有高效、轻量级的特点，并可与预训练的LPR模型无缝集成，而无需显著增加计算开销。为了评估该方法的有效性，论文还引入了一个新的基准数据集ResLPR，涵盖多种由严重降雪、雾霾和降雨引起的LiDAR失真场景。实验结果表明，使用ResLPRNet的恢复方法与多种LPR方法结合后，在挑战性的天气条件下取得了显著的性能提升。

链接: https://arxiv.org/abs/2503.12350
作者: Wenqing Kuang(1),Xiongwei Zhao(2),Yehui Shen(1),Congcong Wen(3),Huimin Lu(1),Zongtan Zhou(1),Xieyuanli Chen(1) ((1) National University of Defense Technology, (2) Harbin Institute of Technology, (3) New York University Abu Dhabi)
机构: the College of Intelligence Science and Technology, and the National Key Laboratory of Equipment State Sensing and Smart Support, National University of Defense Technology (国防科技大学), China; Harbin Institute of Technology (哈尔滨工业大学), China; New York University Abu Dhabi (纽约大学阿布扎比分校), UAE; the University of Science and Technology of China (中国科学技术大学), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR-based place recognition (LPR) is a key component for autonomous driving, and its resilience to environmental corruption is critical for safety in high-stakes applications. While state-of-the-art (SOTA) LPR methods perform well in clean weather, they still struggle with weather-induced corruption commonly encountered in driving scenarios. To tackle this, we propose ResLPRNet, a novel LiDAR data restoration network that largely enhances LPR performance under adverse weather by restoring corrupted LiDAR scans using a wavelet transform-based network. ResLPRNet is efficient, lightweight and can be integrated plug-and-play with pretrained LPR models without substantial additional computational cost. Given the lack of LPR datasets under adverse weather, we introduce ResLPR, a novel benchmark that examines SOTA LPR methods under a wide range of LiDAR distortions induced by severe snow, fog, and rain conditions. Experiments on our proposed WeatherKITTI and WeatherNCLT datasets demonstrate the resilience and notable gains achieved by using our restoration method with multiple LPR approaches in challenging weather scenarios. Our code and benchmark are publicly available here: this https URL.
zh

[CV-205] ProbDiffFlow: An Efficient Learning-Free Framework for Probabilistic Single-Image Optical Flow Estimation

【速读】：该论文致力于解决单帧光流估计的问题，传统方法依赖连续帧且存在任务特定性和无法捕捉运动不确定性两大局限。为克服这些挑战，论文提出了一种名为ProbDiffFlow的训练-free框架，其关键在于采用基于估计-合成范式：首先利用扩散模型生成多样的可能未来帧，接着使用预训练的光流模型从这些合成样本中估计运动，最后整合结果形成概率光流分布。这一设计既避免了任务特定性需求，又能够捕获多种可能的运动。

链接: https://arxiv.org/abs/2503.12348
作者: Mo Zhou,Jianwei Wang,Xuanmeng Zhang,Dylan Campbell,Kai Wang,Long Yuan,Wenjie Zhang,Xuemin Lin
机构: The University of New South Wales (新南威尔士大学), Sydney (悉尼), Australia; The Australian National University (澳大利亚国立大学), Canberra (堪培拉), Australia; Antai College of Economics & Management (安泰经济管理学院), Shanghai Jiao Tong University (上海交通大学), Shanghai (上海), China; Nanjing University of Science and Technology (南京理工大学), Nanjing (南京), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper studies optical flow estimation, a critical task in motion analysis with applications in autonomous navigation, action recognition, and film production. Traditional optical flow methods require consecutive frames, which are often unavailable due to limitations in data acquisition or real-world scene disruptions. Thus, single-frame optical flow estimation is emerging in the literature. However, existing single-frame approaches suffer from two major limitations: (1) they rely on labeled training data, making them task-specific, and (2) they produce deterministic predictions, failing to capture motion uncertainty. To overcome these challenges, we propose ProbDiffFlow, a training-free framework that estimates optical flow distributions from a single image. Instead of directly predicting motion, ProbDiffFlow follows an estimation-by-synthesis paradigm: it first generates diverse plausible future frames using a diffusion-based model, then estimates motion from these synthesized samples using a pre-trained optical flow model, and finally aggregates the results into a probabilistic flow distribution. This design eliminates the need for task-specific training while capturing multiple plausible motions. Experiments on both synthetic and real-world datasets demonstrate that ProbDiffFlow achieves superior accuracy, diversity, and efficiency, outperforming existing single-image and two-frame baselines.
zh

[CV-206] opoGaussian: Inferring Internal Topology Structures from Visual Clues

【速读】：该论文旨在解决从易于获取的照片和视频推断不透明物体内部结构的问题。传统基于网格的方法需要繁琐且容易出错的网格填充和修复过程，并通常只能输出粗糙的边界表面。本文的关键解决方案在于提出了一种名为TopoGaussian的整体性粒子基流程，它结合了高斯点 splatting 和一种新颖的多功能粒子基可微分模拟器，该模拟器能够同时处理本构模型、执行器和碰撞，而无需干扰网格。基于此模拟器的梯度，提供了灵活的拓扑表示优化选择，包括粒子、神经隐式曲面和二次曲面。结果表明，与现有的基于网格的方法相比，该方法在保持改进形状质量的同时平均快5.26倍，展示了其在3D视觉、软体机器人和制造应用中的潜力。

链接: https://arxiv.org/abs/2503.12343
作者: Xiaoyu Xiong,Changyu Hu,Chunru Lin,Pingchuan Ma,Chuang Gan,Tao Du
机构: Tsinghua University (清华大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Massachusetts Institute of Technology (麻省理工学院); Shanghai Qi Zhi Institute (上海期智研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present TopoGaussian, a holistic, particle-based pipeline for inferring the interior structure of an opaque object from easily accessible photos and videos as input. Traditional mesh-based approaches require tedious and error-prone mesh filling and fixing process, while typically output rough boundary surface. Our pipeline combines Gaussian Splatting with a novel, versatile particle-based differentiable simulator that simultaneously accommodates constitutive model, actuator, and collision, without interference with mesh. Based on the gradients from this simulator, we provide flexible choice of topology representation for optimization, including particle, neural implicit surface, and quadratic surface. The resultant pipeline takes easily accessible photos and videos as input and outputs the topology that matches the physical characteristics of the input. We demonstrate the efficacy of our pipeline on a synthetic dataset and four real-world tasks with 3D-printed prototypes. Compared with existing mesh-based method, our pipeline is 5.26x faster on average with improved shape quality. These results highlight the potential of our pipeline in 3D vision, soft robotics, and manufacturing applications.
zh

[CV-207] GS-3I: Gaussian Splatting for Surface Reconstruction from Illumination-Inconsistent Images IROS2025

【速读】：该论文旨在解决在复杂光照条件下（inconsistent illumination）几何表面重建的鲁棒性与准确性问题。现有的基于3D Gaussian Splatting (3DGS) 的方法虽然在几何质量和计算效率方面表现出色，但在处理光照不一致场景时仍面临挑战，包括由欠曝区域引起的三维高斯优化偏差以及多视角图像间光照差异导致的几何约束失配问题。

解决方案的关键在于提出了一种名为GS-3I的方法。首先，通过引入基于卷积神经网络（CNN）的色调映射校正框架，有效缓解了单视角图像中欠曝区域引发的三维高斯优化偏差问题。其次，为了克服多视角图像间光照不一致导致的几何约束不匹配，提出了一个法向量补偿机制，将从单视角图像提取的参考法向量与多视角观测计算得到的法向量相结合，从而有效地约束了几何不一致性。这些创新点共同确保了GS-3I能够在复杂的光照条件下实现稳健且精确的表面重建。

链接: https://arxiv.org/abs/2503.12335
作者: Tengfei Wang,Yongmao Hou,Zhaoning Zhang,Yiwei Xu,Zongqian Zhan,Xin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been submitted to IROS 2025

点击查看摘要

Abstract:Accurate geometric surface reconstruction, providing essential environmental information for navigation and manipulation tasks, is critical for enabling robotic self-exploration and interaction. Recently, 3D Gaussian Splatting (3DGS) has gained significant attention in the field of surface reconstruction due to its impressive geometric quality and computational efficiency. While recent relevant advancements in novel view synthesis under inconsistent illumination using 3DGS have shown promise, the challenge of robust surface reconstruction under such conditions is still being explored. To address this challenge, we propose a method called GS-3I. Specifically, to mitigate 3D Gaussian optimization bias caused by underexposed regions in single-view images, based on Convolutional Neural Network (CNN), a tone mapping correction framework is introduced. Furthermore, inconsistent lighting across multi-view images, resulting from variations in camera settings and complex scene illumination, often leads to geometric constraint mismatches and deviations in the reconstructed surface. To overcome this, we propose a normal compensation mechanism that integrates reference normals extracted from single-view image with normals computed from multi-view observations to effectively constrain geometric inconsistencies. Extensive experimental evaluations demonstrate that GS-3I can achieve robust and accurate surface reconstruction across complex illumination scenarios, highlighting its effectiveness and versatility in this critical challenge. this https URL
zh

[CV-208] VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining

【速读】：该论文旨在解决基于Mamba架构的视频理解模型面临的过拟合问题，这限制了其可扩展性。为克服这一挑战，论文提出VideoMAP（Hybrid Mamba-Transformer框架），其关键在于引入了一种新颖的帧级掩码自回归预训练策略，并采用4:1的Mamba与Transformer比例，以平衡计算成本与模型容量。这种设计显著提升了模型在扩展到更大规模时的性能，并展示了卓越的样本效率，同时在多个数据集上实现了性能超越。此外，VideoMAP还展示了作为多模态大型语言模型视觉编码器的潜力，能够减少内存使用并处理更长的视频序列。

链接: https://arxiv.org/abs/2503.12332
作者: Yunze Liu,Peiran Wu,Cheng Liang,Junxiao Shen,Limin Wang,Li Yi
机构: IIIS, Tsinghua University (清华大学); Shanghai Qi Zhi Institute (上海期智研究院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Bristol (布里斯托大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent Mamba-based architectures for video understanding demonstrate promising computational efficiency and competitive performance, yet struggle with overfitting issues that hinder their scalability. To overcome this challenge, we introduce VideoMAP, a Hybrid Mamba-Transformer framework featuring a novel pre-training approach. VideoMAP uses a 4:1 Mamba-to-Transformer ratio, effectively balancing computational cost and model capacity. This architecture, combined with our proposed frame-wise masked autoregressive pre-training strategy, delivers significant performance gains when scaling to larger models. Additionally, VideoMAP exhibits impressive sample efficiency, significantly outperforming existing methods with less training data. Experiments show that VideoMAP outperforms existing models across various datasets, including Kinetics-400, Something-Something V2, Breakfast, and COIN. Furthermore, we demonstrate the potential of VideoMAP as a visual encoder for multimodal large language models, highlighting its ability to reduce memory usage and enable the processing of longer video sequences. The code is open-source at this https URL
zh

[CV-209] Leverag ing Vision Capabilities of Multimodal LLM s for Automated Data Extraction from Plots

【速读】：该论文试图解决从研究论文中的图表自动提取数据的问题，这一任务长期以来主要依赖于手动操作，因其复杂性而难以通过自动化方法高效完成。论文的关键解决方案在于提出了一种名为PlotExtract的零样本提示链式推理方法，利用预训练的多模态大型语言模型（Multimodal Large Language Models, MLLMs）在无需微调的情况下，通过精心设计的指令和工作流实现高精度的数据提取。实验结果表明，PlotExtract在两轴图的提取任务中达到了超过90%的精确度和召回率，并将x和y位置误差控制在5%以下，证明了多模态LLMs在高吞吐量图表数据提取中的可行性和潜力，可在许多场景下替代现有的手动数据提取方法。

链接: https://arxiv.org/abs/2503.12326
作者: Maciej P. Polak,Dane Morgan
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Automated data extraction from research texts has been steadily improving, with the emergence of large language models (LLMs) accelerating progress even further. Extracting data from plots in research papers, however, has been such a complex task that it has predominantly been confined to manual data extraction. We show that current multimodal large language models, with proper instructions and engineered workflows, are capable of accurately extracting data from plots. This capability is inherent to the pretrained models and can be achieved with a chain-of-thought sequence of zero-shot engineered prompts we call PlotExtract, without the need to fine-tune. We demonstrate PlotExtract here and assess its performance on synthetic and published plots. We consider only plots with two axes in this analysis. For plots identified as extractable, PlotExtract finds points with over 90% precision (and around 90% recall) and errors in x and y position of around 5% or lower. These results prove that multimodal LLMs are a viable path for high-throughput data extraction for plots and in many circumstances can replace the current manual methods of data extraction.
zh

[CV-210] Swift4D:Adaptive divide-and-conquer Gaussian Splatting for compact and efficient reconstruction of dynamic scene ICLR2025

【速读】：该论文旨在解决新型视角合成（Novel View Synthesis）这一具有挑战性的任务，尽管已有多种方法被提出，甚至结合先进的表示方式如3D高斯点阵(3D Gaussian Splatting)，但它们在恢复高质量结果的同时往往需要消耗过多的存储空间和训练时间。本文提出的Swift4D是一种分而治之的3D高斯点阵方法，能够分别处理静态和动态基元，从而在渲染质量和效率之间实现良好平衡，其灵感来源于大多数场景由静态基元组成且无需额外动态属性的事实。解决方案的关键在于仅针对动态基元建模动态变换，这既提高了效率又保证了质量。具体而言，首先采用可学习的分解策略将基元分离，并通过一个额外的参数将基元分类为静态或动态；对于动态基元，则使用紧凑的多分辨率4D哈希映射器将其从规范空间转换到每个时间戳的变形空间，最后混合静态和动态基元生成最终输出。这种分而治之的方法促进了高效的训练并减少了存储冗余。我们的方法不仅达到了最先进的渲染质量，在真实数据集上的训练速度比现有最先进方法快20倍，且最小存储需求仅为30MB。代码可在提供的链接获取。

链接: https://arxiv.org/abs/2503.12307
作者: Jiahao Wu,Rui Peng,Zhiyan Wang,Lu Xiao,Luyang Tang,Jinbo Yan,Kaiqiang Xiong,Ronggang Wang
机构: Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology (广东省超高清晰沉浸式媒体技术重点实验室); Shenzhen Graduate School, Peking University (北京大学深圳研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2025

点击查看摘要

Abstract:Novel view synthesis has long been a practical but challenging task, although the introduction of numerous methods to solve this problem, even combining advanced representations like 3D Gaussian Splatting, they still struggle to recover high-quality results and often consume too much storage memory and training time. In this paper we propose Swift4D, a divide-and-conquer 3D Gaussian Splatting method that can handle static and dynamic primitives separately, achieving a good trade-off between rendering quality and efficiency, motivated by the fact that most of the scene is the static primitive and does not require additional dynamic properties. Concretely, we focus on modeling dynamic transformations only for the dynamic primitives which benefits both efficiency and quality. We first employ a learnable decomposition strategy to separate the primitives, which relies on an additional parameter to classify primitives as static or dynamic. For the dynamic primitives, we employ a compact multi-resolution 4D Hash mapper to transform these primitives from canonical space into deformation space at each timestamp, and then mix the static and dynamic primitives to produce the final output. This divide-and-conquer method facilitates efficient training and reduces storage redundancy. Our method not only achieves state-of-the-art rendering quality while being 20X faster in training than previous SOTA methods with a minimum storage requirement of only 30MB on real-world datasets. Code is available at this https URL.
zh

[CV-211] owards Self-Improving Systematic Cognition for Next-Generation Foundation MLLM s

【速读】：该论文试图解决多模态大型语言模型（MLLMs）在细粒度感知和复杂推理方面面临的挑战。由于收集链式思维（CoT）推理数据的成本极高，现有预训练方法主要依赖高质量图像描述进行感知增强，但这种方法难以有效提升模型的系统性认知能力。此外，利用先进 MLLMs 生成描述虽然提高了可扩展性，但输出结果通常缺乏全面性和准确性。

解决方案的关键在于提出 Self-Improving Cognition (SIcog)，这是一种自学习框架，通过使用自生成的数据进行多模态预训练，以提升下一代基础 MLLMs 的系统性认知能力。具体而言，SIcog 引入了“描述链”（chain-of-description）方法，通过逐步视觉理解提升模型的系统感知能力，并采用结构化 CoT 推理技术实现深度多模态推理的整合。该框架首先利用少量外部标注使 MLLM 具备系统感知与推理能力，然后通过自一致性筛选生成高质量的详细描述和推理数据，这些数据被用于多模态预训练以进一步优化模型。实验表明，仅需 213K 自生成样本，SIcog 构建的下一代基础 MLLMs 在多个基准测试中实现了显著的认知提升，性能优于主流预训练方法。

链接: https://arxiv.org/abs/2503.12303
作者: Xiaoying Zhang,Da Peng,Yipeng Zhang,Zonghao Guo,Chengyue Wu,Chi Chen,Wei Ke,Helen Meng,Maosong Sun
机构: The Chinese University of Hong Kong; Xi’an Jiaotong University; Tsinghua University; The University of Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages

点击查看摘要

Abstract:Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) face challenges with fine-grained perception and complex reasoning. Prevalent pre-training approaches focus on enhancing perception by training on high-quality image captions due to the extremely high cost of collecting chain-of-thought (CoT) reasoning data for improving reasoning. While leveraging advanced MLLMs for caption generation enhances scalability, the outputs often lack comprehensiveness and accuracy. In this paper, we introduce Self-Improving Cognition (SIcog), a self-learning framework designed to construct next-generation foundation MLLMs by enhancing their systematic cognitive capabilities through multimodal pre-training with self-generated data. Specifically, we propose chain-of-description, an approach that improves an MLLM’s systematic perception by enabling step-by-step visual understanding, ensuring greater comprehensiveness and accuracy. Additionally, we adopt a structured CoT reasoning technique to enable MLLMs to integrate in-depth multimodal reasoning. To construct a next-generation foundation MLLM with self-improved cognition, SIcog first equips an MLLM with systematic perception and reasoning abilities using minimal external annotations. The enhanced models then generate detailed captions and CoT reasoning data, which are further curated through self-consistency. This curated data is ultimately used to refine the MLLM during multimodal pre-training, facilitating next-generation foundation MLLM construction. Extensive experiments on both low- and high-resolution MLLMs across diverse benchmarks demonstrate that, with merely 213K self-generated pre-training samples, SIcog produces next-generation foundation MLLMs with significantly improved cognition, achieving benchmark-leading performance compared to prevalent pre-training approaches.
zh

[CV-212] REdiSplats: Ray Tracing for Editable Gaussian Splatting

【速读】：本文旨在解决经典高斯点 splatting (Gaussian Splatting, GS) 在处理复杂光照条件（如阴影和反射）、手动调整以及物理引擎集成时存在的局限性。尽管已有方法通过引入光线追踪或网格基元部分缓解了这些问题，但尚未有一种方案能够同时全面解决这些限制。为此，论文提出REdiSplats，其关键创新在于结合了光线追踪与基于网格表示的平面高斯分布，通过参数化网格来建模场景，并利用网格顶点调整实现对高斯分量的灵活控制。这种设计不仅支持高效的光线追踪，还实现了对光照条件建模、手动干预以及物理模拟的支持，同时兼容主流三维工具（如Blender或Nvdiffrast）进行渲染，从而无缝整合到现有的基于网格的三维图形技术体系中。

链接: https://arxiv.org/abs/2503.12284
作者: Krzysztof Byrski,Grzegorz Wilczyński,Weronika Smolak-Dyżewska,Piotr Borycki,Dawid Baran,Sławomir Tadeja,Przemysław Spurek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gaussian Splatting (GS) has become one of the most important neural rendering algorithms. GS represents 3D scenes using Gaussian components with trainable color and opacity. This representation achieves high-quality renderings with fast inference. Regrettably, it is challenging to integrate such a solution with varying light conditions, including shadows and light reflections, manual adjustments, and a physical engine. Recently, a few approaches have appeared that incorporate ray-tracing or mesh primitives into GS to address some of these caveats. However, no such solution can simultaneously solve all the existing limitations of the classical GS. Consequently, we introduce REdiSplats, which employs ray tracing and a mesh-based representation of flat 3D Gaussians. In practice, we model the scene using flat Gaussian distributions parameterized by the mesh. We can leverage fast ray tracing and control Gaussian modification by adjusting the mesh vertices. Moreover, REdiSplats allows modeling of light conditions, manual adjustments, and physical simulation. Furthermore, we can render our models using 3D tools such as Blender or Nvdiffrast, which opens the possibility of integrating them with all existing 3D graphics techniques dedicated to mesh representations.
zh

[CV-213] Exploration of VLMs for Driver Monitoring Systems Applications

【速读】：该论文旨在解决大型视觉-语言模型（Vision-Language Models, VLMs）在驾驶员监控系统（Driver Monitoring Systems, DMS）中的应用研究不足的问题。现有文献中关于深度学习技术在汽车行业特别是DMS领域的探索较为有限，而该研究通过利用驾驶员监控数据集评估VLMs的性能，并探讨其在实际场景中的优势与挑战。关键在于采用先进的视觉-语言模型替代传统基于数据捕捉和算法训练的方法，转而专注于设计合适的提示（prompt engineering），以实现更高效、智能化的驾驶员状态监测解决方案。

链接: https://arxiv.org/abs/2503.12281
作者: Paola Natalia Cañas,Marcos Nieto,Oihana Otaegui,Igor Rodríguez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted in 16th ITS European Congress, Seville, Spain, 19-21 May 2025

点击查看摘要

Abstract:In recent years, we have witnessed significant progress in emerging deep learning models, particularly Large Language Models (LLMs) and Vision-Language Models (VLMs). These models have demonstrated promising results, indicating a new era of Artificial Intelligence (AI) that surpasses previous methodologies. Their extensive knowledge and zero-shot capabilities suggest a paradigm shift in developing deep learning solutions, moving from data capturing and algorithm training to just writing appropriate prompts. While the application of these technologies has been explored across various industries, including automotive, there is a notable gap in the scientific literature regarding their use in Driver Monitoring Systems (DMS). This paper presents our initial approach to implementing VLMs in this domain, utilising the Driver Monitoring Dataset to evaluate their performance and discussing their advantages and challenges when implemented in real-world scenarios.
zh

[CV-214] Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

【速读】：该论文试图解决文本到图像扩散模型在推理阶段性能提升受限的问题，当前主要依赖于最佳-of-N采样方法，即针对每个提示生成多个图像并通过选择模型挑选最佳输出。为克服这一局限，论文提出了一种新的方法，通过赋予文本到图像扩散Transformer模型上下文反射能力来替代传统的简单最佳-of-N采样。解决方案的关键在于引入Reflect-DiT方法，使扩散Transformer能够利用先前生成图像的上下文示例以及描述必要改进的文字反馈来优化其生成结果，从而主动调整生成以针对性地改善特定方面。实验结果显示，采用SANA-1.0-1.6B作为基础模型时，Reflect-DiT在GenEval基准测试中提升了0.19分，并且仅需每提示生成20个样本即可达到0.81的新SOTA分数，优于使用更大模型（SANA-1.5-4.8B）通过最佳-of-N方法生成2048个样本获得的0.80分数。

链接: https://arxiv.org/abs/2503.12271
作者: Shufan Li,Konstantinos Kallidromitis,Akash Gokul,Arsh Koneru,Yusuke Kato,Kazuki Kozuka,Aditya Grover
机构: UCLA (加州大学洛杉矶分校); Panasonic AI Research (松下人工智能研究中心); Salesforce AI Research (Salesforce人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures

点击查看摘要

Abstract:The predominant approach to advancing text-to-image generation has been training-time scaling, where larger models are trained on more data using greater computational resources. While effective, this approach is computationally expensive, leading to growing interest in inference-time scaling to improve performance. Currently, inference-time scaling for text-to-image diffusion models is largely limited to best-of-N sampling, where multiple images are generated per prompt and a selection model chooses the best output. Inspired by the recent success of reasoning models like DeepSeek-R1 in the language domain, we introduce an alternative to naive best-of-N sampling by equipping text-to-image Diffusion Transformers with in-context reflection capabilities. We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. Instead of passively relying on random sampling and hoping for a better result in a future generation, Reflect-DiT explicitly tailors its generations to address specific aspects requiring enhancement. Experimental results demonstrate that Reflect-DiT improves performance on the GenEval benchmark (+0.19) using SANA-1.0-1.6B as a base model. Additionally, it achieves a new state-of-the-art score of 0.81 on GenEval while generating only 20 samples per prompt, surpassing the previous best score of 0.80, which was obtained using a significantly larger model (SANA-1.5-4.8B) with 2048 samples under the best-of-N approach.
zh

[CV-215] Cracking the PUMA Challenge in 24 Hours with CellViT and nnU-Net

【速读】：该论文致力于解决晚期黑色素瘤病理图像中组织分割与核检测的任务，这是病理学中生物标志物提取与发现的重要环节。论文通过参与PUMA挑战赛，旨在提升黑色素瘤组织病理学中的组织分割与核检测性能。论文的关键解决方案在于设计了一个可在24小时内开发完成且具备实际部署能力的流水线，采用开箱即用的框架。具体而言，该流水线结合了CellViT++模型用于核检测，以及nnU-Net模型用于组织分割。实验结果显示，组织分割的Dice系数达到了0.750，显著优于基线值0.629；而在核检测方面，结果与基线水平相当。代码已公开发布。

链接: https://arxiv.org/abs/2503.12269
作者: Negar Shahamiri,Moritz Rempe,Lukas Heine,Jens Kleesiek,Fabian Hörst
机构: uk-essen.de(埃森大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic tissue segmentation and nuclei detection is an important task in pathology, aiding in biomarker extraction and discovery. The panoptic segmentation of nuclei and tissue in advanced melanoma (PUMA) challenge aims to improve tissue segmentation and nuclei detection in melanoma histopathology. Unlike many challenge submissions focusing on extensive model tuning, our approach emphasizes delivering a deployable solution within a 24-hour development timeframe, using out-of-the-box frameworks. The pipeline combines two models, namely CellViT++ for nuclei detection and nnU-Net for tissue segmentation. Our results demonstrate a significant improvement in tissue segmentation, achieving a Dice score of 0.750, surpassing the baseline score of 0.629. For nuclei detection, we obtained results comparable to the baseline in both challenge tracks. The code is publicly available at this https URL.
zh

[CV-216] An Efficient Deep Learning-Based Approach to Automating Invoice Document Validation

【速读】：该论文旨在解决大型组织中快速增长的财务交易导致的多准则发票验证需求，现有方法或手动处理效率低且易出错，或自动化方案无法支持多种约束条件（如部分手写或手机拍摄的文档）。为应对这一挑战，论文提出利用基于深度学习（Deep Learning, DL）的文档布局分析和目标检测技术来自动验证机器生成的发票。解决方案的关键在于引入一个包含真实世界手工标注发票的新数据集以及一个多准则验证流程，并针对该数据集微调和评估相关DL模型。实验结果验证了所提方法在快速且准确验证发票方面的有效性。

链接: https://arxiv.org/abs/2503.12267
作者: Aziz Amari,Mariem Makni,Wissal Fnaich,Akram Lahmar,Fedi Koubaa,Oumayma Charrad,Mohamed Ali Zormati,Rabaa Youssef Douss
机构: National Institute of Applied Sciences and Technology (INSAT) (国立应用科学与技术研究所); University of Carthage (迦太基大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In large organizations, the number of financial transactions can grow rapidly, driving the need for fast and accurate multi-criteria invoice validation. Manual processing remains error-prone and time-consuming, while current automated solutions are limited by their inability to support a variety of constraints, such as documents that are partially handwritten or photographed with a mobile phone. In this paper, we propose to automate the validation of machine written invoices using document layout analysis and object detection techniques based on recent deep learning (DL) models. We introduce a novel dataset consisting of manually annotated real-world invoices and a multi-criteria validation process. We fine-tune and benchmark the most relevant DL models on our dataset. Experimental results show the effectiveness of the proposed pipeline and selected DL models in terms of achieving fast and accurate validation of invoices.
zh

[CV-217] Handling Weak Complementary Relationships for Audio-Visual Emotion Recognition

【速读】：该论文旨在解决多模态情感识别中音频和视觉模态特征表示不足的问题，尤其是在两者互补关系较弱的情况下导致系统性能下降的挑战。论文的关键解决方案在于提出了一种基于门控注意力机制的灵活音频-视觉融合模型。通过在递归联合交叉注意力模型的每次迭代中引入门控机制，该模型能够根据音频和视觉特征互补关系的强度，动态控制输入特征与注意特征之间的信息流动：当模态间存在强互补关系时，选择交叉注意特征；否则选择非注意特征。此外，进一步引入阶段门控机制以跨迭代控制门控输出的信息流动，从而增强模型的灵活性并提升系统性能，即使在模态间互补关系较弱时也能显著改善情感识别效果。该方法在具有挑战性的 Affwild2 数据集上进行了评估，并显著优于现有的多模态融合方法。

链接: https://arxiv.org/abs/2503.12261
作者: R. Gnana Praveen,Jahangir Alam
机构: Computer Research Institute of Montreal (CRIM)(蒙特利尔计算机研究中心 (CRIM))
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submission to valence arousal track of 8th ABAW competition. arXiv admin note: substantial text overlap with arXiv:2403.13659

点击查看摘要

Abstract:Multimodal emotion recognition has recently drawn a lot of interest in affective computing as it has immense potential to outperform isolated unimodal approaches. Audio and visual modalities are two predominant contact-free channels in videos, which are often expected to carry a complementary relationship with each other. However, audio and visual channels may not always be complementary with each other, resulting in poor audio-visual feature representations, thereby degrading the performance of the system. In this paper, we propose a flexible audio-visual fusion model that can adapt to weak complementary relationships using a gated attention mechanism. Specifically, we extend the recursive joint cross-attention model by introducing gating mechanism in every iteration to control the flow of information between the input features and the attended features depending on the strength of their complementary relationship. For instance, if the modalities exhibit strong complementary relationships, the gating mechanism chooses cross-attended features, otherwise non-attended features. To further improve the performance of the system, we further introduce stage gating mechanism, which is used to control the flow of information across the gated outputs of each iteration. Therefore, the proposed model improves the performance of the system even when the audio and visual modalities do not have a strong complementary relationship with each other by adding more flexibility to the recursive joint cross attention mechanism. The proposed model has been evaluated on the challenging Affwild2 dataset and significantly outperforms the state-of-the-art fusion approaches.
zh

[CV-218] Enhancing Facial Expression Recognition through Dual-Direction Attention Mixed Feature Networks and CLIP: Application to 8th ABAW Challenge

【速读】：该论文致力于解决表情识别领域的三个独立挑战：Valence-Arousal估计（VA估计）、情感识别（Emotion Recognition）以及面部动作单元检测（Facial Action Unit Detection）。论文的关键在于采用了Dual-Direction Attention Mixed Feature Network (DDAMFN)，这是一种在所有三项任务中均表现优异的网络架构，并且超越了提出的基准模型。此外，作者还通过实验探索了使用CLIP模型进行情感识别的可能性。论文提供了关于网络结构设计选择的见解，这些选择显著提升了方法的性能。

链接: https://arxiv.org/abs/2503.12260
作者: Josep Cabacas-Maso,Elena Ortega-Beltrán,Ismael Benito-Altamirano,Carles Ventura
机构: eHealth Center, Faculty of Computer Science, Multimedia and Telecommunication, Universitat Oberta de Catalunya (巴塞罗那开放大学); MIND/IN2UB, Department of Electronic and Biomedical Engineering, Universitat de Barcelona (巴塞罗那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present our contribution to the 8th ABAW challenge at CVPR 2025, where we tackle valence-arousal estimation, emotion recognition, and facial action unit detection as three independent challenges. Our approach leverages the well-known Dual-Direction Attention Mixed Feature Network (DDAMFN) for all three tasks, achieving results that surpass the proposed baselines. Additionally, we explore the use of CLIP for the emotion recognition challenge as an additional experiment. We provide insights into the architectural choices that contribute to the strong performance of our methods.
zh

[CV-219] Minuscule Cell Detection in AS-OCT Images with Progressive Field-of-View Focusing

【速读】：本文旨在解决前房炎症细胞在高分辨率眼前段光学相干断层扫描（Anterior Segment Optical Coherence Tomography, AS-OCT）图像中检测困难的问题。传统方法依赖人工识别，而自动化的计算机视觉方法面临两大挑战：一是炎症细胞在图像中表现为极小的目标（小于图像的0.005%），难以察觉；二是光学相干断层扫描（OCT）引入的像素级噪声容易被误认为是细胞，导致假阳性检测。为克服这些挑战，论文提出了一种基于渐进视野聚焦策略的微小细胞检测框架。该框架的关键在于通过逐步缩小检测范围，从整个图像聚焦到可能含有细胞的目标区域，再进一步细化到可能包含单个细胞的微小区域。解决方案的核心包括两个模块：首先利用视觉基础模型分割目标区域的场聚焦模块，随后通过细粒度对象检测模块中的微小区域提议和空间注意力网络，在分割区域中区分细胞与噪声。实验结果表明，该框架在细胞检测任务上超越现有最先进的方法，提升了临床应用的效能。

链接: https://arxiv.org/abs/2503.12249
作者: Boyu Chen,Ameenat L. Solebo,Daqian Shi,Jinge Wu,Paul Taylor
机构: Institute of Health Informatics, University College London (伦敦大学学院健康信息学研究所); Great Ormond Street Institute of Child Health, University College London (大奥蒙德街儿童健康研究所, 伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anterior Segment Optical Coherence Tomography (AS-OCT) is an emerging imaging technique with great potential for diagnosing anterior uveitis, a vision-threatening ocular inflammatory condition. A hallmark of this condition is the presence of inflammatory cells in the eye’s anterior chamber, and detecting these cells using AS-OCT images has attracted research interest. While recent efforts aim to replace manual cell detection with automated computer vision approaches, detecting extremely small (minuscule) objects in high-resolution images, such as AS-OCT, poses substantial challenges: (1) each cell appears as a minuscule particle, representing less than 0.005% of the image, making the detection difficult, and (2) OCT imaging introduces pixel-level noise that can be mistaken for cells, leading to false positive detections. To overcome these challenges, we propose a minuscule cell detection framework through a progressive field-of-view focusing strategy. This strategy systematically refines the detection scope from the whole image to a target region where cells are likely to be present, and further to minuscule regions potentially containing individual cells. Our framework consists of two modules. First, a Field-of-Focus module uses a vision foundation model to segment the target region. Subsequently, a Fine-grained Object Detection module introduces a specialized Minuscule Region Proposal followed by a Spatial Attention Network to distinguish individual cells from noise within the segmented region. Experimental results demonstrate that our framework outperforms state-of-the-art methods for cell detection, providing enhanced efficacy for clinical applications. Our code is publicly available at: this https URL.
zh

[CV-220] RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance CVPR2025

【速读】：该论文旨在解决现有方法在重放通用动态场景与重新表演之间存在的局限性问题，提出了一种新的基于高斯分布的表示方法RePerformer，以统一重放和重新表演功能，实现高保真的人本中心体素视频。关键在于通过层次化分解动态场景为运动高斯分布和表观高斯分布，并在规范空间中关联它们；采用Morton参数化高效编码表观高斯分布为2D位置图和属性图；利用2D CNN映射位置图到属性图以重构表观高斯分布，从而实现高质量渲染。此外，通过语义感知对齐模块和基于运动高斯分布的变形传递，实现了在新运动下的照片级真实感渲染。这一系列创新构成了RePerformer的核心解决方案。

链接: https://arxiv.org/abs/2503.12242
作者: Yuheng Jiang,Zhehao Shen,Chengcheng Guo,Yu Hong,Zhuo Su,Yingliang Zhang,Marc Habermann,Lan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Human-centric volumetric videos offer immersive free-viewpoint experiences, yet existing methods focus either on replaying general dynamic scenes or animating human avatars, limiting their ability to re-perform general dynamic scenes. In this paper, we present RePerformer, a novel Gaussian-based representation that unifies playback and re-performance for high-fidelity human-centric volumetric videos. Specifically, we hierarchically disentangle the dynamic scenes into motion Gaussians and appearance Gaussians which are associated in the canonical space. We further employ a Morton-based parameterization to efficiently encode the appearance Gaussians into 2D position and attribute maps. For enhanced generalization, we adopt 2D CNNs to map position maps to attribute maps, which can be assembled into appearance Gaussians for high-fidelity rendering of the dynamic scenes. For re-performance, we develop a semantic-aware alignment module and apply deformation transfer on motion Gaussians, enabling photo-real rendering under novel motions. Extensive experiments validate the robustness and effectiveness of RePerformer, setting a new benchmark for playback-then-reperformance paradigm in human-centric volumetric videos.
zh

[CV-221] From Laboratory to Real World: A New Benchmark Towards Privacy-Preserved Visible-Infrared Person Re-Identification CVPR2025

【速读】：该论文旨在解决跨光照条件行人图像匹配的可见光-红外person re-identification (VI-ReID) 领域中，因实际监控场景中的多设备/实体分布式数据分布所引发的隐私和所有权问题，使得现有的集中式训练方法在VI-ReID中变得不切实际。为应对这些挑战，论文提出了L2RW基准，其关键是将去中心化(decentralized)训练整合到VI-ReID中，以解决有限数据共享法规场景下的隐私顾虑。具体而言，设计了适用于不同隐私敏感度级别的协议与相应算法，在确保模型训练过程中数据（1）来自每个相机的数据完全隔离，或（2）不同的数据实体可以选择性共享数据的前提下，模拟严格的隐私约束场景，使其更接近真实世界条件。通过大量服务器端联邦(federated)算法实验，验证了去中心化VI-ReID训练的可行性，并证明了使用隔离数据（隐私保护）训练的L2RW在未见过的领域（即新数据实体）上的性能可媲美使用共享数据（无隐私限制）训练的SOTA方法。

链接: https://arxiv.org/abs/2503.12232
作者: Yan Jiang,Hao Yu,Xu Cheng,Haoyu Chen,Zhaodong Sun,Guoying Zhao
机构: School of Computer Science, Nanjing University of Information Science and Technology (南京信息工程大学计算机科学学院); Center for Machine Vision and Signal Analysis, University of Oulu (奥卢大学机器视觉与信号分析中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Aiming to match pedestrian images captured under varying lighting conditions, visible-infrared person re-identification (VI-ReID) has drawn intensive research attention and achieved promising results. However, in real-world surveillance contexts, data is distributed across multiple devices/entities, raising privacy and ownership concerns that make existing centralized training impractical for VI-ReID. To tackle these challenges, we propose L2RW, a benchmark that brings VI-ReID closer to real-world applications. The rationale of L2RW is that integrating decentralized training into VI-ReID can address privacy concerns in scenarios with limited data-sharing regulation. Specifically, we design protocols and corresponding algorithms for different privacy sensitivity levels. In our new benchmark, we ensure the model training is done in the conditions that: 1) data from each camera remains completely isolated, or 2) different data entities (e.g., data controllers of a certain region) can selectively share the data. In this way, we simulate scenarios with strict privacy constraints which is closer to real-world conditions. Intensive experiments with various server-side federated algorithms are conducted, showing the feasibility of decentralized VI-ReID training. Notably, when evaluated in unseen domains (i.e., new data entities), our L2RW, trained with isolated data (privacy-preserved), achieves performance comparable to SOTAs trained with shared data (privacy-unrestricted). We hope this work offers a novel research entry for deploying VI-ReID that fits real-world scenarios and can benefit the community.
zh

[CV-222] LIAM: Multimodal Transformer for Language Instructions Images Actions and Semantic Maps

【速读】：该论文旨在解决家庭服务机器人在处理多样化家庭任务时缺乏灵活性的问题。传统方法需要为每个任务单独编程，而本文提出通过提供任务描述和适当的环境信息，使机器人能够更灵活地适应不同的家庭任务。论文的关键在于提出了一种端到端模型LIAM，它基于语言、图像、动作和地图输入预测动作字幕。为了提高多模态嵌入空间的一致性，模型使用CLIP主干，并设计了两个预训练任务来微调其权重并预先对齐潜在空间。此外，引入语义地图进一步增强了模型的性能。这些方法共同解决了跨模态嵌入对齐的重要性及语义地图整合的有效性。

链接: https://arxiv.org/abs/2503.12230
作者: Yihao Wang,Raphael Memmesheimer,Sven Behnke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The availability of large language models and open-vocabulary object perception methods enables more flexibility for domestic service robots. The large variability of domestic tasks can be addressed without implementing each task individually by providing the robot with a task description along with appropriate environment information. In this work, we propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs. Language and image inputs are encoded with a CLIP backbone, for which we designed two pre-training tasks to fine-tune its weights and pre-align the latent spaces. We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks. Our results demonstrate the importance of pre-aligning embedding spaces from different modalities and the efficacy of incorporating semantic maps.
zh

[CV-223] Shadow Art Kanji: Inverse Rendering Application

【速读】：该论文旨在解决艺术美感与机器生成图像之间的平衡难题，具体目标是创建能够在光照下投射出类似日文 Kanji 字符阴影的 3D 模型。解决方案的关键在于结合艺术表达与计算技术，提出一种精确且高效的方法，通过阴影来可视化这些日本字符。

链接: https://arxiv.org/abs/2503.12229
作者: William Louis Rothman,Yasuyuki Matsushita
机构: University of California, Berkeley (加州大学伯克利分校); Osaka University (大阪大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 10 figures, 8 references

点击查看摘要

Abstract:Finding a balance between artistic beauty and machine-generated imagery is always a difficult task. This project seeks to create 3D models that, when illuminated, cast shadows resembling Kanji characters. It aims to combine artistic expression with computational techniques, providing an accurate and efficient approach to visualizing these Japanese characters through shadows.
zh

[CV-224] Adaptive Label Correction for Robust Medical Image Segmentation with Noisy Labels

【速读】：该论文试图解决深度学习在医学图像分割中因依赖大量高质量标注数据而限制其应用的问题，特别是当使用包含噪声的标签数据时，直接纳入训练会导致模型性能下降。为应对这一挑战，论文提出了一种基于Mean Teacher的自适应标签校正（Adaptive Label Correction, ALC）自集成框架。解决方案的关键在于利用Mean Teacher架构确保在噪声扰动下的稳定学习，并结合自适应标签细化机制动态捕捉和加权多个扰动版本之间的差异以提升噪声标签质量。此外，引入基于样本级别不确定性的标签选择算法优先更新高置信度样本，同时通过一致性学习对齐学生网络和教师网络的预测，进一步增强模型鲁棒性。

链接: https://arxiv.org/abs/2503.12218
作者: Chengxuan Qian,Kai Han,Siqi Ma,Chongwen Lyu,Zhenlong Yuan,Jun Chen,Zhe Liu
机构: School of Computer Science and Communication engineering, Jiangsu University (计算机科学与通信工程学院，江苏大学); Lenovo Research (联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has shown remarkable success in medical image analysis, but its reliance on large volumes of high-quality labeled data limits its applicability. While noisy labeled data are easier to obtain, directly incorporating them into training can degrade model performance. To address this challenge, we propose a Mean Teacher-based Adaptive Label Correction (ALC) self-ensemble framework for robust medical image segmentation with noisy labels. The framework leverages the Mean Teacher architecture to ensure consistent learning under noise perturbations. It includes an adaptive label refinement mechanism that dynamically captures and weights differences across multiple disturbance versions to enhance the quality of noisy labels. Additionally, a sample-level uncertainty-based label selection algorithm is introduced to prioritize high-confidence samples for network updates, mitigating the impact of noisy annotations. Consistency learning is integrated to align the predictions of the student and teacher networks, further enhancing model robustness. Extensive experiments on two public datasets demonstrate the effectiveness of the proposed framework, showing significant improvements in segmentation performance. By fully exploiting the strengths of the Mean Teacher structure, the ALC framework effectively processes noisy labels, adapts to challenging scenarios, and achieves competitive results compared to state-of-the-art methods.
zh

[CV-225] Gun Detection Using Combined Human Pose and Weapon Appearance

【速读】：该论文旨在解决传统枪支检测方法在公共空间中因人工检查和监控导致的劳动强度高、误报率和漏报率高的问题。为应对这些局限性，论文提出了一种创新方法，将人体姿态估计与武器外观识别相结合，并采用深度学习技术实现联合分析。这种方法的关键在于同时考虑人体姿态和武器存在性，以提高真实动态环境下的检测准确性。为此，研究团队构建了一个包含多样化数据集的训练集，包括来自开放资源库（如IMFDB和Monash Guns）以及通过AI生成和手动收集的网络图像，从而确保模型在各种监控条件下的鲁棒性和性能评估的真实性。研究目标是提升枪支检测系统的精确度和可靠性，进而增强公共安全并降低高风险区域的威胁水平。

链接: https://arxiv.org/abs/2503.12215
作者: Amulya Reddy Maligireddy,Manohar Reddy Uppula,Nidhi Rastogi,Yaswanth Reddy Parla
机构: Rochester Institute of Technology (罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing frequency of firearm-related incidents has necessitated advancements in security and surveillance systems, particularly in firearm detection within public spaces. Traditional gun detection methods rely on manual inspections and continuous human monitoring of CCTV footage, which are labor-intensive and prone to high false positive and negative rates. To address these limitations, we propose a novel approach that integrates human pose estimation with weapon appearance recognition using deep learning techniques. Unlike prior studies that focus on either body pose estimation or firearm detection in isolation, our method jointly analyzes posture and weapon presence to enhance detection accuracy in real-world, dynamic environments. To train our model, we curated a diverse dataset comprising images from open-source repositories such as IMFDB and Monash Guns, supplemented with AI-generated and manually collected images from web sources. This dataset ensures robust generalization and realistic performance evaluation under various surveillance conditions. Our research aims to improve the precision and reliability of firearm detection systems, contributing to enhanced public safety and threat mitigation in high-risk areas.
zh

[CV-226] STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation WACV2025

【速读】：该论文致力于解决布局到图像（Layout-to-Image, L2I）合成中的挑战，旨在从粗略信息（如边界框）生成可控且复杂的场景图像。论文提出的关键解决方案是STAY Diffusion模型，它基于扩散模型生成照片级真实感图像，并提供对场景中风格化对象的精细控制。其核心在于通过引入Edge-Aware Normalization (EA Norm) 学习每个布局的全局条件以及自监督语义图以实现权重调制，同时提出Styled-Mask Attention (SM Attention) 来交叉条件化全局条件与图像特征以捕捉对象间的关系。这些措施确保了模型在生成过程中的一致性指导，从而实现更精确且可控制的图像生成。实验表明，STAY Diffusion在生成多样性、准确性和可控性方面超越了现有最先进方法。

链接: https://arxiv.org/abs/2503.12213
作者: Ruyu Wang,Xuefeng Hou,Sabrina Schmedding,Marco F. Huber
机构: Bosch Center for Artificial Intelligence (博世人工智能中心), Germany; Institute of Industrial Manufacturing and Management IFF, University of Stuttgart (斯图加特大学工业制造与管理研究所), Germany; Fraunhofer Institute for Manufacturing Engineering and Automation IPA (弗劳恩霍夫制造工程与自动化研究所), Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV2025

点击查看摘要

Abstract:In layout-to-image (L2I) synthesis, controlled complex scenes are generated from coarse information like bounding boxes. Such a task is exciting to many downstream applications because the input layouts offer strong guidance to the generation process while remaining easily reconfigurable by humans. In this paper, we proposed STyled LAYout Diffusion (STAY Diffusion), a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes. Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation using a novel Edge-Aware Normalization (EA Norm). A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects’ relationships. These measures provide consistent guidance through the model, enabling more accurate and controllable image generation. Extensive benchmarking demonstrates that our STAY Diffusion presents high-quality images while surpassing previous state-of-the-art methods in generation diversity, accuracy, and controllability.
zh

[CV-227] LAC: Two-stage LMM Augmented CLIP for Zero-Shot Classification

【速读】：该论文旨在解决现有基于CLIP的图像分类方法在适应新数据集和领域时需要依赖微调技术（如提示学习和适配器调整）的问题，这导致了显著的时间和计算资源消耗。为克服这一限制，论文提出了一种无需训练的简单有效方法，即单阶段（SLAC）和双阶段（TLAC）大型多模态模型增强的CLIP方法。其关键是利用预训练的大型多模态模型（如Gemini）的强大能力，通过提示模型识别图像中的对象，并使用CLIP文本编码器确定与LLM预测对象语义相似度最高的数据集类别来完成图像分类，从而实现对多样化数据集和领域的无缝适应，而无需额外训练。

链接: https://arxiv.org/abs/2503.12206
作者: Ans Munir,Faisal Z. Qureshi,Muhammad Haris Khan,Mohsen Ali
机构: Information Technology University (信息技术大学), Lahore, Pakistan; University of Ontario Institute of Technology (安大略理工大学), Oshawa, Canada; Mohamed Bin Zayed University of AI (穆罕默德 bin扎耶德人工智能大学), Abu Dhabi, UAE; MZASS AI (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) has shown impressive zero-shot performance on image classification. However, state-of-the-art methods often rely on fine-tuning techniques like prompt learning and adapter-based tuning to optimize CLIP’s performance. The necessity for fine-tuning significantly limits CLIP’s adaptability to novel datasets and domains. This requirement mandates substantial time and computational resources for each new dataset. To overcome this limitation, we introduce simple yet effective training-free approaches, Single-stage LMM Augmented CLIP (SLAC) and Two-stage LMM Augmented CLIP (TLAC), that leverages powerful Large Multimodal Models (LMMs), such as Gemini, for image classification. The proposed methods leverages the capabilities of pre-trained LMMs, allowing for seamless adaptation to diverse datasets and domains without the need for additional training. Our approaches involve prompting the LMM to identify objects within an image. Subsequently, the CLIP text encoder determines the image class by identifying the dataset class with the highest semantic similarity to the LLM predicted object. We evaluated our models on 11 base-to-novel datasets and they achieved superior accuracy on 9 of these, including benchmarks like ImageNet, SUN397 and Caltech101, while maintaining a strictly training-free paradigm. Our overall accuracy of 83.44% surpasses the previous state-of-the-art few-shot methods by a margin of 6.75%. Our method achieved 83.6% average accuracy across 13 datasets, a 9.7% improvement over the previous 73.9% state-of-the-art for training-free approaches. Our method improves domain generalization, with a 3.6% gain on ImageNetV2, 16.96% on ImageNet-S, and 12.59% on ImageNet-R, over prior few-shot methods.
zh

[CV-228] S2IL: Structurally Stable Incremental Learning

【速读】：该论文旨在解决类增量学习（Class Incremental Learning, CIL）中灾难性遗忘（Catastrophic Forgetting, CF）的问题，同时克服现有特征蒸馏（Feature Distillation, FD）方法因严格对齐特征幅值和方向而导致模型适应新知识能力受限的局限。论文的关键解决方案是提出了一种名为结构稳定增量学习（Structurally Stable Incremental Learning, S22IL）的方法，通过专注于保持特征的整体空间模式来实现既具有灵活性（可塑性）又保持稳定的表示，从而在保留旧知识的同时吸收新知识。实验结果表明，S2IL 在 CIFAR-100、ImageNet-100 和 ImageNet-1K 等 SOTA 数据集上实现了卓越的增量准确性，并在任务数量较大的场景中显著优于其他 FD 方法。

链接: https://arxiv.org/abs/2503.12193
作者: S Balasubramanian,Yedu Krishna P,Talasu Sai Sriram,M Sai Subramaniam,Manepalli Pranav Phanindra Sai,Darshan Gera
机构: Sri Sathya Sai Institute of Higher Learning (斯里萨亚 Sai 高等学习研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Feature Distillation (FD) strategies are proven to be effective in mitigating Catastrophic Forgetting (CF) seen in Class Incremental Learning (CIL). However, current FD approaches enforce strict alignment of feature magnitudes and directions across incremental steps, limiting the model’s ability to adapt to new knowledge. In this paper we propose Structurally Stable Incremental Learning(S22IL), a FD method for CIL that mitigates CF by focusing on preserving the overall spatial patterns of features which promote flexible (plasticity) yet stable representations that preserve old knowledge (stability). We also demonstrate that our proposed method S2IL achieves strong incremental accuracy and outperforms other FD methods on SOTA benchmark datasets CIFAR-100, ImageNet-100 and ImageNet-1K. Notably, S2IL outperforms other methods by a significant margin in scenarios that have a large number of incremental tasks.
zh

[CV-229] Breaking the Box: Enhancing Remote Sensing Image Segmentation with Freehand Sketches

【速读】：该论文致力于解决遥感影像零样本交互分割问题，旨在通过用户直观的草图提示提升分割精度与鲁棒性。其解决方案的关键在于提出了一种新颖的基于草图的提示方法（sketch-based prompting），超越了传统的点或框提示；构建了首个包含人类草图与遥感影像配对数据集LTL-Sensing，为后续研究设立基准；并设计了具备多输入提示传输模块（multi-input prompting transport module）的LTL-Net模型，专门用于处理自由手绘草图，从而显著提升了分割性能并增强了人机协作的直观性。

链接: https://arxiv.org/abs/2503.12191
作者: Ying Zang,Yuncan Gao,Jiangi Zhang,Yuangi Hu,Runlong Cao,Lanyun Zhu,Qi Zhu,Deyi Ji,Renjun Xu,Tianrun Chen
机构: Huzhou University; Singapore University of Technology and Design; University of Science and Technology of China; Zhejiang University; KOKONI 3D, Moxin (Huzhou) Tech. Co., LTD.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work advances zero-shot interactive segmentation for remote sensing imagery through three key contributions. First, we propose a novel sketch-based prompting method, enabling users to intuitively outline objects, surpassing traditional point or box prompts. Second, we introduce LTL-Sensing, the first dataset pairing human sketches with remote sensing imagery, setting a benchmark for future research. Third, we present LTL-Net, a model featuring a multi-input prompting transport module tailored for freehand sketches. Extensive experiments show our approach significantly improves segmentation accuracy and robustness over state-of-the-art methods like SAM, fostering more intuitive human-AI collaboration in remote sensing analysis and enhancing its applications.
zh

[CV-230] Bench2FreeAD: A Benchmark for Vision-based End-to-end Navigation in Unstructured Robotic Environments

【速读】：该论文旨在解决现有端到端（End-to-End, E2E）自动驾驶算法主要针对结构化交通场景的标准车辆设计，缺乏对非结构化场景（如辅助道路、校园道路和室内环境）中机器人导航探索的问题。为应对这一挑战，论文提出了一个关键解决方案：构建了一个面向非结构化道路环境的端到端机器人导航数据集——FreeWorld 数据集，并通过结合真实世界机器人数据与基于Isaac Sim模拟器生成的合成数据来实现。此外，论文进一步微调了一个高效的端到端自动驾驶模型——VAD，以验证这些模型在非结构化环境中的性能与适应性。实验结果表明，通过上述数据集的微调显著提升了端到端自动驾驶模型在非结构化机器人导航任务中的潜力。因此，该研究不仅提供了首个针对非结构化场景端到端机器人导航任务的数据集，还基于视觉驱动的端到端自动驾驶算法构建了基准，以推动物流和服务机器人导航技术的发展。

链接: https://arxiv.org/abs/2503.12180
作者: Yuhang Peng,Sidong Wang,Jihaoyu Yang,Shilong Li,Han Wang,Jiangtao Gong
机构: Institute for AI Industry Research, Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 9 figures

点击查看摘要

Abstract:Most current end-to-end (E2E) autonomous driving algorithms are built on standard vehicles in structured transportation scenarios, lacking exploration of robot navigation for unstructured scenarios such as auxiliary roads, campus roads, and indoor settings. This paper investigates E2E robot navigation in unstructured road environments. First, we introduce two data collection pipelines - one for real-world robot data and another for synthetic data generated using the Isaac Sim simulator, which together produce an unstructured robotics navigation dataset – FreeWorld Dataset. Second, we fine-tuned an efficient E2E autonomous driving model – VAD – using our datasets to validate the performance and adaptability of E2E autonomous driving models in these environments. Results demonstrate that fine-tuning through our datasets significantly enhances the navigation potential of E2E autonomous driving models in unstructured robotic environments. Thus, this paper presents the first dataset targeting E2E robot navigation tasks in unstructured scenarios, and provides a benchmark based on vision-based E2E autonomous driving algorithms to facilitate the development of E2E navigation technology for logistics and service robots. The project is available on Github.
zh

[CV-231] DPCS: Path Tracing-Based Differentiable Projector-Camera Systems

【速读】：该论文旨在解决传统基于神经网络的投影仪-相机系统（Projector-Camera Systems, ProCams）仿真方法中存在的两个主要问题：一是网络参数隐式封装场景参数（如表面材质、伽马校正和白平衡），导致模型可解释性差且难以模拟新场景；二是通过图像到图像的转换方式隐式学习间接光照，导致在模拟复杂投影效果（如柔化阴影和多反弹反射）时性能不佳。论文的关键解决方案是提出了一种基于路径追踪的可微分投影仪-相机系统（Differentiable Projector-Camera Systems, DPCS）。DPCS 利用基于物理的可微分渲染（Physically-Based Rendering, PBR），显式集成多反弹路径追踪，从而实现场景参数的显式解耦与学习，并显著减少所需采样数量。此外，其基于物理的方法不仅支持高质量的下游任务（如投影重照明和投影补偿），还允许使用学习到的场景参数进行新颖场景的仿真。

链接: https://arxiv.org/abs/2503.12174
作者: Jijiang Li,Qingyue Deng,Haibin Ling,Bingyao Huang
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages,16 figures

点击查看摘要

Abstract:Projector-camera systems (ProCams) simulation aims to model the physical project-and-capture process and associated scene parameters of a ProCams, and is crucial for spatial augmented reality (SAR) applications such as ProCams relighting and projector compensation. Recent advances use an end-to-end neural network to learn the project-and-capture process. However, these neural network-based methods often implicitly encapsulate scene parameters, such as surface material, gamma, and white balance in the network parameters, and are less interpretable and hard for novel scene simulation. Moreover, neural networks usually learn the indirect illumination implicitly in an image-to-image translation way which leads to poor performance in simulating complex projection effects such as soft-shadow and interreflection. In this paper, we introduce a novel path tracing-based differentiable projector-camera systems (DPCS), offering a differentiable ProCams simulation method that explicitly integrates multi-bounce path tracing. Our DPCS models the physical project-and-capture process using differentiable physically-based rendering (PBR), enabling the scene parameters to be explicitly decoupled and learned using much fewer samples. Moreover, our physically-based method not only enables high-quality downstream ProCams tasks, such as ProCams relighting and projector compensation, but also allows novel scene simulation using the learned scene parameters. In experiments, DPCS demonstrates clear advantages over previous approaches in ProCams simulation, offering better interpretability, more efficient handling of complex interreflection and shadow, and requiring fewer training samples.
zh

[CV-232] LAPIG: Language Guided Projector Image Generation with Surface Adaptation and Stylization

【速读】：该论文旨在解决在使用投影仪生成图像时，由于投影仪亮度限制和投影表面纹理导致的色彩饱和度不足及伪影问题。特别是在暗区和亮区，即便采用最先进的投影补偿技术，观众仍可能观察到明显的表面纹理相关伪影。为了解决这一问题，论文提出了一种名为投影表面适应（Projection Surface Adaptation, PSA）的方法，用于生成可补偿的表面风格化效果。关键在于首先通过训练两个网络来模拟投影补偿和投影-捕获过程，从而无需实际进行投影-捕获即可找到满意的投影图像，并利用梯度下降实现快速收敛；其次设计内容损失和饱和度损失来引导投影图像生成，确保生成的图像在投影后没有明显可感知的伪影。最终实现视觉上令人愉悦的表面风格变换效果。

链接: https://arxiv.org/abs/2503.12173
作者: Yuchen Deng,Haibin Ling,Bingyao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:We propose LAPIG, a language guided projector image generation method with surface adaptation and stylization. LAPIG consists of a projector-camera system and a target textured projection surface. LAPIG takes the user text prompt as input and aims to transform the surface style using the projector. LAPIG’s key challenge is that due to the projector’s physical brightness limitation and the surface texture, the viewer’s perceived projection may suffer from color saturation and artifacts in both dark and bright regions, such that even with the state-of-the-art projector compensation techniques, the viewer may see clear surface texture-related artifacts. Therefore, how to generate a projector image that follows the user’s instruction while also displaying minimum surface artifacts is an open problem. To address this issue, we propose projection surface adaptation (PSA) that can generate compensable surface stylization. We first train two networks to simulate the projector compensation and project-and-capture processes, this allows us to find a satisfactory projector image without real project-and-capture and utilize gradient descent for fast convergence. Then, we design content and saturation losses to guide the projector image generation, such that the generated image shows no clearly perceivable artifacts when projected. Finally, the generated image is projected for visually pleasing surface style morphing effects. The source code and video are available on the project page: this https URL.
zh

[CV-233] SEAL: Semantic Aware Image Watermarking

【速读】：该论文旨在解决生成式模型（Generative Models）所产生图像的鲁棒水印嵌入问题，以区分自然内容与人工智能生成内容，同时确保水印不影响图像质量且能够抵御去除或伪造攻击。现有方法要么通过初始噪声嵌入水印但导致生成图像分布失真，要么依赖于长密钥字典进行检测，缺乏灵活性。

解决方案的关键在于提出了一种新颖的内容感知型水印方法，直接将生成图像的语义信息编码到水印中，从而实现无失真的水印嵌入。该方法无需依赖预先存储的密钥数据库，而是利用局部敏感哈希（Locality-Sensitive Hashing, LSH）从图像的语义嵌入中推导出密钥模式。此外，通过将水印检测条件化于原始图像内容，进一步增强了对篡改攻击的鲁棒性。论文验证了该方法在应对两种特定攻击策略（提取初始噪声生成新图像或插入无关对象）时的显著改进，表明其能够在保障图像完整性的同时有效减轻生成式模型带来的潜在风险。

链接: https://arxiv.org/abs/2503.12172
作者: Kasra Arabi,R. Teal Witter,Chinmay Hegde,Niv Cohen
机构: New York University (纽约大学)
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative models have rapidly evolved to generate realistic outputs. However, their synthetic outputs increasingly challenge the clear distinction between natural and AI-generated content, necessitating robust watermarking techniques. Watermarks are typically expected to preserve the integrity of the target image, withstand removal attempts, and prevent unauthorized replication onto unrelated images. To address this need, recent methods embed persistent watermarks into images produced by diffusion models using the initial noise. Yet, to do so, they either distort the distribution of generated images or rely on searching through a long dictionary of used keys for detection. In this paper, we propose a novel watermarking method that embeds semantic information about the generated image directly into the watermark, enabling a distortion-free watermark that can be verified without requiring a database of key patterns. Instead, the key pattern can be inferred from the semantic embedding of the image using locality-sensitive hashing. Furthermore, conditioning the watermark detection on the original image content improves robustness against forgery attacks. To demonstrate that, we consider two largely overlooked attack strategies: (i) an attacker extracting the initial noise and generating a novel image with the same pattern; (ii) an attacker inserting an unrelated (potentially harmful) object into a watermarked image, possibly while preserving the watermark. We empirically validate our method’s increased robustness to these attacks. Taken together, our results suggest that content-aware watermarks can mitigate risks arising from image-generative models. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.12172 [cs.LG] (or arXiv:2503.12172v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.12172 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-234] DiffAD: A Unified Diffusion Modeling Approach for Autonomous Driving

【速读】：本文针对端到端自动驾驶（E2E-AD）系统中任务协调困难及系统复杂度高的问题，提出了一种名为DiffAD的新方法。现有E2E-AD系统通常采用传统多任务框架，通过独立的任务特定头处理感知、预测和规划任务，尽管这些任务可以通过全可微的方式训练，但仍然面临任务协调困难以及系统复杂性高的挑战。DiffAD重新定义了自动驾驶为一个条件图像生成任务，通过将异构目标栅格化到统一的鸟瞰图（BEV），并建模其潜在分布，实现了多种驾驶目标的统一以及所有驾驶任务在一个单一框架内的联合优化，显著降低了系统复杂度并改善了任务协调。关键在于通过扩散概率模型统一任务表示与优化，同时利用逆向过程迭代精炼生成的BEV图像，从而实现更鲁棒和真实的驾驶行为。

链接: https://arxiv.org/abs/2503.12170
作者: Tao Wang,Cong Zhang,Xingguang Qu,Kun Li,Weiwei Liu,Chang Huang
机构: Carizon; Beihang University (北京航空航天大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:End-to-end autonomous driving (E2E-AD) has rapidly emerged as a promising approach toward achieving full autonomy. However, existing E2E-AD systems typically adopt a traditional multi-task framework, addressing perception, prediction, and planning tasks through separate task-specific heads. Despite being trained in a fully differentiable manner, they still encounter issues with task coordination, and the system complexity remains high. In this work, we introduce DiffAD, a novel diffusion probabilistic model that redefines autonomous driving as a conditional image generation task. By rasterizing heterogeneous targets onto a unified bird’s-eye view (BEV) and modeling their latent distribution, DiffAD unifies various driving objectives and jointly optimizes all driving tasks in a single framework, significantly reducing system complexity and harmonizing task coordination. The reverse process iteratively refines the generated BEV image, resulting in more robust and realistic driving behaviors. Closed-loop evaluations in Carla demonstrate the superiority of the proposed method, achieving a new state-of-the-art Success Rate and Driving Score. The code will be made publicly available.
zh

[CV-235] Learning Extremely High Density Crowds as Active Matters CVPR2025

【速读】：该论文旨在解决基于视频的高密度人群分析与预测这一长期存在的难题，尤其针对缺乏高质量数据和复杂人群动力学的问题。论文的关键创新在于引入了一种新的物理先验（physics prior）来建模人群动力学。作者将高密度人群视为“人群物质”（crowd material），即由受随机力作用的活性粒子组成的连续体，通过结合物理模型与神经网络，构建了一个神经随机微分方程系统，以模拟复杂的人群动力学。这种连续时间物理模型不仅能够提升极高密度人群分析和预测的性能，还提供了更强的可解释性，区别于大多数离散时间且黑箱化的深度学习方法。

链接: https://arxiv.org/abs/2503.12168
作者: Feixiang He,Jiangbei Yue,Jialin Zhu,Armin Seyfried,Dan Casas,Julien Pettré,He Wang
机构: AI Centre, University College London (伦敦大学学院); University College London (伦敦大学学院); University of Leeds (利兹大学); Forschungszentrum Jülich (尤利希研究中心); Universidad Rey Juan Carlos (胡安卡洛斯国王大学); INRIA Rennes (法国国家信息与自动化研究所); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Video-based high-density crowd analysis and prediction has been a long-standing topic in computer vision. It is notoriously difficult due to, but not limited to, the lack of high-quality data and complex crowd dynamics. Consequently, it has been relatively under studied. In this paper, we propose a new approach that aims to learn from in-the-wild videos, often with low quality where it is difficult to track individuals or count heads. The key novelty is a new physics prior to model crowd dynamics. We model high-density crowds as active matter, a continumm with active particles subject to stochastic forces, named ‘crowd material’. Our physics model is combined with neural networks, resulting in a neural stochastic differential equation system which can mimic the complex crowd dynamics. Due to the lack of similar research, we adapt a range of existing methods which are close to ours for comparison. Through exhaustive evaluation, we show our model outperforms existing methods in analyzing and forecasting extremely high-density crowds. Furthermore, since our model is a continuous-time physics model, it can be used for simulation and analysis, providing strong interpretability. This is categorically different from most deep learning methods, which are discrete-time models and black-boxes.
zh

[CV-236] VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction CVPR2025

【速读】：该论文旨在解决虚拟试穿（Virtual Try-On, VTON）技术在支持任意视角渲染（any-view rendering）的同时实现高保真结果的挑战。论文提出的解决方案关键在于将传统的2D VTON扩展至3D领域，并通过确保多视角之间的一致性来实现高质量的3D虚拟试穿效果。具体而言，论文引入了一种新的方法——VTON 360，其核心创新包括：(1) 利用从SMPL-X 3D人体模型导出的法线贴图构建伪3D姿态表示；(2) 提出一种多视角空间注意力机制以建模不同视角特征间的相关性；(3) 使用包含相机信息的多视角CLIP嵌入增强传统2D VTON中的服装CLIP特征。这些技术共同保证了从多个视角观察时的视觉一致性与真实性。实验验证表明，该方法在大规模真实数据集上的表现优异。

链接: https://arxiv.org/abs/2503.12165
作者: Zijian He,Yuwei Ning,Yipeng Qin,Wangrun Wang,Sibei Yang,Liang Lin,Guanbin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Virtual Try-On (VTON) is a transformative technology in e-commerce and fashion design, enabling realistic digital visualization of clothing on individuals. In this work, we propose VTON 360, a novel 3D VTON method that addresses the open challenge of achieving high-fidelity VTON that supports any-view rendering. Specifically, we leverage the equivalence between a 3D model and its rendered multi-view 2D images, and reformulate 3D VTON as an extension of 2D VTON that ensures 3D consistent results across multiple views. To achieve this, we extend 2D VTON models to include multi-view garments and clothing-agnostic human body images as input, and propose several novel techniques to enhance them, including: i) a pseudo-3D pose representation using normal maps derived from the SMPL-X 3D human model, ii) a multi-view spatial attention mechanism that models the correlations between features from different viewing angles, and iii) a multi-view CLIP embedding that enhances the garment CLIP features used in 2D VTON with camera information. Extensive experiments on large-scale real datasets and clothing images from e-commerce platforms demonstrate the effectiveness of our approach. Project page: this https URL.
zh

[CV-237] Point-Cache: Test-time Dynamic and Hierarchical Cache for Robust and Generalizable Point Cloud Analysis CVPR2025

【速读】：该论文旨在解决点云识别模型在测试阶段应对分布偏移（distribution shifts）的问题。不同于以往依赖于训练数据的方法，这些方法通常在在线推理时不可用，并且仅限于识别训练阶段预定义的一组固定点云类别，本文探索了一个更实用且更具挑战性的场景：仅基于在线测试数据来适应模型，以在测试阶段同时识别已见过的类别和全新的未见类别。解决方案的关键在于提出了一种名为Point-Cache的分层缓存模型，它能够捕捉在线测试样本的关键线索，特别关注点云的全局结构及其局部细节。Point-Cache作为一个丰富的三维知识库，通过动态管理优先选择高质量样本，被设计为一个即插即用模块，可以灵活集成到大型多模态三维模型中，支持开放词汇点云识别。该方法无需训练，效率接近零样本推理，同时在8个具有挑战性的基准数据集和4种代表性大型3D模型上展示了显著性能提升。

链接: https://arxiv.org/abs/2503.12150
作者: Hongyu Sun,Qiuhong Ke,Ming Cheng,Yongcai Wang,Deying Li,Chenhui Gou,Jianfei Cai
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025; 24 pages, 14 figures, 18 tables

点击查看摘要

Abstract:This paper proposes a general solution to enable point cloud recognition models to handle distribution shifts at test time. Unlike prior methods, which rely heavily on training data-often inaccessible during online inference-and are limited to recognizing a fixed set of point cloud classes predefined during training, we explore a more practical and challenging scenario: adapting the model solely based on online test data to recognize both previously seen classes and novel, unseen classes at test time. To this end, we develop Point-Cache, a hierarchical cache model that captures essential clues of online test samples, particularly focusing on the global structure of point clouds and their local-part details. Point-Cache, which serves as a rich 3D knowledge base, is dynamically managed to prioritize the inclusion of high-quality samples. Designed as a plug-and-play module, our method can be flexibly integrated into large multimodal 3D models to support open-vocabulary point cloud recognition. Notably, our solution operates with efficiency comparable to zero-shot inference, as it is entirely training-free. Point-Cache demonstrates substantial gains across 8 challenging benchmarks and 4 representative large 3D models, highlighting its effectiveness. Code is available at this https URL.
zh

[CV-238] DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap

【速读】：该论文致力于解决现有跨模态理解和生成方法（如CLAP和CAVP）在通过单一对比损失对齐文本、视频和音频嵌入时忽视双向交互及模态内噪声的问题，这些问题可能严重影响跨模态整合的质量与效果。论文的关键解决方案是提出DiffGAP，一种在对比空间内引入轻量级生成模块的新方法。具体而言，DiffGAP采用双向扩散过程，通过在文本和视频嵌入上的去噪过程（以音频嵌入为条件）以及反向操作，有效弥合跨模态差距，从而实现更精细且稳健的跨模态交互。实验结果表明，DiffGAP在VGGSound和AudioCaps数据集上的视频/文本-音频生成和检索任务中显著提升了性能，验证了其在增强跨模态理解与生成能力方面的有效性。

链接: https://arxiv.org/abs/2503.12131
作者: Shentong Mo,Zehua Chen,Fan Bao,Jun Zhu
机构: Department of CST, Tsinghua University (清华大学); Department of Machine Learning, Carnegie Mellon University (卡内基梅隆大学); Department of Machine Learning, MBZUAI ( Mohamed bin Zayed University of Artificial Intelligence ); Shengshu AI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent works in cross-modal understanding and generation, notably through models like CLAP (Contrastive Language-Audio Pretraining) and CAVP (Contrastive Audio-Visual Pretraining), have significantly enhanced the alignment of text, video, and audio embeddings via a single contrastive loss. However, these methods often overlook the bidirectional interactions and inherent noises present in each modality, which can crucially impact the quality and efficacy of cross-modal integration. To address this limitation, we introduce DiffGAP, a novel approach incorporating a lightweight generative module within the contrastive space. Specifically, our DiffGAP employs a bidirectional diffusion process tailored to bridge the cross-modal gap more effectively. This involves a denoising process on text and video embeddings conditioned on audio embeddings and vice versa, thus facilitating a more nuanced and robust cross-modal interaction. Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks, confirming its effectiveness in enhancing cross-modal understanding and generation capabilities.
zh

[CV-239] Z-Magic: Zero-shot Multiple Attributes Guided Image Creator CVPR2025

【速读】：该论文致力于解决多属性个性化内容生成中不同属性之间上下文连贯性不足的问题。尽管已有方法在实验结果上表现良好，但它们未能充分考虑不同属性间的语境一致性。为了解决这一问题，论文从条件概率理论的角度重新定义了多属性生成任务，并特别关注零样本（zero-shot）设置下的挑战。解决方案的关键在于显式建模属性之间的依赖关系，通过引入由先前属性创建所决定的多变量条件分布来确保后续属性的一致性，从而显著提升跨多样化属性组合生成图像的连贯性。此外，论文还将多属性定制与多任务学习联系起来，有效缓解了多属性合成过程中的高计算成本问题。大量实验表明，提出的Z-Magic模型在零样本图像生成任务中优于现有方法，并对AI驱动的设计和创意应用具有广泛意义。

链接: https://arxiv.org/abs/2503.12124
作者: Yingying Deng,Xiangyu He,Fan Tang,Weiming Dong
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (自动化所，中国科学院); Institute of Computing Technology, Chinese Academy of Sciences (计算技术研究所，中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:The customization of multiple attributes has gained popularity with the rising demand for personalized content creation. Despite promising empirical results, the contextual coherence between different attributes has been largely overlooked. In this paper, we argue that subsequent attributes should follow the multivariable conditional distribution introduced by former attribute creation. In light of this, we reformulate multi-attribute creation from a conditional probability theory perspective and tackle the challenging zero-shot setting. By explicitly modeling the dependencies between attributes, we further enhance the coherence of generated images across diverse attribute combinations. Furthermore, we identify connections between multi-attribute customization and multi-task learning, effectively addressing the high computing cost encountered in multi-attribute synthesis. Extensive experiments demonstrate that Z-Magic outperforms existing models in zero-shot image generation, with broad implications for AI-driven design and creative applications.
zh

[CV-240] A Speech-to-Video Synthesis Approach Using Spatio-Temporal Diffusion for Vocal Tract MRI

【速读】：该论文旨在解决通过语音信号生成实时/ cine-磁共振成像（RT-/cine-MRI）喉部可视化的问题，以支持临床评估及个性化治疗与康复策略的发展。论文的关键在于提出了一种音频到视频生成框架，其核心解决方案包括：首先对RT-/cine-MRI序列和语音样本进行预处理以实现时间对齐，确保视觉与音频数据的同步；其次采用集成结构块和时间块的改进型稳定扩散模型，有效捕捉同步数据中的运动特征与时间动态。此方法实现了从新语音输入生成MRI序列的能力，显著提升了语音到视觉数据转换的效果，并在健康对照组和舌癌患者中验证了其适应性和泛化能力，最终通过人类评估确认了其可视化结果的逼真与准确性。

链接: https://arxiv.org/abs/2503.12102
作者: Paula Andrea Pérez-Toro,Tomás Arias-Vergara,Fangxu Xing,Xiaofeng Liu,Maureen Stone,Jiachen Zhuo,Juan Rafael Orozco-Arroyave,Elmar Nöth,Jana Hutter,Jerry L. Prince,Andreas Maier,Jonghye Woo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the relationship between vocal tract motion during speech and the resulting acoustic signal is crucial for aided clinical assessment and developing personalized treatment and rehabilitation strategies. Toward this goal, we introduce an audio-to-video generation framework for creating Real Time/cine-Magnetic Resonance Imaging (RT-/cine-MRI) visuals of the vocal tract from speech signals. Our framework first preprocesses RT-/cine-MRI sequences and speech samples to achieve temporal alignment, ensuring synchronization between visual and audio data. We then employ a modified stable diffusion model, integrating structural and temporal blocks, to effectively capture movement characteristics and temporal dynamics in the synchronized data. This process enables the generation of MRI sequences from new speech inputs, improving the conversion of audio into visual data. We evaluated our framework on healthy controls and tongue cancer patients by analyzing and comparing the vocal tract movements in synthesized videos. Our framework demonstrated adaptability to new speech inputs and effective generalization. In addition, positive human evaluations confirmed its effectiveness, with realistic and accurate visualizations, suggesting its potential for outpatient therapy and personalized simulation of vocal tract visualizations.
zh

[CV-241] O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models CVPR2025

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在测试时提示微调（Test-time Prompt Tuning, TPT）方法中存在的校准性能不佳的问题。现有方法依赖于文本特征的分散性来实现校准，但其效果并不理想。论文指出，通过在可学习提示对应的文本特征上引入正交性约束，可以更有效地提升校准性能。解决方案的关键在于提出了一种名为O-TPT的新方法，它通过强制文本特征的正交化来实现更好的分散性，从而显著减少整体平均校准误差，并在细粒度分类任务中超越零样本校准性能。

链接: https://arxiv.org/abs/2503.12096
作者: Ashshak Sharifdeen,Muhammad Akhtar Munir,Sanoojan Baliah,Salman Khan,Muhammad Haris Khan
机构: Mohamed bin Zayed University of AI (穆罕默德·本·扎耶德人工智能大学); University of Colombo (科伦坡大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Test-time prompt tuning for vision-language models (VLMs) is getting attention because of their ability to learn with unlabeled data without fine-tuning. Although test-time prompt tuning methods for VLMs can boost accuracy, the resulting models tend to demonstrate poor calibration, which casts doubts on the reliability and trustworthiness of these models. Notably, more attention needs to be devoted to calibrating the test-time prompt tuning in vision-language models. To this end, we propose a new approach, called O-TPT that introduces orthogonality constraints on the textual features corresponding to the learnable prompts for calibrating test-time prompt tuning in VLMs. Towards introducing orthogonality constraints, we make the following contributions. First, we uncover new insights behind the suboptimal calibration performance of existing methods relying on textual feature dispersion. Second, we show that imposing a simple orthogonalization of textual features is a more effective approach towards obtaining textual dispersion. We conduct extensive experiments on various datasets with different backbones and baselines. The results indicate that our method consistently outperforms the prior state of the art in significantly reducing the overall average calibration error. Also, our method surpasses the zero-shot calibration performance on fine-grained classification tasks.
zh

[CV-242] owards Vision Zero: The Accid3nD Dataset

【速读】：该论文旨在解决现实交通网络中不可避免且偶发的交通事故理解与检测问题。尽管已有大量工作致力于提升交通网络的安全性，但缺乏包含真实世界事故场景中3D标注的数据集。为此，论文提出了Accid3nD数据集，包含在不同天气和光照条件下记录的真实高速公路事故，并结合车载摄像头和LiDAR数据提供了丰富的标注信息，包括2D边界框、实例掩模、3D边界框及轨迹ID等。针对事故检测问题，论文提出了一种结合基于规则的方法与基于学习的方法的模型，其关键在于通过规则与机器学习的互补优势，实现更鲁棒的事故检测性能。实验结果验证了所提方法的有效性和鲁棒性。

链接: https://arxiv.org/abs/2503.12095
作者: Walter Zimmer,Ross Greer,Daniel Lehmberg,Marc Pavel,Holger Caesar,Xingcheng Zhou,Ahmed Ghita,Mohan Trivedi,Rui Song,Hu Cao,Akshay Gopalkrishnan,Alois C. Knoll
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Even though a significant amount of work has been done to increase the safety of transportation networks, accidents still occur regularly. They must be understood as unavoidable and sporadic outcomes of traffic networks. No public dataset contains 3D annotations of real-world accidents recorded from roadside sensors. We present the Accid3nD dataset, a collection of real-world highway accidents in different weather and lighting conditions. It contains vehicle crashes at high-speed driving with 2,634,233 labeled 2D bounding boxes, instance masks, and 3D bounding boxes with track IDs. In total, the dataset contains 111,945 labeled frames recorded from four roadside cameras and LiDARs at 25 Hz. The dataset contains six object classes and is provided in the OpenLABEL format. We propose an accident detection model that combines a rule-based approach with a learning-based one. Experiments and ablation studies on our dataset show the robustness of our proposed method. The dataset, model, and code are available on our website: this https URL.
zh

[CV-243] E-SAM: Training-Free Segment Every Entity Model

【速读】：该论文旨在解决现有实体分割（Entity Segmentation, ES）方法因依赖大规模标注数据或高训练成本而难以扩展和适应多样化环境的问题。同时，针对Segment Anything Model (SAM) 在自动掩码生成（Automatic Mask Generation, AMG）模式下存在的过分割与欠分割问题，提出了一种新的无训练框架E-SAM以实现卓越的实体分割能力。

解决方案的关键在于设计了三个模块：首先，多级掩码生成（Multi-level Mask Generation, MMG）通过分层处理SAM的AMG输出，生成可靠的物体级掩码并保留其他层级的细节；其次，实体级掩码优化（Entity-level Mask Refinement, EMR）进一步优化这些物体级掩码成为精确的实体级掩码，包括分离重叠掩码以解决SAM输出中的冗余问题，并通过评估实体一致性合并相似掩码；最后，欠分割优化（Under-Segmentation Refinement, USR）通过生成额外的高置信度掩码并与EMR输出融合来解决欠分割问题。这三个模块无缝集成，无需额外训练开销即可实现最佳实体分割性能。实验结果表明，E-SAM在基准指标上取得了显著提升，较之前方法提高了+30.1。

链接: https://arxiv.org/abs/2503.12094
作者: Weiming Zhang,Dingwen Xiao,Lei Chen,Lin Wang
机构: HKUST (香港科技大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Entity Segmentation (ES) aims at identifying and segmenting distinct entities within an image without the need for predefined class labels. This characteristic makes ES well-suited to open-world applications with adaptation to diverse and dynamically changing environments, where new and previously unseen entities may appear frequently. Existing ES methods either require large annotated datasets or high training costs, limiting their scalability and adaptability. Recently, the Segment Anything Model (SAM), especially in its Automatic Mask Generation (AMG) mode, has shown potential for holistic image segmentation. However, it struggles with over-segmentation and under-segmentation, making it less effective for ES. In this paper, we introduce E-SAM, a novel training-free framework that exhibits exceptional ES capability. Specifically, we first propose Multi-level Mask Generation (MMG) that hierarchically processes SAM’s AMG outputs to generate reliable object-level masks while preserving fine details at other levels. Entity-level Mask Refinement (EMR) then refines these object-level masks into accurate entity-level masks. That is, it separates overlapping masks to address the redundancy issues inherent in SAM’s outputs and merges similar masks by evaluating entity-level consistency. Lastly, Under-Segmentation Refinement (USR) addresses under-segmentation by generating additional high-confidence masks fused with EMR outputs to produce the final ES map. These three modules are seamlessly optimized to achieve the best ES without additional training overhead. Extensive experiments demonstrate that E-SAM achieves state-of-the-art performance compared to prior ES methods, demonstrating a significant improvement by +30.1 on benchmark metrics.
zh

[CV-244] SFMNet: Sparse Focal Modulation for 3D Object Detection

【速读】：本文旨在解决传统稀疏卷积在三维目标检测中难以建模长程依赖关系的问题，同时克服Transformer在处理三维稀疏场景时因二次复杂度导致的高计算成本。解决方案的关键在于提出了一种新颖的Sparse Focal Modulation (SFM)模块，通过层次化稀疏卷积设计，以线性复杂度整合短程与长程上下文信息，从而实现高效且高性能的三维稀疏检测器SFMNet，特别适用于大规模LiDAR点云数据集。

链接: https://arxiv.org/abs/2503.12093
作者: Oren Shrout,Ayellet Tal
机构: Technion (以色列理工学院); Technion (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose SFMNet, a novel 3D sparse detector that combines the efficiency of sparse convolutions with the ability to model long-range dependencies. While traditional sparse convolution techniques efficiently capture local structures, they struggle with modeling long-range relationships. However, capturing long-range dependencies is fundamental for 3D object detection. In contrast, transformers are designed to capture these long-range dependencies through attention mechanisms. But, they come with high computational costs, due to their quadratic query-key-value interactions. Furthermore, directly applying attention to non-empty voxels is inefficient due to the sparse nature of 3D scenes. Our SFMNet is built on a novel Sparse Focal Modulation (SFM) module, which integrates short- and long-range contexts with linear complexity by leveraging a new hierarchical sparse convolution design. This approach enables SFMNet to achieve high detection performance with improved efficiency, making it well-suited for large-scale LiDAR scenes. We show that our detector achieves state-of-the-art performance on autonomous driving datasets.
zh

[CV-245] mporally Consistent Mitral Annulus Measurements from Sparse Annotations in Echocardiographic Videos

【速读】：该论文致力于解决在超声心动图视频中利用稀疏标注实现二尖瓣环标志点定位的时间一致性问题。解决方案的关键在于引入了一种自监督损失项（self-supervised loss term），通过约束相邻帧之间的时序一致性，从而平滑标志点的位置并提高其随时间推移的测量精度。此外，还结合了现实视野增强技术以改善缺失解剖标志点的识别效果。

链接: https://arxiv.org/abs/2503.12087
作者: Gino E. Jansen,Mark J. Schuuring,Berto J. Bouma,Ivana Išgum
机构: Department of Biomedical Engineering & Physics, Amsterdam University Medical Center (阿姆斯特丹大学医学中心), Netherlands; Informatics Institute, University of Amsterdam (阿姆斯特丹大学), Netherlands; Amsterdam Cardiovascular Sciences, Netherlands; Department of Biomedical Signals and Systems, University of Twente (特文特大学), Netherlands; Department of Cardiology, Amsterdam University Medical Center (阿姆斯特丹大学医学中心), Netherlands; Department of Radiology & Nuclear Medicine, Amsterdam University Medical Center (阿姆斯特丹大学医学中心), Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work presents a novel approach to achieving temporally consistent mitral annulus landmark localization in echocardiography videos using sparse annotations. Our method introduces a self-supervised loss term that enforces temporal consistency between neighboring frames, which smooths the position of landmarks and enhances measurement accuracy over time. Additionally, we incorporate realistic field-of-view augmentations to improve the recognition of missing anatomical landmarks. We evaluate our approach on both a public and private dataset, and demonstrate significant improvements in Mitral Annular Plane Systolic Excursion (MAPSE) calculations and overall landmark tracking stability. The method achieves a mean absolute MAPSE error of 1.81 \pm 0.14 mm, an annulus size error of 2.46 \pm 0.31 mm, and a landmark localization error of 2.48 \pm 0.07 mm. Finally, it achieves a 0.99 ROC-AUC for recognition of missing landmarks.
zh

[CV-246] FA-BARF: Frequency Adapted Bundle-Adjusting Neural Radiance Fields

【速读】：该论文旨在解决Neural Radiance Fields (NeRF) 在处理具有不完美相机姿态的真实场景重建时，依赖人工设计的频率退火策略导致联合优化速度变慢的问题。论文的关键解决方案是引入Frequency Adapted Bundle Adjusting Radiance Field (FA-BARF)，用频率自适应的空间低通滤波器替代传统的时域低通滤波器。这一方法不仅通过理论框架揭示了NeRF的位置编码与相机配准之间的关系，还证明了频率自适应滤波器能够缓解由时域滤波引起的频率波动，并利用不同视角间的径向不确定性重叠有效优化相机姿态。实验结果表明，FA-BARF能够在物体中心场景的小扰动下加速联合优化过程，并成功恢复未知相机姿态下的真实场景，为NeRF在实时密集3D映射与重建中的应用提供了更广泛的可能性。

链接: https://arxiv.org/abs/2503.12086
作者: Rui Qian,Chenyangguang Zhang,Yan Di,Guangyao Zhai,Ruida Zhang,Jiayu Guo,Benjamin Busam,Jian Pu
机构: Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University (复旦大学脑智能技术研究所); Department of Automation, Tsinghua University (清华大学自动化系); Chair for Computer Aided Medical Procedures and Augmented Reality, Technical University of Munich (慕尼黑工业大学计算机辅助医疗程序与增强现实讲席)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have exhibited highly effective performance for photorealistic novel view synthesis recently. However, the key limitation it meets is the reliance on a hand-crafted frequency annealing strategy to recover 3D scenes with imperfect camera poses. The strategy exploits a temporal low-pass filter to guarantee convergence while decelerating the joint optimization of implicit scene reconstruction and camera registration. In this work, we introduce the Frequency Adapted Bundle Adjusting Radiance Field (FA-BARF), substituting the temporal low-pass filter for a frequency-adapted spatial low-pass filter to address the decelerating problem. We establish a theoretical framework to interpret the relationship between position encoding of NeRF and camera registration and show that our frequency-adapted filter can mitigate frequency fluctuation caused by the temporal filter. Furthermore, we show that applying a spatial low-pass filter in NeRF can optimize camera poses productively through radial uncertainty overlaps among various views. Extensive experiments show that FA-BARF can accelerate the joint optimization process under little perturbations in object-centric scenes and recover real-world scenes with unknown camera poses. This implies wider possibilities for NeRF applied in dense 3D mapping and reconstruction under real-time requirements. The code will be released upon paper acceptance.
zh

[CV-247] V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents CVPR2025

【速读】：该论文旨在解决现有视频风格化方法难以处理复杂转场且无法基于开放式的用户查询描述生成任何视频的问题。为填补这一空白，论文提出了一种基于多模态大语言模型协作与反思范式的通用多智能体系统——V-Stylist。其关键是设计了一个包含三个关键角色的系统性工作流：(1) 视频解析器（Video Parser）通过视频到镜头提示范式将输入视频分解为多个镜头，并生成关键镜头内容的文字提示，从而有效处理复杂转场；(2) 风格解析器（Style Parser）识别用户查询中的风格并通过树状搜索范式从风格树中逐步匹配相应的风格模型，以精确指定开放式查询中模糊的风格偏好；(3) 风格艺术家（Style Artist）利用匹配的模型实现所有镜头的风格化，并通过多轮自省范式根据风格需求自适应调整细节控制。这种模仿人类专业人士的设计使V-Stylist在高效自动视频风格化方面取得了重大突破。此外，论文还构建了新的基准数据集Text-driven Video Stylization Benchmark (TVSBench)，用于评估复杂视频在开放式用户查询下的风格化能力。实验表明，V-Stylist在总体平均指标上分别比FRESCO和ControlVideo高出6.05%和4.51%，标志着视频风格化领域的显著进步。

链接: https://arxiv.org/abs/2503.12077
作者: Zhengrong Yue,Shaobin Zhuang,Kunchang Li,Yanbo Ding,Yali Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025

点击查看摘要

Abstract:Despite the recent advancement in video stylization, most existing methods struggle to render any video with complex transitions, based on an open style description of user query. To fill this gap, we introduce a generic multi-agent system for video stylization, V-Stylist, by a novel collaboration and reflection paradigm of multi-modal large language models. Specifically, our V-Stylist is a systematical workflow with three key roles: (1) Video Parser decomposes the input video into a number of shots and generates their text prompts of key shot content. Via a concise video-to-shot prompting paradigm, it allows our V-Stylist to effectively handle videos with complex transitions. (2) Style Parser identifies the style in the user query and progressively search the matched style model from a style tree. Via a robust tree-of-thought searching paradigm, it allows our V-Stylist to precisely specify vague style preference in the open user query. (3) Style Artist leverages the matched model to render all the video shots into the required style. Via a novel multi-round self-reflection paradigm, it allows our V-Stylist to adaptively adjust detail control, according to the style requirement. With such a distinct design of mimicking human professionals, our V-Stylist achieves a major breakthrough over the primary challenges for effective and automatic video stylization. Moreover,we further construct a new benchmark Text-driven Video Stylization Benchmark (TVSBench), which fills the gap to assess stylization of complex videos on open user queries. Extensive experiments show that, V-Stylist achieves the state-of-the-art, e.g.,V-Stylist surpasses FRESCO and ControlVideo by 6.05% and 4.51% respectively in overall average metrics, marking a significant advance in video stylization.
zh

[CV-248] Robust Dataset Distillation by Matching Adversarial Trajectories

【速读】：该论文试图解决现有数据蒸馏方法忽视模型鲁棒性的问题，导致通过蒸馏数据训练的模型容易受到对抗攻击。为了解决这一局限，论文引入了“鲁棒数据蒸馏”任务，这是一种在蒸馏过程中将对抗鲁棒性嵌入合成数据的新范式。方案的关键在于提出匹配对抗轨迹（Matching Adversarial Trajectories, MAT）方法，它将对抗训练融入基于轨迹的数据蒸馏中，在轨迹生成过程中引入对抗样本以获得鲁棒的训练轨迹，并以此指导蒸馏过程。实验表明，即使在自然训练条件下，基于蒸馏数据训练的模型也能实现增强的对抗鲁棒性，同时保持与现有方法相当的准确性。

链接: https://arxiv.org/abs/2503.12069
作者: Wei Lai,Tianyu Ding,ren dongdong,Lei Wang,Jing Huo,Yang Gao,Wenbin Li
机构: State Key Laboratory for Novel Software Technology, Nanjing University (南京大学); Microsoft Corporation (微软公司), USA; University of Wollongong (卧龙岗大学), Australia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation synthesizes compact datasets that enable models to achieve performance comparable to training on the original large-scale datasets. However, existing distillation methods overlook the robustness of the model, resulting in models that are vulnerable to adversarial attacks when trained on distilled data. To address this limitation, we introduce the task of ``robust dataset distillation", a novel paradigm that embeds adversarial robustness into the synthetic datasets during the distillation process. We propose Matching Adversarial Trajectories (MAT), a method that integrates adversarial training into trajectory-based dataset distillation. MAT incorporates adversarial samples during trajectory generation to obtain robust training trajectories, which are then used to guide the distillation process. As experimentally demonstrated, even through natural training on our distilled dataset, models can achieve enhanced adversarial robustness while maintaining competitive accuracy compared to existing distillation methods. Our work highlights robust dataset distillation as a new and important research direction and provides a strong baseline for future research to bridge the gap between efficient training and adversarial robustness.
zh

[CV-249] Prototype-Based Image Prompting for Weakly Supervised Histopathological Image Segmentation

【速读】：本文旨在解决弱监督下基于图像级标签的组织病理学图像分割问题，传统基于类别激活图（Class Activation Maps, CAMs）的方法因仅突出最判别性区域而产生不完整的分割掩膜。此外，引入文本信息的方法在组织病理学图像中面临类间同质性和类内异质性的挑战。为解决这些问题，论文提出了一种基于原型的图像提示框架。其关键在于通过聚类从训练集构建图像库，为每个类别提取多个原型特征以捕获类内异质性，并利用对比学习设计输入特征与类别特定原型之间的匹配损失，从而应对类间同质性问题并引导模型生成更精确的CAMs。实验结果表明，该方法在四个数据集（LUAD-HistoSeg、BCSS-WSSS、GCSS和BCSS）上优于现有弱监督分割方法，确立了新的组织病理学图像分割基准。

链接: https://arxiv.org/abs/2503.12068
作者: Qingchen Tang,Lei Fan,Maurice Pagnucco,Yang Song
机构: University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly supervised image segmentation with image-level labels has drawn attention due to the high cost of pixel-level annotations. Traditional methods using Class Activation Maps (CAMs) often highlight only the most discriminative regions, leading to incomplete masks. Recent approaches that introduce textual information struggle with histopathological images due to inter-class homogeneity and intra-class heterogeneity. In this paper, we propose a prototype-based image prompting framework for histopathological image segmentation. It constructs an image bank from the training set using clustering, extracting multiple prototype features per class to capture intra-class heterogeneity. By designing a matching loss between input features and class-specific prototypes using contrastive learning, our method addresses inter-class homogeneity and guides the model to generate more accurate CAMs. Experiments on four datasets (LUAD-HistoSeg, BCSS-WSSS, GCSS, and BCSS) show that our method outperforms existing weakly supervised segmentation approaches, setting new benchmarks in histopathological image segmentation.
zh

[CV-250] A Comprehensive Survey on Knowledge Distillation

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNNs）在边缘设备上部署时面临的高运行时间和内存消耗问题，特别是在大规模基础模型（如视觉-语言模型Vision-Language Models, VLMs和大语言模型Large Language Models, LLMs）的应用场景下。论文的关键解决方案是知识蒸馏（Knowledge Distillation, KD），通过教师-学生架构将复杂庞大的教师模型的知识迁移到轻量级的学生模型中，从而实现高效且低资源占用的模型部署。论文提出了一种全面的综述，从蒸馏源、方案、算法、模态、应用以及现有方法对比等多个角度系统分析了知识蒸馏技术，并特别关注了扩散模型、三维输入、基础模型、Transformer和大语言模型等最新研究方向。此外，论文还讨论了知识蒸馏领域现存的挑战及未来可能的研究方向。

链接: https://arxiv.org/abs/2503.12067
作者: Amir M. Mansourian,Rozhan Ahmadi,Masoud Ghafouri,Amir Mohammad Babaei,Elaheh Badali Golezani,Zeynab Yasamani Ghamchi,Vida Ramezanian,Alireza Taherian,Kimia Dinashi,Amirali Miri,Shohreh Kasaei
机构: Image Processing Lab, Sharif University of Technology (图像处理实验室，Sharif理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 47 pages, 10 figures, 13 tables

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have achieved notable performance in the fields of computer vision and natural language processing with various applications in both academia and industry. However, with recent advancements in DNNs and transformer models with a tremendous number of parameters, deploying these large models on edge devices causes serious issues such as high runtime and memory consumption. This is especially concerning with the recent large-scale foundation models, Vision-Language Models (VLMs), and Large Language Models (LLMs). Knowledge Distillation (KD) is one of the prominent techniques proposed to address the aforementioned problems using a teacher-student architecture. More specifically, a lightweight student model is trained using additional knowledge from a cumbersome teacher model. In this work, a comprehensive survey of knowledge distillation methods is proposed. This includes reviewing KD from different aspects: distillation sources, distillation schemes, distillation algorithms, distillation by modalities, applications of distillation, and comparison among existing methods. In contrast to most existing surveys, which are either outdated or simply update former surveys, this work proposes a comprehensive survey with a new point of view and representation structure that categorizes and investigates the most recent methods in knowledge distillation. This survey considers various critically important subcategories, including KD for diffusion models, 3D inputs, foundational models, transformers, and LLMs. Furthermore, existing challenges in KD and possible future research directions are discussed. Github page of the project: this https URL
zh

[CV-251] DLA-Count: Dynamic Label Assignment Network for Dense Cell Distribution Counting

【速读】：该论文致力于解决医学和生物研究中细胞计数这一基础 yet 挑战性任务，主要面临细胞形态多样、密集分布以及图像质量变化等难题。为应对这些挑战，论文提出了一种名为DLA-Count的新方法，其关键创新包括：(1) K-相邻匈牙利匹配（KHM），显著提升了密集区域内的细胞匹配性能；(2) 多尺度可变形高斯卷积（MDGC），适应不同细胞形态；(3) 高斯增强特征解码器（GFD），用于高效多尺度特征融合。实验结果表明，该方法在ADI和MBM数据集上的平均绝对误差分别提升了46.7%和42.5%，优于现有方法。相关代码已公开发布。

链接: https://arxiv.org/abs/2503.12063
作者: Yuqing Yan,Yirui Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cell counting remains a fundamental yet challenging task in medical and biological research due to the diverse morphology of cells, their dense distribution, and variations in image quality. We present DLA-Count, a breakthrough approach to cell counting that introduces three key innovations: (1) K-adjacent Hungarian Matching (KHM), which dramatically improves cell matching in dense regions, (2) Multi-scale Deformable Gaussian Convolution (MDGC), which adapts to varying cell morphologies, and (3) Gaussian-enhanced Feature Decoder (GFD) for efficient multi-scale feature fusion. Our extensive experiments on four challenging cell counting datasets (ADI, MBM, VGG, and DCC) demonstrate that our method outperforms previous methods across diverse datasets, with improvements in Mean Absolute Error of up to 46.7% on ADI and 42.5% on MBM datasets. Our code is available at this https URL.
zh

[CV-252] EHNet: An Efficient Hybrid Network for Crowd Counting and Localization

【速读】：该论文旨在解决人群计数任务中多尺度人群分布场景下的挑战，即如何在单张图像内同时有效处理不同尺度的人群密度。论文提出了一种名为Efficient Hybrid Network (EHNet) 的新型框架，其关键是通过将人群计数重新定义为点回归框架，并引入Spatial-Position Attention Module (SPAM) 来捕捉全面的空间上下文和长距离依赖关系，同时开发Adaptive Feature Aggregation Module (AFAM) 来有效融合多尺度特征表示。在此基础上，进一步提出了Multi-Scale Attentive Decoder (MSAD)，从而实现了在降低计算开销的同时，在ShanghaiTech、UCF-CC-50和UCF-QNRF等基准数据集上取得具有竞争力的性能表现。

链接: https://arxiv.org/abs/2503.12061
作者: Yuqing Yan,Yirui Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, crowd counting and localization have become crucial techniques in computer vision, with applications spanning various domains. The presence of multi-scale crowd distributions within a single image remains a fundamental challenge in crowd counting tasks. To address these challenges, we introduce the Efficient Hybrid Network (EHNet), a novel framework for efficient crowd counting and localization. By reformulating crowd counting into a point regression framework, EHNet leverages the Spatial-Position Attention Module (SPAM) to capture comprehensive spatial contexts and long-range dependencies. Additionally, we develop an Adaptive Feature Aggregation Module (AFAM) to effectively fuse and harmonize multi-scale feature representations. Building upon these, we introduce the Multi-Scale Attentive Decoder (MSAD). Experimental results on four benchmark datasets demonstrate that EHNet achieves competitive performance with reduced computational overhead, outperforming existing methods on ShanghaiTech Part _A, ShanghaiTech Part _B, UCF-CC-50, and UCF-QNRF. Our code is in this https URL.
zh

[CV-253] ailor: An Integrated Text-Driven CG-Ready Human and Garment Generation System

【速读】：该论文试图解决的问题是如何通过生成式 AI 提供一种易于访问且集成的文本到虚拟人生成管道，以高效生成高质量、可定制的带服装的三维人体 avatar。现有方法在生成仿真就绪的服装和高保真细节方面存在不足，无法满足实际应用需求。论文提出的解决方案关键在于设计了一个名为 Tailor 的三阶段系统：首先利用大型语言模型将文本描述转换为参数化的人体形状和语义匹配的服装模板；其次通过引入拓扑保持的变形技术及新颖的几何损失函数，实现服装与人体几何的精确适配；最后借助增强的纹理扩散模块和对称局部注意力机制，确保视图一致性与照片级真实感细节。这些创新点共同提升了生成结果的保真度、可用性和多样性。

链接: https://arxiv.org/abs/2503.12052
作者: Zhiyao Sun,Yu-Hui Wen,Matthieu Lin,Ho-Jui Fang,Sheng Ye,Tian Lv,Yong-Jin Liu
机构: Tsinghua University (清华大学); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Creating detailed 3D human avatars with garments typically requires specialized expertise and labor-intensive processes. Although recent advances in generative AI have enabled text-to-3D human/clothing generation, current methods fall short in offering accessible, integrated pipelines for producing ready-to-use clothed avatars. To solve this, we introduce Tailor, an integrated text-to-avatar system that generates high-fidelity, customizable 3D humans with simulation-ready garments. Our system includes a three-stage pipeline. We first employ a large language model to interpret textual descriptions into parameterized body shapes and semantically matched garment templates. Next, we develop topology-preserving deformation with novel geometric losses to adapt garments precisely to body geometries. Furthermore, an enhanced texture diffusion module with a symmetric local attention mechanism ensures both view consistency and photorealistic details. Quantitative and qualitative evaluations demonstrate that Tailor outperforms existing SoTA methods in terms of fidelity, usability, and diversity. Code will be available for academic use.
zh

[CV-254] ACO: Taming Diffusion for in-the-wild Video Amodal Completion

【速读】：本文旨在解决视频模态完成（Video Amodal Completion, VAC）的问题，即在给定视觉提示指定感兴趣物体的情况下，生成视频中始终一致的完整物体。现有模型在处理未结构化的真实世界视频时，在保持跨帧一致性的同时完成部分可观察物体仍面临挑战。为应对这一问题，论文的关键在于提出了一种条件扩散模型TACO，它通过重新利用预训练视频扩散模型学到的丰富且一致的流形来实现VAC。此外，为了使TACO能够有效且稳健地推广到具有挑战性的真实场景，研究者构建了一个包含多难度级别的大规模合成数据集，并设计了一种逐步微调范式，从简单的恢复任务开始逐步过渡到更复杂的任务。

链接: https://arxiv.org/abs/2503.12049
作者: Ruijie Lu,Yixin Chen,Yu Liu,Jiaxiang Tang,Junfeng Ni,Diwen Wan,Gang Zeng,Siyuan Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Humans can infer complete shapes and appearances of objects from limited visual cues, relying on extensive prior knowledge of the physical world. However, completing partially observable objects while ensuring consistency across video frames remains challenging for existing models, especially for unstructured, in-the-wild videos. This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. To enable its effective and robust generalization to challenging in-the-wild scenarios, we curate a large-scale synthetic dataset with multiple difficulty levels by systematically imposing occlusions onto un-occluded videos. Building on this, we devise a progressive fine-tuning paradigm that starts with simpler recovery tasks and gradually advances to more complex ones. We demonstrate TACO’s versatility on a wide range of in-the-wild videos from Internet, as well as on diverse, unseen datasets commonly used in autonomous driving, robotic manipulation, and scene understanding. Moreover, we show that TACO can be effectively applied to various downstream tasks like object reconstruction and pose estimation, highlighting its potential to facilitate physical world understanding and reasoning. Our project page is available at this https URL.
zh

[CV-255] PSGait: Multimodal Gait Recognition using Parsing Skeleton

【速读】：该论文旨在解决在真实世界场景中实现精确步态识别的问题。传统方法通常依赖于剪影或骨架表示，但它们在受控实验室环境中的表现优于实际场景，主要由于其有限的信息熵来表征步态。为了解决这一挑战，论文提出了一种创新的步态表示方法——解析骨架（Parsing Skeleton），它通过引入基于骨架的人体解析方法捕获细粒度的身体动态，从而显著提高对行走过程中人体细粒度部分形状和动态编码的信息熵。关键解决方案在于结合解析骨架与剪影模态，构建了一个名为PSGait的新框架，通过融合这两种信息源增强个体区分能力，并通过广泛的基准测试验证了其在多种数据集上的优越性能，最终实现了比现有最先进的多模态方法更高的Rank-1准确性提升（最高可达10.9%）。这表明解析骨架对于野外步态识别的有效性和通用性，确立了PSGait作为新的最先进多模态步态识别方法的地位。

链接: https://arxiv.org/abs/2503.12047
作者: Hangrui Xu,Chuanrui Zhang,Zhengxian Wu,Peng Jiao,Haoqian Wang
机构: Hefei University of Technology (合肥工业大学), China; Tsinghua University (清华大学), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gait recognition has emerged as a robust biometric modality due to its non-intrusive nature and resilience to occlusion. Conventional gait recognition methods typically rely on silhouettes or skeletons. Despite their success in gait recognition for controlled laboratory environments, they usually fail in real-world scenarios due to their limited information entropy for gait representations. To achieve accurate gait recognition in the wild, we propose a novel gait representation, named Parsing Skeleton. This representation innovatively introduces the skeleton-guided human parsing method to capture fine-grained body dynamics, so they have much higher information entropy to encode the shapes and dynamics of fine-grained human parts during walking. Moreover, to effectively explore the capability of the parsing skeleton representation, we propose a novel parsing skeleton-based gait recognition framework, named PSGait, which takes parsing skeletons and silhouettes as input. By fusing these two modalities, the resulting image sequences are fed into gait recognition models for enhanced individual differentiation. We conduct comprehensive benchmarks on various datasets to evaluate our model. PSGait outperforms existing state-of-the-art multimodal methods. Furthermore, as a plug-and-play method, PSGait leads to a maximum improvement of 10.9% in Rank-1 accuracy across various gait recognition models. These results demonstrate the effectiveness and versatility of parsing skeletons for gait recognition in the wild, establishing PSGait as a new state-of-the-art approach for multimodal gait recognition.
zh

[CV-256] Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing CVPR2025

【速读】：该论文致力于解决电影配音任务中因数据集规模有限及音频背景噪声导致的声学建模性能下降问题。同时，为了实现高质量且精确韵律对齐的配音生成，论文提出了一个声学-韵律解耦的两阶段方法作为解决方案。关键在于首先通过增强韵律的声学前训练提升模型的鲁棒性，随后冻结预训练的声学系统，并设计解耦框架以建模韵律文本特征与配音风格，同时保持良好的声学质量。此外，引入域内情感分析模块以减少跨电影视觉领域偏移对情感-韵律对齐的影响。实验结果表明，该方法在两个主要基准测试中优于现有先进模型。

链接: https://arxiv.org/abs/2503.12042
作者: Zhedong Zhang,Liang Li,Chenggang Yan,Chunshan Liu,Anton van den Hengel,Yuankai Qi
机构: Hangzhou Dianzi University (杭州电子科技大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Adelaide (阿德莱德大学); Macquarie University (麦考瑞大学); VIPL group, ICT, CAS (中科院计算所智能信息处理实验室)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker’s voice demonstrated in a short reference audio clip. This task demands the model bridge character performances and complicated prosody structures to build a high-quality video-synchronized dubbing track. The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. First, we propose a prosody-enhanced acoustic pre-training to develop robust acoustic modeling capabilities. Then, we freeze the pre-trained acoustic system and design a disentangled framework to model prosodic text features and dubbing style while maintaining acoustic quality. Additionally, we incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies, thereby enhancing emotion-prosody alignment. Extensive experiments show that our method performs favorably against the state-of-the-art models on two primary benchmarks. The demos are available at this https URL.
zh

[CV-257] MOS: Modeling Object-Scene Associations in Generalized Category Discovery

【速读】：该论文旨在解决广义类别发现（Generalized Category Discovery, GCD）任务中忽视或误用场景信息的问题。传统方法通常将场景信息视为噪声或忽略其作用，而本文提出场景信息实际上是推断新类别的强大先验知识。论文的关键在于识别并解决了GCD中的歧义挑战（Ambiguity Challenge），即基类场景中的新物体可能被错误分类为基础类别，反之亦然。为有效利用场景信息，作者提出了建模物体-场景关联（Modeling Object-Scene Associations, MOS）框架，通过引入基于MLP的场景感知模块显著提升了细粒度GCD任务的性能，在挑战性数据集上的平均准确率较现有方法提高了4%。

链接: https://arxiv.org/abs/2503.12035
作者: Zhengyuan Peng,Jinpeng Ma,Zhimin Sun,Ran Yi,Haichuan Song,Xin Tan,Lizhuang Ma
机构: Shanghai Jiao Tong University (上海交通大学); East China Normal University (华东师范大学); Chongqing University (重庆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalized Category Discovery (GCD) is a classification task that aims to classify both base and novel classes in unlabeled images, using knowledge from a labeled dataset. In GCD, previous research overlooks scene information or treats it as noise, reducing its impact during model training. However, in this paper, we argue that scene information should be viewed as a strong prior for inferring novel classes. We attribute the misinterpretation of scene information to a key factor: the Ambiguity Challenge inherent in GCD. Specifically, novel objects in base scenes might be wrongly classified into base categories, while base objects in novel scenes might be mistakenly recognized as novel categories. Once the ambiguity challenge is addressed, scene information can reach its full potential, significantly enhancing the performance of GCD models. To more effectively leverage scene information, we propose the Modeling Object-Scene Associations (MOS) framework, which utilizes a simple MLP-based scene-awareness module to enhance GCD performance. It achieves an exceptional average accuracy improvement of 4% on the challenging fine-grained datasets compared to state-of-the-art methods, emphasizing its superior performance in fine-grained GCD. The code is publicly available at this https URL.
zh

[CV-258] Real-Time Manipulation Action Recognition with a Factorized Graph Sequence Encoder

【速读】：本文旨在解决实时识别人类操作动作的问题，以实现安全有效的人机交互与协作。现有方法虽能在实时运行，但难以在长时间操作中有效扩展，即缺乏时间维度上的适应能力。为应对这一挑战，作者提出了一种新的Factorized Graph Sequence Encoder网络，通过利用可泛化的场景图表示，并采用因式分解编码器架构，不仅实现了实时运行，还在时间维度上具备了良好的扩展性。此外，引入了Hand Pooling操作，用于更聚焦地提取图级嵌入。关键在于其因式分解编码器结构以及Hand Pooling操作，从而显著提升了模型性能，在KIT Bimanual Action (Bimacs)数据集和Collaborative Action (CoAx)数据集上分别提高了14.3%和5.6%的F1-macro得分。

链接: https://arxiv.org/abs/2503.12034
作者: Enes Erdogan,Eren Erdal Aksoy,Sanem Sariel
机构: Artificial Intelligence and Robotics Lab, Faculty of Computer and Informatics Engineering, Istanbul Technical University (伊斯坦布尔技术大学); School of Information Technology, Halmstad University (哈姆斯塔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 7 tables

点击查看摘要

Abstract:Recognition of human manipulation actions in real-time is essential for safe and effective human-robot interaction and collaboration. The challenge lies in developing a model that is both lightweight enough for real-time execution and capable of generalization. While some existing methods in the literature can run in real-time, they struggle with temporal scalability, i.e., they fail to adapt to long-duration manipulations effectively. To address this, leveraging the generalizable scene graph representations, we propose a new Factorized Graph Sequence Encoder network that not only runs in real-time but also scales effectively in the temporal dimension, thanks to its factorized encoder architecture. Additionally, we introduce Hand Pooling operation, a simple pooling operation for more focused extraction of the graph-level embeddings. Our model outperforms the previous state-of-the-art real-time approach, achieving a 14.3% and 5.6% improvement in F1-macro score on the KIT Bimanual Action (Bimacs) Dataset and Collaborative Action (CoAx) Dataset, respectively. Moreover, we conduct an extensive ablation study to validate our network design choices. Finally, we compare our model with its architecturally similar RGB-based model on the Bimacs dataset and show the limitations of this model in contrast to ours on such an object-centric manipulation dataset.
zh

[CV-259] Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training

【速读】：该论文旨在解决端到端自动驾驶领域中开放环路训练与闭环部署之间的差距问题。当前方法在开放环路环境中预测轨迹，难以快速响应闭环环境中的其他动态代理，并因训练与实际驾驶之间的差距可能导致运动学不可行的规划。为解决此问题，论文提出Hydra-NeXt，这是一种新颖的多分支规划框架，统一了轨迹预测、控制预测以及轨迹精化网络。其关键是通过引入控制解码器专注于短期动作以实现更快的反应速度，同时利用轨迹精化模块确保闭环环境下的运动学约束，从而弥合开放环路训练与闭环驾驶之间的鸿沟。这一统一方法在Bench2Drive数据集上取得了65.89的驾驶评分（Driving Score）和48.20%的成功率（Success Rate），显著超越了前人工作。

链接: https://arxiv.org/abs/2503.12030
作者: Zhenxin Li,Shihao Wang,Shiyi Lan,Zhiding Yu,Zuxuan Wu,Jose M. Alvarez
机构: Fudan University (复旦大学); The Hong Kong Polytechnic University (香港理工大学); NVIDIA (英伟达)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving research currently faces a critical challenge in bridging the gap between open-loop training and closed-loop deployment. Current approaches are trained to predict trajectories in an open-loop environment, which struggle with quick reactions to other agents in closed-loop environments and risk generating kinematically infeasible plans due to the gap between open-loop training and closed-loop driving. In this paper, we introduce Hydra-NeXt, a novel multi-branch planning framework that unifies trajectory prediction, control prediction, and a trajectory refinement network in one model. Unlike current open-loop trajectory prediction models that only handle general-case planning, Hydra-NeXt further utilizes a control decoder to focus on short-term actions, which enables faster responses to dynamic situations and reactive agents. Moreover, we propose the Trajectory Refinement module to augment and refine the planning decisions by effectively adhering to kinematic constraints in closed-loop environments. This unified approach bridges the gap between open-loop training and closed-loop driving, demonstrating superior performance of 65.89 Driving Score (DS) and 48.20% Success Rate (SR) on the Bench2Drive dataset without relying on external experts for data collection. Hydra-NeXt surpasses the previous state-of-the-art by 22.98 DS and 17.49 SR, marking a significant advancement in autonomous driving. Code will be available at this https URL.
zh

[CV-260] Challenges in Plane Symmetry: From Theory to Perception

【速读】：该论文旨在研究平面装饰图案的对称性，从理论与感知两个角度分析一个具有挑战性的装饰图案。论文的关键在于通过感知实验发现，参与者感知到的图案对称性与基于群论（Group Theory）预测的理论对称性不一致，从而揭示了理论模型与人类视觉感知之间的差异。

链接: https://arxiv.org/abs/2503.12028
作者: F. Çengel,V. Adanova,S. Tari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The planar ornaments are created by repeating a base unit using a combination of four primitive geometric operations: translation, rotation, reflection, and glide reflection. According to group theory, different combinations of these four geometric operations lead to different symmetry groups. In this work, we select a single challenging ornament, and analyze it both from the theoretical point of view and perceptual point of view. We present the perceptual experiment results, where one can see that the symmetries that the participants perceived from the ornaments do not match to what the theory dictates.
zh

[CV-261] Leverag ing Motion Information for Better Self-Supervised Video Correspondence Learning

【速读】：该论文致力于解决自监督视频对应学习中精确像素匹配的问题，尤其是在无监督条件下实现可靠像素关联的挑战。现有方法在提取精确像素对应关系时仍面临困难，并容易产生错误匹配，从而限制了其在自监督任务中的有效性。为应对这一问题，论文提出了一种高效的自监督视频对应学习框架（MER）。其关键在于设计了一个专用的运动增强引擎（Motion Enhancement Engine），以强调捕捉视频中物体的动态运动；同时引入了一种灵活的多簇采样策略（Multi-Cluster Sampler），用于提取像素间对应信息，使模型能够更关注运动中重要物体的像素变化。通过实验验证，该算法在视频对象分割和视频关键点跟踪等任务中超越了现有最先进方法。

链接: https://arxiv.org/abs/2503.12026
作者: Zihan Zhoua,Changrui Daia,Aibo Songa,Xiaolin Fang
机构: Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised video correspondence learning depends on the ability to accurately associate pixels between video frames that correspond to the same visual object. However, achieving reliable pixel matching without supervision remains a major challenge. To address this issue, recent research has focused on feature learning techniques that aim to encode unique pixel representations for matching. Despite these advances, existing methods still struggle to achieve exact pixel correspondences and often suffer from false matches, limiting their effectiveness in self-supervised settings. To this end, we explore an efficient self-supervised Video Correspondence Learning framework (MER) that aims to accurately extract object details from unlabeled videos. First, we design a dedicated Motion Enhancement Engine that emphasizes capturing the dynamic motion of objects in videos. In addition, we introduce a flexible sampling strategy for inter-pixel correspondence information (Multi-Cluster Sampler) that enables the model to pay more attention to the pixel changes of important objects in motion. Through experiments, our algorithm outperforms the state-of-the-art competitors on video correspondence learning tasks such as video object segmentation and video object keypoint tracking. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.12026 [cs.CV] (or arXiv:2503.12026v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.12026 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-262] SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering

【速读】：该论文旨在解决现有3D/4D场景生成方法在视频生成和场景重建过程中物理对齐困难的问题，特别是各阶段独立优化导致难以处理来自其他阶段的细微错位。为了解决这一问题，论文提出了一种名为SteerX的零样本推理时间引导方法，其关键是将场景重建统一到生成过程中，并通过引入两种基于无姿态前馈场景重建模型的几何奖励函数，调整数据分布以实现更好的几何对齐。

链接: https://arxiv.org/abs/2503.12024
作者: Byeongjun Park,Hyojun Go,Hyelin Nam,Byung-Hoon Kim,Hyungjin Chung,Changick Kim
机构: KAIST; EverEx; Yonsei University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent progress in 3D/4D scene generation emphasizes the importance of physical alignment throughout video generation and scene reconstruction. However, existing methods improve the alignment separately at each stage, making it difficult to manage subtle misalignments arising from another stage. Here, we present SteerX, a zero-shot inference-time steering method that unifies scene reconstruction into the generation process, tilting data distributions toward better geometric alignment. To this end, we introduce two geometric reward functions for 3D/4D scene generation by using pose-free feed-forward scene reconstruction models. Through extensive experiments, we demonstrate the effectiveness of SteerX in improving 3D/4D scene generation.
zh

[CV-263] Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art

【速读】：该论文试图解决文本到图像（Text-to-Image, T2I）扩散模型（Diffusion Models, DM）在生成高保真图像时容易产生不吸引人输出的问题。传统方法通常基于视觉美学普遍性的假设，而该研究提出了一种新颖的任务——美学对齐（aesthetics alignment），旨在通过将用户指定的美学偏好与生成结果对齐来实现个性化美学表达。为实现这一目标，论文引入了艺术创作的原则（Principles of Art, PoA）作为编码视觉美学的方法，并构建了一个名为CompArt的大规模数据集，其中包含基于WikiArt的Composition分析以及多模态大型语言模型（Multimodal LLM）标注的PoA信息。解决方案的关键在于利用大型语言模型的强大表达能力训练轻量且可迁移的适配器（adapter），从而使得T2I DM能够根据用户指定的PoA条件提供十种构图控制，同时设计了适当的评估框架以验证所提方法的有效性。

链接: https://arxiv.org/abs/2503.12018
作者: Zhe Jin,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) diffusion models (DM) have garnered widespread adoption due to their capability in generating high-fidelity outputs and accessibility to anyone able to put imagination into words. However, DMs are often predisposed to generate unappealing outputs, much like the random images on the internet they were trained on. Existing approaches to address this are founded on the implicit premise that visual aesthetics is universal, which is limiting. Aesthetics in the T2I context should be about personalization and we propose the novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output. Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ, known as the Principles of Art (PoA). To facilitate this study, we introduce CompArt, a large-scale compositional art dataset building on top of WikiArt with PoA analysis annotated by a capable Multimodal LLM. Leveraging the expressive power of LLMs and training a lightweight and transferrable adapter, we demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions. Additionally, we design an appropriate evaluation framework to assess the efficacy of our approach.
zh

[CV-264] QDM: Quadtree-Based Region-Adaptive Sparse Diffusion Models for Efficient Image Super-Resolution

【速读】：该论文针对基于深度学习的超分辨率（Super-Resolution, SR）方法在均匀区域进行统一像素级计算而导致冗余计算的问题，提出了解决方案。论文的关键在于引入Quadtree Diffusion Model (QDM)，这是一种基于四叉树结构的区域自适应扩散框架。通过从低质量输入图像中导出的四叉树结构来指导扩散过程，QDM能够识别出需要精细细节增强的关键区域（由叶节点表示），并在其他同质化区域减少计算量。这种掩码引导的双流架构实现了质量和效率之间的自适应平衡，生成高保真的输出结果同时减少了计算冗余。实验表明，QDM在多种图像类型的高分辨率SR任务中表现出色，特别是在医学成像领域（如CT扫描），并且在标准基准测试中优于或媲美最先进的SR方法，同时大幅降低了计算成本。

链接: https://arxiv.org/abs/2503.12015
作者: Donglin Yang,Paul Vicol,Xiaojuan Qi,Renjie Liao,Xiaofan Zhang
机构: The University of Hong Kong (香港大学); Google DeepMind (谷歌深度思维); The University of Hong Kong (香港大学); The University of British Columbia (英属哥伦比亚大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based super-resolution (SR) methods often perform pixel-wise computations uniformly across entire images, even in homogeneous regions where high-resolution refinement is redundant. We propose the Quadtree Diffusion Model (QDM), a region-adaptive diffusion framework that leverages a quadtree structure to selectively enhance detail-rich regions while reducing computations in homogeneous areas. By guiding the diffusion with a quadtree derived from the low-quality input, QDM identifies key regions-represented by leaf nodes-where fine detail is essential and applies minimal refinement elsewhere. This mask-guided, two-stream architecture adaptively balances quality and efficiency, producing high-fidelity outputs with low computational redundancy. Experiments demonstrate QDM’s effectiveness in high-resolution SR tasks across diverse image types, particularly in medical imaging (e.g., CT scans), where large homogeneous regions are prevalent. Furthermore, QDM outperforms or is comparable to state-of-the-art SR methods on standard benchmarks while significantly reducing computational costs, highlighting its efficiency and suitability for resource-limited environments. Our code is available at this https URL.
zh

[CV-265] Learning Dual-Domain Multi-Scale Representations for Single Image Deraining

【速读】：该论文针对现有去雨方法通常依赖单输入、单输出和单尺度架构的问题，指出其忽视了外部特征与内部特征之间的联合多尺度信息，并且单一域表示过于受限，难以应对真实场景中雨的复杂性。为了解决这些问题，论文提出了一种新颖的双域多尺度表示网络（Dual-Domain Multi-Scale Representation Network, DMSR）。该方法的关键在于并行利用外部和内部域的联合多尺度表示，同时结合空间域和频率域的优势以捕获更全面的特性。具体而言，DMSR 包含两个主要模块：多尺度渐进空间细化模块（Multi-Scale Progressive Spatial Refinement Module, MPSRM）和频域尺度混合器（Frequency Domain Scale Mixer, FDSM）。其中，MPSRM 利用分层调制和融合策略实现内部域多尺度专家信息的交互与耦合；而 FDSM 在空间域提取多尺度局部信息的同时，在频率域建模全局依赖关系。大量实验表明，该模型在六个基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2503.12014
作者: Shun Zou,Yi Zou,Mingya Zhang,Shipeng Luo,Guangwei Gao,Guojun Qi
机构: Nanjing Agricultural University (南京农业大学); Xiangtan University (湘潭大学); Nanjing University (南京大学); Northeast Forestry University (东北林业大学); Nanjing University of Posts and Telecommunications (南京邮电大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures, code: this https URL

点击查看摘要

Abstract:Existing image deraining methods typically rely on single-input, single-output, and single-scale architectures, which overlook the joint multi-scale information between external and internal features. Furthermore, single-domain representations are often too restrictive, limiting their ability to handle the complexities of real-world rain scenarios. To address these challenges, we propose a novel Dual-Domain Multi-Scale Representation Network (DMSR). The key idea is to exploit joint multi-scale representations from both external and internal domains in parallel while leveraging the strengths of both spatial and frequency domains to capture more comprehensive properties. Specifically, our method consists of two main components: the Multi-Scale Progressive Spatial Refinement Module (MPSRM) and the Frequency Domain Scale Mixer (FDSM). The MPSRM enables the interaction and coupling of multi-scale expert information within the internal domain using a hierarchical modulation and fusion strategy. The FDSM extracts multi-scale local information in the spatial domain, while also modeling global dependencies in the frequency domain. Extensive experiments show that our model achieves state-of-the-art performance across six benchmark datasets.
zh

[CV-266] UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection CVPR2025

【速读】：该论文旨在解决基于Transformer的LiDAR 3D检测框架在处理点云空间全局依赖性时遇到的两个主要问题：一是3D体素的空间结构在序列化为一维序列的过程中被破坏；二是由于体素数量庞大及Transformer的二次复杂度，多序列分组限制了其感受野。为了解决这些问题，论文提出了一种名为UniMamba的新方法，通过将3D卷积与状态空间模型（State Space Models, SSM）以简洁的多头方式无缝集成，实现高效且同时的“局部和全局”空间上下文聚合。关键在于设计了一个包含空间局部性建模、互补Z序列化以及局部-全局顺序聚合器的UniMamba块，并结合堆叠的编码器-解码器架构以支持多尺度空间学习。

链接: https://arxiv.org/abs/2503.12009
作者: Xin Jin,Haisheng Su,Kai Liu,Cong Ma,Wei Wu,Fei Hui,Junchi Yan
机构: Chang’an University (长安大学); Shanghai Jiao Tong University (上海交通大学); SenseAuto Research (纵目科技); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces, which serialize the 3D voxels into the flattened 1D sequence for iterative self-attention. However, the spatial structure of 3D voxels will be inevitably destroyed during the serialization process. Besides, due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field. Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, in this paper, we propose a novel Unified Mamba (UniMamba), which seamlessly integrates the merits of 3D convolution and SSM in a concise multi-head manner, aiming to perform “local and global” spatial context aggregation efficiently and simultaneously. Specifically, a UniMamba block is designed which mainly consists of spatial locality modeling, complementary Z-order serialization and local-global sequential aggregator. The spatial locality modeling module integrates 3D submanifold convolution to capture the dynamic spatial position embedding before serialization. Then the efficient Z-order curve is adopted for serialization both horizontally and vertically. Furthermore, the local-global sequential aggregator adopts the channel grouping strategy to efficiently encode both “local and global” spatial inter-dependencies using multi-head SSM. Additionally, an encoder-decoder architecture with stacked UniMamba blocks is formed to facilitate multi-scale spatial learning hierarchically. Extensive experiments are conducted on three popular datasets: nuScenes, Waymo and Argoverse 2. Particularly, our UniMamba achieves 70.2 mAP on the nuScenes dataset.
zh

[CV-267] ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object CVPR2025

【速读】：该论文旨在解决大尺度遥感视频数据中高质量交互分割的挑战，特别是小目标尺寸、特征模糊以及泛化能力有限等问题。为应对这些挑战，论文提出了ROS-SAM方法，其关键创新点包括：1) 基于LoRA的微调技术，实现在保持SAM泛化能力的同时高效进行领域自适应；2) 改进深度网络层以提升提取特征的判别性，从而减少误分类；3) 在掩码解码器中融合全局上下文与局部边界细节，生成高质量分割掩码。此外，设计的数据管道确保模型在训练阶段更好地处理不同尺度的目标，并在推理阶段专注于高质量预测。实验结果表明，ROS-SAM相比现有方法提升了13%的IoU，验证了其在细粒度遥感分割任务中的有效性。

链接: https://arxiv.org/abs/2503.12006
作者: Zhe Shan,Yang Liu,Lei Zhou,Cheng Yan,Heng Wang,Xia Xie
机构: Hainan University (海南大学); Zhejiang University (浙江大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:The availability of large-scale remote sensing video data underscores the importance of high-quality interactive segmentation. However, challenges such as small object sizes, ambiguous features, and limited generalization make it difficult for current methods to achieve this goal. In this work, we propose ROS-SAM, a method designed to achieve high-quality interactive segmentation while preserving generalization across diverse remote sensing data. The ROS-SAM is built upon three key innovations: 1) LoRA-based fine-tuning, which enables efficient domain adaptation while maintaining SAM’s generalization ability, 2) Enhancement of deep network layers to improve the discriminability of extracted features, thereby reducing misclassifications, and 3) Integration of global context with local boundary details in the mask decoder to generate high-quality segmentation masks. Additionally, we design the data pipeline to ensure the model learns to better handle objects at varying scales during training while focusing on high-quality predictions during inference. Experiments on remote sensing video datasets show that the redesigned data pipeline boosts the IoU by 6%, while ROS-SAM increases the IoU by 13%. Finally, when evaluated on existing remote sensing object tracking datasets, ROS-SAM demonstrates impressive zero-shot capabilities, generating masks that closely resemble manual annotations. These results confirm ROS-SAM as a powerful tool for fine-grained segmentation in remote sensing applications. Code is available at this https URL.
zh

[CV-268] 3D Gaussian Splatting against Moving Objects for High-Fidelity Street Scene Reconstruction

【速读】：本文旨在解决动态街景高精度重建在自动驾驶、增强现实和虚拟现实等应用中的挑战。传统方法基于密集点云和三角网格，在处理移动物体、遮挡及实时处理约束时存在局限性，而多视图立体视觉和神经辐射场虽提升了三维重建效果，但在计算效率和动态场景适应性方面仍面临难题。为此，论文提出了一种新颖的三维高斯点分布方法，其关键在于引入自适应透明度机制以消除移动物体，同时保留静态场景细节的高保真度，并通过迭代优化高斯点分布提升几何精度与纹理表现。此外，结合方向编码与空间位置优化，有效减少了存储冗余并提高了渲染效率。实验结果表明，该方法在大规模动态环境下的重建质量、渲染性能及适应性均表现出色，为实时高精度三维重建提供了坚实框架。

链接: https://arxiv.org/abs/2503.12001
作者: Peizhen Zheng,Longfei Wei,Dongjing Jiang,Jianfei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The accurate reconstruction of dynamic street scenes is critical for applications in autonomous driving, augmented reality, and virtual reality. Traditional methods relying on dense point clouds and triangular meshes struggle with moving objects, occlusions, and real-time processing constraints, limiting their effectiveness in complex urban environments. While multi-view stereo and neural radiance fields have advanced 3D reconstruction, they face challenges in computational efficiency and handling scene dynamics. This paper proposes a novel 3D Gaussian point distribution method for dynamic street scene reconstruction. Our approach introduces an adaptive transparency mechanism that eliminates moving objects while preserving high-fidelity static scene details. Additionally, iterative refinement of Gaussian point distribution enhances geometric accuracy and texture representation. We integrate directional encoding with spatial position optimization to optimize storage and rendering efficiency, reducing redundancy while maintaining scene integrity. Experimental results demonstrate that our method achieves high reconstruction quality, improved rendering performance, and adaptability in large-scale dynamic environments. These contributions establish a robust framework for real-time, high-precision 3D reconstruction, advancing the practicality of dynamic scene modeling across multiple applications. The source code for this work is available to the public at this https URL
zh

[CV-269] Diffusion Dynamics Models with Generative State Estimation for Cloth Manipulation

【速读】：本文针对柔性物体（如布料）操作中的挑战展开研究，这些挑战包括复杂动力学、近乎无限的自由度以及频繁的自遮挡，这些问题使得状态估计和动力学建模变得困难。传统方法在鲁棒的状态估计方面表现不佳，而基于图神经网络（Graph Neural Networks, GNNs）的动力学模型则受限于其局部性。本文的关键在于提出了一种基于扩散的生成式方法，用于感知与动力学建模。具体而言，将状态估计形式化为从稀疏的RGB-D观测中重建完整的布料状态，并将其条件化到一个规范化的布料网格上；动力学建模则被定义为根据当前状态和机器人动作预测未来状态。通过利用基于Transformer的扩散模型，该方法实现了高保真的状态重建，同时相较于基于GNN的方法，大幅度减少了长时间范围内的动力学预测误差。最终，结合模型预测控制（Model-Predictive Control, MPC），该框架成功实现了真实机器人系统的布料折叠任务，展示了生成式模型在部分可观测性和复杂动力学场景下的潜力。

链接: https://arxiv.org/abs/2503.11999
作者: Tongxuan Tian,Haoyang Li,Bo Ai,Xiaodi Yuan,Zhiao Huang,Hao Su
机构: University of California San Diego (加州大学圣地亚哥分校); Hillbot (希尔博特); University of California San Diego (加州大学圣地亚哥分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Manipulating deformable objects like cloth is challenging due to their complex dynamics, near-infinite degrees of freedom, and frequent self-occlusions, which complicate state estimation and dynamics modeling. Prior work has struggled with robust cloth state estimation, while dynamics models, primarily based on Graph Neural Networks (GNNs), are limited by their locality. Inspired by recent advances in generative models, we hypothesize that these expressive models can effectively capture intricate cloth configurations and deformation patterns from data. Building on this insight, we propose a diffusion-based generative approach for both perception and dynamics modeling. Specifically, we formulate state estimation as reconstructing the full cloth state from sparse RGB-D observations conditioned on a canonical cloth mesh and dynamics modeling as predicting future states given the current state and robot actions. Leveraging a transformer-based diffusion model, our method achieves high-fidelity state reconstruction while reducing long-horizon dynamics prediction errors by an order of magnitude compared to GNN-based approaches. Integrated with model-predictive control (MPC), our framework successfully executes cloth folding on a real robotic system, demonstrating the potential of generative models for manipulation tasks with partial observability and complex dynamics.
zh

[CV-270] Fraesormer: Learning Adaptive Sparse Transformer for Efficient Food Recognition

【速读】：该论文旨在解决轻量级食品识别中的两个关键挑战：(1) Transformer中与无关标记交互引起的二次复杂性和冗余特征表示；(2) 静态特征识别和单尺度表征，未能充分利用食品图像的非结构化和非固定特性以及多尺度特征的需求。为了解决这些问题，论文提出了自适应高效的稀疏Transformer架构Fraesormer，其核心设计包括自适应Top-k稀疏部分注意力（ATK-SPA）和分层尺度敏感特征门控网络（HSSFGN）。ATK-SPA通过可学习的门控动态Top-K算子（GDTKO）保留关键注意力分数，过滤掉阻碍特征聚合的低查询-键匹配，并引入部分通道机制以减少冗余并促进专家信息流，实现局部-全局协作建模。HSSFGN利用门控机制实现多尺度特征表征，增强上下文语义信息。大量实验表明，Fraesormer在性能上优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.11995
作者: Shun Zou,Yi Zou,Mingya Zhang,Shipeng Luo,Zhihao Chen,Guangwei Gao
机构: Nanjing Agricultural University (南京农业大学); Xiangtan University (湘潭大学); Nanjing University (南京大学); Northeast Forestry University (东北林业大学); Beijing Information Science and Technology University (北京信息科技大学); Nanjing University of Posts and Telecommunications (南京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:In recent years, Transformer has witnessed significant progress in food recognition. However, most existing approaches still face two critical challenges in lightweight food recognition: (1) the quadratic complexity and redundant feature representation from interactions with irrelevant tokens; (2) static feature recognition and single-scale representation, which overlook the unstructured, non-fixed nature of food images and the need for multi-scale features. To address these, we propose an adaptive and efficient sparse Transformer architecture (Fraesormer) with two core designs: Adaptive Top-k Sparse Partial Attention (ATK-SPA) and Hierarchical Scale-Sensitive Feature Gating Network (HSSFGN). ATK-SPA uses a learnable Gated Dynamic Top-K Operator (GDTKO) to retain critical attention scores, filtering low query-key matches that hinder feature aggregation. It also introduces a partial channel mechanism to reduce redundancy and promote expert information flow, enabling local-global collaborative modeling. HSSFGN employs gating mechanism to achieve multi-scale feature representation, enhancing contextual semantic information. Extensive experiments show that Fraesormer outperforms state-of-the-art methods. code is available at this https URL.
zh

[CV-271] DecompDreamer: Advancing Structured 3D Asset Generation with Multi-Object Decomposition and Gaussian Splatting

【速读】：该论文致力于解决现有文本到三维（Text-to-3D）生成技术在处理包含多个对象及其空间关系的组合提示（compositional prompts）时存在的挑战，这些方法通常难以捕捉对象间的精细交互。论文提出的关键解决方案是DecompDreamer，这是一种基于高斯散射（Gaussian splatting）的训练方法，通过视觉语言模型（Vision-Language Models, VLMs）将场景分解为结构化的组件及其关系，并采用一种渐进优化策略，首先优先建模对象间的关系，然后逐步转向特定对象的细化。这种方案能够生成高质量且复杂的三维构图，提供卓越的对象解耦能力，增强三维生成的控制与灵活性。

链接: https://arxiv.org/abs/2503.11981
作者: Utkarsh Nath,Rajeev Goel,Rahul Khurana,Kyle Min,Mark Ollila,Pavan Turaga,Varun Jampani,Tejaswi Gowda
机构: Arizona State University (亚利桑那州立大学); Intel Labs (英特尔实验室); Stability AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-3D generation saw dramatic advances in recent years by leveraging Text-to-Image models. However, most existing techniques struggle with compositional prompts, which describe multiple objects and their spatial relationships. They often fail to capture fine-grained inter-object interactions. We introduce DecompDreamer, a Gaussian splatting-based training routine designed to generate high-quality 3D compositions from such complex prompts. DecompDreamer leverages Vision-Language Models (VLMs) to decompose scenes into structured components and their relationships. We propose a progressive optimization strategy that first prioritizes joint relationship modeling before gradually shifting toward targeted object refinement. Our qualitative and quantitative evaluations against state-of-the-art text-to-3D models demonstrate that DecompDreamer effectively generates intricate 3D compositions with superior object disentanglement, offering enhanced control and flexibility in 3D generation. Project page : this https URL
zh

[CV-272] DynaGSLAM: Real-Time Gaussian-Splatting SLAM for Online Rendering Tracking Motion Predictions of Moving Objects in Dynamic Scenes

【速读】：该论文旨在解决动态场景中基于高斯点 splatting (Gaussian Splatting, GS) 的同时定位与建图 (Simultaneous Localization and Mapping, SLAM) 的问题。传统静态假设下的 GS-SLAM 在存在移动物体时会因束调整 (bundle adjustment) 的静态约束失效而导致建图失败，并且移动 GS 的更新错误会污染整个地图。论文的关键解决方案是提出了一种名为“DynaGSLAM”的实时 GS-SLAM 系统，它实现了高质量的在线 GS 渲染与跟踪，同时能够预测动态场景中移动物体的运动，并联合估计精确的自身运动。这一方法通过综合考虑动态物体，避免了仅关注静态背景的传统“反动态”GS-SLAM 的局限性，从而在保持高效速度和内存使用的同时显著提升了性能。

链接: https://arxiv.org/abs/2503.11979
作者: Runfa Blark Li,Mahdi Shaghaghi,Keito Suzuki,Xinshuang Liu,Varun Moparthi,Bang Du,Walker Curtis,Martin Renschler,Ki Myung Brian Lee,Nikolay Atanasov,Truong Nguyen
机构: UC San Diego (加州大学圣地亚哥分校); Qualcomm XR Advanced Technology (高通 XR先进技术部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Simultaneous Localization and Mapping (SLAM) is one of the most important environment-perception and navigation algorithms for computer vision, robotics, and autonomous cars/drones. Hence, high quality and fast mapping becomes a fundamental problem. With the advent of 3D Gaussian Splatting (3DGS) as an explicit representation with excellent rendering quality and speed, state-of-the-art (SOTA) works introduce GS to SLAM. Compared to classical pointcloud-SLAM, GS-SLAM generates photometric information by learning from input camera views and synthesize unseen views with high-quality textures. However, these GS-SLAM fail when moving objects occupy the scene that violate the static assumption of bundle adjustment. The failed updates of moving GS affects the static GS and contaminates the full map over long frames. Although some efforts have been made by concurrent works to consider moving objects for GS-SLAM, they simply detect and remove the moving regions from GS rendering ("anti’’ dynamic GS-SLAM), where only the static background could benefit from GS. To this end, we propose the first real-time GS-SLAM, "DynaGSLAM’‘, that achieves high-quality online GS rendering, tracking, motion predictions of moving objects in dynamic scenes while jointly estimating accurate ego motion. Our DynaGSLAM outperforms SOTA static "Anti’’ dynamic GS-SLAM on three dynamic real datasets, while keeping speed and memory efficiency in practice.
zh

[CV-273] Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars

【速读】：该论文旨在解决现有个性化头像平台在表达性、定制化过程以及渲染效率方面的局限性。具体而言，这些问题包括由于预定义资产导致的表情表达受限、繁琐的定制流程或低效的渲染需求。为了解决这些问题，论文提出了一种名为Snapmoji的头像生成系统，其核心解决方案在于引入了高斯域适应（Gaussian Domain Adaptation, GDA）方法。GDA通过在大规模基于高斯模型的3D数据（如Objaverse）上进行预训练，并结合2D风格迁移任务进行微调，从而赋予系统丰富的3D先验知识。这种方法使得Snapmoji能够从自拍照快速生成可动画化的双重风格化头像，同时保持用户的个性化特征和主要风格的完整性，并支持动态面部表情迁移。最终，Snapmoji实现了仅需0.9秒即可完成自拍照到头像的转换，并在移动设备上以每秒30至40帧的速度实现实时交互，显著提升了生成的多样性和速度。

链接: https://arxiv.org/abs/2503.11978
作者: Eric M. Chen,Di Liu,Sizhuo Ma,Michael Vasilkovsky,Bing Zhou,Qiang Gao,Wenzhou Wang,Jiahao Luo,Dimitris N. Metaxas,Vincent Sitzmann,Jian Wang
机构: Snap Inc; MIT (麻省理工学院); Rutgers University (罗格斯大学); University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: N/A

点击查看摘要

Abstract:The increasing popularity of personalized avatar systems, such as Snapchat Bitmojis and Apple Memojis, highlights the growing demand for digital self-representation. Despite their widespread use, existing avatar platforms face significant limitations, including restricted expressivity due to predefined assets, tedious customization processes, or inefficient rendering requirements. Addressing these shortcomings, we introduce Snapmoji, an avatar generation system that instantly creates animatable, dual-stylized avatars from a selfie. We propose Gaussian Domain Adaptation (GDA), which is pre-trained on large-scale Gaussian models using 3D data from sources such as Objaverse and fine-tuned with 2D style transfer tasks, endowing it with a rich 3D prior. This enables Snapmoji to transform a selfie into a primary stylized avatar, like the Bitmoji style, and apply a secondary style, such as Plastic Toy or Alien, all while preserving the user’s identity and the primary style’s integrity. Our system is capable of producing 3D Gaussian avatars that support dynamic animation, including accurate facial expression transfer. Designed for efficiency, Snapmoji achieves selfie-to-avatar conversion in just 0.9 seconds and supports real-time interactions on mobile devices at 30 to 40 frames per second. Extensive testing confirms that Snapmoji outperforms existing methods in versatility and speed, making it a convenient tool for automatic avatar creation in various styles.
zh

[CV-274] Evaluation of Intra-operative Patient-specific Methods for Point Cloud Completion for Minimally Invasive Liver Interventions

【速读】：本文旨在解决术前模型与术中点云配准在图像引导肝脏手术中的关键挑战，特别是在肝脏点云覆盖范围有限、存在空洞和噪声的情况下，传统配准方法面临显著困难的问题。为缓解这些问题，论文探索了六种最先进的点云补全方法，以确定适用于肝脏手术的最佳补全方案。研究重点在于患者特异性的肝脏点云补全，涵盖三种情况：标准姿态（canonical pose）、非标准姿态（non-canonical pose）以及带噪声的标准姿态。结果表明，基于Transformer的方法AdaPoinTr在标准姿态下优于其他方法，能够有效生成完整的肝脏点云；然而，在非标准姿态或含噪环境下，这些方法均表现出性能下降，揭示了现有方法的局限性。因此，论文强调需要开发一种鲁棒的点云补全方法，以满足图像引导肝脏手术的实际需求。

链接: https://arxiv.org/abs/2503.11969
作者: Nakul Poudel,Zixin Yang,Kelly Merrell,Richard Simon,Cristian A. Linte
机构: Center for Imaging Science, Rochester Institute of Technology (罗切斯特理工学院); Biomedical Engineering, Rochester Institute of Technology (罗切斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The registration between the pre-operative model and the intra-operative surface is crucial in image-guided liver surgery, as it facilitates the effective use of pre-operative information during the procedure. However, the intra-operative surface, usually represented as a point cloud, often has limited coverage, especially in laparoscopic surgery, and is prone to holes and noise, posing significant challenges for registration methods. Point cloud completion methods have the potential to alleviate these issues. Thus, we explore six state-of-the-art point cloud completion methods to identify the optimal completion method for liver surgery applications. We focus on a patient-specific approach for liver point cloud completion from a partial liver surface under three cases: canonical pose, non-canonical pose, and canonical pose with noise. The transformer-based method, AdaPoinTr, outperforms all other methods to generate a complete point cloud from the given partial liver point cloud under the canonical pose. On the other hand, our findings reveal substantial performance degradation of these methods under non-canonical poses and noisy settings, highlighting the limitations of these methods, which suggests the need for a robust point completion method for its application in image-guided liver surgery.
zh

[CV-275] CHOrD: Generation of Collision-Free House-Scale and Organized Digital Twins for 3D Indoor Scenes with Controllable Floor Plans and Optimal Layouts

【速读】：本文旨在解决3D室内场景高效合成的问题，特别是如何生成大规模、无碰撞且具有层次结构的室内数字孪生模型。现有方法通常直接以场景图或物体列表形式合成场景布局，但难以有效避免碰撞伪影。为解决此问题，论文提出了一种名为CHOrD的新框架，其关键在于引入基于2D图像的中间布局表示方式，通过将碰撞伪影成功识别为生成过程中的分布外（Out-of-Distribution, OOD）场景，从而实现有效的碰撞检测与规避。此外，CHOrD还支持复杂楼层平面的多模态控制，能够生成一致且鲁棒的全屋布局，适应房间几何结构和语义上的变化。同时，论文构建了一个包含更多家居物品和房间配置的新数据集，并显著提升了数据质量。实验结果表明，CHOrD在3D-FRONT及新数据集上均达到当前最优性能，可生成逼真且空间连贯的室内场景，适应任意楼层平面的变化。

链接: https://arxiv.org/abs/2503.11958
作者: Chong Su,Yingbin Fu,Zheyuan Hu,Jing Yang,Param Hanji,Shaojun Wang,Xuan Zhao,Cengiz Öztireli,Fangcheng Zhong
机构: KE Holdings Inc. (贝壳找房); Department of Computer Science and Technology, University of Cambridge (剑桥大学计算机科学与技术系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Chong Su and Yingbin Fu contributed equally to this work

点击查看摘要

Abstract:We introduce CHOrD, a novel framework for scalable synthesis of 3D indoor scenes, designed to create house-scale, collision-free, and hierarchically structured indoor digital twins. In contrast to existing methods that directly synthesize the scene layout as a scene graph or object list, CHOrD incorporates a 2D image-based intermediate layout representation, enabling effective prevention of collision artifacts by successfully capturing them as out-of-distribution (OOD) scenarios during generation. Furthermore, unlike existing methods, CHOrD is capable of generating scene layouts that adhere to complex floor plans with multi-modal controls, enabling the creation of coherent, house-wide layouts robust to both geometric and semantic variations in room structures. Additionally, we propose a novel dataset with expanded coverage of household items and room configurations, as well as significantly improved data quality. CHOrD demonstrates state-of-the-art performance on both the 3D-FRONT and our proposed datasets, delivering photorealistic, spatially coherent indoor scene synthesis adaptable to arbitrary floor plan variations.
zh

[CV-276] SPOC: Spatially-Progressing Object State Change Segmentation in Video

【速读】：该论文试图解决视频中物体状态变化的空间和时间细节难以精确分割的问题。现有方法仅能定位物体初始状态（如未切开的牛油果）与完成状态（如已切开的牛油果）的时间点，而无法提供关于动作进展的详细信息及其空间定位。为应对这一挑战，论文引入了“空间进展物体状态变化分割”任务，目标是在像素级别分割出物体中可操作区域与已被改变的区域。解决方案的关键在于提出了一种基于视觉语言模型(VLM)的伪标签生成方法、状态变化动态约束机制，并构建了一个名为WhereToChange的新基准数据集，利用真实互联网视频进行训练。实验结果验证了新任务的难度以及所提模型在精确定位物体变化位置和速度方面的潜力，并展示了其在跟踪活动进展以辅助机器人应用中的实用价值。

链接: https://arxiv.org/abs/2503.11953
作者: Priyanka Mandikal,Tushar Nagarajan,Alex Stoken,Zihui Xue,Kristen Grauman
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object state changes in video reveal critical information about human and agent activity. However, existing methods are limited to temporal localization of when the object is in its initial state (e.g., the unchopped avocado) versus when it has completed a state change (e.g., the chopped avocado), which limits applicability for any task requiring detailed information about the progress of the actions and its spatial localization. We propose to deepen the problem by introducing the spatially-progressing object state change segmentation task. The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed. We introduce the first model to address this task, designing a VLM-based pseudo-labeling approach, state-change dynamics constraints, and a novel WhereToChange benchmark built on in-the-wild Internet videos. Experiments on two datasets validate both the challenge of the new task as well as the promise of our model for localizing exactly where and how fast objects are changing in video. We further demonstrate useful implications for tracking activity progress to benefit robotic agents. Project page: this https URL
zh

[CV-277] Your Text Encoder Can Be An Object-Level Watermarking Controller

【速读】：该论文旨在解决AI生成图像版权保护的问题，提出了一种针对文本到图像（T2I）潜扩散模型（Latent Diffusion Models, LDMs）的新型不可见水印方法。解决方案的关键在于仅通过微调文本标记嵌入（text token embeddings, $W_*$ ），实现对图像中特定对象或部分区域的水印嵌入，而非传统的全图水印，从而提供更高的灵活性。此外，利用文本编码器在不同LDMs之间的兼容性，该方法可实现即插即用的集成能力，并且由于水印被引入编码阶段早期，显著提升了对抗扰动的鲁棒性。最终，该方法以减少 $10^5$ 倍参数量的代价实现了99%的比特准确性（48比特），证明了其高效性与有效性。

链接: https://arxiv.org/abs/2503.11945
作者: Naresh Kumar Devulapally,Mingzhen Huang,Vishal Asnani,Shruti Agarwal,Siwei Lyu,Vishnu Suresh Lokhande
机构: University at Buffalo, SUNY (布法罗大学，纽约州立大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Invisible watermarking of AI-generated images can help with copyright protection, enabling detection and identification of AI-generated media. In this work, we present a novel approach to watermark images of T2I Latent Diffusion Models (LDMs). By only fine-tuning text token embeddings W_* , we enable watermarking in selected objects or parts of the image, offering greater flexibility compared to traditional full-image watermarking. Our method leverages the text encoder’s compatibility across various LDMs, allowing plug-and-play integration for different LDMs. Moreover, introducing the watermark early in the encoding stage improves robustness to adversarial perturbations in later stages of the pipeline. Our approach achieves 99% bit accuracy ( 48 bits) with a 10^5 \times reduction in model parameters, enabling efficient watermarking.
zh

[CV-278] Integrating Product Coefficients for Improved 3D LiDAR Data Classification ICDM

【速读】：该论文旨在提升基于三维点云激光雷达数据（Lidar）分类的准确性。激光雷达是一种通过光学遥感技术估计给定地形三维坐标的手段。论文的关键解决方案在于引入由测度论推导出的“乘积系数”（product coefficients）作为额外特征，用于分类过程。这些乘积系数与主成分分析（Principal Component Analysis, PCA）结合，作为特征输入进行对比研究。实验结果表明，在新的框架下，将乘积系数纳入特征集显著提升了分类精度。

链接: https://arxiv.org/abs/2503.11943
作者: Patricia Medina
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Functional Analysis (math.FA); Machine Learning (stat.ML)
备注: 13 pages, 5 figures, to be published in Data Management, Analytics, and Innovation, Proceedings of ICDMAI 2025 by Springer

点击查看摘要

Abstract:In this paper, we address the enhancement of classification accuracy for 3D point cloud Lidar data, an optical remote sensing technique that estimates the three-dimensional coordinates of a given terrain. Our approach introduces product coefficients, theoretical quantities derived from measure theory, as additional features in the classification process. We define and present the formulation of these product coefficients and conduct a comparative study, using them alongside principal component analysis (PCA) as feature inputs. Results demonstrate that incorporating product coefficients into the feature set significantly improves classification accuracy within this new framework.
zh

[CV-279] Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder

【速读】：该论文试图解决在预训练扩散模型中通过纯文本指导实现对新领域（如数值型连续属性，例如眼睛睁开程度或汽车宽度）的精确控制，特别是同时控制多个属性的难题。为了解决这一问题，论文提出的关键方案是引入Attribute (Att) Adapter，这是一种新型的即插即用模块。Att-Adapter通过解耦交叉注意力机制自然地将多域属性与文本条件对齐，并结合Conditional Variational Autoencoder (CVAE) 来缓解过拟合问题，以匹配视觉世界的多样性。此外，该方法无需配对合成数据即可灵活扩展到单个模型中的多个属性控制。

链接: https://arxiv.org/abs/2503.11937
作者: Wonwoong Cho,Yan-Ying Chen,Matthew Klenk,David I. Inouye,Yanxia Zhang
机构: Purdue University (普渡大学); Toyota Research Institute (丰田研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning. We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world. Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.
zh

[CV-280] Design of an Expression Recognition Solution Employing the Global Channel-Spatial Attention Mechanism

【速读】：该论文旨在解决面部表情识别在视频任务中因细微表情变化和多尺度特征导致的低精度问题。为应对这些挑战，论文的关键解决方案包括提出全局通道-空间注意力（Global Channel-Spatial Attention）和中值增强的空间-通道注意力（Median-Enhanced Spatial-Channel Attention），以分别强化语音和图像特征处理。此外，通过引入语音与面部表情模态的关键帧对齐技术，计算语音和面部表情的权重，并将其输入到多尺度膨胀融合特征融合层中，有效提升了表情识别率。这些方法在第6届ABAW竞赛的表情识别任务中展现了优异的性能，验证了所提方法的有效性和竞争力。

链接: https://arxiv.org/abs/2503.11935
作者: Jun Yu,Yang Zheng,Lei Wang,Yongqi Wang,Shengfan Xu
机构: University of Science and Technology of China (中国科学技术大学); Macau University of Science and Technology (澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial expression recognition is a challenging classification task with broad application prospects in the field of human - computer interaction. This paper aims to introduce the methods of our upcoming 8th Affective Behavior Analysis in the Wild (ABAW) competition to be held at CVPR2025. To address issues such as low recognition accuracy caused by subtle expression changes and multi - scales in facial expression recognition in videos, we propose global channel - spatial attention and median - enhanced spatial - channel attention to strengthen feature processing for speech and images respectively. Secondly, to fully utilize the complementarity between the speech and facial expression modalities, a speech - and - facial - expression key - frame alignment technique is adopted to calculate the weights of speech and facial expressions. These weights are input into the feature fusion layer for multi - scale dilated fusion, which effectively improves the recognition rate of facial expression recognition. In the facial expression recognition task of the 6th ABAW competition, our method achieved excellent results on the official validation set, which fully demonstrates the effectiveness and competitiveness of the proposed method.
zh

[CV-281] SPRINT: Script-agnostic Structure Recognition in Tables ICDAR2024

【速读】：该论文旨在解决跨语言表格结构识别（Table Structure Recognition, TSR）的问题，特别是在非英语文档中的应用。当前最先进的方法主要针对英语文档的TSR进行了优化，而构建大规模标注数据集并从头训练这些模型在其他语言中代价高昂且耗时。为应对这一挑战，论文提出了一种与语言无关的单元排列预测方法，并引入了SPRINT（Script-agnostic Structure Recognition in Tables），即一种不依赖特定书写系统的表格结构识别框架。SPRINT的关键在于利用最近提出的优化表格结构语言（Optimized Table Structure Language, OTSL）序列来预测表格结构。通过结合预训练的表格网格估计器，SPRINT能够在树编辑距离基础上显著提升表格结构的相似性评分，尤其适用于非英语文档。此外，SPRINT不仅在标准数据集上的表现可媲美最先进模型，还展示了更低的延迟，并在处理非英语文档时表现出更高的准确性，其平均绝对提升达到11.12%。论文还提供了一个算法，用于将有效的OTSL预测转换为常用的基于HTML的表格表示形式，并开源了代码及相关多语言扫描与场景表格结构识别数据集MUSTARD，以促进进一步研究。

链接: https://arxiv.org/abs/2503.11932
作者: Dhruv Kudale,Badri Vishal Kasuba,Venkatapathy Subramanian,Parag Chaudhuri,Ganesh Ramakrishnan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICDAR 2024

点击查看摘要

Abstract:Table Structure Recognition (TSR) is vital for various downstream tasks like information retrieval, table reconstruction, and document understanding. While most state-of-the-art (SOTA) research predominantly focuses on TSR in English documents, the need for similar capabilities in other languages is evident, considering the global diversity of data. Moreover, creating substantial labeled data in non-English languages and training these SOTA models from scratch is costly and time-consuming. We propose TSR as a language-agnostic cell arrangement prediction and introduce SPRINT, Script-agnostic Structure Recognition in Tables. SPRINT uses recently introduced Optimized Table Structure Language (OTSL) sequences to predict table structures. We show that when coupled with a pre-trained table grid estimator, SPRINT can improve the overall tree edit distance-based similarity structure scores of tables even for non-English documents. We experimentally evaluate our performance across benchmark TSR datasets including PubTabNet, FinTabNet, and PubTables-1M. Our findings reveal that SPRINT not only matches SOTA models in performance on standard datasets but also demonstrates lower latency. Additionally, SPRINT excels in accurately identifying table structures in non-English documents, surpassing current leading models by showing an absolute average increase of 11.12%. We also present an algorithm for converting valid OTSL predictions into a widely used HTML-based table representation. To encourage further research, we release our code and Multilingual Scanned and Scene Table Structure Recognition Dataset, MUSTARD labeled with OTSL sequences for 1428 tables in thirteen languages encompassing several scripts at this https URL
zh

[CV-282] Generating a Biometrically Unique and Realistic Iris Database

【速读】：该论文旨在解决因伦理问题导致获取真实虹膜图像数据库困难的问题，以降低隐私和安全顾虑，并促进生物特征识别研究的发展。论文的关键解决方案是通过在开源扩散框架中训练扩散模型，生成具备现实性且在生物特征上不可辨识的彩色虹膜图像数据库。其核心创新在于利用扩散网络轻松实现虹膜纹理的独特性以及生成多样化的真实虹膜色素分布的能力，这为虹膜数据库构建及呈现攻击安全性研究提供了新的思路。

链接: https://arxiv.org/abs/2503.11930
作者: Jingxuan Zhang,Robert J. Hart,Ziqian Bi,Shiaofen Fang,Susan Walsh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: for associated iris database, see this https URL

点击查看摘要

Abstract:The use of the iris as a biometric identifier has increased dramatically over the last 30 years, prompting privacy and security concerns about the use of iris images in research. It can be difficult to acquire iris image databases due to ethical concerns, and this can be a barrier for those performing biometrics research. In this paper, we describe and show how to create a database of realistic, biometrically unidentifiable colored iris images by training a diffusion model within an open-source diffusion framework. Not only were we able to verify that our model is capable of creating iris textures that are biometrically unique from the training data, but we were also able to verify that our model output creates a full distribution of realistic iris pigmentations. We highlight the fact that the utility of diffusion networks to achieve these criteria with relative ease, warrants additional research in its use within the context of iris database generation and presentation attack security.
zh

[CV-283] k-fold Subsampling based Sequential Backward Feature Elimination

【速读】：该论文致力于解决行人检测中特征选择与分类效率之间的平衡问题。解决方案的关键在于提出了一种新的基于包装器（wrapper）的特征选择算法，该算法结合了过滤器（filter）方法和包装器方法的优点，通过采用k折子采样和顺序向后消除策略，实现了对最优特征向量的选择，该特征向量能够很好地表征图像中行人的形状。同时，使用标准线性支持向量机（SVM）作为分类器，并在INRIA和ETH公开数据集上验证了算法的有效性。与现有先进技术相比，该方法不仅提升了SVM分类器的检测速度超过50%，还提高了高达2%的检测精度，且在检测准确性方面比可变形部件模型（deformable part model）方法提升约9%。

链接: https://arxiv.org/abs/2503.11919
作者: Jeonghwan Park,Kang Li,Huiyu Zhou
机构: School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast (女王大学贝尔法斯特分校), Belfast, United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:We present a new wrapper feature selection algorithm for human detection. This algorithm is a hybrid feature selection approach combining the benefits of filter and wrapper methods. It allows the selection of an optimal feature vector that well represents the shapes of the subjects in the images. In detail, the proposed feature selection algorithm adopts the k-fold subsampling and sequential backward elimination approach, while the standard linear support vector machine (SVM) is used as the classifier for human detection. We apply the proposed algorithm to the publicly accessible INRIA and ETH pedestrian full image datasets with the PASCAL VOC evaluation criteria. Compared to other state of the arts algorithms, our feature selection based approach can improve the detection speed of the SVM classifier by over 50% with up to 2% better detection accuracy. Our algorithm also outperforms the equivalent systems introduced in the deformable part model approach with around 9% improvement in the detection accuracy.
zh

[CV-284] A Survey on SAR ship classification using Deep Learning

【速读】：该论文旨在解决合成孔径雷达（SAR）舰船分类中的深度学习（Deep Learning, DL）技术应用所面临的挑战，并提出提升模型性能的关键方法。论文的关键在于整合手工特征设计（handcrafted features）、利用公共数据集、数据增强（data augmentation）、微调（fine-tuning）、可解释性技术（explainability techniques），以及促进跨学科合作。通过建立首个基于DL模型、手工特征使用、SAR属性利用及微调影响的分类体系，论文系统分析了SAR舰船分类任务中的方法论及其技术影响，并探讨了未来研究的方向，包括缓解数据稀缺性、探索新型DL架构、引入可解释性技术及制定标准化性能评估指标。通过解决这些挑战并利用DL领域的进展，研究者能够开发更精确且高效的舰船分类系统，从而提升海洋监视及相关应用的能力。

链接: https://arxiv.org/abs/2503.11906
作者: Ch Muhammad Awais,Marco Reggiannini,Davide Moroni,Emanuele Salerno
机构: University of Pisa (比萨大学); Institute of Information Science and Technologies, National Research Council of Italy (意大利国家研究委员会信息科学与技术研究所); National Biodiversity Future Center - NBFC (国家生物多样性未来中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to JSTARS journal

点击查看摘要

Abstract:Deep learning (DL) has emerged as a powerful tool for Synthetic Aperture Radar (SAR) ship classification. This survey comprehensively analyzes the diverse DL techniques employed in this domain. We identify critical trends and challenges, highlighting the importance of integrating handcrafted features, utilizing public datasets, data augmentation, fine-tuning, explainability techniques, and fostering interdisciplinary collaborations to improve DL model performance. This survey establishes a first-of-its-kind taxonomy for categorizing relevant research based on DL models, handcrafted feature use, SAR attribute utilization, and the impact of fine-tuning. We discuss the methodologies used in SAR ship classification tasks and the impact of different techniques. Finally, the survey explores potential avenues for future research, including addressing data scarcity, exploring novel DL architectures, incorporating interpretability techniques, and establishing standardized performance metrics. By addressing these challenges and leveraging advancements in DL, researchers can contribute to developing more accurate and efficient ship classification systems, ultimately enhancing maritime surveillance and related applications.
zh

[CV-285] Upcycling Text-to-Image Diffusion Models for Multi-Task Capabilities

【速读】：该论文旨在解决现有文本到图像（text-to-image）扩散模型在支持多任务时需要资源密集型微调或引入额外参数的问题，这限制了其在设备端高效部署的能力。论文提出的解决方案是Multi-Task Upcycling (MTU)，其关键是通过用小型前馈网络（Feed-Forward Network, FFN）专家替换扩散模型中的标准FFN层，并结合动态路由机制，实现了多任务能力的扩展，同时避免了参数膨胀问题。这种方法使MTU能够在保持与单任务微调模型相当的性能和计算开销的同时，无缝支持多种图像到图像生成任务，包括图像编辑、超分辨率和图像修复等。

链接: https://arxiv.org/abs/2503.11905
作者: Ruchika Chavhan,Abhinav Mehrotra,Malcolm Chadwick,Alberto Gil Ramos,Luca Morreale,Mehdi Noroozi,Sourav Bhattacharya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Text-to-image synthesis has witnessed remarkable advancements in recent years. Many attempts have been made to adopt text-to-image models to support multiple tasks. However, existing approaches typically require resource-intensive re-training or additional parameters to accommodate for the new tasks, which makes the model inefficient for on-device deployment. We propose Multi-Task Upcycling (MTU), a simple yet effective recipe that extends the capabilities of a pre-trained text-to-image diffusion model to support a variety of image-to-image generation tasks. MTU replaces Feed-Forward Network (FFN) layers in the diffusion model with smaller FFNs, referred to as experts, and combines them with a dynamic routing mechanism. To the best of our knowledge, MTU is the first multi-task diffusion modeling approach that seamlessly blends multi-tasking with on-device compatibility, by mitigating the issue of parameter inflation. We show that the performance of MTU is on par with the single-task fine-tuned diffusion models across several tasks including image editing, super-resolution, and inpainting, while maintaining similar latency and computational load (GFLOPs) as the single-task fine-tuned models.
zh

[CV-286] UStyle: Waterbody Style Transfer of Underwater Scenes by Depth-Guided Feature Synthesis

【速读】：该论文旨在解决水下图像风格迁移（waterbody style transfer）领域中未充分探索的问题，特别是传统图像风格迁移方法难以在高散射介质（如水下环境）中保持物体和场景几何结构的问题。此外，由于水下图像受到波长相关的非线性衰减和深度相关的后向散射伪影的影响，从无配对数据中学习风格迁移变得更加复杂。为了解决这些问题，论文提出了UStyle，这是一种基于数据驱动的学习框架，用于跨水下图像进行风格迁移，无需参考图像或场景信息。UStyle的关键创新在于引入了一种深度感知的白化与着色变换（depth-aware whitening and coloring transform, DA-WCT）机制，该机制结合了基于物理的水体合成技术，以确保感知一致的风格化同时保留场景结构。此外，通过设计精心的损失函数，UStyle能够在色彩丰富度、亮度、结构完整性以及频率域特征等多个方面进行优化，并在VGG和CLIP特征空间中保持高级别的内容一致性。这些措施共同构成了UStyle的核心解决方案，使其能够超越依赖端到端重建损失的传统方法，提供一个鲁棒的无参考水下图像风格迁移框架。

链接: https://arxiv.org/abs/2503.11893
作者: Md Abu Bakr Siddique,Junliang Liu,Piyush Singh,Md Jahidul Islam
机构: RoboPI Laboratory, Department of ECE, University of Florida (罗伯特·皮伊实验室, 电气与计算机工程系, 弗洛里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The concept of waterbody style transfer remains largely unexplored in the underwater imaging and vision literature. Traditional image style transfer (STx) methods primarily focus on artistic and photorealistic blending, often failing to preserve object and scene geometry in images captured in high-scattering mediums such as underwater. The wavelength-dependent nonlinear attenuation and depth-dependent backscattering artifacts further complicate learning underwater image STx from unpaired data. This paper introduces UStyle, the first data-driven learning framework for transferring waterbody styles across underwater images without requiring prior reference images or scene information. We propose a novel depth-aware whitening and coloring transform (DA-WCT) mechanism that integrates physics-based waterbody synthesis to ensure perceptually consistent stylization while preserving scene structure. To enhance style transfer quality, we incorporate carefully designed loss functions that guide UStyle to maintain colorfulness, lightness, structural integrity, and frequency-domain characteristics, as well as high-level content in VGG and CLIP (contrastive language-image pretraining) feature spaces. By addressing domain-specific challenges, UStyle provides a robust framework for no-reference underwater image STx, surpassing state-of-the-art (SOTA) methods that rely solely on end-to-end reconstruction loss. Furthermore, we introduce the UF7D dataset, a curated collection of high-resolution underwater images spanning seven distinct waterbody styles, establishing a benchmark to support future research in underwater image STx. The UStyle inference pipeline and UF7D dataset are released at: this https URL.
zh

[CV-287] DecAlign: Hierarchical Cross-Modal Alignment for Decoupled Multimodal Representation Learning

【速读】：该论文旨在解决多模态表征学习中因不同模态内在异质性导致的有效跨模态协作与整合的挑战。为应对这一问题，论文提出了一种名为DecAlign的新型分层跨模态对齐框架，其关键在于将多模态表征解耦为模态独有的（异构）特征和模态共有的（同构）特征。针对异质性，DecAlign采用原型引导最优传输对齐策略，结合高斯混合建模与多边缘传输计划，从而在缓解分布差异的同时保留模态独有的特性；为了强化同构性，通过最大均值差异正则化实现跨模态潜在分布匹配的一致性对齐。此外，引入多模态Transformer以增强高级语义特征融合，进一步减少跨模态不一致性。实验结果表明，DecAlign在五个指标上始终优于现有最先进的方法，验证了其在提升跨模态对齐和语义一致性方面的有效性，同时保持模态独有的特性，标志着多模态表征学习领域的重要进展。

链接: https://arxiv.org/abs/2503.11892
作者: Chengxuan Qian,Shuo Xing,Shawn Li,Yue Zhao,Zhengzhong Tu
机构: Texas A&M University (德克萨斯农工大学); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities. However, the intrinsic heterogeneity of diverse modalities presents substantial challenges to achieve effective cross-modal collaboration and integration. To address this, we introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features. For handling heterogeneity, we employ a prototype-guided optimal transport alignment strategy leveraging gaussian mixture modeling and multi-marginal transport plans, thus mitigating distribution discrepancies while preserving modality-unique characteristics. To reinforce homogeneity, we ensure semantic consistency across modalities by aligning latent distribution matching with Maximum Mean Discrepancy regularization. Furthermore, we incorporate a multimodal transformer to enhance high-level semantic feature fusion, thereby further reducing cross-modal inconsistencies. Our extensive experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods across five metrics. These results highlight the efficacy of DecAlign in enhancing superior cross-modal alignment and semantic consistency while preserving modality-unique features, marking a significant advancement in multimodal representation learning scenarios. Our project page is at this https URL and the code is available at this https URL.
zh

[CV-288] owards a Unified Copernicus Foundation Model for Earth Vision

【速读】：该论文旨在解决现有地球观测（Earth Observation, EO）基础模型在处理卫星数据时存在的局限性，包括局限于固定光谱传感器、仅关注地表信息以及忽视图像之外有价值的元数据等问题。为应对这些挑战，论文提出了三个关键组件：1）Copernicus-Pretrain，一个包含来自主要Copernicus哨兵任务的1870万对齐图像的大规模预训练数据集，覆盖从地表到大气的多维信息；2）Copernicus-FM，一种能够通过扩展动态超网络和灵活的元数据编码处理任意光谱或非光谱传感器模态的统一基础模型；3）Copernicus-Bench，一个包含15个分层下游任务的系统评估基准，涵盖从预处理到针对每个哨兵任务的特定应用。这些创新显著提升了EO基础模型的可扩展性、多功能性和多模态适应能力，并为连接EO、天气和气候研究创造了新机遇。

链接: https://arxiv.org/abs/2503.11849
作者: Yi Wang,Zhitong Xiong,Chenying Liu,Adam J. Stewart,Thomas Dujardin,Nikolaos Ioannis Bountos,Angelos Zavras,Franziska Gerken,Ioannis Papoutsis,Laura Leal-Taixé,Xiao Xiang Zhu
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); National Technical University of Athens & National Observatory of Athens (雅典国立技术大学 & 雅典国家天文台); Harokopio University of Athens (雅典哈罗科皮奥大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 32 figures

点击查看摘要

Abstract:Advances in Earth observation (EO) foundation models have unlocked the potential of big satellite data to learn generic representations from space, benefiting a wide range of downstream applications crucial to our planet. However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth’s surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth’s surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. Our dataset, model, and benchmark greatly improve the scalability, versatility, and multimodal adaptability of EO foundation models, while also creating new opportunities to connect EO, weather, and climate research. Codes, datasets and models are available at this https URL.
zh

[CV-289] How Can Time Series Analysis Benefit From Multiple Modalities? A Survey and Outlook

【速读】：本文旨在探索时间序列分析（Time Series Analysis, TSA）领域中多模态方法（Multiple Modalities for TSA, MM4TSA）的新兴研究方向，并系统性地总结其潜在价值与未来机会。论文的关键在于揭示如何通过引入其他模态（如文本、图像、音频、表格等）来提升TSA的效果，具体体现在三个主要方面：(1) 利用已有基础模型实现高效的时间序列分析；(2) 多模态扩展以增强分析能力；(3) 跨模态交互以支持更高级别的分析任务。解决方案的核心在于深入挖掘不同模态之间的互补性和关联性，同时明确当前研究中的不足之处及未来的发展方向，包括模态选择、异构模态融合以及未见任务的泛化能力提升。

链接: https://arxiv.org/abs/2503.11835
作者: Haoxin Liu,Harshavardhan Kamarthi,Zhiyuan Zhao,Shangqing Xu,Shiyu Wang,Qingsong Wen,Tom Hartvigsen,Fei Wang,B. Aditya Prakash
机构: Georgia Institute of Technology (乔治亚理工学院); Bytedance Inc. (字节跳动); Squirrel AI (松鼠AI) (美国); The University of Virginia (弗吉尼亚大学); Cornell University (康奈尔大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Time series analysis (TSA) is a longstanding research topic in the data mining community and has wide real-world significance. Compared to “richer” modalities such as language and vision, which have recently experienced explosive development and are densely connected, the time-series modality remains relatively underexplored and isolated. We notice that many recent TSA works have formed a new research field, i.e., Multiple Modalities for TSA (MM4TSA). In general, these MM4TSA works follow a common motivation: how TSA can benefit from multiple modalities. This survey is the first to offer a comprehensive review and a detailed outlook for this emerging field. Specifically, we systematically discuss three benefits: (1) reusing foundation models of other modalities for efficient TSA, (2) multimodal extension for enhanced TSA, and (3) cross-modality interaction for advanced TSA. We further group the works by the introduced modality type, including text, images, audio, tables, and others, within each perspective. Finally, we identify the gaps with future opportunities, including the reused modalities selections, heterogeneous modality combinations, and unseen tasks generalizations, corresponding to the three benefits. We release an up-to-date GitHub repository that includes key papers and resources.
zh

[CV-290] Mitigating Bad Ground Truth in Supervised Machine Learning based Crop Classification: A Multi-Level Framework with Sentinel-2 Images

【速读】：该论文旨在解决农业管理中基于机器学习的作物分类所依赖的Ground Truth (GT) 数据中存在的问题，如作物误标和土地识别错误。为了解决这些问题，论文提出了一种多层级的GT数据清洗框架，并利用多时相Sentinel-2数据进行验证。该方案的关键在于通过生成农田嵌入、聚类相似的作物特征以及识别异常值来检测GT数据中的错误，并结合False Colour Composite (FCC) 检查与距离度量方法实现清洗过程的自动化和扩展性验证。实验结果表明，使用清洗后的GT数据训练模型可显著提高分类性能，例如Random Forest模型的F1分数提升了高达70个百分点，从而推动了作物分类方法的发展，并为贷款核保和农业决策优化提供了潜在应用价值。

链接: https://arxiv.org/abs/2503.11807
作者: Sanayya A,Amoolya Shetty,Abhijeet Sharma,Venkatesh Ravichandran,Masthan Wali Gosuvarapalli,Sarthak Jain,Priyamvada Nanjundiah,Ujjal Kr Dutta,Divya Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted In IEEE India Geoscience and Remote Sensing Symposium (InGARSS) 2024

点击查看摘要

Abstract:In agricultural management, precise Ground Truth (GT) data is crucial for accurate Machine Learning (ML) based crop classification. Yet, issues like crop mislabeling and incorrect land identification are common. We propose a multi-level GT cleaning framework while utilizing multi-temporal Sentinel-2 data to address these issues. Specifically, this framework utilizes generating embeddings for farmland, clustering similar crop profiles, and identification of outliers indicating GT errors. We validated clusters with False Colour Composite (FCC) checks and used distance-based metrics to scale and automate this verification process. The importance of cleaning the GT data became apparent when the models were trained on the clean and unclean data. For instance, when we trained a Random Forest model with the clean GT data, we achieved upto 70% absolute percentage points higher for the F1 score metric. This approach advances crop classification methodologies, with potential for applications towards improving loan underwriting and agricultural decision-making.
zh

[CV-291] Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling WWW

【速读】：本文提出了一种新颖的人在回路（Human-in-the-loop）方法，用于从第一人称视角利用人类反馈估计3D场景布局。论文通过引入局部修正任务研究此方法，用户在此任务中识别局部错误并触发模型自动修正。基于SceneScript——一种利用结构化语言进行3D场景布局估计的最先进框架，作者将此问题构架为“补全”任务，这是自然语言处理领域的一个研究课题。为同时保持全局预测性能并显著提升局部修正能力，作者训练了一个多任务版本的SceneScript。最终，将此方案集成到人机协作系统中，使用户能够通过低摩擦的“一键修复”工作流迭代优化场景布局估计。关键在于通过补全任务结合多任务学习，实现复杂布局的更精准建模，并允许最终精化布局偏离训练数据分布。

链接: https://arxiv.org/abs/2503.11806
作者: Christopher Xie,Armen Avetisyan,Henry Howard-Jenkins,Yawar Siddiqui,Julian Straub,Richard Newcombe,Vasileios Balntas,Jakob Engel
机构: Meta Reality Labs (Meta 实景实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present a novel human-in-the-loop approach to estimate 3D scene layout that uses human feedback from an egocentric standpoint. We study this approach through introduction of a novel local correction task, where users identify local errors and prompt a model to automatically correct them. Building on SceneScript, a state-of-the-art framework for 3D scene layout estimation that leverages structured language, we propose a solution that structures this problem as “infilling”, a task studied in natural language processing. We train a multi-task version of SceneScript that maintains performance on global predictions while significantly improving its local correction ability. We integrate this into a human-in-the-loop system, enabling a user to iteratively refine scene layout estimates via a low-friction "one-click fix’’ workflow. Our system enables the final refined layout to diverge from the training distribution, allowing for more accurate modelling of complex layouts.
zh

[CV-292] StyleMorpheus: A Style-Based 3D-Aware Morphable Face Model

【速读】：该论文试图解决现有参数化可控3D感知人脸模型训练依赖于实验室采集的大规模数据集的问题。为了解决这一问题，论文提出了“StyleMorpheus”，首个基于风格的神经3D可形变人脸模型(3DMM)，其训练数据来源于野外图像。解决方案的关键在于StyleMorpheus采用了一种基于风格的设计，通过自编码器结构实现解耦表示的学习，并利用形状与外观相关的风格代码优化网络子模块中的解耦程度。此外，通过基于风格的生成对抗学习微调解码器，实现了照片级真实的3D渲染质量。这种基于风格的设计使StyleMorpheus在保持3DMM解耦可控性的同时无需精确重建显式3D形状，从而达到最先进的3D感知人脸重建效果，并支持实时渲染速度，适用于虚拟现实等应用。

链接: https://arxiv.org/abs/2503.11792
作者: Peizhi Yan,Rabab K. Ward,Dan Wang,Qiang Tang,Shan Du
机构: Department of Electrical and Computer Engineering, The University of British Columbia (The University of British Columbia); Huawei Technologies Canada (华为技术加拿大); Department of Computer Science, Mathematics, Physics and Statistics, The University of British Columbia (Okanagan) (英属哥伦比亚大学奥肯那根校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, work was completed in 2023

点击查看摘要

Abstract:For 3D face modeling, the recently developed 3D-aware neural rendering methods are able to render photorealistic face images with arbitrary viewing directions. The training of the parametric controllable 3D-aware face models, however, still relies on a large-scale dataset that is lab-collected. To address this issue, this paper introduces “StyleMorpheus”, the first style-based neural 3D Morphable Face Model (3DMM) that is trained on in-the-wild images. It inherits 3DMM’s disentangled controllability (over face identity, expression, and appearance) but without the need for accurately reconstructed explicit 3D shapes. StyleMorpheus employs an auto-encoder structure. The encoder aims at learning a representative disentangled parametric code space and the decoder improves the disentanglement using shape and appearance-related style codes in the different sub-modules of the network. Furthermore, we fine-tune the decoder through style-based generative adversarial learning to achieve photorealistic 3D rendering quality. The proposed style-based design enables StyleMorpheus to achieve state-of-the-art 3D-aware face reconstruction results, while also allowing disentangled control of the reconstructed face. Our model achieves real-time rendering speed, allowing its use in virtual reality applications. We also demonstrate the capability of the proposed style-based design in face editing applications such as style mixing and color editing. Project homepage: this https URL.
zh

[CV-293] ECLARE: Efficient cross-planar learning for anisotropic resolution enhancement

【速读】：该论文旨在解决临床磁共振（MR）图像在以2D切片形式获取时，因扫描时间缩短、信噪比提升及独特对比度带来的优势，却导致现有针对3D分析设计的自动化算法在处理这些2D扫描数据（尤其是厚切片和切片间间隙存在时）表现不佳的问题。论文提出的关键解决方案是ECLARE（Efficient Cross-planar Learning for Anisotropic Resolution Enhancement），这是一种自监督超分辨率（Super-resolution, SR）方法，能够同时处理切片轮廓形状估计、切片间隙、域偏移以及非整数/任意放大因子等问题。其核心在于通过估计2D获取的多切片MR体积的切片轮廓，训练网络学习同一数据体中从低分辨率到高分辨率平面补丁的映射，并结合抗混叠技术实现超分辨率增强。

链接: https://arxiv.org/abs/2503.11787
作者: Samuel W. Remedios,Shuwen Wei,Shuo Han,Jinwei Zhang,Aaron Carass,Kurt G. Schilling,Dzung L. Pham,Jerry L. Prince,Blake E. Dewey
机构: Johns Hopkins University (约翰霍普金斯大学); Vanderbilt University Medical Center (范德比尔特大学医学中心); Uniformed Services University (统一服务大学); Johns Hopkins School of Medicine (约翰霍普金斯医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In clinical imaging, magnetic resonance (MR) image volumes are often acquired as stacks of 2D slices, permitting decreased scan times, improved signal-to-noise ratio, and image contrasts unique to 2D MR pulse sequences. While this is sufficient for clinical evaluation, automated algorithms designed for 3D analysis perform sub-optimally on 2D-acquired scans, especially those with thick slices and gaps between slices. Super-resolution (SR) methods aim to address this problem, but previous methods do not address all of the following: slice profile shape estimation, slice gap, domain shift, and non-integer / arbitrary upsampling factors. In this paper, we propose ECLARE (Efficient Cross-planar Learning for Anisotropic Resolution Enhancement), a self-SR method that addresses each of these factors. ECLARE estimates the slice profile from the 2D-acquired multi-slice MR volume, trains a network to learn the mapping from low-resolution to high-resolution in-plane patches from the same volume, and performs SR with anti-aliasing. We compared ECLARE to cubic B-spline interpolation, SMORE, and other contemporary SR methods. We used realistic and representative simulations so that quantitative performance against a ground truth could be computed, and ECLARE outperformed all other methods in both signal recovery and downstream tasks. On real data for which there is no ground truth, ECLARE demonstrated qualitative superiority over other methods as well. Importantly, as ECLARE does not use external training data it cannot suffer from domain shift between training and testing. Our code is open-source and available at this https URL.
zh

[CV-294] Color Matching Using Hypernetwork-Based Kolmogorov-Arnold Networks

【速读】：该论文试图解决跨设备颜色匹配（color matching）的问题，具体目标是在监督与无监督设置下，将源颜色分布中的颜色映射到目标颜色分布中，以实现准确且有效的颜色转换。论文提出的解决方案关键在于利用Kolmogorov-Arnold Networks (KANs) 的样条能力构建了一个多功能框架cmKAN，并通过开发一个超网络生成空间变化的权重图来控制KAN的非线性样条，从而实现精确的颜色匹配。此外，研究还引入了一个大规模配对图像数据集，用于评估方法在多种颜色匹配任务中的表现，包括不同设备间的原始色彩空间映射及sRGB空间内的映射等。结果表明，该方法在监督和无监督情况下平均优于现有方法37.3%，同时保持了轻量级特性。

链接: https://arxiv.org/abs/2503.11781
作者: Artem Nikonorov,Georgy Perevozchikov,Andrei Korepanov,Nancy Mehta,Mahmoud Afifi,Egor Ershov,Radu Timofte
机构: Samara National Research University (萨马拉国立研究大学); Computer Vision Lab, CAIDAS & IFI, University of Würzburg (维尔茨堡大学计算机视觉实验室，CAIDAS & IFI); York University (约克大学); Institute for Information Transmission Problems RAS (俄罗斯科学院信息传输问题研究所); Moscow Institute of Physics and Technologies (莫斯科物理技术学院); Artificial Intelligence Research Institute (人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present cmKAN, a versatile framework for color matching. Given an input image with colors from a source color distribution, our method effectively and accurately maps these colors to match a target color distribution in both supervised and unsupervised settings. Our framework leverages the spline capabilities of Kolmogorov-Arnold Networks (KANs) to model the color matching between source and target distributions. Specifically, we developed a hypernetwork that generates spatially varying weight maps to control the nonlinear splines of a KAN, enabling accurate color matching. As part of this work, we introduce a first large-scale dataset of paired images captured by two distinct cameras and evaluate the efficacy of our and existing methods in matching colors. We evaluated our approach across various color-matching tasks, including: (1) raw-to-raw mapping, where the source color distribution is in one camera’s raw color space and the target in another camera’s raw space; (2) raw-to-sRGB mapping, where the source color distribution is in a camera’s raw space and the target is in the display sRGB space, emulating the color rendering of a camera ISP; and (3) sRGB-to-sRGB mapping, where the goal is to transfer colors from a source sRGB space (e.g., produced by a source camera ISP) to a target sRGB space (e.g., from a different camera ISP). The results show that our method outperforms existing approaches by 37.3% on average for supervised and unsupervised cases while remaining lightweight compared to other methods. The codes, dataset, and pre-trained models are available at: this https URL
zh

[CV-295] Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning

【速读】：本文旨在解决多模态目标检测（MMOD）中单模态特征学习不足导致的特征提取能力下降问题，以及由此引发的融合退化（Fusion Degradation）现象，该现象阻碍了多模态检测模型的性能提升。为应对这一挑战，论文引入线性探测评估方法，并从单模态学习的角度重新审视多模态目标检测任务。解决方案的关键在于提出了一种名为M²D-LIF的新框架，其核心包括单模态蒸馏（M²D）方法和局部光照感知融合（LIF）模块。其中，M²D方法促进了多模态联合训练过程中单模态特征的充分学习，而LIF模块则探索了一种轻量且有效的特征融合方式，从而实现了卓越的目标检测性能。实验结果表明，该框架有效缓解了融合退化现象，并在三个多模态数据集上的表现优于现有最先进的检测器。

链接: https://arxiv.org/abs/2503.11780
作者: Tianyi Zhao,Boyang Liu,Yanglei Gao,Yiming Sun,Maoxun Yuan,Xingxing Wei
机构: Beihang University (北京航空航天大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem that the decreased feature extraction capability in multi-modal joint learning. This leads to an unreasonable but prevalent phenomenon–Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct an novel framework called M ^2 D-LIF, which consists of the Mono-Modality Distillation (M ^2 D) method and the Local Illumination-aware Fusion (LIF) module. The M ^2 D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M ^2 D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors.
zh

[CV-296] Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization

【速读】：该论文旨在解决现有对抗性越狱攻击方法中存在的一个问题：并非每次对抗优化步骤都能带来积极的结果，而无选择地接受每一步的优化结果可能导致整体攻击成功率下降。为应对这一挑战，论文提出了一种名为HKVE（分层键值均衡）的创新越狱框架，其关键是通过基于不同层级注意力分数分布来选择性地接受梯度优化结果，从而确保每一优化步骤都能对攻击产生正面贡献。实验表明，HKVE在MiniGPT4、LLaVA和Qwen-VL上的攻击成功率分别达到了75.08%、85.84%和81.00%，显著优于现有方法。此外，这种方法不仅提高了攻击成功率，还减少了迭代次数，降低了计算成本。

链接: https://arxiv.org/abs/2503.11750
作者: Shuyang Hao,Yiwei Wang,Bryan Hooi,Jun Liu,Muhao Chen,Zi Huang,Yujun Cai
机构: Southeast University; University of California, Merced; National University of Singapore; Lancaster University; University of Queensland
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:In the realm of large vision-language models (LVLMs), adversarial jailbreak attacks serve as a red-teaming approach to identify safety vulnerabilities of these models and their associated defense mechanisms. However, we identify a critical limitation: not every adversarial optimization step leads to a positive outcome, and indiscriminately accepting optimization results at each step may reduce the overall attack success rate. To address this challenge, we introduce HKVE (Hierarchical Key-Value Equalization), an innovative jailbreaking framework that selectively accepts gradient optimization results based on the distribution of attention scores across different layers, ensuring that every optimization step positively contributes to the attack. Extensive experiments demonstrate HKVE’s significant effectiveness, achieving attack success rates of 75.08% on MiniGPT4, 85.84% on LLaVA and 81.00% on Qwen-VL, substantially outperforming existing methods by margins of 20.43%, 21.01% and 26.43% respectively. Furthermore, making every step effective not only leads to an increase in attack success rate but also allows for a reduction in the number of iterations, thereby lowering computational costs. Warning: This paper contains potentially harmful example data.
zh

[CV-297] Safe Vision-Language Models via Unsafe Weights Manipulation

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在安全评估中存在的局限性，即仅关注模型对不安全输入的安全性表现，而忽视了其在安全输入上的潜在不足。论文通过引入SafeGround这一新指标集，揭示了基于训练的方法可能导致模型在安全输入上表现更差的问题。为了解决这一问题，论文提出了Unsafe Weights Manipulation (UWM) 方法，其关键是利用一组标注好的安全与不安全实例，通过比较安全与不安全内容之间的激活差异，识别处理不安全内容的关键参数，并通过参数值的反转操作来提升模型的安全性，同时保持知识完整性。实验表明，UWM在安全性与知识保留之间取得了最佳权衡，在不安全查询任务上优于现有方法，同时在安全输入上也表现出色。

链接: https://arxiv.org/abs/2503.11742
作者: Moreno D’Incà,Elia Peruzzo,Xingqian Xu,Humphrey Shi,Nicu Sebe,Massimiliano Mancini
机构: University of Trento (特伦托大学); SHI Labs @ Georgia Tech & UIUC (乔治亚理工学院 & 伊利诺伊大学香槟分校); Picsart AI Research (PAIR) (Picsart AI 研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Vision-language models (VLMs) often inherit the biases and unsafe associations present within their large-scale training dataset. While recent approaches mitigate unsafe behaviors, their evaluation focuses on how safe the model is on unsafe inputs, ignoring potential shortcomings on safe ones. In this paper, we first revise safety evaluation by introducing SafeGround, a new set of metrics that evaluate safety at different levels of granularity. With this metric, we uncover a surprising issue of training-based methods: they make the model less safe on safe inputs. From this finding, we take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM). UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter. Their values are then manipulated via negation. Experiments show that UWM achieves the best tradeoff between safety and knowledge preservation, consistently improving VLMs on unsafe queries while outperforming even training-based state-of-the-art methods on safe ones.
zh

[CV-298] Industrial-Grade Sensor Simulation via Gaussian Splatting: A Modular Framework for Scalable Editing and Full-Stack Validation

【速读】：本文旨在解决基于Neural Radiance Fields (NeRF) 的传感器仿真方法在工业工作流中面临的适用性和效率挑战。关键解决方案在于引入基于Gaussian Splatting (GS) 的系统，并通过重构三个关键组件来利用其显式的场景表示和实时渲染能力：(1) 使用二维神经高斯表示进行物理兼容的场景和传感器建模；(2) 提出基于高斯基元库的数据增强场景编辑管道；(3) 集成可控扩散模型以实现场景扩展与一致性优化。这些改进显著降低了帧级仿真延迟，提升了几何与光度一致性，并实现了可解释的显式场景编辑与扩展，同时验证了其在端到端自动驾驶算法全栈测试中的有效性。

链接: https://arxiv.org/abs/2503.11731
作者: Xianming Zeng,Sicong Du,Qifeng Chen,Lizhe Liu,Haoyu Shu,Jiaxuan Gao,Jiarun Liu,Jiulong Xu,Jianyun Xu,Mingxia Chen,Yiru Zhao,Peng Chen,Yapeng Xue,Chunming Zhao,Sheng Yang,Qiang Li
机构: Unmanned Vehicle Department of CaiNiao Inc., Alibaba Group (菜鸟无人车部门，阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Sensor simulation is pivotal for scalable validation of autonomous driving systems, yet existing Neural Radiance Fields (NeRF) based methods face applicability and efficiency challenges in industrial workflows. This paper introduces a Gaussian Splatting (GS) based system to address these challenges: We first break down sensor simulator components and analyze the possible advantages of GS over NeRF. Then in practice, we refactor three crucial components through GS, to leverage its explicit scene representation and real-time rendering: (1) choosing the 2D neural Gaussian representation for physics-compliant scene and sensor modeling, (2) proposing a scene editing pipeline to leverage Gaussian primitives library for data augmentation, and (3) coupling a controllable diffusion model for scene expansion and harmonization. We implement this framework on a proprietary autonomous driving dataset supporting cameras and LiDAR sensors. We demonstrate through ablation studies that our approach reduces frame-wise simulation latency, achieves better geometric and photometric consistency, and enables interpretable explicit scene editing and expansion. Furthermore, we showcase how integrating such a GS-based sensor simulator with traffic and dynamic simulators enables full-stack testing of end-to-end autonomy algorithms. Our work provides both algorithmic insights and practical validation, establishing GS as a cornerstone for industrial-grade sensor simulation.
zh

[CV-299] FloPE: Flower Pose Estimation for Precision Pollination IROS2025

【速读】：该论文旨在解决因自然授粉者数量减少而提出的补充性机器人授粉系统中，花的姿态估计（flower pose estimation）面临的挑战。这些挑战包括花朵的自然变异性、花簇的存在以及由于花朵脆弱性导致的高精度需求。论文的关键解决方案是提出了一种名为Flower Pose Estimation (FloPE) 的实时框架，通过利用3D高斯点样技术生成具有精确姿态标注的逼真合成数据集，并采用知识蒸馏方法将高性能教师模型的知识高效迁移至轻量级学生模型，从而实现高效的推理。这种方案在单臂和多臂机器人平台上均表现出色，以较低的计算成本实现了0.6厘米的位置误差和19.14度的角度误差，验证了其有效性。

链接: https://arxiv.org/abs/2503.11692
作者: Rashik Shrestha,Madhav Rijal,Trevor Smith,Yu Gu
机构: West Virginia University (西弗吉尼亚大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: IROS2025 under review

点击查看摘要

Abstract:This study presents Flower Pose Estimation (FloPE), a real-time flower pose estimation framework for computationally constrained robotic pollination systems. Robotic pollination has been proposed to supplement natural pollination to ensure global food security due to the decreased population of natural pollinators. However, flower pose estimation for pollination is challenging due to natural variability, flower clusters, and high accuracy demands due to the flowers’ fragility when pollinating. This method leverages 3D Gaussian Splatting to generate photorealistic synthetic datasets with precise pose annotations, enabling effective knowledge distillation from a high-capacity teacher model to a lightweight student model for efficient inference. The approach was evaluated on both single and multi-arm robotic platforms, achieving a mean pose estimation error of 0.6 cm and 19.14 degrees within a low computational cost. Our experiments validate the effectiveness of FloPE, achieving up to 78.75% pollination success rate and outperforming prior robotic pollination techniques.
zh

[CV-300] CORDIC Is All You Need

【速读】：本文旨在解决高效人工智能推理所需的自适应硬件加速器问题，特别是在高吞吐量大规模运算场景下的能效与性能优化。论文的关键创新在于提出了一种基于可重构处理引擎（Reconfigurable Processing Engine, RPE）的流水线架构，其核心是采用CORDIC模块实现线性MAC计算以及非线性迭代激活函数（如tanh、sigmoid和softmax）。该方案通过40%的模型剪枝率，在CMOS 28 nm工艺下实现了高达4.64倍的吞吐量提升，同时将功耗和面积分别降低了5.02倍和4.06倍，仅伴随极小的精度损失。此外，FPGA实现展示了相比先前工作的资源节省达2.5倍及功耗降低3倍的优势。此方法通过Systolic CORDIC引擎（SYCore），结合输出静止数据流与CAESAR控制引擎，支持多种AI工作负载（如Transformers、RNN/LSTM、DNN等），适用于图像检测、大语言模型（LLM）及语音识别等应用场景，为边缘AI加速器提供了高效的能效与灵活性。

链接: https://arxiv.org/abs/2503.11685
作者: Omkar Kokane,Adam Teman,Anushka Jha,Guru Prasath SL,Gopal Raut,Mukul Lokhande,S. V. Jaya Chand,Tanushree Dewangan,Santosh Kumar Vishvakarma
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Artificial intelligence necessitates adaptable hardware accelerators for efficient high-throughput million operations. We present pipelined architecture with CORDIC block for linear MAC computations and nonlinear iterative Activation Functions (AF) such as tanh , sigmoid , and softmax . This approach focuses on a Reconfigurable Processing Engine (RPE) based systolic array, with 40% pruning rate, enhanced throughput up to 4.64 \times , and reduction in power and area by 5.02 \times and 4.06 \times at CMOS 28 nm, with minor accuracy loss. FPGA implementation achieves a reduction of up to 2.5 \times resource savings and 3 \times power compared to prior works. The Systolic CORDIC engine for Reconfigurability and Enhanced throughput (SYCore) deploys an output stationary dataflow with the CAESAR control engine for diverse AI workloads such as Transformers, RNNs/LSTMs, and DNNs for applications like image detection, LLMs, and speech recognition. The energy-efficient and flexible approach extends the enhanced approach for edge AI accelerators supporting emerging workloads.
zh

[CV-301] Simulation of prosthetic vision with PRIMA system and enhancement of face representation

【速读】：该论文旨在解决由植入PRIMA光动力视网膜假体的地理萎缩患者在感知人脸时遇到的困难。尽管这些患者通过假体获得了相当于100微米像素大小的平均视力，并能够阅读和书写，但在识别人脸方面仍存在显著挑战。为解决这一问题，论文提出了一种新的非像素化算法，用于模拟PRIMA患者体验到的假体视觉，并将其预测结果与临床感知结果进行比较。同时，还引入计算机视觉和机器学习方法以改善人脸表征。

解决方案的关键在于开发了一种综合灰度滤波器、空间分辨率滤波器和对比度滤波器的仿真算法，以反映视网膜植入物有限采样密度以及假体视觉对比敏感度降低的特点。此外，通过应用机器学习面部标志检测模型及对比度调整色调曲线于人脸图像上，再投影至植入物之前，可以恢复假体视觉中丢失的人脸特征，从而进一步提升人脸表征质量。实验结果显示，所提出的算法成功匹配了临床研究中观察到的最大字母视力，并且逆向对比度滤波器有助于保持假体视觉中的对比度。最终证明，基于机器学习的方法和对比度调整能够在一定程度上缓解假体视觉的空间和对比度限制带来的影响，显著改善人脸表征效果。

链接: https://arxiv.org/abs/2503.11677
作者: Jungyeon Park,Anna Kochnev Goldstein,Yueming Zhou,Nathan Jensen,Daniel Palanker
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Objective. Patients implanted with the PRIMA photovoltaic subretinal prosthesis in geographic atrophy report form vision with the average acuity matching the 100um pixel size. Although this remarkable outcome enables them to read and write, they report difficulty with perceiving faces. This paper provides a novel, non-pixelated algorithm for simulating prosthetic vision the way it is experienced by PRIMA patients, compares the algorithm’s predictions to clinical perceptual outcomes, and offers computer vision and machine learning (ML) methods to improve face representation. Approach. Our simulation algorithm integrates a grayscale filter, spatial resolution filter, and contrast filter. This accounts for the limited sampling density of the retinal implant, as well as the reduced contrast sensitivity of prosthetic vision. Patterns of Landolt C and faces created using this simulation algorithm are compared to reports from actual PRIMA users. To recover the facial features lost in prosthetic vision, we apply an ML facial landmarking model as well as contrast adjusting tone curves to the face image prior to its projection onto the implant. Main results. Simulated prosthetic vision matches the maximum letter acuity observed in clinical studies as well as patients’ subjective descriptions. Application of the inversed contrast filter helps preserve the contrast in prosthetic vision. Identification of the facial features using an ML facial landmarking model and accentuating them further improve face representation. Significance. Spatial and contrast constraints of prosthetic vision limit resolvable features and degrade natural images. ML based methods and contrast adjustments mitigate some limitations and improve face representation. Even though higher spatial resolution can be expected with implants having smaller pixels, contrast enhancement still remains essential for face recognition.
zh

[CV-302] U2AD: Uncertainty-based Unsupervised Anomaly Detection Framework for Detecting T2 Hyperintensity in MRI Spinal Cord

【速读】：该论文旨在解决脊髓磁共振成像（Spinal Cord MRI）中T2高信号（T2 hyperintensities）检测领域依赖手动评估以及现有无监督异常检测（Unsupervised Anomaly Detection, UAD）方法因域偏移（domain shift）导致性能下降的问题。论文提出了一种基于不确定性的无监督异常检测框架（Uncertainty-based Unsupervised Anomaly Detection, U2AD），其关键是通过“掩膜-重建”范式结合视觉Transformer架构，在同一临床数据集内完成训练与测试，同时引入不确定性引导的掩膜策略来平衡正常区域重建与异常检测之间的任务冲突。具体而言，利用蒙特卡洛采样技术估算重建不确定性映射，并在认识论不确定性（epistemic uncertainty）和统计不确定性（aleatoric uncertainty）的指导下优化重建过程，从而减少整体重建方差并突出异常区域。实验结果表明，U2AD在患者级识别和病灶级定位任务中优于现有监督与无监督方法，为UAD中融入不确定性指导建立了新的基准，并展示了其在应对域偏移和任务冲突方面的临床价值。

链接: https://arxiv.org/abs/2503.13400
作者: Qi Zhang,Xiuyuan Chen,Ziyi He,Kun Wang,Lianming Wu,Hongxing Shen,Jianqi Sun
机构: unknown
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:T2 hyperintensities in spinal cord MR images are crucial biomarkers for conditions such as degenerative cervical myelopathy. However, current clinical diagnoses primarily rely on manual evaluation. Deep learning methods have shown promise in lesion detection, but most supervised approaches are heavily dependent on large, annotated datasets. Unsupervised anomaly detection (UAD) offers a compelling alternative by eliminating the need for abnormal data annotations. However, existing UAD methods rely on curated normal datasets and their performance frequently deteriorates when applied to clinical datasets due to domain shifts. We propose an Uncertainty-based Unsupervised Anomaly Detection framework, termed U2AD, to address these limitations. Unlike traditional methods, U2AD is designed to be trained and tested within the same clinical dataset, following a “mask-and-reconstruction” paradigm built on a Vision Transformer-based architecture. We introduce an uncertainty-guided masking strategy to resolve task conflicts between normal reconstruction and anomaly detection to achieve an optimal balance. Specifically, we employ a Monte-Carlo sampling technique to estimate reconstruction uncertainty mappings during training. By iteratively optimizing reconstruction training under the guidance of both epistemic and aleatoric uncertainty, U2AD reduces overall reconstruction variance while emphasizing regions. Experimental results demonstrate that U2AD outperforms existing supervised and unsupervised methods in patient-level identification and segment-level localization tasks. This framework establishes a new benchmark for incorporating uncertainty guidance into UAD, highlighting its clinical utility in addressing domain shifts and task conflicts in medical image anomaly detection. Our code is available: this https URL
zh

[CV-303] LEAVS: An LLM -based Labeler for Abdominal CT Supervision

【速读】：该论文试图解决从腹部CT放射学报告中提取结构化标签的问题，特别是针对复杂解剖结构和广泛病理类型的挑战。现有工作主要集中在胸部区域，而腹部研究较少。论文提出了一种名为LEAVS（Large language model Extractor for Abdominal Vision Supervision）的标注器，能够为九个腹部器官的七类异常提供存在确信度和紧急程度的标注。解决方案的关键在于采用了一种专门的链式思维提示策略，利用基于树的决策系统中的句子提取和多选题方法，在本地运行的大语言模型上实现高效的标签提取。实验结果表明，该方法在平均F1分数为0.89的情况下显著优于竞争性标注器和人工标注，并且紧急程度标签的提取性能与人工注释相当，同时证明了这些标签对训练多器官分类的视觉模型具有重要价值。

链接: https://arxiv.org/abs/2503.13330
作者: Ricardo Bigolin Lanfredi,Yan Zhuang,Mark Finkelstein,Praveen Thoppey Srinivasan Balamuralikrishna,Luke Krembs,Brandon Khoury,Arthi Reddy,Pritam Mukherjee,Neil M. Rofsky,Ronald M. Summers
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Extracting structured labels from radiology reports has been employed to create vision models to simultaneously detect several types of abnormalities. However, existing works focus mainly on the chest region. Few works have been investigated on abdominal radiology reports due to more complex anatomy and a wider range of pathologies in the abdomen. We propose LEAVS (Large language model Extractor for Abdominal Vision Supervision). This labeler can annotate the certainty of presence and the urgency of seven types of abnormalities for nine abdominal organs on CT radiology reports. To ensure broad coverage, we chose abnormalities that encompass most of the finding types from CT reports. Our approach employs a specialized chain-of-thought prompting strategy for a locally-run LLM using sentence extraction and multiple-choice questions in a tree-based decision system. We demonstrate that the LLM can extract several abnormality types across abdominal organs with an average F1 score of 0.89, significantly outperforming competing labelers and humans. Additionally, we show that extraction of urgency labels achieved performance comparable to human annotations. Finally, we demonstrate that the abnormality labels contain valuable information for training a single vision model that classifies several organs as normal or abnormal. We release our code and structured annotations for a public CT dataset containing over 1,000 CT volumes.
zh

[CV-304] Integrating AI for Human-Centric Breast Cancer Diagnostics: A Multi-Scale and Multi-View Swin Transformer Framework

【速读】：该论文旨在解决乳腺癌诊断中计算机辅助诊断（CAD）系统面临的挑战，特别是依赖详细的肿瘤标注以及对缺失视图（尤其是在测试阶段）的敏感性问题。为了解决这些问题，论文提出了一种基于Swin Transformer的混合多尺度多视图框架（MSMV-Swin）。该框架的关键在于结合Segment Anything Model (SAM) 提取乳腺区域以减少背景噪声，并通过多尺度设计同时捕捉肿瘤局部特征及其周围组织的空间特性，从而增强诊断的鲁棒性和准确性。此外，通过融合上下文与局部信息，确保输出结果符合放射科医生的解读方式，促进人机交互的信任。同时，该框架还设计了混合融合结构以应对临床实践中常见的单视图缺失情况，保证系统的稳健性。

链接: https://arxiv.org/abs/2503.13309
作者: Farnoush Bayatmakou,Reza Taleei,Milad Amir Toutounchian,Arash Mohammadi
机构: Concordia Institute for Information Systems Engineering (CIISE), Concordia University (康考迪亚大学), Montreal, Canada; Thomas Jefferson University Hospital (托马斯杰斐逊大学医院), Philadelphia, Pennsylvania, USA; College of Computing & Informatics, Drexel University (德雷塞尔大学), Philadelphia, Pennsylvania, USA
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite advancements in Computer-Aided Diagnosis (CAD) systems, breast cancer remains one of the leading causes of cancer-related deaths among women worldwide. Recent breakthroughs in Artificial Intelligence (AI) have shown significant promise in development of advanced Deep Learning (DL) architectures for breast cancer diagnosis through mammography. In this context, the paper focuses on the integration of AI within a Human-Centric workflow to enhance breast cancer diagnostics. Key challenges are, however, largely overlooked such as reliance on detailed tumor annotations and susceptibility to missing views, particularly during test time. To address these issues, we propose a hybrid, multi-scale and multi-view Swin Transformer-based framework (MSMV-Swin) that enhances diagnostic robustness and accuracy. The proposed MSMV-Swin framework is designed to work as a decision-support tool, helping radiologists analyze multi-view mammograms more effectively. More specifically, the MSMV-Swin framework leverages the Segment Anything Model (SAM) to isolate the breast lobe, reducing background noise and enabling comprehensive feature extraction. The multi-scale nature of the proposed MSMV-Swin framework accounts for tumor-specific regions as well as the spatial characteristics of tissues surrounding the tumor, capturing both localized and contextual information. The integration of contextual and localized data ensures that MSMV-Swin’s outputs align with the way radiologists interpret mammograms, fostering better human-AI interaction and trust. A hybrid fusion structure is then designed to ensure robustness against missing views, a common occurrence in clinical practice when only a single mammogram view is available.
zh

[CV-305] Artificial Intelligence-Driven Prognostic Classification of COVID-19 Using Chest X-rays: A Deep Learning Approach

【速读】：该论文旨在解决现有基于胸部X光影像的COVID-19预后分类方法在可扩展性和效率方面的不足，提出了一种高精度深度学习模型以实现对COVID-19严重程度（轻度、中度和重度）的快速且准确分类。解决方案的关键在于利用微软Azure Custom Vision平台，基于包含1,103张确诊COVID-19患者胸部X光影像的数据集，训练并验证了一个采用卷积神经网络（Convolutional Neural Networks, CNNs）的深度学习模型，并通过精确度（Accuracy）、特异性（Specificity）、敏感性（Sensitivity）及F1分数等指标验证其性能，最终实现了平均97%的准确率以及各严重程度分类下的高精度表现。这一方案展示了将深度学习集成到临床常规工作流程中的潜力，为实际临床应用提供了高效工具。

链接: https://arxiv.org/abs/2503.13277
作者: Alfred Simbun,Suresh Kumar
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 6 figures, 10 tables

点击查看摘要

Abstract:Background: The COVID-19 pandemic has overwhelmed healthcare systems, emphasizing the need for AI-driven tools to assist in rapid and accurate patient prognosis. Chest X-ray imaging is a widely available diagnostic tool, but existing methods for prognosis classification lack scalability and efficiency. Objective: This study presents a high-accuracy deep learning model for classifying COVID-19 severity (Mild, Moderate, and Severe) using Chest X-ray images, developed on Microsoft Azure Custom Vision. Methods: Using a dataset of 1,103 confirmed COVID-19 X-ray images from AIforCOVID, we trained and validated a deep learning model leveraging Convolutional Neural Networks (CNNs). The model was evaluated on an unseen dataset to measure accuracy, precision, and recall. Results: Our model achieved an average accuracy of 97%, with specificity of 99%, sensitivity of 87%, and an F1-score of 93.11%. When classifying COVID-19 severity, the model achieved accuracies of 89.03% (Mild), 95.77% (Moderate), and 81.16% (Severe). These results demonstrate the model’s potential for real-world clinical applications, aiding in faster decision-making and improved resource allocation. Conclusion: AI-driven prognosis classification using deep learning can significantly enhance COVID-19 patient management, enabling early intervention and efficient triaging. Our study provides a scalable, high-accuracy AI framework for integrating deep learning into routine clinical workflows. Future work should focus on expanding datasets, external validation, and regulatory compliance to facilitate clinical adoption.
zh

[CV-306] How Good is my Histopathology Vision-Language Foundation Model? A Holistic Benchmark

【速读】：该论文旨在解决现有组织病理学视觉-语言基础模型（Histopathology Vision-Language Foundation Models, VLMs）缺乏在统一基准设置下全面评估的问题。目前大多数现有的组织病理学基准数据集要么是单模态的，要么在临床任务多样性、器官类型、采集设备以及由于患者数据隐私限制而部分公开方面存在局限性。为填补这一空白，论文提出了HistoVL，这是一个完全开源且综合性的基准数据集，包含来自11种不同采集工具的图像，并通过结合类别名称与多样化的病理描述生成专门设计的描述语句。HistoVL涵盖了26个器官、31种癌症类型以及从14个异质患者队列中获取的各种组织样本，总计超过500万个来自41,000多张全片扫描显微镜图像（WSIs）的切片。通过在HistoVL上的系统性评估，研究者模拟了真实世界临床场景中专家执行的各种任务。分析结果表明，大多数现有组织病理学VLMs对文本变化非常敏感，在转移灶检测等任务中的平衡准确率下降可达25%，并且表现出较低的对抗攻击鲁棒性以及不恰当的模型校准（高ECE值和低预测置信度），这些问题可能影响其临床应用。解决方案的关键在于创建了一个更加全面且开放的基准数据集HistoVL，它能够更好地反映广泛的临床实际情况，并通过此基准对现有模型进行全面评估以揭示上述问题。

链接: https://arxiv.org/abs/2503.12990
作者: Roba Al Majzoub,Hashmat Malik,Muzammal Naseer,Zaigham Zaheer,Tariq Mahmood,Salman Khan,Fahad Khan
机构: Department of Computer Vision, Mohamed Bin Zayed University of Artificial Intelligence ( Mohamed Bin Zayed University of Artificial Intelligence ); Khalifa University ( Khalifa University ); Shaukat Khanum Cancer Hospital ( Shaukat Khanum Cancer Hospital ); Department of Computer Vision, Mohamed Bin Zayed University of Artificial Intelligence ( Mohamed Bin Zayed University of Artificial Intelligence ); Department of Computer Vision, Mohamed Bin Zayed University of Artificial Intelligence ( Mohamed Bin Zayed University of Artificial Intelligence ); Department of Computer Vision, Mohamed Bin Zayed University of Artificial Intelligence ( Mohamed Bin Zayed University of Artificial Intelligence )
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, histopathology vision-language foundation models (VLMs) have gained popularity due to their enhanced performance and generalizability across different downstream tasks. However, most existing histopathology benchmarks are either unimodal or limited in terms of diversity of clinical tasks, organs, and acquisition instruments, as well as their partial availability to the public due to patient data privacy. As a consequence, there is a lack of comprehensive evaluation of existing histopathology VLMs on a unified benchmark setting that better reflects a wide range of clinical scenarios. To address this gap, we introduce HistoVL, a fully open-source comprehensive benchmark comprising images acquired using up to 11 various acquisition tools that are paired with specifically crafted captions by incorporating class names and diverse pathology descriptions. Our Histo-VL includes 26 organs, 31 cancer types, and a wide variety of tissue obtained from 14 heterogeneous patient cohorts, totaling more than 5 million patches obtained from over 41K WSIs viewed under various magnification levels. We systematically evaluate existing histopathology VLMs on Histo-VL to simulate diverse tasks performed by experts in real-world clinical scenarios. Our analysis reveals interesting findings, including large sensitivity of most existing histopathology VLMs to textual changes with a drop in balanced accuracy of up to 25% in tasks such as Metastasis detection, low robustness to adversarial attacks, as well as improper calibration of models evident through high ECE values and low model prediction confidence, all of which can affect their clinical implementation.
zh

[CV-307] ask-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference

【速读】：该论文针对大规模多模态模型（LMMs）推理过程中因边缘设备上传带宽限制和视觉特征数量庞大导致的高传输延迟和显著计算延迟问题，试图提升延迟敏感任务的性能并改善用户体验。论文的关键解决方案在于提出了一种面向任务的任务导向特征压缩（Task-Oriented Feature Compression, TOFC）方法，该方法在设备-边缘协同推理框架下工作。具体而言，TOFC通过基于K最近邻的密度峰值聚类减少视觉特征的数量以降低数据传输和计算复杂度，并利用具有超先验的可学习熵模型对合并后的特征进行编码与解码，进一步减少传输开销。此外，通过自适应选择多个熵模型以更准确地估计特征的概率分布，进一步提升了压缩效率。实验结果表明，TOFC相比传统图像压缩方法可将数据传输开销减少高达60%，系统延迟减少50%，同时保持相同的任务性能。

链接: https://arxiv.org/abs/2503.12926
作者: Cheng Yuan,Zhening Liu,Jiashu Lv,Jiawei Shao,Yufei Jiang,Jun Zhang,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI) of China Telecom, China (中国电信人工智能研究院, 中国); Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong (香港科技大学电子与计算机工程系, 香港); School of Software and Microelectronics, Peking University, China (北京大学软件与微电子学院, 中国); School of Electronic and Information Engineering, Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学(深圳)电子与信息工程学院, 中国)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of large multimodal models (LMMs), multimodal understanding applications are emerging. As most LMM inference requests originate from edge devices with limited computational capabilities, the predominant inference pipeline involves directly forwarding the input data to an edge server which handles all computations. However, this approach introduces high transmission latency due to limited uplink bandwidth of edge devices and significant computation latency caused by the prohibitive number of visual tokens, thus hindering delay-sensitive tasks and degrading user experience. To address this challenge, we propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework, where visual features are merged by clustering and encoded by a learnable and selective entropy model before feature projection. Specifically, we employ density peaks clustering based on K nearest neighbors to reduce the number of visual features, thereby minimizing both data transmission and computational complexity. Subsequently, a learnable entropy model with hyperprior is utilized to encode and decode merged features, further reducing transmission overhead. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features, enabling a more accurate estimation of the probability distribution. Comprehensive experiments on seven visual question answering benchmarks validate the effectiveness of the proposed TOFC method. Results show that TOFC achieves up to 60% reduction in data transmission overhead and 50% reduction in system latency while maintaining identical task performance, compared with traditional image compression methods.
zh

[CV-308] A Continual Learning-driven Model for Accurate and Generalizable Segmentation of Clinically Comprehensive and Fine-grained Whole-body Anatomies in CT

【速读】：该论文旨在解决在精准医学中，利用计算机断层扫描（CT）实现慢性病和肿瘤学定量管理时，缺乏大规模全标注数据集的问题。由于完整标注所有解剖结构的手动成本极高、需要专业的临床知识以及完成任务所需的时间，目前尚无包含所有解剖结构的完全标注CT数据集用于训练。为了解决这一难题，论文提出了一种基于持续学习的新型CT模型CL-Net。其关键在于通过动态扩展能力，在无需遗忘已有器官知识的前提下，利用数十个部分标注的数据集分割完整的解剖结构。现有方法难以在动态分割新解剖结构的同时避免灾难性遗忘，并且在分割全身各区域数百个解剖结构时会遇到优化困难或不可行性。CL-Net由通用编码器和多个经过优化与剪枝的解码器组成，使用来自20个公开及16个私有高质量部分标注CT数据集的13,952例CT扫描构建，涵盖不同厂商、对比剂阶段和病理类型。实验表明，CL-Net不仅在分割235种细粒度全身解剖结构方面表现出高度准确性，而且在模型大小仅为36个专用nnUNet集成模型5%的情况下，始终优于这些模型，并显著超越近期领先的基于Segment Anything风格的医学图像基础模型的分割精度。该模型为促进肿瘤学和慢性病的许多下游任务奠定了坚实基础。

链接: https://arxiv.org/abs/2503.12698
作者: Dazhou Guo,Zhanghexuan Ji,Yanzhou Su,Dandan Zheng,Heng Guo,Puyang Wang,Ke Yan,Yirui Wang,Qinji Yu,Zi Li,Minfeng Xu,Jianfeng Zhang,Haoshen Li,Jia Ge,Tsung-Ying Ho,Bing-Shen Huang,Tashan Ai,Kuaile Zhao,Na Shen,Qifeng Wang,Yun Bian,Tingyu Wu,Peng Du,Hua Zhang,Feng-Ming Kong,Alan L. Yuille,Cher Heng Tan,Chunyan Miao,Perry J. Pickhardt,Senxiang Yan,Ronald M. Summers,Le Lu,Dakai Jin,Xianghua Ye
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precision medicine in the quantitative management of chronic diseases and oncology would be greatly improved if the Computed Tomography (CT) scan of any patient could be segmented, parsed and analyzed in a precise and detailed way. However, there is no such fully annotated CT dataset with all anatomies delineated for training because of the exceptionally high manual cost, the need for specialized clinical expertise, and the time required to finish the task. To this end, we proposed a novel continual learning-driven CT model that can segment complete anatomies presented using dozens of previously partially labeled datasets, dynamically expanding its capacity to segment new ones without compromising previously learned organ knowledge. Existing multi-dataset approaches are not able to dynamically segment new anatomies without catastrophic forgetting and would encounter optimization difficulty or infeasibility when segmenting hundreds of anatomies across the whole range of body regions. Our single unified CT segmentation model, CL-Net, can highly accurately segment a clinically comprehensive set of 235 fine-grained whole-body anatomies. Composed of a universal encoder, multiple optimized and pruned decoders, CL-Net is developed using 13,952 CT scans from 20 public and 16 private high-quality partially labeled CT datasets of various vendors, different contrast phases, and pathologies. Extensive evaluation demonstrates that CL-Net consistently outperforms the upper limit of an ensemble of 36 specialist nnUNets trained per dataset with the complexity of 5% model size and significantly surpasses the segmentation accuracy of recent leading Segment Anything-style medical image foundation models by large margins. Our continual learning-driven CL-Net model would lay a solid foundation to facilitate many downstream tasks of oncology and chronic diseases using the most widely adopted CT imaging.
zh

[CV-309] COVID 19 Diagnosis Analysis using Transfer Learning

【速读】：该论文旨在解决由于新型冠状病毒（SARS-CoV-2）引起的 COVID-19 疾病在全球范围内快速传播所导致的诊断能力不足问题。随着确诊病例数量每日增加，医院可用的检测设备有限，因此需要一种自动诊断系统来帮助识别感染患者。论文的关键解决方案在于提出了一种基于预训练卷积神经网络模型（VGG16、VGG19 和 ResNet50）的方法，通过分析胸部 X 光片和计算机断层扫描（CT）图像实现对冠状病毒性肺炎患者的二分类检测（COVID-19 和正常）。实验结果显示，ResNet50 模型在 6259 张图像上的表现最佳，其准确率高达 97.77%，敏感性为 100%，特异性为 93.33%，F1 分数为 98.00%，表明该方法具有较高的临床实用价值。

链接: https://arxiv.org/abs/2503.12642
作者: Anjali Dharmik
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Coronaviruses transmit COVID-19, a rapidly spreading disease. A Coronavirus infection (COVID-19) was first discovered in December 2019 in Wuhan, China, and spread rapidly throughout the planet in exactly some months. because of this, the virus can cause severe symptoms and even death, especially within the elderly and in people with medical conditions. The virus causes acute respiratory infections in humans. the primary case was diagnosed in China in 2019 and the pandemic started in 2020. Since the quantity of cases of COVID-19 is increasing daily, there are only a limited number of test kits available in hospitals. So, to stop COVID-19 from spreading among people, an automatic diagnosis system must be implemented. during this study, three pre-trained neural networks supported convolutional neural networks (VGG16, VGG19, ResNet50) are proposed for detecting Coronavirus pneumonia infected patients through X-rays and computerized tomography (CT). By using cross-validation, we’ve got implemented binary classifications with two classes (COVID-19, Normal (healthy)). Taking into consideration the results obtained, the pre-trained ResNet50 model provides the simplest classification performance (97.77% accuracy, 100% sensitivity, 93.33% specificity, 98.00% F1-score) among the opposite three used models over 6259 images.
zh

[CV-310] Goal-Oriented Source Coding using LDPC Codes for Compressed-Domain Image Classification

【速读】：该论文致力于解决在目标导向通信领域中，如何在不解码的情况下直接利用熵编码数据进行学习的问题。传统熵编码方法（如Huffman和算术编码）虽具备高效的压缩能力，但会破坏数据结构，导致其不适合作为直接学习的基础。论文的关键解决方案在于探索低密度奇偶校验（Low-Density Parity-Check, LDPC）码的应用，这是一种原本设计用于信道编码的方法。通过利用LDPC码的结构化特性，深度学习模型能够更有效地完成分类等任务。实验结果表明，与传统的熵编码方法相比，LDPC码不仅在分类任务中表现更优，还允许使用更小的学习模型。此外，论文分析了LDPC码相较于传统方法更好地保留数据结构的原因，并探讨了关键码参数对分类性能的影响。这些发现表明，基于LDPC的熵编码方法能够在学习效率与模型复杂度之间实现最佳平衡，从而无需预先解码即可直接进行学习。

链接: https://arxiv.org/abs/2503.11954
作者: Ahcen Aliouat,Elsa Dupraz
机构: IMT Atlantique (IMT 大西洋学院), CNRS UMR 6285 (法国国家科学研究中心), Lab-STICC (实验室);
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 11 pages, 13 figures, Submitted to IEEE Transactions on Communications (Under Review)

点击查看摘要

Abstract:In the emerging field of goal-oriented communications, the focus has shifted from reconstructing data to directly performing specific learning tasks, such as classification, segmentation, or pattern recognition, on the received coded data. In the commonly studied scenario of classification from compressed images, a key objective is to enable learning directly on entropy-coded data, thereby bypassing the computationally intensive step of data reconstruction. Conventional entropy-coding methods, such as Huffman and Arithmetic coding, are effective for compression but disrupt the data structure, making them less suitable for direct learning without decoding. This paper investigates the use of low-density parity-check (LDPC) codes – originally designed for channel coding – as an alternative entropy-coding approach. It is hypothesized that the structured nature of LDPC codes can be leveraged more effectively by deep learning models for tasks like classification. At the receiver side, gated recurrent unit (GRU) models are trained to perform image classification directly on LDPC-coded data. Experiments on datasets like MNIST, Fashion-MNIST, and CIFAR show that LDPC codes outperform Huffman and Arithmetic coding in classification tasks, while requiring significantly smaller learning models. Furthermore, the paper analyzes why LDPC codes preserve data structure more effectively than traditional entropy-coding techniques and explores the impact of key code parameters on classification performance. These results suggest that LDPC-based entropy coding offers an optimal balance between learning efficiency and model complexity, eliminating the need for prior decoding.
zh

[CV-311] DCAT: Dual Cross-Attention Fusion for Disease Classification in Radiological Images with Uncertainty Estimation

【速读】：该论文旨在解决医学影像分析中特征整合与模型可解释性方面的关键挑战，特别是针对现有深度学习模型在面对不确定性时可能产生过自信预测的问题。论文提出了一种新颖的双交叉注意力融合模型，通过引入双向交叉注意力机制，结合EfficientNetB4和ResNet34网络的多尺度特征图，利用通道和空间注意力对特征进行精炼与动态融合。这一方案的关键在于通过双向交叉注意力机制捕捉特征之间的依赖关系，并通过增强的通道和空间注意力突出分类所需的判别性模式，从而显著提升模型的诊断准确性与透明度。实验结果表明，该模型在多种医学影像数据集上实现了卓越的性能（AUC和AUPR分别达到99.75%-100%和96.36%-99.97%），同时通过熵值和高不确定性样本的可视化进一步增强了模型的可解释性。

链接: https://arxiv.org/abs/2503.11851
作者: Jutika Borah,Hidam Kumarjit Singh
机构: Gauhati University (印度高哈蒂大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Accurate and reliable image classification is crucial in radiology, where diagnostic decisions significantly impact patient outcomes. Conventional deep learning models tend to produce overconfident predictions despite underlying uncertainties, potentially leading to misdiagnoses. Attention mechanisms have emerged as powerful tools in deep learning, enabling models to focus on relevant parts of the input data. Combined with feature fusion, they can be effective in addressing uncertainty challenges. Cross-attention has become increasingly important in medical image analysis for capturing dependencies across features and modalities. This paper proposes a novel dual cross-attention fusion model for medical image analysis by addressing key challenges in feature integration and interpretability. Our approach introduces a bidirectional cross-attention mechanism with refined channel and spatial attention that dynamically fuses feature maps from EfficientNetB4 and ResNet34 leveraging multi-network contextual dependencies. The refined features through channel and spatial attention highlights discriminative patterns crucial for accurate classification. The proposed model achieved AUC of 99.75%, 100%, 99.93% and 98.69% and AUPR of 99.81%, 100%, 99.97%, and 96.36% on Covid-19, Tuberculosis, Pneumonia Chest X-ray images and Retinal OCT images respectively. The entropy values and several high uncertain samples give an interpretable visualization from the model enhancing transparency. By combining multi-scale feature extraction, bidirectional attention and uncertainty estimation, our proposed model strongly impacts medical image analysis.
zh

[CV-312] From Pixels to Histopathology: A Graph-Based Framework for Interpretable Whole Slide Image Analysis

【速读】：该论文旨在解决全片扫描图像（Whole-Slide Images, WSIs）组织病理学分类任务中因人工分割补丁导致网络无法学习完整图像上下文、忽略天然组织结构且影响可解释性的问题。论文的关键解决方案在于提出了一种基于图的新框架，通过构建高效的WSI图表示来克服上述限制。该方法以生物边界为基础定义组织节点，而非任意划分的补丁，从而提供更具生物学意义且易于解释的特征表示。此外，通过由学习到的嵌入引导的自适应图粗化过程，逐步合并区域的同时保留判别性局部特征并促进全局信息的有效交换。最终，利用图注意力网络完成诊断任务。实验结果表明，该方法在癌症分期分类和生存预测等具有挑战性的任务中表现出色，并通过Integrated Gradients识别出预测因子。

链接: https://arxiv.org/abs/2503.11846
作者: Alexander Weers,Alexander H. Berger,Laurin Lux,Peter Schüffler,Daniel Rueckert,Johannes C. Paetzold
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:The histopathological classification of whole-slide images (WSIs) is a fundamental task in digital pathology; yet it requires extensive time and expertise from specialists. While deep learning methods show promising results, they typically process WSIs by dividing them into artificial patches, which inherently prevents a network from learning from the entire image context, disregards natural tissue structures and compromises interpretability. Our method overcomes this limitation through a novel graph-based framework that constructs WSI graph representations. The WSI-graph efficiently captures essential histopathological information in a compact form. We build tissue representations (nodes) that follow biological boundaries rather than arbitrary patches all while providing interpretable features for explainability. Through adaptive graph coarsening guided by learned embeddings, we progressively merge regions while maintaining discriminative local features and enabling efficient global information exchange. In our method’s final step, we solve the diagnostic task through a graph attention network. We empirically demonstrate strong performance on multiple challenging tasks such as cancer stage classification and survival prediction, while also identifying predictive factors using Integrated Gradients. Our implementation is publicly available at this https URL
zh

人工智能

[AI-0] Deep Belief Markov Models for POMDP Inference

链接: https://arxiv.org/abs/2503.13438
作者: Giacomo Arcieri,Konstantinos G. Papakonstantinou,Daniel Straub,Eleni Chatzi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work introduces a novel deep learning-based architecture, termed the Deep Belief Markov Model (DBMM), which provides efficient, model-formulation agnostic inference in Partially Observable Markov Decision Process (POMDP) problems. The POMDP framework allows for modeling and solving sequential decision-making problems under observation uncertainty. In complex, high-dimensional, partially observable environments, existing methods for inference based on exact computations (e.g., via Bayes’ theorem) or sampling algorithms do not scale well. Furthermore, ground truth states may not be available for learning the exact transition dynamics. DBMMs extend deep Markov models into the partially observable decision-making framework and allow efficient belief inference entirely based on available observation data via variational inference methods. By leveraging the potency of neural networks, DBMMs can infer and simulate non-linear relationships in the system dynamics and naturally scale to problems with high dimensionality and discrete or continuous variables. In addition, neural network parameters can be dynamically updated efficiently based on data availability. DBMMs can thus be used to infer a belief variable, thus enabling the derivation of POMDP solutions over the belief space. We evaluate the efficacy of the proposed methodology by evaluating the capability of model-formulation agnostic inference of DBMMs in benchmark problems that include discrete and continuous variables.

[AI-1] Securing Virtual Reality Experiences: Unveiling and Tackling Cybersickness Attacks with Explainable AI

链接: https://arxiv.org/abs/2503.13419
作者: Ripan Kumar Kundu,Matthew Denton,Genova Mongalo,Prasad Calyam,Khaza Anuarul Hoque
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The synergy between virtual reality (VR) and artificial intelligence (AI), specifically deep learning (DL)-based cybersickness detection models, has ushered in unprecedented advancements in immersive experiences by automatically detecting cybersickness severity and adaptively various mitigation techniques, offering a smooth and comfortable VR experience. While this DL-enabled cybersickness detection method provides promising solutions for enhancing user experiences, it also introduces new risks since these models are vulnerable to adversarial attacks; a small perturbation of the input data that is visually undetectable to human observers can fool the cybersickness detection model and trigger unexpected mitigation, thus disrupting user immersive experiences (UIX) and even posing safety risks. In this paper, we present a new type of VR attack, i.e., a cybersickness attack, which successfully stops the triggering of cybersickness mitigation by fooling DL-based cybersickness detection models and dramatically hinders the UIX. Next, we propose a novel explainable artificial intelligence (XAI)-guided cybersickness attack detection framework to detect such attacks in VR to ensure UIX and a comfortable VR experience. We evaluate the proposed attack and the detection framework using two state-of-the-art open-source VR cybersickness datasets: Simulation 2021 and Gameplay dataset. Finally, to verify the effectiveness of our proposed method, we implement the attack and the XAI-based detection using a testbed with a custom-built VR roller coaster simulation with an HTC Vive Pro Eye headset and perform a user study. Our study shows that such an attack can dramatically hinder the UIX. However, our proposed XAI-guided cybersickness attack detection can successfully detect cybersickness attacks and trigger the proper mitigation, effectively reducing VR cybersickness.

[AI-2] FLEX: A Framework for Learning Robot-Agnostic Force-based Skills Involving Sustained Contact Object Manipulation ICRA-2025

链接: https://arxiv.org/abs/2503.13418
作者: Shijie Fang,Wenchang Gao,Shivam Goel,Christopher Thierauf,Matthias Scheutz,Jivko Sinapov
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted at IEEE-ICRA-2025

点击查看摘要

Abstract:Learning to manipulate objects efficiently, particularly those involving sustained contact (e.g., pushing, sliding) and articulated parts (e.g., drawers, doors), presents significant challenges. Traditional methods, such as robot-centric reinforcement learning (RL), imitation learning, and hybrid techniques, require massive training and often struggle to generalize across different objects and robot platforms. We propose a novel framework for learning object-centric manipulation policies in force space, decoupling the robot from the object. By directly applying forces to selected regions of the object, our method simplifies the action space, reduces unnecessary exploration, and decreases simulation overhead. This approach, trained in simulation on a small set of representative objects, captures object dynamics – such as joint configurations – allowing policies to generalize effectively to new, unseen objects. Decoupling these policies from robot-specific dynamics enables direct transfer to different robotic platforms (e.g., Kinova, Panda, UR5) without retraining. Our evaluations demonstrate that the method significantly outperforms baselines, achieving over an order of magnitude improvement in training efficiency compared to other state-of-the-art methods. Additionally, operating in force space enhances policy transferability across diverse robot platforms and object types. We further showcase the applicability of our method in a real-world robotic setting. For supplementary materials and videos, please visit: this https URL

[AI-3] A Comprehensive Survey on Multi-Agent Cooperative Decision-Making: Scenarios Approaches Challenges and Perspectives

链接: https://arxiv.org/abs/2503.13415
作者: Weiqiang Jin,Hongyang Du,Biao Zhao,Xingwu Tian,Bohang Shi,Guang Yang
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 54 pages, 24 figures

点击查看摘要

Abstract:With the rapid development of artificial intelligence, intelligent decision-making techniques have gradually surpassed human levels in various human-machine competitions, especially in complex multi-agent cooperative task scenarios. Multi-agent cooperative decision-making involves multiple agents working together to complete established tasks and achieve specific objectives. These techniques are widely applicable in real-world scenarios such as autonomous driving, drone navigation, disaster rescue, and simulated military confrontations. This paper begins with a comprehensive survey of the leading simulation environments and platforms used for multi-agent cooperative decision-making. Specifically, we provide an in-depth analysis for these simulation environments from various perspectives, including task formats, reward allocation, and the underlying technologies employed. Subsequently, we provide a comprehensive overview of the mainstream intelligent decision-making approaches, algorithms and models for multi-agent systems (MAS). Theseapproaches can be broadly categorized into five types: rule-based (primarily fuzzy logic), game theory-based, evolutionary algorithms-based, deep multi-agent reinforcement learning (MARL)-based, and large language models(LLMs)reasoning-based. Given the significant advantages of MARL andLLMs-baseddecision-making methods over the traditional rule, game theory, and evolutionary algorithms, this paper focuses on these multi-agent methods utilizing MARL and LLMs-based techniques. We provide an in-depth discussion of these approaches, highlighting their methodology taxonomies, advantages, and drawbacks. Further, several prominent research directions in the future and potential challenges of multi-agent cooperative decision-making are also detailed.

[AI-4] Reward Adaptation Via Q-Manipulation

链接: https://arxiv.org/abs/2503.13414
作者: Kevin Vora,Yu Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose a new solution to reward adaptation (RA), the problem where the learning agent adapts to a target reward function based on one or multiple existing behaviors learned a priori under the same domain dynamics but different reward functions. Learning the target behavior from scratch is possible but often inefficient given the available source behaviors. Our work represents a new approach to RA via the manipulation of Q-functions. Assuming that the target reward function is a known function of the source reward functions, our approach to RA computes bounds of the Q function. We introduce an iterative process to tighten the bounds, similar to value iteration. This enables action pruning in the target domain before learning even starts. We refer to such a method as Q-Manipulation (Q-M). We formally prove that our pruning strategy does not affect the optimality of the returned policy while empirically show that it improves the sample complexity. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.

[AI-5] Fed-Joint: Joint Modeling of Nonlinear Degradation Signals and Failure Events for Remaining Useful Life Prediction using Federated Learning

链接: https://arxiv.org/abs/2503.13404
作者: Cheoljoon Jeong,Xubo Yue,Seokhyun Chung
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many failure mechanisms of machinery are closely related to the behavior of condition monitoring (CM) signals. To achieve a cost-effective preventive maintenance strategy, accurate remaining useful life (RUL) prediction based on the signals is of paramount importance. However, the CM signals are often recorded at different factories and production lines, with limited amounts of data. Unfortunately, these datasets have rarely been shared between the sites due to data confidentiality and ownership issues, a lack of computing and storage power, and high communication costs associated with data transfer between sites and a data center. Another challenge in real applications is that the CM signals are often not explicitly specified \textita priori, meaning that existing methods, which often usually a parametric form, may not be applicable. To address these challenges, we propose a new prognostic framework for RUL prediction using the joint modeling of nonlinear degradation signals and time-to-failure data within a federated learning scheme. The proposed method constructs a nonparametric degradation model using a federated multi-output Gaussian process and then employs a federated survival model to predict failure times and probabilities for in-service machinery. The superiority of the proposed method over other alternatives is demonstrated through comprehensive simulation studies and a case study using turbofan engine degradation signal data that include run-to-failure events.

[AI-6] Scalable Runtime Architecture for Data-driven Hybrid HPC and ML Workflow Applications

链接: https://arxiv.org/abs/2503.13343
作者: Andre Merzky,Mikhail Titov,Matteo Turilli,Ozgur Kilic,Tianle Wang,Shantenu Jha
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Hybrid workflows combining traditional HPC and novel ML methodologies are transforming scientific computing. This paper presents the architecture and implementation of a scalable runtime system that extends RADICAL-Pilot with service-based execution to support AI-out-HPC workflows. Our runtime system enables distributed ML capabilities, efficient resource management, and seamless HPC/ML coupling across local and remote platforms. Preliminary experimental results show that our approach manages concurrent execution of ML models across local and remote HPC/cloud resources with minimal architectural overheads. This lays the foundation for prototyping three representative data-driven workflow applications and executing them at scale on leadership-class HPC platforms.

[AI-7] Generative AI for Software Architecture. Applications Trends Challenges and Future Directions

链接: https://arxiv.org/abs/2503.13310
作者: Matteo Esposito,Xiaozhou Li,Sergio Moreschini,Noman Ahmad,Tomas Cerny,Karthik Vaidhyanathan,Valentina Lenarduzzi,Davide Taibi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Context: Generative Artificial Intelligence (GenAI) is transforming much of software development, yet its application in software architecture is still in its infancy, and no prior study has systematically addressed the topic. Aim: We aim to systematically synthesize the use, rationale, contexts, usability, and future challenges of GenAI in software architecture. Method: We performed a multivocal literature review (MLR), analyzing peer-reviewed and gray literature, identifying current practices, models, adoption contexts, and reported challenges, extracting themes via open coding. Results: Our review identified significant adoption of GenAI for architectural decision support and architectural reconstruction. OpenAI GPT models are predominantly applied, and there is consistent use of techniques such as few-shot prompting and retrieved-augmented generation (RAG). GenAI has been applied mostly to initial stages of the Software Development Life Cycle (SDLC), such as Requirements-to-Architecture and Architecture-to-Code. Monolithic and microservice architectures were the dominant targets. However, rigorous testing of GenAI outputs was typically missing from the studies. Among the most frequent challenges are model precision, hallucinations, ethical aspects, privacy issues, lack of architecture-specific datasets, and the absence of sound evaluation frameworks. Conclusions: GenAI shows significant potential in software design, but several challenges remain on its path to greater adoption. Research efforts should target designing general evaluation methodologies, handling ethics and precision, increasing transparency and explainability, and promoting architecture-specific datasets and benchmarks to bridge the gap between theoretical possibilities and practical use.

[AI-8] Goal2Story: A Multi-Agent Fleet based on Privately Enabled sLLM s for Impacting Mapping on Requirements Elicitation

链接: https://arxiv.org/abs/2503.13279
作者: Xinkai Zou,Yan Liu,Xiongbo Shi,Chen Yang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As requirements drift with rapid iterations, agile development becomes the dominant paradigm. Goal-driven Requirements Elicitation (RE) is a pivotal yet challenging task in agile project development due to its heavy tangling with adaptive planning and efficient collaboration. Recently, AI agents have shown promising ability in supporting requirements analysis by saving significant time and effort for stakeholders. However, current research mainly focuses on functional RE, and research works have not been reported bridging the long journey from goal to user stories. Moreover, considering the cost of LLM facilities and the need for data and idea protection, privately hosted small-sized LLM should be further utilized in RE. To address these challenges, we propose Goal2Story, a multi-agent fleet that adopts the Impact Mapping (IM) framework while merely using cost-effective sLLMs for goal-driven RE. Moreover, we introduce a StorySeek dataset that contains over 1,000 user stories (USs) with corresponding goals and project context information, as well as the semi-automatic dataset construction method. For evaluation, we proposed two metrics: Factuality Hit Rate (FHR) to measure consistency between the generated USs with the dataset and Quality And Consistency Evaluation (QuACE) to evaluate the quality of the generated USs. Experimental results demonstrate that Goal2Story outperforms the baseline performance of the Super-Agent adopting powerful LLMs, while also showcasing the performance improvements in key metrics brought by CoT and Agent Profile to Goal2Story, as well as its exploration in identifying latent needs.

[AI-9] Knowledge-Aware Iterative Retrieval for Multi-Agent Systems

链接: https://arxiv.org/abs/2503.13275
作者: Seyoung Song
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We introduce a novel large language model (LLM)-driven agent framework, which iteratively refines queries and filters contextual evidence by leveraging dynamically evolving knowledge. A defining feature of the system is its decoupling of external sources from an internal knowledge cache that is progressively updated to guide both query generation and evidence selection. This design mitigates bias-reinforcement loops and enables dynamic, trackable search exploration paths, thereby optimizing the trade-off between exploring diverse information and maintaining accuracy through autonomous agent decision-making. Our approach is evaluated on a broad range of open-domain question answering benchmarks, including multi-step tasks that mirror real-world scenarios where integrating information from multiple sources is critical, especially given the vulnerabilities of LLMs that lack explicit reasoning or planning capabilities. The results show that the proposed system not only outperforms single-step baselines regardless of task difficulty but also, compared to conventional iterative retrieval methods, demonstrates pronounced advantages in complex tasks through precise evidence-based reasoning and enhanced efficiency. The proposed system supports both competitive and collaborative sharing of updated context, enabling multi-agent extension. The benefits of multi-agent configurations become especially prominent as task difficulty increases. The number of convergence steps scales with task difficulty, suggesting cost-effective scalability.

[AI-10] Robust Decision-Making Via Free Energy Minimization

链接: https://arxiv.org/abs/2503.13223
作者: Allahkaram Shafiei,Hozefa Jesawada,Karl Friston,Giovanni Russo
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: Contains main text and supplementary information

点击查看摘要

Abstract:Despite their groundbreaking performance, state-of-the-art autonomous agents can misbehave when training and environmental conditions become inconsistent, with minor mismatches leading to undesirable behaviors or even catastrophic failures. Robustness towards these training/environment ambiguities is a core requirement for intelligent agents and its fulfillment is a long-standing challenge when deploying agents in the real world. Here, departing from mainstream views seeking robustness through training, we introduce DR-FREE, a free energy model that installs this core property by design. It directly wires robustness into the agent decision-making mechanisms via free energy minimization. By combining a robust extension of the free energy principle with a novel resolution engine, DR-FREE returns a policy that is optimal-yet-robust against ambiguity. Moreover, for the first time, it reveals the mechanistic role of ambiguity on optimal decisions and requisite Bayesian belief updating. We evaluate DR-FREE on an experimental testbed involving real rovers navigating an ambiguous environment filled with obstacles. Across all the experiments, DR-FREE enables robots to successfully navigate towards their goal even when, in contrast, standard free energy minimizing agents that do not use DR-FREE fail. In short, DR-FREE can tackle scenarios that elude previous methods: this milestone may inspire both deployment in multi-agent settings and, at a perhaps deeper level, the quest for a biologically plausible explanation of how natural agents - with little or no training - survive in capricious environments.

[AI-11] ming the Match: A Deep Reinforcement Learning Approach for Ride-Hailing and Ride-Pooling Services

链接: https://arxiv.org/abs/2503.13200
作者: Yiman Bao,Jie Gao,Jinke He,Frans A. Oliehoek,Oded Cats
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient timing in ride-matching is crucial for improving the performance of ride-hailing and ride-pooling services, as it determines the number of drivers and passengers considered in each matching process. Traditional batched matching methods often use fixed time intervals to accumulate ride requests before assigning matches. While this approach increases the number of available drivers and passengers for matching, it fails to adapt to real-time supply-demand fluctuations, often leading to longer passenger wait times and driver idle periods. To address this limitation, we propose an adaptive ride-matching strategy using deep reinforcement learning (RL) to dynamically determine when to perform matches based on real-time system conditions. Unlike fixed-interval approaches, our method continuously evaluates system states and executes matching at moments that minimize total passenger wait time. Additionally, we incorporate a potential-based reward shaping (PBRS) mechanism to mitigate sparse rewards, accelerating RL training and improving decision quality. Extensive empirical evaluations using a realistic simulator trained on real-world data demonstrate that our approach outperforms fixed-interval matching strategies, significantly reducing passenger waiting times and detour delays, thereby enhancing the overall efficiency of ride-hailing and ride-pooling systems.

[AI-12] A representational framework for learning and encoding structurally enriched trajectories in complex agent environments

链接: https://arxiv.org/abs/2503.13194
作者: Corina Catarau-Cotutiu,Esther Mondragon,Eduardo Alonso
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ability of artificial intelligence agents to make optimal decisions and generalise them to different domains and tasks is compromised in complex scenarios. One way to address this issue has focused on learning efficient representations of the world and on how the actions of agents affect them, such as disentangled representations that exploit symmetries. Whereas such representations are procedurally efficient, they are based on the compression of low-level state-action transitions, which lack structural richness. To address this problem, we propose to enrich the agent’s ontology and extend the traditional conceptualisation of trajectories to provide a more nuanced view of task execution. Structurally Enriched Trajectories (SETs) extend the encoding of sequences of states and their transitions by incorporating hierarchical relations between objects, interactions and affordances. SETs are built as multi-level graphs, providing a detailed representation of the agent dynamics and a transferable functional abstraction of the task. SETs are integrated into an architecture, Structurally Enriched Trajectory Learning and Encoding (SETLE), that employs a heterogeneous graph-based memory structure of multi-level relational dependencies essential for generalisation. Using reinforcement learning as a data generation tool, we demonstrate that SETLE can support downstream tasks, enabling agents to recognise task-relevant structural patterns across diverse environments.

[AI-13] GC-Fed: Gradient Centralized Federated Learning with Partial Client Participation

链接: https://arxiv.org/abs/2503.13180
作者: Jungwon Seo,Ferhat Ozgur Catak,Chunming Rong,Kibeom Hong,Minhoe Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Multi-source information fusion (MSIF) leverages diverse data streams to enhance decision-making, situational awareness, and system resilience. Federated Learning (FL) enables MSIF while preserving privacy but suffers from client drift under high data heterogeneity, leading to performance degradation. Traditional mitigation strategies rely on reference-based gradient adjustments, which can be unstable in partial participation settings. To address this, we propose Gradient Centralized Federated Learning (GC-Fed), a reference-free gradient correction method inspired by Gradient Centralization (GC). We introduce Local GC and Global GC, applying GC during local training and global aggregation, respectively. Our hybrid GC-Fed approach selectively applies GC at the feature extraction layer locally and at the classifier layer globally, improving training stability and model performance. Theoretical analysis and empirical results demonstrate that GC-Fed mitigates client drift and achieves state-of-the-art accuracy gains of up to 20% in heterogeneous settings.

[AI-14] Rapfi: Distilling Efficient Neural Network for the Game of Gomoku

链接: https://arxiv.org/abs/2503.13178
作者: Zhanggen Jin,Haobin Duan,Zhiyang Hang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Games have played a pivotal role in advancing artificial intelligence, with AI agents using sophisticated techniques to compete. Despite the success of neural network based game AIs, their performance often requires significant computational resources. In this paper, we present Rapfi, an efficient Gomoku agent that outperforms CNN-based agents in limited computation environments. Rapfi leverages a compact neural network with a pattern-based codebook distilled from CNNs, and an incremental update scheme that minimizes computation when input changes are minor. This new network uses computation that is orders of magnitude less to reach a similar accuracy of much larger neural networks such as Resnet. Thanks to our incremental update scheme, depth-first search methods such as the alpha-beta search can be significantly accelerated. With a carefully tuned evaluation and search, Rapfi reached strength surpassing Katagomo, the strongest open-source Gomoku AI based on AlphaZero’s algorithm, under limited computational resources where accelerators like GPUs are absent. Rapfi ranked first among 520 Gomoku agents on Botzone and won the championship in GomoCup 2024.

[AI-15] HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning

链接: https://arxiv.org/abs/2503.13171
作者: Wensheng Wang,Ning Tan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The acquisition of large-scale and diverse demonstration data are essential for improving robotic imitation learning generalization. However, generating such data for complex manipulations is challenging in real-world settings. We introduce HybridGen, an automated framework that integrates Vision-Language Model (VLM) and hybrid planning. HybridGen uses a two-stage pipeline: first, VLM to parse expert demonstrations, decomposing tasks into expert-dependent (object-centric pose transformations for precise control) and plannable segments (synthesizing diverse trajectories via path planning); second, pose transformations substantially expand the first-stage data. Crucially, HybridGen generates a large volume of training data without requiring specific data formats, making it broadly applicable to a wide range of imitation learning algorithms, a characteristic which we also demonstrate empirically across multiple algorithms. Evaluations across seven tasks and their variants demonstrate that agents trained with HybridGen achieve substantial performance and generalization gains, averaging a 5% improvement over state-of-the-art methods. Notably, in the most challenging task variants, HybridGen achieves significant improvement, reaching a 59.7% average success rate, significantly outperforming Mimicgen’s 49.5%. These results demonstrating its effectiveness and practicality.

[AI-16] Collaborative AI Enhances Image Understanding in Materials Science

链接: https://arxiv.org/abs/2503.13169
作者: Ruoyan Avery Yin,Zhichu Ren,Zongyou Yin,Zhen Zhang,So Yeon Kim,Chia-Wei Hsu,Ju Li
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:The Copilot for Real-world Experimental Scientist (CRESt) system empowers researchers to control autonomous laboratories through conversational AI, providing a seamless interface for managing complex experimental workflows. We have enhanced CRESt by integrating a multi-agent collaboration mechanism that utilizes the complementary strengths of the ChatGPT and Gemini models for precise image analysis in materials science. This innovative approach significantly improves the accuracy of experimental outcomes by fostering structured debates between the AI models, which enhances decision-making processes in materials phase analysis. Additionally, to evaluate the generalizability of this approach, we tested it on a quantitative task of counting particles. Here, the collaboration between the AI models also led to improved results, demonstrating the versatility and robustness of this method. By harnessing this dual-AI framework, this approach stands as a pioneering method for enhancing experimental accuracy and efficiency in materials research, with applications extending beyond CRESt to broader scientific experimentation and analysis.

[AI-17] Efficient Imitation Under Misspecification ICLR2025

链接: https://arxiv.org/abs/2503.13162
作者: Nicolas Espinosa-Dice,Sanjiban Choudhury,Wen Sun,Gokul Swamy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 37 pages, 5 figures. Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Interactive imitation learning (IL) is a powerful paradigm for learning to make sequences of decisions from an expert demonstrating how to perform a task. Prior work in efficient imitation learning has focused on the realizable setting, where the expert’s policy lies within the learner’s policy class (i.e. the learner can perfectly imitate the expert in all states). However, in practice, perfect imitation of the expert is often impossible due to differences in state information and action space expressiveness (e.g. morphological differences between robots and humans.) In this paper, we consider the more general misspecified setting, where no assumptions are made about the expert policy’s realizability. We introduce a novel structural condition, reward-agnostic policy completeness, and prove that it is sufficient for interactive IL algorithms to efficiently avoid the quadratically compounding errors that stymie offline approaches like behavioral cloning. We address an additional practical constraint-the case of limited expert data-and propose a principled method for using additional offline data to further improve the sample-efficiency of interactive IL algorithms. Finally, we empirically investigate the optimal reset distribution in efficient IL under misspecification with a suite of continuous control tasks.

[AI-18] MIXPINN: Mixed-Material Simulations by Physics-Informed Neural Network IROS2025

链接: https://arxiv.org/abs/2503.13123
作者: Xintian Yuan,Yunke Ao,Boqi Chen,Philipp Fuernstahl
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the lEEE IROS 2025 for possible publication

点击查看摘要

Abstract:Simulating the complex interactions between soft tissues and rigid anatomy is critical for applications in surgical training, planning, and robotic-assisted interventions. Traditional Finite Element Method (FEM)-based simulations, while accurate, are computationally expensive and impractical for real-time scenarios. Learning-based approaches have shown promise in accelerating predictions but have fallen short in modeling soft-rigid interactions effectively. We introduce MIXPINN, a physics-informed Graph Neural Network (GNN) framework for mixed-material simulations, explicitly capturing soft-rigid interactions using graph-based augmentations. Our approach integrates Virtual Nodes (VNs) and Virtual Edges (VEs) to enhance rigid body constraint satisfaction while preserving computational efficiency. By leveraging a graph-based representation of biomechanical structures, MIXPINN learns high-fidelity deformations from FEM-generated data and achieves real-time inference with sub-millimeter accuracy. We validate our method in a realistic clinical scenario, demonstrating superior performance compared to baseline GNN models and traditional FEM methods. Our results show that MIXPINN reduces computational cost by an order of magnitude while maintaining high physical accuracy, making it a viable solution for real-time surgical simulation and robotic-assisted procedures.

[AI-19] Beyond Propagation of Chaos: A Stochastic Algorithm for Mean Field Optimization

链接: https://arxiv.org/abs/2503.13115
作者: Chandan Tankala,Dheeraj M. Nagaraj,Anant Raj
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gradient flow in the 2-Wasserstein space is widely used to optimize functionals over probability distributions and is typically implemented using an interacting particle system with n particles. Analyzing these algorithms requires showing (a) that the finite-particle system converges and/or (b) that the resultant empirical distribution of the particles closely approximates the optimal distribution (i.e., propagation of chaos). However, establishing efficient sufficient conditions can be challenging, as the finite particle system may produce heavily dependent random variables. In this work, we study the virtual particle stochastic approximation, originally introduced for Stein Variational Gradient Descent. This method can be viewed as a form of stochastic gradient descent in the Wasserstein space and can be implemented efficiently. In popular settings, we demonstrate that our algorithm’s output converges to the optimal distribution under conditions similar to those for the infinite particle limit, and it produces i.i.d. samples without the need to explicitly establish propagation of chaos bounds. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2503.13115 [cs.LG] (or arXiv:2503.13115v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.13115 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-20] Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided Self-Consistent MLLM s for Food Preparation Task Planning

链接: https://arxiv.org/abs/2503.13055
作者: Yu-Hong Shen,Chuan-Yu Wu,Yi-Ru Yang,Yen-Ling Tai,Yi-Ting Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study Multimodal Large Language Models (MLLMs) with in-context learning for food preparation task planning. In this context, we identify two key challenges: cross-modal distraction and geometric feasibility. Cross-modal distraction occurs when the inclusion of visual input degrades the reasoning performance of a MLLM. Geometric feasibility refers to the ability of MLLMs to ensure that the selected skills are physically executable in the environment. To address these issues, we adapt Chain of Thought (CoT) with Self-Consistency to mitigate reasoning loss from cross-modal distractions and use affordance predictor as skill preconditions to guide MLLM on geometric feasibility. We construct a dataset to evaluate the ability of MLLMs on quantity estimation, reachability analysis, relative positioning and collision avoidance. We conducted a detailed evaluation to identify issues among different baselines and analyze the reasons for improvement, providing insights into each approach. Our method reaches a success rate of 76.7% on the entire dataset, showing a substantial improvement over the CoT baseline at 36.7%.

[AI-21] Robot Policy Transfer with Online Demonstrations: An Active Reinforcement Learning Approach

链接: https://arxiv.org/abs/2503.12993
作者: Muhan Hou,Koen Hindriks,A.E. Eiben,Kim Baraka
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer Learning (TL) is a powerful tool that enables robots to transfer learned policies across different environments, tasks, or embodiments. To further facilitate this process, efforts have been made to combine it with Learning from Demonstrations (LfD) for more flexible and efficient policy transfer. However, these approaches are almost exclusively limited to offline demonstrations collected before policy transfer starts, which may suffer from the intrinsic issue of covariance shift brought by LfD and harm the performance of policy transfer. Meanwhile, extensive work in the learning-from-scratch setting has shown that online demonstrations can effectively alleviate covariance shift and lead to better policy performance with improved sample efficiency. This work combines these insights to introduce online demonstrations into a policy transfer setting. We present Policy Transfer with Online Demonstrations, an active LfD algorithm for policy transfer that can optimize the timing and content of queries for online episodic expert demonstrations under a limited demonstration budget. We evaluate our method in eight robotic scenarios, involving policy transfer across diverse environment characteristics, task objectives, and robotic embodiments, with the aim to transfer a trained policy from a source task to a related but different target task. The results show that our method significantly outperforms all baselines in terms of average success rate and sample efficiency, compared to two canonical LfD methods with offline demonstrations and one active LfD method with online demonstrations. Additionally, we conduct preliminary sim-to-real tests of the transferred policy on three transfer scenarios in the real-world environment, demonstrating the policy effectiveness on a real robot manipulator.

[AI-22] ROMA: a Read-Only-Memory-based Accelerator for QLoRA-based On-Device LLM

链接: https://arxiv.org/abs/2503.12988
作者: Wenqiang Wang,Yijia Zhang,Zikai Zhang,Guanting Huo,Hao Liang,Shijie Cao,Ningyi Xu
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) demonstrate powerful capabilities, deploying them on edge devices has become increasingly crucial, offering advantages in privacy and real-time interaction. QLoRA has emerged as the standard approach for on-device LLMs, leveraging quantized models to reduce memory and computational costs while utilizing LoRA for task-specific adaptability. In this work, we propose ROMA, a QLoRA accelerator with a hybrid storage architecture that uses ROM for quantized base models and SRAM for LoRA weights and KV cache. Our insight is that the quantized base model is stable and converged, making it well-suited for ROM storage. Meanwhile, LoRA modules offer the flexibility to adapt to new data without requiring updates to the base model. To further reduce the area cost of ROM, we introduce a novel B-ROM design and integrate it with the compute unit to form a fused cell for efficient use of chip resources. ROMA can effectively store both a 4-bit 3B and a 2-bit 8B LLaMA model entirely on-chip, achieving a notable generation speed exceeding 20,000 tokens/s without requiring external memory.

[AI-23] Open3DBench: Open-Source Benchmark for 3D-IC Backend Implementation and PPA Evaluation

链接: https://arxiv.org/abs/2503.12946
作者: Yunqi Shi,Chengrui Gao,Wanqi Ren,Siyuan Xu,Ke Xue,Mingxuan Yuan,Chao Qian,Zhi-Hua Zhou
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work introduces Open3DBench, an open-source 3D-IC backend implementation benchmark built upon the OpenROAD-flow-scripts framework, enabling comprehensive evaluation of power, performance, area, and thermal metrics. Our proposed flow supports modular integration of 3D partitioning, placement, 3D routing, RC extraction, and thermal simulation, aligning with advanced 3D flows that rely on commercial tools and in-house scripts. We present two foundational 3D placement algorithms: Open3D-Tiling, which emphasizes regular macro placement, and Open3D-DMP, which enhances wirelength optimization through cross-die co-placement with analytical placer DREAMPlace. Experimental results show significant improvements in area (51.19%), wirelength (24.06%), timing (30.84%), and power (5.72%) compared to 2D flows. The results also highlight that better wirelength does not necessarily lead to PPA gain, emphasizing the need of developing PPA-driven methods. Open3DBench offers a standardized, reproducible platform for evaluating 3D EDA methods, effectively bridging the gap between open-source tools and commercial solutions in 3D-IC design.

[AI-24] MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

链接: https://arxiv.org/abs/2503.12931
作者: Rui Pu,Chaozhuo Li,Rui Ha,Litian Zhang,Lirong Qiu,Xi Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Defending large language models (LLMs) against jailbreak attacks is crucial for ensuring their safe deployment. Existing defense strategies generally rely on predefined static criteria to differentiate between harmful and benign prompts. However, such rigid rules are incapable of accommodating the inherent complexity and dynamic nature of real jailbreak attacks. In this paper, we propose a novel concept of ``mirror’’ to enable dynamic and adaptive defense. A mirror refers to a dynamically generated prompt that mirrors the syntactic structure of the input while ensuring semantic safety. The personalized discrepancies between the input prompts and their corresponding mirrors serve as the guiding principles for defense. A new defense paradigm, MirrorGuard, is further proposed to detect and calibrate risky inputs based on such mirrors. An entropy-based detection metric, Relative Input Uncertainty (RIU), is integrated into MirrorGuard to quantify the discrepancies between input prompts and mirrors. MirrorGuard is evaluated on several popular datasets, demonstrating state-of-the-art defense performance while maintaining general effectiveness.

[AI-25] Verification Learning: Make Unsupervised Neuro-Symbolic System Feasible

链接: https://arxiv.org/abs/2503.12917
作者: Lin-Han Jia,Wen-Chao Hu,Jie-Jing Shao,Lan-Zhe Guo,Yu-Feng Li
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The current Neuro-Symbolic (NeSy) Learning paradigm suffers from an over-reliance on labeled data. If we completely disregard labels, it leads to less symbol information, a larger solution space, and more shortcuts-issues that current Nesy systems cannot resolve. This paper introduces a novel learning paradigm, Verification Learning (VL), which addresses this challenge by transforming the label-based reasoning process in Nesy into a label-free verification process. VL achieves excellent learning results solely by relying on unlabeled data and a function that verifies whether the current predictions conform to the rules. We formalize this problem as a Constraint Optimization Problem (COP) and propose a Dynamic combinatorial Sorting (DCS) algorithm that accelerates the solution by reducing verification attempts, effectively lowering computational costs to the level of a Constraint Satisfaction Problem (CSP). To further enhance performance, we introduce a prior alignment method to address potential shortcuts. Our theoretical analysis points out which tasks in Nesy systems can be completed without labels and explains why rules can replace infinite labels, such as in addition, for some tasks, while for others, like Sudoku, the rules have no effect. We validate the proposed framework through several fully unsupervised tasks including addition, sort, match, and chess, each showing significant performance and efficiency improvements.

[AI-26] Federated Continual Instruction Tuning

链接: https://arxiv.org/abs/2503.12897
作者: Haiyang Guo,Fanhu Zeng,Fei Zhu,Wenzhuo Liu,Da-Han Wang,Jian Xu,Xu-Yao Zhang,Cheng-Lin Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Our source code and dataset will be made publicly available.

[AI-27] SparseLUT: Sparse Connectivity Optimization for Lookup Table-based Deep Neural Networks

链接: https://arxiv.org/abs/2503.12829
作者: Binglei Lou,Ruilin Wu,Philip Leong
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The deployment of deep neural networks (DNNs) on resource-constrained edge devices such as field-programmable gate arrays (FPGAs) requires a careful balance of latency, power, and resource usage while maintaining high accuracy. Existing Lookup Table (LUT)-based DNNs, including LogicNets, PolyLUT, PolyLUT-Add, and NeuraLUT, exploit native FPGA resources with random sparse connectivity. This paper introduces SparseLUT, a connectivity-centric training technique tailored for LUT-based DNNs. SparseLUT leverages a non-greedy training strategy that prioritizes the pruning of less significant connections and strategically regrows alternative ones, resulting in efficient convergence to the target sparsity. Experimental results show consistent accuracy improvements across benchmarks, including up to a 2.13% increase on MNIST and a 0.94% improvement for Jet Substructure Classification compared to random sparsity. This is done without any hardware overhead and achieves state-of-the-art results for LUT-based DNNs.

[AI-28] Versatile Physics-based Character Control with Hybrid Latent Representation

链接: https://arxiv.org/abs/2503.12814
作者: Jinseok Bae,Jungdam Won,Donggeun Lim,Inwoo Hwang,Young Min Kim
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We present a versatile latent representation that enables physically simulated character to efficiently utilize motion priors. To build a powerful motion embedding that is shared across multiple tasks, the physics controller should employ rich latent space that is easily explored and capable of generating high-quality motion. We propose integrating continuous and discrete latent representations to build a versatile motion prior that can be adapted to a wide range of challenging control tasks. Specifically, we build a discrete latent model to capture distinctive posterior distribution without collapse, and simultaneously augment the sampled vector with the continuous residuals to generate high-quality, smooth motion without jittering. We further incorporate Residual Vector Quantization, which not only maximizes the capacity of the discrete motion prior, but also efficiently abstracts the action space during the task learning phase. We demonstrate that our agent can produce diverse yet smooth motions simply by traversing the learned motion prior through unconditional motion generation. Furthermore, our model robustly satisfies sparse goal conditions with highly expressive natural motions, including head-mounted device tracking and motion in-betweening at irregular intervals, which could not be achieved with existing latent representations.

[AI-29] Analyzing sequential activity and travel decisions with interpretable deep inverse reinforcement learning

链接: https://arxiv.org/abs/2503.12761
作者: Yuebing Liang,Shenhao Wang,Jiangbo Yu,Zhan Zhao,Jinhua Zhao,Sandy Pentland
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Travel demand modeling has shifted from aggregated trip-based models to behavior-oriented activity-based models because daily trips are essentially driven by human activities. To analyze the sequential activity-travel decisions, deep inverse reinforcement learning (DIRL) has proven effective in learning the decision mechanisms by approximating a reward function to represent preferences and a policy function to replicate observed behavior using deep neural networks (DNNs). However, most existing research has focused on using DIRL to enhance only prediction accuracy, with limited exploration into interpreting the underlying decision mechanisms guiding sequential decision-making. To address this gap, we introduce an interpretable DIRL framework for analyzing activity-travel decision processes, bridging the gap between data-driven machine learning and theory-driven behavioral models. Our proposed framework adapts an adversarial IRL approach to infer the reward and policy functions of activity-travel behavior. The policy function is interpreted through a surrogate interpretable model based on choice probabilities from the policy function, while the reward function is interpreted by deriving both short-term rewards and long-term returns for various activity-travel patterns. Our analysis of real-world travel survey data reveals promising results in two key areas: (i) behavioral pattern insights from the policy function, highlighting critical factors in decision-making and variations among socio-demographic groups, and (ii) behavioral preference insights from the reward function, indicating the utility individuals gain from specific activity sequences.

[AI-30] MAP: Multi-user Personalization with Collaborative LLM -powered Agents

链接: https://arxiv.org/abs/2503.12757
作者: Christine Lee,Jihye Choi,Bilge Mutlu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '25), April 26-May 1, 2025, Yokohama, Japan

点击查看摘要

Abstract:The widespread adoption of Large Language Models (LLMs) and LLM-powered agents in multi-user settings underscores the need for reliable, usable methods to accommodate diverse preferences and resolve conflicting directives. Drawing on conflict resolution theory, we introduce a user-centered workflow for multi-user personalization comprising three stages: Reflection, Analysis, and Feedback. We then present MAP – a \textbfMulti-\textbfAgent system for multi-user \textbfPersonalization – to operationalize this workflow. By delegating subtasks to specialized agents, MAP (1) retrieves and reflects on relevant user information, while enhancing reliability through agent-to-agent interactions, (2) provides detailed analysis for improved transparency and usability, and (3) integrates user feedback to iteratively refine results. Our user study findings (n=12) highlight MAP’s effectiveness and usability for conflict resolution while emphasizing the importance of user involvement in resolution verification and failure management. This work highlights the potential of multi-agent systems to implement user-centered, multi-user personalization workflows and concludes by offering insights for personalization in multi-user contexts.

[AI-31] SafeSlice: Enabling SLA-Compliant O-RAN Slicing via Safe Deep Reinforcement Learning ICML

链接: https://arxiv.org/abs/2503.12753
作者: Ahmad M. Nagib,Hatem Abou-Zeid,Hossam S. Hassanein
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This article has been accepted for presentation in the IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN) 2025

点击查看摘要

Abstract:Deep reinforcement learning (DRL)-based slicing policies have shown significant success in simulated environments but face challenges in physical systems such as open radio access networks (O-RANs) due to simulation-to-reality gaps. These policies often lack safety guarantees to ensure compliance with service level agreements (SLAs), such as the strict latency requirements of immersive applications. As a result, a deployed DRL slicing agent may make resource allocation (RA) decisions that degrade system performance, particularly in previously unseen scenarios. Real-world immersive applications require maintaining SLA constraints throughout deployment to prevent risky DRL exploration. In this paper, we propose SafeSlice to address both the cumulative (trajectory-wise) and instantaneous (state-wise) latency constraints of O-RAN slices. We incorporate the cumulative constraints by designing a sigmoid-based risk-sensitive reward function that reflects the slices’ latency requirements. Moreover, we build a supervised learning cost model as part of a safety layer that projects the slicing agent’s RA actions to the nearest safe actions, fulfilling instantaneous constraints. We conduct an exhaustive experiment that supports multiple services, including real virtual reality (VR) gaming traffic, to investigate the performance of SafeSlice under extreme and changing deployment conditions. SafeSlice achieves reductions of up to 83.23% in average cumulative latency, 93.24% in instantaneous latency violations, and 22.13% in resource consumption compared to the baselines. The results also indicate SafeSlice’s robustness to changing the threshold configurations of latency constraints, a vital deployment scenario that will be realized by the O-RAN paradigm to empower mobile network operators (MNOs).

[AI-32] nySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

链接: https://arxiv.org/abs/2503.12730
作者: Philip Quirke,Clement Neo,Abir Harrasse,Dhruv Nathawani,Amir Abdullah
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: 9 pages, 19 figures, 7 tables, 18 trained models

点击查看摘要

Abstract:Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including edge attribution patching and sparse autoencoders, to identify minimal circuits and components supporting SQL generation. Our analysis reveals both the potential and limitations of current interpretability methods, showing how circuits can vary even across similar queries. Lastly, we demonstrate how mechanistic interpretability can identify flawed heuristics in models and improve synthetic dataset design. Our work provides a comprehensive framework for evaluating and advancing interpretability techniques while establishing clear boundaries for their reliable application.

[AI-33] Can Reasoning Models Reason about Hardware? An Agent ic HLS Perspective

链接: https://arxiv.org/abs/2503.12721
作者: Luca Collini,Andrew Hennessee,Ramesh Karri,Siddharth Garg
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages, submitted for peer review

点击查看摘要

Abstract:Recent Large Language Models (LLMs) such as OpenAI o3-mini and DeepSeek-R1 use enhanced reasoning through Chain-of-Thought (CoT). Their potential in hardware design, which relies on expert-driven iterative optimization, remains unexplored. This paper investigates whether reasoning LLMs can address challenges in High-Level Synthesis (HLS) design space exploration and optimization. During HLS, engineers manually define pragmas/directives to balance performance and resource constraints. We propose an LLM-based optimization agentic framework that automatically restructures code, inserts pragmas, and identifies optimal design points via feedback from HLs tools and access to integer-linear programming (ILP) solvers. Experiments compare reasoning models against conventional LLMs on benchmarks using success rate, efficiency, and design quality (area/latency) metrics, and provide the first-ever glimpse into the CoTs produced by a powerful open-source reasoning model like DeepSeek-R1.

[AI-34] AI Agents : Evolution Architecture and Real-World Applications

链接: https://arxiv.org/abs/2503.12687
作者: Naveen Krishnan
类目: Artificial Intelligence (cs.AI)
*备注: 52 pages, 4 figures, comprehensive survey and analysis of AI agent evolution, architecture, evaluation frameworks, and applications

点击查看摘要

Abstract:This paper examines the evolution, architecture, and practical applications of AI agents from their early, rule-based incarnations to modern sophisticated systems that integrate large language models with dedicated modules for perception, planning, and tool use. Emphasizing both theoretical foundations and real-world deployments, the paper reviews key agent paradigms, discusses limitations of current evaluation benchmarks, and proposes a holistic evaluation framework that balances task effectiveness, efficiency, robustness, and safety. Applications across enterprise, personal assistance, and specialized domains are analyzed, with insights into future research directions for more resilient and adaptive AI agent systems.

[AI-35] FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

链接: https://arxiv.org/abs/2503.12649
作者: Hao Mark Chen,Shell Xu Hu,Wayne Luk,Timothy Hospedales,Hongxiang Fan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model merging has emerged as a promising approach for multi-task learning (MTL), offering a data-efficient alternative to conventional fine-tuning. However, with the rapid development of the open-source AI ecosystem and the increasing availability of fine-tuned foundation models, existing model merging methods face two key limitations: (i) They are primarily designed for in-house fine-tuned models, making them less adaptable to diverse model sources with partially unknown model and task information, (ii) They struggle to scale effectively when merging numerous model checkpoints. To address these challenges, we formulate model merging as a constrained optimization problem and introduce a novel approach: Frank-Wolfe Merging (FW-Merging). Inspired by Frank-Wolfe optimization, our approach iteratively selects the most relevant model in the pool to minimize a linear approximation of the objective function and then executes a local merging similar to the Frank-Wolfe update. The objective function is designed to capture the desired behavior of the target-merged model, while the fine-tuned candidate models define the constraint set. More importantly, FW-Merging serves as an orthogonal technique for existing merging methods, seamlessly integrating with them to further enhance accuracy performance. Our experiments show that FW-Merging scales across diverse model sources, remaining stable with 16 irrelevant models and improving by 15.3% with 16 relevant models on 20 CV tasks, while maintaining constant memory overhead, unlike the linear overhead of data-informed merging methods. Compared with the state-of-the-art approaches, FW-Merging surpasses the data-free merging method by 32.8% and outperforms the data-informed Adamerging by 8.39% when merging 20 ViT models.

[AI-36] Understanding Driver Cognition and Decision-Making Behaviors in High-Risk Scenarios: A Drift Diffusion Perspective

链接: https://arxiv.org/abs/2503.12637
作者: Heye Huang,Zheng Li,Hao Cheng,Haoran Wang,Junkai Jiang,Xiaopeng Li,Arkady Zgonnikov
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
*备注: 23 pages, 11 figures

点击查看摘要

Abstract:Ensuring safe interactions between autonomous vehicles (AVs) and human drivers in mixed traffic systems remains a major challenge, particularly in complex, high-risk scenarios. This paper presents a cognition-decision framework that integrates individual variability and commonalities in driver behavior to quantify risk cognition and model dynamic decision-making. First, a risk sensitivity model based on a multivariate Gaussian distribution is developed to characterize individual differences in risk cognition. Then, a cognitive decision-making model based on the drift diffusion model (DDM) is introduced to capture common decision-making mechanisms in high-risk environments. The DDM dynamically adjusts decision thresholds by integrating initial bias, drift rate, and boundary parameters, adapting to variations in speed, relative distance, and risk sensitivity to reflect diverse driving styles and risk preferences. By simulating high-risk scenarios with lateral, longitudinal, and multidimensional risk sources in a driving simulator, the proposed model accurately predicts cognitive responses and decision behaviors during emergency maneuvers. Specifically, by incorporating driver-specific risk sensitivity, the model enables dynamic adjustments of key DDM parameters, allowing for personalized decision-making representations in diverse scenarios. Comparative analysis with IDM, Gipps, and MOBIL demonstrates that DDM more precisely captures human cognitive processes and adaptive decision-making in high-risk scenarios. These findings provide a theoretical basis for modeling human driving behavior and offer critical insights for enhancing AV-human interaction in real-world traffic environments.

[AI-37] Hybrid Learners Do Not Forget: A Brain-Inspired Neuro-Symbolic Approach to Continual Learning

链接: https://arxiv.org/abs/2503.12635
作者: Amin Banayeeanzade,Mohammad Rostami
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual learning is crucial for creating AI agents that can learn and improve themselves autonomously. A primary challenge in continual learning is to learn new tasks without losing previously learned knowledge. Current continual learning methods primarily focus on enabling a neural network with mechanisms that mitigate forgetting effects. Inspired by the two distinct systems in the human brain, System 1 and System 2, we propose a Neuro-Symbolic Brain-Inspired Continual Learning (NeSyBiCL) framework that incorporates two subsystems to solve continual learning: A neural network model responsible for quickly adapting to the most recent task, together with a symbolic reasoner responsible for retaining previously acquired knowledge from previous tasks. Moreover, we design an integration mechanism between these components to facilitate knowledge transfer from the symbolic reasoner to the neural network. We also introduce two compositional continual learning benchmarks and demonstrate that NeSyBiCL is effective and leads to superior performance compared to continual learning methods that merely rely on neural architectures to address forgetting.

[AI-38] Automated Planning for Optimal Data Pipeline Instantiation

链接: https://arxiv.org/abs/2503.12626
作者: Leonardo Rosa Amado,Adriano Vogel,Dalvan Griebler,Gabriel Paludo Licks,Eric Simon,Felipe Meneguzzi
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Data pipeline frameworks provide abstractions for implementing sequences of data-intensive transformation operators, automating the deployment and execution of such transformations in a cluster. Deploying a data pipeline, however, requires computing resources to be allocated in a data center, ideally minimizing the overhead for communicating data and executing operators in the pipeline while considering each operator’s execution requirements. In this paper, we model the problem of optimal data pipeline deployment as planning with action costs, where we propose heuristics aiming to minimize total execution time. Experimental results indicate that the heuristics can outperform the baseline deployment and that a heuristic based on connections outperforms other strategies.

[AI-39] Negotiative Alignment: Embracing Disagreement to Achieve Fairer Outcomes – Insights from Urban Studies

链接: https://arxiv.org/abs/2503.12613
作者: Rashid Mushkani,Hugo Berard,Shin Koseki
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
*备注: 16 pages, 13 figures

点击查看摘要

Abstract:Cities are not monolithic; they are arenas of negotiation among groups that hold varying needs, values, and experiences. Conventional methods of urban assessment – from standardized surveys to AI-driven evaluations – frequently rely on a single consensus metric (e.g., an average measure of inclusivity or safety). Although such aggregations simplify design decisions, they risk obscuring the distinct perspectives of marginalized populations. In this paper, we present findings from a community-centered study in Montreal involving 35 residents with diverse demographic and social identities, particularly wheelchair users, seniors, and LGBTQIA2+ individuals. Using rating and ranking tasks on 20 urban sites, we observe that disagreements are systematic rather than random, reflecting structural inequalities, differing cultural values, and personal experiences of safety and accessibility. Based on these empirical insights, we propose negotiative alignment, an AI framework that treats disagreement as an essential input to be preserved, analyzed, and addressed. Negotiative alignment builds on pluralistic models by dynamically updating stakeholder preferences through multi-agent negotiation mechanisms, ensuring no single perspective is marginalized. We outline how this framework can be integrated into urban analytics – and other decision-making contexts – to retain minority viewpoints, adapt to changing stakeholder concerns, and enhance fairness and accountability. The study demonstrates that preserving and engaging with disagreement, rather than striving for an artificial consensus, can produce more equitable and responsive AI-driven outcomes in urban design. Comments: 16 pages, 13 figures Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA) Cite as: arXiv:2503.12613 [cs.HC] (or arXiv:2503.12613v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2503.12613 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] HyConEx: Hypernetwork classifier with counterfactual explanations

链接: https://arxiv.org/abs/2503.12525
作者: Patryk Marszałek,Ulvi Movsum-zada,Oleksii Furman,Kamil Książek,Przemysław Spurek,Marek Śmieja
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, there has been a growing interest in explainable AI methods. We want not only to make accurate predictions using sophisticated neural networks but also to understand what the model’s decision is based on. One of the fundamental levels of interpretability is to provide counterfactual examples explaining the rationale behind the decision and identifying which features, and to what extent, must be modified to alter the model’s outcome. To address these requirements, we introduce HyConEx, a classification model based on deep hypernetworks specifically designed for tabular data. Owing to its unique architecture, HyConEx not only provides class predictions but also delivers local interpretations for individual data samples in the form of counterfactual examples that steer a given sample toward an alternative class. While many explainable methods generated counterfactuals for external models, there have been no interpretable classifiers simultaneously producing counterfactual samples so far. HyConEx achieves competitive performance on several metrics assessing classification accuracy and fulfilling the criteria of a proper counterfactual attack. This makes HyConEx a distinctive deep learning model, which combines predictions and explainers as an all-in-one neural network. The code is available at this https URL.

[AI-41] LLM -Driven Multi-step Translation from C to Rust using Static Analysis

链接: https://arxiv.org/abs/2503.12511
作者: Tianyang Zhou,Haowen Lin,Somesh Jha,Mihai Christodorescu,Kirill Levchenko,Varun Chandrasekaran
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: 22 pages, 13 figures

点击查看摘要

Abstract:Translating software written in legacy languages to modern languages, such as C to Rust, has significant benefits in improving memory safety while maintaining high performance. However, manual translation is cumbersome, error-prone, and produces unidiomatic code. Large language models (LLMs) have demonstrated promise in producing idiomatic translations, but offer no correctness guarantees as they lack the ability to capture all the semantics differences between the source and target languages. To resolve this issue, we propose SACTOR, an LLM-driven C-to-Rust zero-shot translation tool using a two-step translation methodology: an “unidiomatic” step to translate C into Rust while preserving semantics, and an “idiomatic” step to refine the code to follow Rust’s semantic standards. SACTOR utilizes information provided by static analysis of the source C program to address challenges such as pointer semantics and dependency resolution. To validate the correctness of the translated result from each step, we use end-to-end testing via the foreign function interface to embed our translated code segment into the original code. We evaluate the translation of 200 programs from two datasets and two case studies, comparing the performance of GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Llama 3.3 70B and DeepSeek-R1 in SACTOR. Our results demonstrate that SACTOR achieves high correctness and improved idiomaticity, with the best-performing model (DeepSeek-R1) reaching 93% and (GPT-4o, Claude 3.5, DeepSeek-R1) reaching 84% correctness (on each dataset, respectively), while producing more natural and Rust-compliant translations compared to existing methods.

[AI-42] A General Close-loop Predictive Coding Framework for Auditory Working Memory

链接: https://arxiv.org/abs/2503.12506
作者: Zhongju Yuan,Geraint Wiggins,Dick Botteldooren
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Auditory working memory is essential for various daily activities, such as language acquisition, conversation. It involves the temporary storage and manipulation of information that is no longer present in the environment. While extensively studied in neuroscience and cognitive science, research on its modeling within neural networks remains limited. To address this gap, we propose a general framework based on a close-loop predictive coding paradigm to perform short auditory signal memory tasks. The framework is evaluated on two widely used benchmark datasets for environmental sound and speech, demonstrating high semantic similarity across both datasets.

[AI-43] Facilitating Automated Online Consensus Building through Parallel Thinking

链接: https://arxiv.org/abs/2503.12499
作者: Wen Gu,Zhaoxing Li,Jan Buermann,Jim Dilkes,Dimitris Michailidis,Shinobu Hasegawa,Vahid Yazdanpanah,Sebastian Stein
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Consensus building is inherently challenging due to the diverse opinions held by stakeholders. Effective facilitation is crucial to support the consensus building process and enable efficient group decision making. However, the effectiveness of facilitation is often constrained by human factors such as limited experience and scalability. In this research, we propose a Parallel Thinking-based Facilitation Agent (PTFA) that facilitates online, text-based consensus building processes. The PTFA automatically collects textual posts and leverages large language models (LLMs) to perform all of the six distinct roles of the well-established Six Thinking Hats technique in parallel thinking. To illustrate the potential of PTFA, a pilot study was carried out and PTFA’s ability in idea generation, emotional probing, and deeper analysis of ideas was demonstrated. Furthermore, a comprehensive dataset that contains not only the conversational content among the participants but also between the participants and the agent is constructed for future study.

[AI-44] Defense Against Model Stealing Based on Account-Aware Distribution Discrepancy AAAI2025

链接: https://arxiv.org/abs/2503.12497
作者: Jian-Ping Mei,Weibin Zhang,Jie Chen,Xuyun Zhang,Tiantian Zhu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figures, published in AAAI 2025

点击查看摘要

Abstract:Malicious users attempt to replicate commercial models functionally at low cost by training a clone model with query responses. It is challenging to timely prevent such model-stealing attacks to achieve strong protection and maintain utility. In this paper, we propose a novel non-parametric detector called Account-aware Distribution Discrepancy (ADD) to recognize queries from malicious users by leveraging account-wise local dependency. We formulate each class as a Multivariate Normal distribution (MVN) in the feature space and measure the malicious score as the sum of weighted class-wise distribution discrepancy. The ADD detector is combined with random-based prediction poisoning to yield a plug-and-play defense module named D-ADD for image classification models. Results of extensive experimental studies show that D-ADD achieves strong defense against different types of attacks with little interference in serving benign users for both soft and hard-label settings.

[AI-45] KDSelector: A Knowledge-Enhanced and Data-Efficient Model Selector Learning Framework for Time Series Anomaly Detection SIGMOD2025

链接: https://arxiv.org/abs/2503.12478
作者: Zhiyu Liang,Dongrui Cai,Chenyuan Zhang,Zheng Liang,Chen Liang,Bo Zheng,Shi Qiu,Jin Wang,Hongzhi Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: This paper has been accepted by SIGMOD 2025

点击查看摘要

Abstract:Model selection has been raised as an essential problem in the area of time series anomaly detection (TSAD), because there is no single best TSAD model for the highly heterogeneous time series in real-world applications. However, despite the success of existing model selection solutions that train a classification model (especially neural network, NN) using historical data as a selector to predict the correct TSAD model for each series, the NN-based selector learning methods used by existing solutions do not make full use of the knowledge in the historical data and require iterating over all training samples, which limits the accuracy and training speed of the selector. To address these limitations, we propose KDSelector, a novel knowledge-enhanced and data-efficient framework for learning the NN-based TSAD model selector, of which three key components are specifically designed to integrate available knowledge into the selector and dynamically prune less important and redundant samples during the learning. We develop a TSAD model selection system with KDSelector as the internal, to demonstrate how users improve the accuracy and training speed of their selectors by using KDSelector as a plug-and-play module. Our demonstration video is hosted at this https URL.

[AI-46] A Survey on the Optimization of Large Language Model-based Agents

链接: https://arxiv.org/abs/2503.12434
作者: Shangheng Du,Jiabao Zhao,Jinxin Shi,Zhentao Xie,Xin Jiang,Yanhong Bai,Liang He
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of Large Language Models (LLMs), LLM-based agents have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks. However, current work typically relies on prompt design or fine-tuning strategies applied to vanilla LLMs, which often leads to limited effectiveness or suboptimal performance in complex agent-related environments. Although LLM optimization techniques can improve model performance across many general tasks, they lack specialized optimization towards critical agent functionalities such as long-term planning, dynamic environmental interaction, and complex decision-making. Although numerous recent studies have explored various strategies to optimize LLM-based agents for complex agent tasks, a systematic review summarizing and comparing these methods from a holistic perspective is still lacking. In this survey, we provide a comprehensive review of LLM-based agent optimization approaches, categorizing them into parameter-driven and parameter-free methods. We first focus on parameter-driven optimization, covering fine-tuning-based optimization, reinforcement learning-based optimization, and hybrid strategies, analyzing key aspects such as trajectory data construction, fine-tuning techniques, reward function design, and optimization algorithms. Additionally, we briefly discuss parameter-free strategies that optimize agent behavior through prompt engineering and external knowledge retrieval. Finally, we summarize the datasets and benchmarks used for evaluation and tuning, review key applications of LLM-based agents, and discuss major challenges and promising future directions. Our repository for related references is available at this https URL.

[AI-47] owards Learnable Anchor for Deep Multi-View Clustering AAAI25

链接: https://arxiv.org/abs/2503.12427
作者: Bocheng Wang,Chusheng Zeng,Mulin Chen,Xuelong Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI25

点击查看摘要

Abstract:Deep multi-view clustering incorporating graph learning has presented tremendous potential. Most methods encounter costly square time consumption w.r.t. data size. Theoretically, anchor-based graph learning can alleviate this limitation, but related deep models mainly rely on manual discretization approaches to select anchors, which indicates that 1) the anchors are fixed during model training and 2) they may deviate from the true cluster distribution. Consequently, the unreliable anchors may corrupt clustering results. In this paper, we propose the Deep Multi-view Anchor Clustering (DMAC) model that performs clustering in linear time. Concretely, the initial anchors are intervened by the positive-incentive noise sampled from Gaussian distribution, such that they can be optimized with a newly designed anchor learning loss, which promotes a clear relationship between samples and anchors. Afterwards, anchor graph convolution is devised to model the cluster structure formed by the anchors, and the mutual information maximization loss is built to provide cross-view clustering guidance. In this way, the learned anchors can better represent clusters. With the optimal anchors, the full sample graph is calculated to derive a discriminative embedding for clustering. Extensive experiments on several datasets demonstrate the superior performance and efficiency of DMAC compared to state-of-the-art competitors.

[AI-48] Bio-Inspired Plastic Neural Networks for Zero-Shot Out-of-Distribution Generalization in Complex Animal-Inspired Robots

链接: https://arxiv.org/abs/2503.12406
作者: Binggwong Leung,Worasuchad Haomachai,Joachim Winther Pedersen,Sebastian Risi,Poramate Manoonpong
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial neural networks can be used to solve a variety of robotic tasks. However, they risk failing catastrophically when faced with out-of-distribution (OOD) situations. Several approaches have employed a type of synaptic plasticity known as Hebbian learning that can dynamically adjust weights based on local neural activities. Research has shown that synaptic plasticity can make policies more robust and help them adapt to unforeseen changes in the environment. However, networks augmented with Hebbian learning can lead to weight divergence, resulting in network instability. Furthermore, such Hebbian networks have not yet been applied to solve legged locomotion in complex real robots with many degrees of freedom. In this work, we improve the Hebbian network with a weight normalization mechanism for preventing weight divergence, analyze the principal components of the Hebbian’s weights, and perform a thorough evaluation of network performance in locomotion control for real 18-DOF dung beetle-like and 16-DOF gecko-like robots. We find that the Hebbian-based plastic network can execute zero-shot sim-to-real adaptation locomotion and generalize to unseen conditions, such as uneven terrain and morphological damage.

[AI-49] FedGAI: Federated Style Learning with Cloud-Edge Collaboration for Generative AI in Fashion Design

链接: https://arxiv.org/abs/2503.12389
作者: Mingzhu Wu,Jianan Jiang,Xinglin Li,Hanhui Deng,Di Wu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Collaboration can amalgamate diverse ideas, styles, and visual elements, fostering creativity and innovation among different designers. In collaborative design, sketches play a pivotal role as a means of expressing design creativity. However, designers often tend to not openly share these meticulously crafted sketches. This phenomenon of data island in the design area hinders its digital transformation under the third wave of AI. In this paper, we introduce a Federated Generative Artificial Intelligence Clothing system, namely FedGAI, employing federated learning to aid in sketch design. FedGAI is committed to establishing an ecosystem wherein designers can exchange sketch styles among themselves. Through FedGAI, designers can generate sketches that incorporate various designers’ styles from their peers, drawing inspiration from collaboration without the need for data disclosure or upload. Extensive performance evaluations indicate that our FedGAI system can produce multi-styled sketches of comparable quality to human-designed ones while significantly enhancing efficiency compared to hand-drawn sketches.

[AI-50] Unveiling Pitfalls: Understanding Why AI-driven Code Agents Fail at GitHub Issue Resolution

链接: https://arxiv.org/abs/2503.12374
作者: Zhi Chen,Wei Ma,Lingxiao Jiang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI-driven software development has rapidly advanced with the emergence of software development agents that leverage large language models (LLMs) to tackle complex, repository-level software engineering tasks. These agents go beyond just generation of final code; they engage in multi-step reasoning, utilize various tools for code modification and debugging, and interact with execution environments to diagnose and iteratively resolve issues. However, most existing evaluations focus primarily on static analyses of final code outputs, yielding limited insights into the agents’ dynamic problem-solving processes. To fill this gap, we conduct an in-depth empirical study on 3,977 solving-phase trajectories and 3,931 testing-phase logs from 8 top-ranked agents evaluated on 500 GitHub issues in the SWE-Bench benchmark. Our exploratory analysis shows that Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads. We have identified the most prevalent errors – such as ModuleNotFoundError and TypeError – and highlighted particularly challenging errors like OSError and database-related issues (e.g., IntegrityError) that demand significantly more debugging effort. Furthermore, we have discovered 3 bugs in the SWE-Bench platform that affect benchmark fairness and accuracy; these issues have been reported to and confirmed by the maintainers. To promote transparency and foster future research, we publicly share our datasets and analysis scripts.

[AI-51] Synthetic Data for Robust AI Model Development in Regulated Enterprises

链接: https://arxiv.org/abs/2503.12353
作者: Aditi Godbole
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In today’s business landscape, organizations need to find the right balance between using their customers’ data ethically to power AI solutions and being compliant regarding data privacy and data usage regulations. In this paper, we discuss synthetic data as a possible solution to this dilemma. Synthetic data is simulated data that mimics the real data. We explore how organizations in heavily regulated industries, such as financial institutions or healthcare organizations, can leverage synthetic data to build robust AI solutions while staying compliant. We demonstrate that synthetic data offers two significant advantages by allowing AI models to learn from more diverse data and by helping organizations stay compliant against data privacy laws with the use of synthetic data instead of customer information. We discuss case studies to show how synthetic data can be effectively used in the finance and healthcare sector while discussing the challenges of using synthetic data and some ethical questions it raises. Our research finds that synthetic data could be a game-changer for AI in regulated industries. The potential can be realized when industry, academia, and regulators collaborate to build solutions. We aim to initiate discussions on the use of synthetic data to build ethical, responsible, and effective AI systems in regulated enterprise industries.

[AI-52] SPIN-Bench: How Well Do LLM s Plan Strategically and Reason Socially?

链接: https://arxiv.org/abs/2503.12349
作者: Jianzhu Yao,Kevin Wang,Ryan Hsieh,Haisu Zhou,Tianqing Zou,Zerui Cheng,Zhangyang Wang,Pramod Viswanath
类目: Artificial Intelligence (cs.AI)
*备注: 51 pages, 7 figures

点击查看摘要

Abstract:Reasoning and strategic behavior in \emphsocial interactions is a hallmark of intelligence. This form of reasoning is significantly more sophisticated than isolated planning or reasoning tasks in static settings (e.g., math problem solving). In this paper, we present \textitStrategic Planning, Interaction, and Negotiation (\textbfSPIN-Bench), a new multi-domain evaluation designed to measure the intelligence of \emphstrategic planning and \emphsocial reasoning. While many existing benchmarks focus on narrow planning or single-agent reasoning, SPIN-Bench combines classical PDDL tasks, competitive board games, cooperative card games, and multi-agent negotiation scenarios in one unified framework. The framework includes both a benchmark as well as an arena to simulate and evaluate the variety of social settings to test reasoning and strategic behavior of AI agents. We formulate the benchmark SPIN-Bench by systematically varying action spaces, state complexity, and the number of interacting agents to simulate a variety of social settings where success depends on not only methodical and step-wise decision making, but also \emphconceptual inference of other (adversarial or cooperative) participants. Our experiments reveal that while contemporary LLMs handle \emphbasic fact retrieval and \emphshort-range planning reasonably well, they encounter significant performance bottlenecks in tasks requiring \emphdeep multi-hop reasoning over large state spaces and \emphsocially adept coordination under uncertainty. We envision SPIN-Bench as a catalyst for future research on robust multi-agent planning, social reasoning, and human–AI teaming.

[AI-53] Augmented Adversarial Trigger Learning

链接: https://arxiv.org/abs/2503.12339
作者: Zhe Wang,Yanjun Qi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Gradient optimization-based adversarial attack methods automate the learning of adversarial triggers to generate jailbreak prompts or leak system prompts. In this work, we take a closer look at the optimization objective of adversarial trigger learning and propose ATLA: Adversarial Trigger Learning with Augmented objectives. ATLA improves the negative log-likelihood loss used by previous studies into a weighted loss formulation that encourages the learned adversarial triggers to optimize more towards response format tokens. This enables ATLA to learn an adversarial trigger from just one query-response pair and the learned trigger generalizes well to other similar queries. We further design a variation to augment trigger optimization with an auxiliary loss that suppresses evasive responses. We showcase how to use ATLA to learn adversarial suffixes jailbreaking LLMs and to extract hidden system prompts. Empirically we demonstrate that ATLA consistently outperforms current state-of-the-art techniques, achieving nearly 100% success in attacking while requiring 80% fewer queries. ATLA learned jailbreak suffixes demonstrate high generalization to unseen queries and transfer well to new LLMs.

[AI-54] A Transformer-based survival model for prediction of all-cause mortality in heart failure patients: a multi-cohort study

链接: https://arxiv.org/abs/2503.12317
作者: Shishir Rao,Nouman Ahmed,Gholamreza Salimi-Khorshidi,Christopher Yau,Huimin Su,Nathalie Conrad,Folkert W Asselbergs,Mark Woodward,Rod Jackson,John GF Cleland,Kazem Rahimi
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We developed and validated TRisk, a Transformer-based AI model predicting 36-month mortality in heart failure patients by analysing temporal patient journeys from UK electronic health records (EHR). Our study included 403,534 heart failure patients (ages 40-90) from 1,418 English general practices, with 1,063 practices for model derivation and 355 for external validation. TRisk was compared against the MAGGIC-EHR model across various patient subgroups. With median follow-up of 9 months, TRisk achieved a concordance index of 0.845 (95% confidence interval: [0.841, 0.849]), significantly outperforming MAGGIC-EHR’s 0.728 (0.723, 0.733) for predicting 36-month all-cause mortality. TRisk showed more consistent performance across sex, age, and baseline characteristics, suggesting less bias. We successfully adapted TRisk to US hospital data through transfer learning, achieving a C-index of 0.802 (0.789, 0.816) with 21,767 patients. Explainability analyses revealed TRisk captured established risk factors while identifying underappreciated predictors like cancers and hepatic failure that were important across both cohorts. Notably, cancers maintained strong prognostic value even a decade after diagnosis. TRisk demonstrated well-calibrated mortality prediction across both healthcare systems. Our findings highlight the value of tracking longitudinal health profiles and revealed risk factors not included in previous expert-driven models.

[AI-55] Bi-Criteria Optimization for Combinatorial Bandits: Sublinear Regret and Constraint Violation under Bandit Feedback

链接: https://arxiv.org/abs/2503.12285
作者: Vaneet Aggarwal,Shweta Jain,Subham Pokhriyal,Christopher John Quinn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we study bi-criteria optimization for combinatorial multi-armed bandits (CMAB) with bandit feedback. We propose a general framework that transforms discrete bi-criteria offline approximation algorithms into online algorithms with sublinear regret and cumulative constraint violation (CCV) guarantees. Our framework requires the offline algorithm to provide an (\alpha, \beta) -bi-criteria approximation ratio with \delta -resilience and utilize \textttN oracle calls to evaluate the objective and constraint functions. We prove that the proposed framework achieves sub-linear regret and CCV, with both bounds scaling as O\left(\delta^2/3 \textttN^1/3T^2/3\log^1/3(T)\right) . Crucially, the framework treats the offline algorithm with \delta -resilience as a black box, enabling flexible integration of existing approximation algorithms into the CMAB setting. To demonstrate its versatility, we apply our framework to several combinatorial problems, including submodular cover, submodular cost covering, and fair submodular maximization. These applications highlight the framework’s broad utility in adapting offline guarantees to online bi-criteria optimization under bandit feedback.

[AI-56] oward Foundation Models for Online Complex Event Detection in CPS-IoT: A Case Study

链接: https://arxiv.org/abs/2503.12282
作者: Liying Han,Gaofeng Dong,Xiaomin Ouyang,Lance Kaplan,Federico Cerutti,Mani Srivastava
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Complex events (CEs) play a crucial role in CPS-IoT applications, enabling high-level decision-making in domains such as smart monitoring and autonomous systems. However, most existing models focus on short-span perception tasks, lacking the long-term reasoning required for CE detection. CEs consist of sequences of short-time atomic events (AEs) governed by spatiotemporal dependencies. Detecting them is difficult due to long, noisy sensor data and the challenge of filtering out irrelevant AEs while capturing meaningful patterns. This work explores CE detection as a case study for CPS-IoT foundation models capable of long-term reasoning. We evaluate three approaches: (1) leveraging large language models (LLMs), (2) employing various neural architectures that learn CE rules from data, and (3) adopting a neurosymbolic approach that integrates neural models with symbolic engines embedding human knowledge. Our results show that the state-space model, Mamba, which belongs to the second category, outperforms all methods in accuracy and generalization to longer, unseen sensor traces. These findings suggest that state-space models could be a strong backbone for CPS-IoT foundation models for long-span reasoning tasks.

[AI-57] Agent ic Search Engine for Real-Time IoT Data

链接: https://arxiv.org/abs/2503.12255
作者: Abdelrahman Elewah,Khalid Elgazzar
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The Internet of Things (IoT) has enabled diverse devices to communicate over the Internet, yet the fragmentation of IoT systems limits seamless data sharing and coordinated management. We have recently introduced SensorsConnect, a unified framework to enable seamless content and sensor data sharing in collaborative IoT systems, inspired by how the World Wide Web (WWW) enabled a shared and accessible space for information among humans. This paper presents the IoT Agentic Search Engine (IoT-ASE), a real-time search engine tailored for IoT environments. IoT-ASE leverages Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) techniques to address the challenge of searching vast, real-time IoT data, enabling it to handle complex queries and deliver accurate, contextually relevant results. We implemented a use-case scenario in Toronto to demonstrate how IoT-ASE can improve service quality recommendations by leveraging real-time IoT data. Our evaluation shows that IoT-ASE achieves a 92% accuracy in retrieving intent-based services and produces responses that are concise, relevant, and context-aware, outperforming generalized responses from systems like Gemini. These findings highlight the potential IoT-ASE to make real-time IoT data accessible and support effective, real-time decision-making.

[AI-58] GenOSIL: Generalized Optimal and Safe Robot Control using Parameter-Conditioned Imitation Learning

链接: https://arxiv.org/abs/2503.12243
作者: Mumuksh Tayal,Manan Tayal,Ravi Prakash
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Ensuring safe and generalizable control remains a fundamental challenge in robotics, particularly when deploying imitation learning in dynamic environments. Traditional behavior cloning (BC) struggles to generalize beyond its training distribution, as it lacks an understanding of the safety critical reasoning behind expert demonstrations. To address this limitation, we propose GenOSIL, a novel imitation learning framework that explicitly incorporates environment parameters into policy learning via a structured latent representation. Unlike conventional methods that treat the environment as a black box, GenOSIL employs a variational autoencoder (VAE) to encode measurable safety parameters such as obstacle position, velocity, and geometry into a latent space that captures intrinsic correlations between expert behavior and environmental constraints. This enables the policy to infer the rationale behind expert trajectories rather than merely replicating them. We validate our approach on two robotic platforms an autonomous ground vehicle and a Franka Emika Panda manipulator demonstrating superior safety and goal reaching performance compared to baseline methods. The simulation and hardware videos can be viewed on the project webpage: this https URL.

[AI-59] A Novel Double Pruning method for Imbalanced Data using Information Entropy and Roulette Wheel Selection for Breast Cancer Diagnosis

链接: https://arxiv.org/abs/2503.12239
作者: Soufiane Bacha,Huansheng Ning,Belarbi Mostefa,Doreen Sebastian Sarwatt,Sahraoui Dhelim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Accurate illness diagnosis is vital for effective treatment and patient safety. Machine learning models are widely used for cancer diagnosis based on historical medical data. However, data imbalance remains a major challenge, leading to hindering classifier performance and reliability. The SMOTEBoost method addresses this issue by generating synthetic data to balance the dataset, but it may overlook crucial overlapping regions near the decision boundary and can produce noisy samples. This paper proposes RE-SMOTEBoost, an enhanced version of SMOTEBoost, designed to overcome these limitations. Firstly, RE-SMOTEBoost focuses on generating synthetic samples in overlapping regions to better capture the decision boundary using roulette wheel selection. Secondly, it incorporates a filtering mechanism based on information entropy to reduce noise, and borderline cases and improve the quality of generated data. Thirdly, we introduce a double regularization penalty to control the synthetic samples proximity to the decision boundary and avoid class overlap. These enhancements enable higher-quality oversampling of the minority class, resulting in a more balanced and effective training dataset. The proposed method outperforms existing state-of-the-art techniques when evaluated on imbalanced datasets. Compared to the top-performing sampling algorithms, RE-SMOTEBoost demonstrates a notable improvement of 3.22% in accuracy and a variance reduction of 88.8%. These results indicate that the proposed model offers a solid solution for medical settings, effectively overcoming data scarcity and severe imbalance caused by limited samples, data collection difficulties, and privacy constraints.

[AI-60] Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments

链接: https://arxiv.org/abs/2503.12228
作者: Yihong Jin,Ze Yang,Xinhe Xu,Yihan Zhang,Shuyang Ji
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: Accepted by IEEE ICCEA 2025

点击查看摘要

Abstract:With the rapid evolution of Large Language Models (LLMs) and their large-scale experimentation in cloud-computing spaces, the challenge of guaranteeing their security and efficiency in a failure scenario has become a main issue. To ensure the reliability and availability of large-scale language models in cloud computing scenarios, such as frequent resource failures, network problems, and computational overheads, this study proposes a novel adaptive fault tolerance mechanism. It builds upon known fault-tolerant mechanisms, such as checkpointing, redundancy, and state transposition, introducing dynamic resource allocation and prediction of failure based on real-time performance metrics. The hybrid model integrates data driven deep learning-based anomaly detection technique underlining the contribution of cloud orchestration middleware for predictive prevention of system failures. Additionally, the model integrates adaptive checkpointing and recovery strategies that dynamically adapt according to load and system state to minimize the influence on the performance of the model and minimize downtime. The experimental results demonstrate that the designed model considerably enhances the fault tolerance in large-scale cloud surroundings, and decreases the system downtime by \mathbf30% , and has a better modeling availability than the classical fault tolerance mechanism.

[AI-61] Research on Large Language Model Cross-Cloud Privacy Protection and Collaborative Training based on Federated Learning

链接: https://arxiv.org/abs/2503.12226
作者: Ze Yang,Yihong Jin,Yihan Zhang,Juntian Liu,Xinhe Xu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted by IEEE AINIT 2025

点击查看摘要

Abstract:The fast development of large language models (LLMs) and popularization of cloud computing have led to increasing concerns on privacy safeguarding and data security of cross-cloud model deployment and training as the key challenges. We present a new framework for addressing these issues along with enabling privacy preserving collaboration on training between distributed clouds based on federated learning. Our mechanism encompasses cutting-edge cryptographic primitives, dynamic model aggregation techniques, and cross-cloud data harmonization solutions to enhance security, efficiency, and scalability to the traditional federated learning paradigm. Furthermore, we proposed a hybrid aggregation scheme to mitigate the threat of Data Leakage and to optimize the aggregation of model updates, thus achieving substantial enhancement on the model effectiveness and stability. Experimental results demonstrate that the training efficiency, privacy protection, and model accuracy of the proposed model compare favorably to those of the traditional federated learning method.

[AI-62] Evaluation-Time Policy Switching for Offline Reinforcement Learning AAMAS2025

链接: https://arxiv.org/abs/2503.12222
作者: Natinael Solomon Neggatu,Jeremie Houssineau,Giovanni Montana
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)

点击查看摘要

Abstract:Offline reinforcement learning (RL) looks at learning how to optimally solve tasks using a fixed dataset of interactions from the environment. Many off-policy algorithms developed for online learning struggle in the offline setting as they tend to over-estimate the behaviour of out of distributions actions. Existing offline RL algorithms adapt off-policy algorithms, employing techniques such as constraining the policy or modifying the value function to achieve good performance on individual datasets but struggle to adapt to different tasks or datasets of different qualities without tuning hyper-parameters. We introduce a policy switching technique that dynamically combines the behaviour of a pure off-policy RL agent, for improving behaviour, and a behavioural cloning (BC) agent, for staying close to the data. We achieve this by using a combination of epistemic uncertainty, quantified by our RL model, and a metric for aleatoric uncertainty extracted from the dataset. We show empirically that our policy switching technique can outperform not only the individual algorithms used in the switching process but also compete with state-of-the-art methods on numerous benchmarks. Our use of epistemic uncertainty for policy switching also allows us to naturally extend our method to the domain of offline to online fine-tuning allowing our model to adapt quickly and safely from online data, either matching or exceeding the performance of current methods that typically require additional modification or hyper-parameter fine-tuning.

[AI-63] Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs

链接: https://arxiv.org/abs/2503.12211
作者: Nir Ailon,Akhiad Bercovich,Omri Weinstein
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We propose a cheaper alternative bilinear operator to matrix-multiplication in deep neural networks (DNNs). Unlike many stubborn attempts to accelerate MatMuls in DNN inference, this operator is supported by capabilities of existing GPU hardware, most notably NVIDIA TensorCores. To our knowledge, this is the first GPU-native acceleration technique which \emphdoes not decrease (in fact, increases) the number of trainable parameters of the network, mitigating the accuracy-loss of compression-based techniques. Hence, this operator is at the same time more expressive than MatMul, yet requires substantially \emphfewer FLOPs to evaluate. We term this new operator \emphStrassen-Tile (STL). The main idea behind STL (X,W) is a \emphlocal change-of-basis (learnable encoder) on weights and activation \emphtiles, after which we perform batched \emphelementwise products between tiles, and a final decoding transformation (inspired by algebraic pipelines from fast matrix and polynomial multiplication). We compare STL against two benchmarks. The first one is SoTA T2T-ViT on Imagenet-1K. Here we show that replacing \emphall linear layers with STL and training from scratch, results in factor x2.7 reduction in FLOPs with a 0.5 \emphaccuracy improvement. Our second speed-accuracy comparison benchmark for pretrained LLMs is the most practical GPU-acceleration technique, \twofour structured Sparsity. Finetuning TinyLlama \citetinyllama24 with STL layers on the Slim Pajama dataset, achieves similar accuracy to 2:4, with x2.2 FLOP speedup compared to x1.7 of the latter. Finally, we discuss a group-theoretic approach for discovering \emphuniversal encoders for STL, which could lead to fast \emphblack-box acceleration via approximate matrix-multiplication (AMM). Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2503.12211 [cs.LG] (or arXiv:2503.12211v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.12211 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Omri Weinstein [view email] [v1] Sat, 15 Mar 2025 17:31:36 UTC (603 KB)

[AI-64] Value Gradients with Action Adaptive Search Trees in Continuous (PO)MDPs

链接: https://arxiv.org/abs/2503.12181
作者: Idan Lev-Yehudi,Michael Novitsky,Moran Barenboim,Ron Benchetrit,Vadim Indelman
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Solving Partially Observable Markov Decision Processes (POMDPs) in continuous state, action and observation spaces is key for autonomous planning in many real-world mobility and robotics applications. Current approaches are mostly sample based, and cannot hope to reach near-optimal solutions in reasonable time. We propose two complementary theoretical contributions. First, we formulate a novel Multiple Importance Sampling (MIS) tree for value estimation, that allows to share value information between sibling action branches. The novel MIS tree supports action updates during search time, such as gradient-based updates. Second, we propose a novel methodology to compute value gradients with online sampling based on transition likelihoods. It is applicable to MDPs, and we extend it to POMDPs via particle beliefs with the application of the propagated belief trick. The gradient estimator is computed in practice using the MIS tree with efficient Monte Carlo sampling. These two parts are combined into a new planning algorithm Action Gradient Monte Carlo Tree Search (AGMCTS). We demonstrate in a simulated environment its applicability, advantages over continuous online POMDP solvers that rely solely on sampling, and we discuss further implications.

[AI-65] Probabilistic Graph Circuits: Deep Generative Models for Tractable Probabilistic Inference over Graphs

链接: https://arxiv.org/abs/2503.12162
作者: Milan Papež,Martin Rektoris,Václav Šmídl,Tomáš Pevný
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep generative models (DGMs) have recently demonstrated remarkable success in capturing complex probability distributions over graphs. Although their excellent performance is attributed to powerful and scalable deep neural networks, it is, at the same time, exactly the presence of these highly non-linear transformations that makes DGMs intractable. Indeed, despite representing probability distributions, intractable DGMs deny probabilistic foundations by their inability to answer even the most basic inference queries without approximations or design choices specific to a very narrow range of queries. To address this limitation, we propose probabilistic graph circuits (PGCs), a framework of tractable DGMs that provide exact and efficient probabilistic inference over (arbitrary parts of) graphs. Nonetheless, achieving both exactness and efficiency is challenging in the permutation-invariant setting of graphs. We design PGCs that are inherently invariant and satisfy these two requirements, yet at the cost of low expressive power. Therefore, we investigate two alternative strategies to achieve the invariance: the first sacrifices the efficiency, and the second sacrifices the exactness. We demonstrate that ignoring the permutation invariance can have severe consequences in anomaly detection, and that the latter approach is competitive with, and sometimes better than, existing intractable DGMs in the context of molecular graph generation.

[AI-66] Aristotles Original Idea: For and Against Logic in the era of AI

链接: https://arxiv.org/abs/2503.12161
作者: Antonis C. Kakas
类目: Artificial Intelligence (cs.AI)
*备注: 40 pages

点击查看摘要

Abstract:Aristotle is generally accepted as the father of logic. The ideas that he raised in his study of logical reasoning carried the development of science over the centuries. Today, in the era of AI, this title of the fatherhood of logic has a renewed significance. Behind it lies his original idea that human reasoning could be studied as a process and that perhaps there exist universal systems of reasoning that underly all human reasoning irrespective of the content of what we are reasoning about. In this article, we look into Aristotle’s work on human thought, his work on reasoning itself but also on how it relates to science and human endeavor more generally, from a modern perspective of Artificial Intelligence and ask if this can help enlighten our understanding of AI and Science more generally.

[AI-67] Weighted Graph Structure Learning with Attention Denoising for Node Classification

链接: https://arxiv.org/abs/2503.12157
作者: Tingting Wang,Jiaxin Su,Haobing Liu,Ruobing Jiang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Node classification in graphs aims to predict the categories of unlabeled nodes by utilizing a small set of labeled nodes. However, weighted graphs often contain noisy edges and anomalous edge weights, which can distort fine-grained relationships between nodes and hinder accurate classification. We propose the Edge Weight-aware Graph Structure Learning (EWGSL) method, which combines weight learning and graph structure learning to address these issues. EWGSL improves node classification by redefining attention coefficients in graph attention networks to incorporate node features and edge weights. It also applies graph structure learning to sparsify attention coefficients and uses a modified InfoNCE loss function to enhance performance by adapting to denoised graph weights. Extensive experimental results show that EWGSL has an average Micro-F1 improvement of 17.8% compared with the best baseline.

[AI-68] Robust Isolation Forest using Soft Sparse Random Projection and Valley Emphasis Method

链接: https://arxiv.org/abs/2503.12125
作者: Hun Kang,Kyoungok Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Isolation Forest (iForest) is an unsupervised anomaly detection algorithm designed to effectively detect anomalies under the assumption that anomalies are ``few and different." Various studies have aimed to enhance iForest, but the resulting algorithms often exhibited significant performance disparities across datasets. Additionally, the challenge of isolating rare and widely distributed anomalies persisted in research focused on improving splits. To address these challenges, we introduce Robust iForest (RiForest). RiForest leverages both existing features and random hyperplanes obtained through soft sparse random projection to identify superior split features for anomaly detection, independent of datasets. It utilizes the underutilized valley emphasis method for optimal split point determination and incorporates sparsity randomization in soft sparse random projection for enhanced anomaly detection robustness. Across 24 benchmark datasets, experiments demonstrate RiForest’s consistent outperformance of existing algorithms in anomaly detection, emphasizing stability and robustness to noise variables.

[AI-69] ICCO: Learning an Instruction-conditioned Coordinator for Language-guided Task-aligned Multi-robot Control

链接: https://arxiv.org/abs/2503.12122
作者: Yoshiki Yano,Kazuki Shibata,Maarten Kokshoorn,Takamitsu Matsubara
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 9 pages, 9 figures

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have permitted the development of language-guided multi-robot systems, which allow robots to execute tasks based on natural language instructions. However, achieving effective coordination in distributed multi-agent environments remains challenging due to (1) misalignment between instructions and task requirements and (2) inconsistency in robot behaviors when they independently interpret ambiguous instructions. To address these challenges, we propose Instruction-Conditioned Coordinator (ICCO), a Multi-Agent Reinforcement Learning (MARL) framework designed to enhance coordination in language-guided multi-robot systems. ICCO consists of a Coordinator agent and multiple Local Agents, where the Coordinator generates Task-Aligned and Consistent Instructions (TACI) by integrating language instructions with environmental states, ensuring task alignment and behavioral consistency. The Coordinator and Local Agents are jointly trained to optimize a reward function that balances task efficiency and instruction following. A Consistency Enhancement Term is added to the learning objective to maximize mutual information between instructions and robot behaviors, further improving coordination. Simulation and real-world experiments validate the effectiveness of ICCO in achieving language-guided task-aligned multi-robot control. The demonstration can be found at this https URL.

[AI-70] Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations

链接: https://arxiv.org/abs/2503.12115
作者: Xue Jiang,Xiulian Peng,Yuan Zhang,Yan Lu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by IEEE Journal of Selected Topics in Signal Processing(JSTSP)

点击查看摘要

Abstract:Current large speech language models are mainly based on semantic tokens from discretization of self-supervised learned representations and acoustic tokens from a neural codec, following a semantic-modeling and acoustic-synthesis paradigm. However, semantic tokens discard paralinguistic attributes of speakers that is important for natural spoken communication, while prompt-based acoustic synthesis from semantic tokens has limits in recovering paralinguistic details and suffers from robustness issues, especially when there are domain gaps between the prompt and the target. This paper unifies two types of tokens and proposes the UniCodec, a universal speech token learning that encapsulates all semantics of speech, including linguistic and paralinguistic information, into a compact and semantically-disentangled unified token. Such a unified token can not only benefit speech language models in understanding with paralinguistic hints but also help speech generation with high-quality output. A low-bitrate neural codec is leveraged to learn such disentangled discrete representations at global and local scales, with knowledge distilled from self-supervised learned features. Extensive evaluations on multilingual datasets demonstrate its effectiveness in generating natural, expressive and long-term consistent output quality with paralinguistic attributes well preserved in several speech processing tasks.

[AI-71] ChronosX: Adapting Pretrained Time Series Models with Exogenous Variables AISTATS

链接: https://arxiv.org/abs/2503.12107
作者: Sebastian Pineda Arango,Pedro Mercado,Shubham Kapoor,Abdul Fatir Ansari,Lorenzo Stella,Huibin Shen,Hugo Senetaire,Caner Turkmen,Oleksandr Shchur,Danielle C. Maddix,Michael Bohlke-Schneider,Yuyang Wang,Syama Sundar Rangapuram
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 2025

点击查看摘要

Abstract:Covariates provide valuable information on external factors that influence time series and are critical in many real-world time series forecasting tasks. For example, in retail, covariates may indicate promotions or peak dates such as holiday seasons that heavily influence demand forecasts. Recent advances in pretraining large language model architectures for time series forecasting have led to highly accurate forecasters. However, the majority of these models do not readily use covariates as they are often specific to a certain task or domain. This paper introduces a new method to incorporate covariates into pretrained time series forecasting models. Our proposed approach incorporates covariate information into pretrained forecasting models through modular blocks that inject past and future covariate information, without necessarily modifying the pretrained model in consideration. In order to evaluate our approach, we introduce a benchmark composed of 32 different synthetic datasets with varying dynamics to evaluate the effectivity of forecasting models with covariates. Extensive evaluations on both synthetic and real datasets show that our approach effectively incorporates covariate information into pretrained models, outperforming existing baselines.

[AI-72] Automating the loop in traffic incident management on highway

链接: https://arxiv.org/abs/2503.12085
作者: Matteo Cercola,Nicola Gatti,Pedro Huertas Leyva,Benedetto Carambia,Simone Formentin
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective traffic incident management is essential for ensuring safety, minimizing congestion, and reducing response times in emergency situations. Traditional highway incident management relies heavily on radio room operators, who must make rapid, informed decisions in high-stakes environments. This paper proposes an innovative solution to support and enhance these decisions by integrating Large Language Models (LLMs) into a decision-support system for traffic incident management. We introduce two approaches: (1) an LLM + Optimization hybrid that leverages both the flexibility of natural language interaction and the robustness of optimization techniques, and (2) a Full LLM approach that autonomously generates decisions using only LLM capabilities. We tested our solutions using historical event data from Autostrade per l’Italia. Experimental results indicate that while both approaches show promise, the LLM + Optimization solution demonstrates superior reliability, making it particularly suited to critical applications where consistency and accuracy are paramount. This research highlights the potential for LLMs to transform highway incident management by enabling accessible, data-driven decision-making support.

[AI-73] Comparing Human Expertise and Large Language Models Embeddings in Content Validity Assessment of Personality Tests

链接: https://arxiv.org/abs/2503.12080
作者: Nicola Milano,Michela Ponticorvo,Davide Marocco
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this article we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio (CVR) to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlights the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.

[AI-74] Maritime Mission Planning for Unmanned Surface Vessel using Large Language Model

链接: https://arxiv.org/abs/2503.12065
作者: Muhayy Ud Din,Waseem Akram,Ahsan B Bakht,Yihao Dong,Irfan Hussain
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots

点击查看摘要

Abstract:Unmanned Surface Vessels (USVs) are essential for various maritime operations. USV mission planning approach offers autonomous solutions for monitoring, surveillance, and logistics. Existing approaches, which are based on static methods, struggle to adapt to dynamic environments, leading to suboptimal performance, higher costs, and increased risk of failure. This paper introduces a novel mission planning framework that uses Large Language Models (LLMs), such as GPT-4, to address these challenges. LLMs are proficient at understanding natural language commands, executing symbolic reasoning, and flexibly adjusting to changing situations. Our approach integrates LLMs into maritime mission planning to bridge the gap between high-level human instructions and executable plans, allowing real-time adaptation to environmental changes and unforeseen obstacles. In addition, feedback from low-level controllers is utilized to refine symbolic mission plans, ensuring robustness and adaptability. This framework improves the robustness and effectiveness of USV operations by integrating the power of symbolic planning with the reasoning abilities of LLMs. In addition, it simplifies the mission specification, allowing operators to focus on high-level objectives without requiring complex programming. The simulation results validate the proposed approach, demonstrating its ability to optimize mission execution while seamlessly adapting to dynamic maritime conditions.

[AI-75] Revisiting Training-Inference Trigger Intensity in Backdoor Attacks USENIX-SECURITY25 USENIX-SECURITY

链接: https://arxiv.org/abs/2503.12058
作者: Chenhao Lin,Chenyang Zhao,Shiwei Wang,Longtian Wang,Chao Shen,Zhengyu Zhao
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: To Appear in the 34th USENIX Security Symposium (USENIX Security 25)

点击查看摘要

Abstract:Backdoor attacks typically place a specific trigger on certain training data, such that the model makes prediction errors on inputs with that trigger during inference. Despite the core role of the trigger, existing studies have commonly believed a perfect match between training-inference triggers is optimal. In this paper, for the first time, we systematically explore the training-inference trigger relation, particularly focusing on their mismatch, based on a Training-Inference Trigger Intensity Manipulation (TITIM) workflow. TITIM specifically investigates the training-inference trigger intensity, such as the size or the opacity of a trigger, and reveals new insights into trigger generalization and overfitting. These new insights challenge the above common belief by demonstrating that the training-inference trigger mismatch can facilitate attacks in two practical scenarios, posing more significant security threats than previously thought. First, when the inference trigger is fixed, using training triggers with mixed intensities leads to stronger attacks than using any single intensity. For example, on CIFAR-10 with ResNet-18, mixing training triggers with 1.0 and 0.1 opacities improves the worst-case attack success rate (ASR) (over different testing opacities) of the best single-opacity attack from 10.61% to 92.77%. Second, intentionally using certain mismatched training-inference triggers can improve the attack stealthiness, i.e., better bypassing defenses. For example, compared to the training/inference intensity of 1.0/1.0, using 1.0/0.7 decreases the area under the curve (AUC) of the Scale-Up defense from 0.96 to 0.62, while maintaining a high attack ASR (99.65% vs. 91.62%). The above new insights are validated to be generalizable across different backdoor attacks, models, datasets, tasks, and (digital/physical) domains. Comments: To Appear in the 34th USENIX Security Symposium (USENIX Security 25) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.12058 [cs.CR] (or arXiv:2503.12058v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.12058 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-76] Ferret: An Efficient Online Continual Learning Framework under Varying Memory Constraints CVPR2025

链接: https://arxiv.org/abs/2503.12053
作者: Yuhao Zhou,Yuxin Tian,Jindi Lv,Mingjia Shi,Yuanxi Li,Qing Ye,Shuhao Zhang,Jiancheng Lv
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: CVPR 2025

点击查看摘要

Abstract:In the realm of high-frequency data streams, achieving real-time learning within varying memory constraints is paramount. This paper presents Ferret, a comprehensive framework designed to enhance online accuracy of Online Continual Learning (OCL) algorithms while dynamically adapting to varying memory budgets. Ferret employs a fine-grained pipeline parallelism strategy combined with an iterative gradient compensation algorithm, ensuring seamless handling of high-frequency data with minimal latency, and effectively counteracting the challenge of stale gradients in parallel training. To adapt to varying memory budgets, its automated model partitioning and pipeline planning optimizes performance regardless of memory limitations. Extensive experiments across 20 benchmarks and 5 integrated OCL algorithms show Ferret’s remarkable efficiency, achieving up to 3.7 \times lower memory overhead to reach the same online accuracy compared to competing methods. Furthermore, Ferret consistently outperforms these methods across diverse memory budgets, underscoring its superior adaptability. These findings position Ferret as a premier solution for efficient and adaptive OCL framework in real-time environments.

[AI-77] An LLM -Integrated Framework for Completion Management and Tracing of STPA

链接: https://arxiv.org/abs/2503.12043
作者: Ali Raeisdanaei,Juho Kim,Michael Liao,Sparsh Kochhar
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In many safety-critical engineering domains, hazard analysis techniques are an essential part of requirement elicitation. Of the methods proposed for this task, STPA (System-Theoretic Process Analysis) represents a relatively recent development in the field. The completion, management, and traceability of this hazard analysis technique present a time-consuming challenge to the requirements and safety engineers involved. In this paper, we introduce a free, open-source software framework to build STPA models with several automated workflows powered by large language models (LLMs). In past works, LLMs have been successfully integrated into a myriad of workflows across various fields. Here, we demonstrate that LLMs can be used to complete tasks associated with STPA with a high degree of accuracy, saving the time and effort of the human engineers involved. We experimentally validate our method on real-world STPA models built by requirement engineers and researchers. The source code of our software framework is available at the following link: this https URL.

[AI-78] Unsupervised Graph Anomaly Detection via Multi-Hypersphere Heterophilic Graph Learning

链接: https://arxiv.org/abs/2503.12037
作者: Hang Ni,Jindong Han,Nengjun Zhu,Hao Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Anomaly Detection (GAD) plays a vital role in various data mining applications such as e-commerce fraud prevention and malicious user detection. Recently, Graph Neural Network (GNN) based approach has demonstrated great effectiveness in GAD by first encoding graph data into low-dimensional representations and then identifying anomalies under the guidance of supervised or unsupervised signals. However, existing GNN-based approaches implicitly follow the homophily principle (i.e., the “like attracts like” phenomenon) and fail to learn discriminative embedding for anomalies that connect vast normal nodes. Moreover, such approaches identify anomalies in a unified global perspective but overlook diversified abnormal patterns conditioned on local graph context, leading to suboptimal performance. To overcome the aforementioned limitations, in this paper, we propose a Multi-hypersphere Heterophilic Graph Learning (MHetGL) framework for unsupervised GAD. Specifically, we first devise a Heterophilic Graph Encoding (HGE) module to learn distinguishable representations for potential anomalies by purifying and augmenting their neighborhood in a fully unsupervised manner. Then, we propose a Multi-Hypersphere Learning (MHL) module to enhance the detection capability for context-dependent anomalies by jointly incorporating critical patterns from both global and local perspectives. Extensive experiments on ten real-world datasets show that MHetGL outperforms 14 baselines. Our code is publicly available at this https URL.

[AI-79] Variance-Dependent Regret Lower Bounds for Contextual Bandits

链接: https://arxiv.org/abs/2503.12020
作者: Jiafan He,Quanquan Gu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 19 pages

点击查看摘要

Abstract:Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical \tildeO(d\sqrtK) regret bound to \tildeO(d\sqrt\sum_k=1^K\sigma_k^2) , where d is the context dimension, K is the number of rounds, and \sigma^2_k is the noise variance in round k , has been widely studied in recent years. However, most existing works focus on the regret upper bounds instead of lower bounds. To our knowledge, the only lower bound is from Jia et al. (2024), which proved that for any eluder dimension d_\textbfelu and total variance budget \Lambda , there exists an instance with \sum_k=1^K\sigma_k^2\leq \Lambda for which any algorithm incurs a variance-dependent lower bound of \Omega(\sqrtd_\textbfelu\Lambda) . However, this lower bound has a \sqrtd gap with existing upper bounds. Moreover, it only considers a fixed total variance budget \Lambda and does not apply to a general variance sequence \sigma_1^2,\ldots,\sigma_K^2\ . In this paper, to overcome the limitations of Jia et al. (2024), we consider the general variance sequence under two settings. For a prefixed sequence, where the entire variance sequence is revealed to the learner at the beginning of the learning process, we establish a variance-dependent lower bound of \Omega(d \sqrt\sum_k=1^K\sigma_k^2 /\log K) for linear contextual bandits. For an adaptive sequence, where an adversary can generate the variance \sigma_k^2 in each round k based on historical observations, we show that when the adversary must generate \sigma_k^2 before observing the decision set \mathcalD_k , a similar lower bound of \Omega(d\sqrt \sum_k=1^K\sigma_k^2 /\log^6(dK)) holds. In both settings, our results match the upper bounds of the SAVE algorithm (Zhao et al., 2023) up to logarithmic factors.

[AI-80] Winning the MIDST Challenge: New Membership Inference Attacks on Diffusion Models for Tabular Data Synthesis

链接: https://arxiv.org/abs/2503.12008
作者: Xiaoyu Wu,Yifei Pang,Terrance Liu,Steven Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Tabular data synthesis using diffusion models has gained significant attention for its potential to balance data utility and privacy. However, existing privacy evaluations often rely on heuristic metrics or weak membership inference attacks (MIA), leaving privacy risks inadequately assessed. In this work, we conduct a rigorous MIA study on diffusion-based tabular synthesis, revealing that state-of-the-art attacks designed for image models fail in this setting. We identify noise initialization as a key factor influencing attack efficacy and propose a machine-learning-driven approach that leverages loss features across different noises and time steps. Our method, implemented with a lightweight MLP, effectively learns membership signals, eliminating the need for manual optimization. Experimental results from the MIDST Challenge @ SaTML 2025 demonstrate the effectiveness of our approach, securing first place across all tracks. Code is available at this https URL.

[AI-81] SagaLLM : Context Management Validation and Transaction Guarantees for Multi-Agent LLM Planning

链接: https://arxiv.org/abs/2503.11951
作者: Edward Y. Chang
类目: Artificial Intelligence (cs.AI)
*备注: 13 pages, 8 tables, 5 figures

点击查看摘要

Abstract:Recent LLM-based agent frameworks have demonstrated impressive capabilities in task delegation and workflow orchestration, but face significant challenges in maintaining context awareness and ensuring planning consistency. This paper presents SagaLLM, a structured multi-agent framework that addresses four fundamental limitations in current LLM approaches: inadequate self-validation, context narrowing, lacking transaction properties, and insufficient inter-agent coordination. By implementing specialized context management agents and validation protocols, SagaLLM preserves critical constraints and state information throughout complex planning processes, enabling robust and consistent decision-making even during disruptions. We evaluate our approach using selected problems from the REALM benchmark, focusing on sequential and reactive planning scenarios that challenge both context retention and adaptive reasoning. Our experiments with state-of-the-art LLMs, Claude 3.7, DeepSeek R1, GPT-4o, and GPT-o1, demonstrate that while these models exhibit impressive reasoning capabilities, they struggle with maintaining global constraint awareness during complex planning tasks, particularly when adapting to unexpected changes. In contrast, the distributed cognitive architecture of SagaLLM shows significant improvements in planning consistency, constraint enforcement, and adaptation to disruptions in various scenarios.

[AI-82] Privacy Ethics Alignment in AI (PEA-AI): A Stakeholder-Centric Based Framework for Ethcial AI

链接: https://arxiv.org/abs/2503.11950
作者: Ankur Barthwal,Molly Campbell,Ajay Kumar Shrestha
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Preprint Version | To be submitted to peer-review venue

点击查看摘要

Abstract:The increasing integration of Artificial Intelligence (AI) in digital ecosystems has reshaped privacy dynamics, particularly for young digital citizens navigating data-driven environments. This study explores evolving privacy concerns across three key stakeholder groups, digital citizens (ages 16-19), parents, educators, and AI professionals, and assesses differences in data ownership, trust, transparency, parental mediation, education, and risk-benefit perceptions. Employing a grounded theory methodology, this research synthesizes insights from 482 participants through structured surveys, qualitative interviews, and focus groups. The findings reveal distinct privacy expectations- Young users emphasize autonomy and digital freedom, while parents and educators advocate for regulatory oversight and AI literacy programs. AI professionals, in contrast, prioritize the balance between ethical system design and technological efficiency. The data further highlights gaps in AI literacy and transparency, emphasizing the need for comprehensive, stakeholder-driven privacy frameworks that accommodate diverse user needs. Using comparative thematic analysis, this study identifies key tensions in privacy governance and develops the novel Privacy-Ethics Alignment in AI (PEA-AI) model, which structures privacy decision-making as a dynamic negotiation between stakeholders. By systematically analyzing themes such as transparency, user control, risk perception, and parental mediation, this research provides a scalable, adaptive foundation for AI governance, ensuring that privacy protections evolve alongside emerging AI technologies and youth-centric digital interactions.

[AI-83] Ethical AI for Young Digital Citizens: A Call to Action on Privacy Governance

链接: https://arxiv.org/abs/2503.11947
作者: Austin Shouli,Ankur Barthwal,Molly Campbell,Ajay Kumar Shrestha
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint Version | To be submitted to peer-reviewed venue

点击查看摘要

Abstract:The rapid expansion of Artificial Intelligence (AI) in digital platforms used by youth has created significant challenges related to privacy, autonomy, and data protection. While AI-driven personalization offers enhanced user experiences, it often operates without clear ethical boundaries, leaving young users vulnerable to data exploitation and algorithmic biases. This paper presents a call to action for ethical AI governance, advocating for a structured framework that ensures youth-centred privacy protections, transparent data practices, and regulatory oversight. We outline key areas requiring urgent intervention, including algorithmic transparency, privacy education, parental data-sharing ethics, and accountability measures. Through this approach, we seek to empower youth with greater control over their digital identities and propose actionable strategies for policymakers, AI developers, and educators to build a fairer and more accountable AI ecosystem.

[AI-84] Human Digital Twins in Personalized Healthcare: An Overview and Future Perspectives

链接: https://arxiv.org/abs/2503.11944
作者: Melvin Mokhtari
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Digital twins (DTs) are redefining healthcare by paving the way for more personalized, proactive, and intelligent medical interventions. As the shift toward personalized care intensifies, there is a growing need for an individual’s virtual replica that delivers the right treatment at the optimal time and in the most effective manner. The emerging concept of a Human Digital Twin (HDT) holds the potential to revolutionize the traditional healthcare system much like digital twins have transformed manufacturing and aviation. An HDT mirrors the physical entity of a human body through a dynamic virtual model that continuously reflects changes in molecular, physiological, emotional, and lifestyle factors. This digital representation not only supports remote monitoring, diagnosis, and prescription but also facilitates surgery, rehabilitation, and overall personalized care, thereby relieving pressure on conventional healthcare frameworks. Despite its promising advantages, there are considerable research challenges to overcome as HDT technology evolves. In this study, I will initially delineate the distinctions between traditional digital twins and HDTs, followed by an exploration of the networking architecture integral to their operation–from data acquisition and communication to computation, management, and decision-making–thereby offering insights into how these innovations may reshape the modern healthcare industry.

[AI-85] End-to-End Edge AI Service Provisioning Framework in 6G ORAN

链接: https://arxiv.org/abs/2503.11933
作者: Yun Tang,Udhaya Chandhar Srinivasan,Benjamin James Scott,Obumneme Umealor,Dennis Kevogo,Weisi Guo
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: 5 pages, 3 figures, submitted to IEEE VTC for possible publication

点击查看摘要

Abstract:With the advent of 6G, Open Radio Access Network (O-RAN) architectures are evolving to support intelligent, adaptive, and automated network orchestration. This paper proposes a novel Edge AI and Network Service Orchestration framework that leverages Large Language Model (LLM) agents deployed as O-RAN rApps. The proposed LLM-agent-powered system enables interactive and intuitive orchestration by translating the user’s use case description into deployable AI services and corresponding network configurations. The LLM agent automates multiple tasks, including AI model selection from repositories (e.g., Hugging Face), service deployment, network adaptation, and real-time monitoring via xApps. We implement a prototype using open-source O-RAN projects (OpenAirInterface and FlexRIC) to demonstrate the feasibility and functionality of our framework. Our demonstration showcases the end-to-end flow of AI service orchestration, from user interaction to network adaptation, ensuring Quality of Service (QoS) compliance. This work highlights the potential of integrating LLM-driven automation into 6G O-RAN ecosystems, paving the way for more accessible and efficient edge AI ecosystems.

[AI-86] Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

链接: https://arxiv.org/abs/2503.11926
作者: Bowen Baker,Joost Huizinga,Leo Gao,Zehao Dou,Melody Y. Guan,Aleksander Madry,Wojciech Zaremba,Jakub Pachocki,David Farhi
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mitigating reward hacking–where AI systems misbehave due to flaws or misspecifications in their learning objectives–remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model’s chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent’s training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior.

[AI-87] Sketch-to-Skill: Bootstrapping Robot Learning with Human Drawn Trajectory Sketches

链接: https://arxiv.org/abs/2503.11918
作者: Peihong Yu,Amisha Bhaskar,Anukriti Singh,Zahiruddin Mahammad,Pratap Tokekar
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Peihong Yu and Amisha Bhaskar contributed equally to this work

点击查看摘要

Abstract:Training robotic manipulation policies traditionally requires numerous demonstrations and/or environmental rollouts. While recent Imitation Learning (IL) and Reinforcement Learning (RL) methods have reduced the number of required demonstrations, they still rely on expert knowledge to collect high-quality data, limiting scalability and accessibility. We propose Sketch-to-Skill, a novel framework that leverages human-drawn 2D sketch trajectories to bootstrap and guide RL for robotic manipulation. Our approach extends beyond previous sketch-based methods, which were primarily focused on imitation learning or policy conditioning, limited to specific trained tasks. Sketch-to-Skill employs a Sketch-to-3D Trajectory Generator that translates 2D sketches into 3D trajectories, which are then used to autonomously collect initial demonstrations. We utilize these sketch-generated demonstrations in two ways: to pre-train an initial policy through behavior cloning and to refine this policy through RL with guided exploration. Experimental results demonstrate that Sketch-to-Skill achieves ~96% of the performance of the baseline model that leverages teleoperated demonstration data, while exceeding the performance of a pure reinforcement learning policy by ~170%, only from sketch inputs. This makes robotic manipulation learning more accessible and potentially broadens its applications across various domains.

[AI-88] A Framework for Evaluating Emerging Cyberattack Capabilities of AI

链接: https://arxiv.org/abs/2503.11917
作者: Mikel Rodriguez,Raluca Ada Popa,Four Flynn,Lihao Liang,Allan Dafoe,Anna Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As frontier models become more capable, the community has attempted to evaluate their ability to enable cyberattacks. Performing a comprehensive evaluation and prioritizing defenses are crucial tasks in preparing for AGI safely. However, current cyber evaluation efforts are ad-hoc, with no systematic reasoning about the various phases of attacks, and do not provide a steer on how to use targeted defenses. In this work, we propose a novel approach to AI cyber capability evaluation that (1) examines the end-to-end attack chain, (2) helps to identify gaps in the evaluation of AI threats, and (3) helps defenders prioritize targeted mitigations and conduct AI-enabled adversary emulation to support red teaming. To achieve these goals, we propose adapting existing cyberattack chain frameworks to AI systems. We analyze over 12,000 instances of real-world attempts to use AI in cyberattacks catalogued by Google’s Threat Intelligence Group. Using this analysis, we curate a representative collection of seven cyberattack chain archetypes and conduct a bottleneck analysis to identify areas of potential AI-driven cost disruption. Our evaluation benchmark consists of 50 new challenges spanning different phases of cyberattacks. Based on this, we devise targeted cybersecurity model evaluations, report on the potential for AI to amplify offensive cyber capabilities across specific attack phases, and conclude with recommendations on prioritizing defenses. In all, we consider this to be the most comprehensive AI cyber risk evaluation framework published so far.

[AI-89] How Problematic Writer-AI Interactions (Rather than Problematic AI) Hinder Writers Idea Generation

链接: https://arxiv.org/abs/2503.11915
作者: Khonzoda Umarova,Talia Wise,Zhuoer Lyu,Mina Lee,Qian Yang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Writing about a subject enriches writers’ understanding of that subject. This cognitive benefit of writing – known as constructive learning – is essential to how students learn in various disciplines. However, does this benefit persist when students write with generative AI writing assistants? Prior research suggests the answer varies based on the type of AI, e.g., auto-complete systems tend to hinder ideation, while assistants that pose Socratic questions facilitate it. This paper adds an additional perspective. Through a case study, we demonstrate that the impact of genAI on students’ idea development depends not only on the AI but also on the students and, crucially, their interactions in between. Students who proactively explored ideas gained new ideas from writing, regardless of whether they used auto-complete or Socratic AI assistants. Those who engaged in prolonged, mindless copyediting developed few ideas even with a Socratic AI. These findings suggest opportunities in designing AI writing assistants, not merely by creating more thought-provoking AI, but also by fostering more thought-provoking writer-AI interactions.

[AI-90] RTD-Lite: Scalable Topological Analysis for Comparing Weighted Graphs in Learning Tasks AISTATS2025

链接: https://arxiv.org/abs/2503.11910
作者: Eduard Tulchinskii,Daria Voronkova,Ilya Trofimov,Evgeny Burnaev,Serguei Barannikov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symplectic Geometry (math.SG)
*备注: Accepted for AISTATS 2025

点击查看摘要

Abstract:Topological methods for comparing weighted graphs are valuable in various learning tasks but often suffer from computational inefficiency on large datasets. We introduce RTD-Lite, a scalable algorithm that efficiently compares topological features, specifically connectivity or cluster structures at arbitrary scales, of two weighted graphs with one-to-one correspondence between vertices. Using minimal spanning trees in auxiliary graphs, RTD-Lite captures topological discrepancies with O(n^2) time and memory complexity. This efficiency enables its application in tasks like dimensionality reduction and neural network training. Experiments on synthetic and real-world datasets demonstrate that RTD-Lite effectively identifies topological differences while significantly reducing computation time compared to existing methods. Moreover, integrating RTD-Lite into neural network training as a loss function component enhances the preservation of topological structures in learned representations. Our code is publicly available at this https URL

[AI-91] Revisiting FastMap: New Applications

链接: https://arxiv.org/abs/2503.11908
作者: Ang Li
类目: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI)
*备注: PhD dissertation

点击查看摘要

Abstract:FastMap was first introduced in the Data Mining community for generating Euclidean embeddings of complex objects. In this dissertation, we first present FastMap to generate Euclidean embeddings of graphs in near-linear time: The pairwise Euclidean distances approximate a desired graph-based distance function on the vertices. We then apply the graph version of FastMap to efficiently solve various graph-theoretic problems of significant interest in AI: including facility location, top-K centrality computations, community detection and block modeling, and graph convex hull computations. We also present a novel learning framework, called FastMapSVM, by combining FastMap and Support Vector Machines. We then apply FastMapSVM to predict the satisfiability of Constraint Satisfaction Problems and to classify seismograms in Earthquake Science.

[AI-92] Characterizing GPU Resilience and Impact on AI/HPC Systems

链接: https://arxiv.org/abs/2503.11901
作者: Shengkun Cui,Archit Patke,Ziheng Chen,Aditya Ranjan,Hung Nguyen,Phuong Cao,Saurabh Jha,Brett Bode,Gregory Bauer,Chandra Narayanaswami,Daby Sow,Catello Di Martino,Zbigniew T. Kalbarczyk,Ravishankar K. Iyer
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this study, we characterize GPU failures in Delta, the current large-scale AI system with over 600 petaflops of peak compute throughput. The system comprises GPU and non-GPU nodes with modern AI accelerators, such as NVIDIA A40, A100, and H100 GPUs. The study uses two and a half years of data on GPU errors. We evaluate the resilience of GPU hardware components to determine the vulnerability of different GPU components to failure and their impact on the GPU and node availability. We measure the key propagation paths in GPU hardware, GPU interconnect (NVLink), and GPU memory. Finally, we evaluate the impact of the observed GPU errors on user jobs. Our key findings are: (i) Contrary to common beliefs, GPU memory is over 30x more reliable than GPU hardware in terms of MTBE (mean time between errors). (ii) The newly introduced GSP (GPU System Processor) is the most vulnerable GPU hardware component. (iii) NVLink errors did not always lead to user job failure, and we attribute it to the underlying error detection and retry mechanisms employed. (iv) We show multiple examples of hardware errors originating from one of the key GPU hardware components, leading to application failure. (v) We project the impact of GPU node availability on larger scales with emulation and find that significant overprovisioning between 5-20% would be necessary to handle GPU failures. If GPU availability were improved to 99.9%, the overprovisioning would be reduced by 4x.

[AI-93] Expressive Music Data Processing and Generation

链接: https://arxiv.org/abs/2503.11896
作者: Jingwei Liu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Musical expressivity and coherence are indispensable in music composition and performance, while often neglected in modern AI generative models. In this work, we introduce a listening-based data-processing technique that captures the expressivity in musical performance. This technique derived from Weber’s law reflects the human perceptual truth of listening and preserves musical subtlety and expressivity in the training input. To facilitate musical coherence, we model the output interdependencies among multiple arguments in the music data such as pitch, duration, velocity, etc. in the neural networks based on the probabilistic chain rule. In practice, we decompose the multi-output sequential model into single-output submodels and condition previously sampled outputs on the subsequent submodels to induce conditional distributions. Finally, to select eligible sequences from all generations, a tentative measure based on the output entropy was proposed. The entropy sequence is set as a criterion to select predictable and stable generations, which is further studied under the context of informational aesthetic measures to quantify musical pleasure and information gain along the music tendency.

[AI-94] Counterfactual Realizability ICLR’25

链接: https://arxiv.org/abs/2503.11870
作者: Arvind Raghavan,Elias Bareinboim
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: published at ICLR’25 (spotlight)

点击查看摘要

Abstract:It is commonly believed that, in a real-world environment, samples can only be drawn from observational and interventional distributions, corresponding to Layers 1 and 2 of the Pearl Causal Hierarchy. Layer 3, representing counterfactual distributions, is believed to be inaccessible by definition. However, Bareinboim, Forney, and Pearl (2015) introduced a procedure that allows an agent to sample directly from a counterfactual distribution, leaving open the question of what other counterfactual quantities can be estimated directly via physical experimentation. We resolve this by introducing a formal definition of realizability, the ability to draw samples from a distribution, and then developing a complete algorithm to determine whether an arbitrary counterfactual distribution is realizable given fundamental physical constraints, such as the inability to go back in time and subject the same unit to a different experimental condition. We illustrate the implications of this new framework for counterfactual data collection using motivating examples from causal fairness and causal reinforcement learning. While the baseline approach in these motivating settings typically follows an interventional or observational strategy, we show that a counterfactual strategy provably dominates both.

[AI-95] Safety Mirag e: How Spurious Correlations Undermine VLM Safety Fine-tuning

链接: https://arxiv.org/abs/2503.11832
作者: Yiwei Chen,Yuguang Yao,Yihua Zhang,Bingquan Shen,Gaowen Liu,Sijia Liu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent vision-language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the “safety mirage” where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address this issue, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under one-word attacks, MU-based alignment reduces the attack success rate by up to 60.17% and cuts unnecessary rejections by over 84.20%. Codes are available at this https URL. WARNING: There exist AI generations that may be offensive in nature.

[AI-96] Semi-Supervised Co-Training of Time and Time-Frequency Models: Application to Bearing Fault Diagnosis

链接: https://arxiv.org/abs/2503.11824
作者: Tuomas Jalonen,Mohammad Al-Sa’d,Serkan Kiranyaz,Moncef Gabbouj
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Neural networks require massive amounts of annotated data to train intelligent solutions. Acquiring many labeled data in industrial applications is often difficult; therefore, semi-supervised approaches are preferred. We propose a new semi-supervised co-training method, which combines time and time-frequency (TF) machine learning models to improve performance and reliability. The developed framework collaboratively co-trains fast time-domain models by utilizing high-performing TF techniques without increasing the inference complexity. Besides, it operates in cloud-edge networks and offers holistic support for many applications covering edge-real-time monitoring and cloud-based updates and corrections. Experimental results on bearing fault diagnosis verify the superiority of our technique compared to a competing self-training method. The results from two case studies show that our method outperforms self-training for different noise levels and amounts of available data with accuracy gains reaching from 10.6% to 33.9%. They demonstrate that fusing time-domain and TF-based models offers opportunities for developing high-performance industrial solutions.

[AI-97] An Algebraic Approach to Moralisation and Triangulation of Probabilistic Graphical Models

链接: https://arxiv.org/abs/2503.11820
作者: Antonio Lorenzin,Fabio Zanasi
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Category Theory (math.CT)
*备注:

点击查看摘要

Abstract:Moralisation and Triangulation are transformations allowing to switch between different ways of factoring a probability distribution into a graphical model. Moralisation allows to view a Bayesian network (a directed model) as a Markov network (an undirected model), whereas triangulation works in the opposite direction. We present a categorical framework where these transformations are modelled as functors between a category of Bayesian networks and one of Markov networks. The two kinds of network (the objects of these categories) are themselves represented as functors, from a syntax' domain to a semantics’ codomain. Notably, moralisation and triangulation are definable inductively on such syntax, and operate as a form of functor pre-composition. This approach introduces a modular, algebraic perspective in the theory of probabilistic graphical models.

[AI-98] Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs

链接: https://arxiv.org/abs/2503.11790
作者: Nasim Borazjanizadeh,Roei Herzig,Eduard Oks,Trevor Darrell,Rogerio Feris,Leonid Karlinsky
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human reasoning relies on constructing and manipulating mental models-simplified internal representations of situations that we use to understand and solve problems. Conceptual diagrams (for example, sketches drawn by humans to aid reasoning) externalize these mental models, abstracting irrelevant details to efficiently capture relational and spatial information. In contrast, Large Language Models (LLMs) and Large Multimodal Models (LMMs) predominantly reason through textual representations, limiting their effectiveness in complex multi-step combinatorial and planning tasks. In this paper, we propose a zero-shot fully automatic framework that enables LMMs to reason through multiple chains of self-generated intermediate conceptual diagrams, significantly enhancing their combinatorial planning capabilities. Our approach does not require any human initialization beyond a natural language description of the task. It integrates both textual and diagrammatic reasoning within an optimized graph-of-thought inference framework, enhanced by beam search and depth-wise backtracking. Evaluated on multiple challenging PDDL planning domains, our method substantially improves GPT-4o’s performance (for example, from 35.5% to 90.2% in Blocksworld). On more difficult planning domains with solution depths up to 40, our approach outperforms even the o1-preview reasoning model (for example, over 13% improvement in Parking). These results highlight the value of conceptual diagrams as a complementary reasoning medium in LMMs.

[AI-99] PUBLICSPEAK: Hearing the Public with a Probabilistic Framework in Local Government AAAI

链接: https://arxiv.org/abs/2503.11743
作者: Tianliang Xu,Eva Maxfield Brown,Dustin Dwyer,Sabina Tomkins
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 10 pages, 3 figures, in the 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Local governments around the world are making consequential decisions on behalf of their constituents, and these constituents are responding with requests, advice, and assessments of their officials at public meetings. So many small meetings cannot be covered by traditional newsrooms at scale. We propose PUBLICSPEAK, a probabilistic framework which can utilize meeting structure, domain knowledge, and linguistic information to discover public remarks in local government meetings. We then use our approach to inspect the issues raised by constituents in 7 cities across the United States. We evaluate our approach on a novel dataset of local government meetings and find that PUBLICSPEAK improves over state-of-the-art by 10% on average, and by up to 40%.

[AI-100] BioMamba: Leverag ing Spectro-Temporal Embedding in Bidirectional Mamba for Enhanced Biosignal Classification

链接: https://arxiv.org/abs/2503.11741
作者: Jian Qian,Teck Lun Goh,Bingyu Xie,Chengyao Zhu,Biao Wan,Yawen Guan,Patrick Yin Chiang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2405.19363 , arXiv:2410.03057 by other authors

点击查看摘要

Abstract:Biological signals, such as electroencephalograms (EEGs) and electrocardiograms (ECGs), play a pivotal role in numerous clinical practices, such as diagnosing brain and cardiac arrhythmic diseases. Existing methods for biosignal classification rely on Attention-based frameworks with dense Feed Forward layers, which lead to inefficient learning, high computational overhead, and suboptimal performance. In this work, we introduce BioMamba, a Spectro-Temporal Embedding strategy applied to the Bidirectional Mamba framework with Sparse Feed Forward layers to enable effective learning of biosignal sequences. By integrating these three key components, BioMamba effectively addresses the limitations of existing methods. Extensive experiments demonstrate that BioMamba significantly outperforms state-of-the-art methods with marked improvement in classification performance. The advantages of the proposed BioMamba include (1) Reliability: BioMamba consistently delivers robust results, confirmed across six evaluation metrics. (2) Efficiency: We assess both model and training efficiency, the BioMamba demonstrates computational effectiveness by reducing model size and resource consumption compared to existing approaches. (3) Generality: With the capacity to effectively classify a diverse set of tasks, BioMamba demonstrates adaptability and effectiveness across various domains and applications.

[AI-101] Multi-View Node Pruning for Accurate Graph Representation

链接: https://arxiv.org/abs/2503.11737
作者: Jiseong Park,Hanjin Kim,Seojin Kim,Jueun Choi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph pooling, which compresses a whole graph into a smaller coarsened graph, is an essential component of graph representation learning. To efficiently compress a given graph, graph pooling methods often drop their nodes with attention-based scoring with the task loss. However, this often results in simply removing nodes with lower degrees without consideration of their feature-level relevance to the given task. To fix this problem, we propose a Multi-View Pruning(MVP), a graph pruning method based on a multi-view framework and reconstruction loss. Given a graph, MVP first constructs multiple graphs for different views either by utilizing the predefined modalities or by randomly partitioning the input features, to consider the importance of each node in diverse perspectives. Then, it learns the score for each node by considering both the reconstruction and the task loss. MVP can be incorporated with any hierarchical pooling framework to score the nodes. We validate MVP on multiple benchmark datasets by coupling it with two graph pooling methods, and show that it significantly improves the performance of the base graph pooling method, outperforming all baselines. Further analysis shows that both the encoding of multiple views and the consideration of reconstruction loss are the key to the success of MVP, and that it indeed identifies nodes that are less important according to domain knowledge.

[AI-102] Class-Level Feature Selection Method Using Feature Weighted Growing Self-Organising Maps

链接: https://arxiv.org/abs/2503.11732
作者: Andrew Starkey,Uduak Idio Akpan,Omaimah AL Hosni,Yaseen Pullissery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 15 figures

点击查看摘要

Abstract:There have been several attempts to develop Feature Selection (FS) algorithms capable of identifying features that are relevant in a dataset. Although in certain applications the FS algorithms can be seen to be successful, they have similar basic limitations. In all cases, the global feature selection algorithms seek to select features that are relevant and common to all classes of the dataset. This is a major limitation since there could be features that are specifically useful for a particular class while irrelevant for other classes, and full explanation of the relationship at class level therefore cannot be determined. While the inclusion of such features for all classes could cause improved predictive ability for the relevant class, the same features could be problematic for other classes. In this paper, we examine this issue and also develop a class-level feature selection method called the Feature Weighted Growing Self-Organising Map (FWGSOM). The proposed method carries out feature analysis at class level which enhances its ability to identify relevant features for each class. Results from experiments indicate that our method performs better than other methods, gives explainable results at class level, and has a low computational footprint when compared to other methods.

[AI-103] BACE-RUL: A Bi-directional Adversarial Network with Covariate Encoding for Machine Remaining Useful Life Prediction

链接: https://arxiv.org/abs/2503.11730
作者: Zekai Zhang,Dan Li,Shunyu Wu,Junya Cai,Bo Zhang,See Kiong Ng,Zibin Zheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper has been received as a research paper at CollaborateCom 2024

点击查看摘要

Abstract:Prognostic and Health Management (PHM) are crucial ways to avoid unnecessary maintenance for Cyber-Physical Systems (CPS) and improve system reliability. Predicting the Remaining Useful Life (RUL) is one of the most challenging tasks for PHM. Existing methods require prior knowledge about the system, contrived assumptions, or temporal mining to model the life cycles of machine equipment/devices, resulting in diminished accuracy and limited applicability in real-world scenarios. This paper proposes a Bi-directional Adversarial network with Covariate Encoding for machine Remaining Useful Life (BACE-RUL) prediction, which only adopts sensor measurements from the current life cycle to predict RUL rather than relying on previous consecutive cycle recordings. The current sensor measurements of mechanical devices are encoded to a conditional space to better understand the implicit inner mechanical status. The predictor is trained as a conditional generative network with the encoded sensor measurements as its conditions. Various experiments on several real-world datasets, including the turbofan aircraft engine dataset and the dataset collected from degradation experiments of Li-Ion battery cells, show that the proposed model is a general framework and outperforms state-of-the-art methods.

[AI-104] Forecasting Empty Container availability for Vehicle Booking System Application

链接: https://arxiv.org/abs/2503.11728
作者: Arthur Cartel Foahom Gouabou(AMU, LIS, Iamp;M),Mohammed Al-Kharaz(LIS),Faouzi Hakimi(AMU),Tarek Khaled(LIS, LIRICA),Kenza Amzil(LISPEN)
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Container terminals, pivotal nodes in the network of empty container movement, hold significant potential for enhancing operational efficiency within terminal depots through effective collaboration between transporters and terminal operators. This collaboration is crucial for achieving optimization, leading to streamlined operations and reduced congestion, thereby benefiting both parties. Consequently, there is a pressing need to develop the most suitable forecasting approaches to address this challenge. This study focuses on developing and evaluating a data-driven approach for forecasting empty container availability at container terminal depots within a Vehicle Booking System (VBS) framework. It addresses the gap in research concerning optimizing empty container dwell time and aims to enhance operational efficiencies in container terminal operations. Four forecasting models-Naive, ARIMA, Prophet, and LSTM-are comprehensively analyzed for their predictive capabilities, with LSTM emerging as the top performer due to its ability to capture complex time series patterns. The research underscores the significance of selecting appropriate forecasting techniques tailored to the specific requirements of container terminal operations, contributing to improved operational planning and management in maritime logistics.

[AI-105] SPECTra: Scalable Multi-Agent Reinforcement Learning with Permutation-Free Networks

链接: https://arxiv.org/abs/2503.11726
作者: Hyunwoo Park,Baekryun Seong,Sang-Ki Ko
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 31 pages, 14 figures

点击查看摘要

Abstract:In cooperative multi-agent reinforcement learning (MARL), the permutation problem where the state space grows exponentially with the number of agents reduces sample efficiency. Additionally, many existing architectures struggle with scalability, relying on a fixed structure tied to a specific number of agents, limiting their applicability to environments with a variable number of entities. While approaches such as graph neural networks (GNNs) and self-attention mechanisms have progressed in addressing these challenges, they have significant limitations as dense GNNs and self-attention mechanisms incur high computational costs. To overcome these limitations, we propose a novel agent network and a non-linear mixing network that ensure permutation-equivariance and scalability, allowing them to generalize to environments with various numbers of agents. Our agent network significantly reduces computational complexity, and our scalable hypernetwork enables efficient weight generation for non-linear mixing. Additionally, we introduce curriculum learning to improve training efficiency. Experiments on SMACv2 and Google Research Football (GRF) demonstrate that our approach achieves superior learning performance compared to existing methods. By addressing both permutation-invariance and scalability in MARL, our work provides a more efficient and adaptable framework for cooperative MARL. Our code is available at this https URL.

[AI-106] Physics-based simulation ontology: an ontology to support modelling and reuse of data for physics-based simulation

链接: https://arxiv.org/abs/2503.11723
作者: Hyunmin Cheong,Adrian Butscher
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The current work presents an ontology developed for physics-based simulation in engineering design, called Physics-based Simulation Ontology (PSO). The purpose of the ontology is to assist in modelling the physical phenomenon of interest in a veridical manner, while capturing the necessary and reusable information for physics-based simulation solvers. The development involved extending an existing upper ontology, Basic Formal Ontology (BFO), to define lower-level terms of PSO. PSO has two parts: PSO-Physics, which consists of terms and relations used to model physical phenomena based on the perspective of classical mechanics involving partial differential equations, and PSO-Sim, which consists of terms used to represent the information artefacts that are about the physical phenomena modelled with PSO-Physics. The former terms are used to model the physical phenomenon of interest independent of solver-specific interpretations, which can be reused across different solvers, while the latter terms are used to instantiate solver-specific input data. A case study involving two simulation solvers was conducted to demonstrate this capability of PSO. Discussion around the benefits and limitations of using BFO for the current work is also provided, which should be valuable for any future work that extends an existing upper ontology to develop ontologies for engineering applications.

[AI-107] Fine-Tuning Diffusion Generative Models via Rich Preference Optimization

链接: https://arxiv.org/abs/2503.11720
作者: Hanyang Zhao,Haoxian Chen,Yucheng Guo,Genta Indra Winata,Tingting Ou,Ziyu Huang,David D. Yao,Wenpin Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Rich Preference Optimization (RPO), a novel pipeline that leverages rich feedback signals to improve the curation of preference pairs for fine-tuning text-to-image diffusion models. Traditional methods, like Diffusion-DPO, often rely solely on reward model labeling, which can be opaque, offer limited insights into the rationale behind preferences, and are prone to issues such as reward hacking or overfitting. In contrast, our approach begins with generating detailed critiques of synthesized images to extract reliable and actionable image editing instructions. By implementing these instructions, we create refined images, resulting in synthetic, informative preference pairs that serve as enhanced tuning datasets. We demonstrate the effectiveness of our pipeline and the resulting datasets in fine-tuning state-of-the-art diffusion models.

[AI-108] he Relativity of Causal Knowledge

链接: https://arxiv.org/abs/2503.11718
作者: Gabriele D’Acunto,Claudio Battiloro
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Category Theory (math.CT); Methodology (stat.ME)
*备注: 14 pages, 2 figures

点击查看摘要

Abstract:Recent advances in artificial intelligence reveal the limits of purely predictive systems and call for a shift toward causal and collaborative reasoning. Drawing inspiration from the revolution of Grothendieck in mathematics, we introduce the relativity of causal knowledge, which posits structural causal models (SCMs) are inherently imperfect, subjective representations embedded within networks of relationships. By leveraging category theory, we arrange SCMs into a functor category and show that their observational and interventional probability measures naturally form convex structures. This result allows us to encode non-intervened SCMs with convex spaces of probability measures. Next, using sheaf theory, we construct the network sheaf and cosheaf of causal knowledge. These structures enable the transfer of causal knowledge across the network while incorporating interventional consistency and the perspective of the subjects, ultimately leading to the formal, mathematical definition of relative causal knowledge.

[AI-109] Privacy-Preserved Automated Scoring using Federated Learning for Educational Research

链接: https://arxiv.org/abs/2503.11711
作者: Ehsan Latif,Xiaoming Zhai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to AIED25

点击查看摘要

Abstract:Data privacy remains a critical concern in educational research, necessitating Institutional Review Board (IRB) certification and stringent data handling protocols to ensure compliance with ethical standards. Traditional approaches rely on anonymization and controlled data-sharing mechanisms to facilitate research while mitigating privacy risks. However, these methods still involve direct access to raw student data, posing potential vulnerabilities and being time-consuming. This study proposes a federated learning (FL) framework for automatic scoring in educational assessments, eliminating the need to share raw data. Our approach leverages client-side model training, where student responses are processed locally on edge devices, and only optimized model parameters are shared with a central aggregation server. To effectively aggregate heterogeneous model updates, we introduce an adaptive weighted averaging strategy, which dynamically adjusts weight contributions based on client-specific learning characteristics. This method ensures robust model convergence while preserving privacy. We evaluate our framework using assessment data from nine middle schools, comparing the accuracy of federated learning-based scoring models with traditionally trained centralized models. A statistical significance test (paired t-test, t(8) = 2.29, p = 0.051 ) confirms that the accuracy difference between the two approaches is not statistically significant, demonstrating that federated learning achieves comparable performance while safeguarding student data. Furthermore, our method significantly reduces data collection, processing, and deployment overhead, accelerating the adoption of AI-driven educational assessments in a privacy-compliant manner.

[AI-110] ConjointNet: Enhancing Conjoint Analysis for Preference Prediction with Representation Learning

链接: https://arxiv.org/abs/2503.11710
作者: Yanxia Zhang,Francine Chen,Shabnam Hakimi,Totte Harinen,Alex Filipowicz,Yan-Ying Chen,Rumen Iliev,Nikos Arechiga,Kalani Murakami,Kent Lyons,Charlene Wu,Matt Klenk
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding consumer preferences is essential to product design and predicting market response to these new products. Choice-based conjoint analysis is widely used to model user preferences using their choices in surveys. However, traditional conjoint estimation techniques assume simple linear models. This assumption may lead to limited predictability and inaccurate estimation of product attribute contributions, especially on data that has underlying non-linear relationships. In this work, we employ representation learning to efficiently alleviate this issue. We propose ConjointNet, which is composed of two novel neural architectures, to predict user preferences. We demonstrate that the proposed ConjointNet models outperform traditional conjoint estimate techniques on two preference datasets by over 5%, and offer insights into non-linear feature interactions.

[AI-111] Conformal Prediction and Human Decision Making

链接: https://arxiv.org/abs/2503.11709
作者: Jessica Hullman,Yifan Wu,Dawei Xie,Ziyang Guo,Andrew Gelman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Methods to quantify uncertainty in predictions from arbitrary models are in demand in high-stakes domains like medicine and finance. Conformal prediction has emerged as a popular method for producing a set of predictions with specified average coverage, in place of a single prediction and confidence value. However, the value of conformal prediction sets to assist human decisions remains elusive due to the murky relationship between coverage guarantees and decision makers’ goals and strategies. How should we think about conformal prediction sets as a form of decision support? Under what conditions do we expect the support they provide to be superior versus inferior to that of alternative presentations of predictive uncertainty? We outline a decision theoretic framework for evaluating predictive uncertainty as informative signals, then contrast what can be said within this framework about idealized use of calibrated probabilities versus conformal prediction sets. Informed by prior empirical results and theories of human decisions under uncertainty, we formalize a set of possible strategies by which a decision maker might use a prediction set. We identify ways in which conformal prediction sets and posthoc predictive uncertainty quantification more broadly are in tension with common goals and needs in human-AI decision making. We give recommendations for future research in predictive uncertainty quantification to support human decision makers.

[AI-112] Refining Filter Global Feature Weighting for Fully-Unsupervised Clustering

链接: https://arxiv.org/abs/2503.11706
作者: Fabian Galis,Darian Onchis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the context of unsupervised learning, effective clustering plays a vital role in revealing patterns and insights from unlabeled data. However, the success of clustering algorithms often depends on the relevance and contribution of features, which can differ between various datasets. This paper explores feature weighting for clustering and presents new weighting strategies, including methods based on SHAP (SHapley Additive exPlanations), a technique commonly used for providing explainability in various supervised machine learning tasks. By taking advantage of SHAP values in a way other than just to gain explainability, we use them to weight features and ultimately improve the clustering process itself in unsupervised scenarios. Our empirical evaluations across five benchmark datasets and clustering methods demonstrate that feature weighting based on SHAP can enhance unsupervised clustering quality, achieving up to a 22.69% improvement over other weighting methods (from 0.586 to 0.719 in terms of the Adjusted Rand Index). Additionally, these situations where the weighted data boosts the results are highlighted and thoroughly explored, offering insight for practical applications. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.11706 [cs.LG] (or arXiv:2503.11706v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.11706 Focus to learn more arXiv-issued DOI via DataCite

[AI-113] Balancing SoC in Battery Cells using Safe Action Perturbations

链接: https://arxiv.org/abs/2503.11696
作者: E Harshith Kumar Yadav,Rahul Narava,Anshika,Shashi Shekher Jha
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Managing equal charge levels in active cell balancing while charging a Li-ion battery is challenging. An imbalance in charge levels affects the state of health of the battery, along with the concerns of thermal runaway and fire hazards. Traditional methods focus on safety assurance as a trade-off between safety and charging time. Others deal with battery-specific conditions to ensure safety, therefore losing on the generalization of the control strategies over various configurations of batteries. In this work, we propose a method to learn safe battery charging actions by using a safety-layer as an add-on over a Deep Reinforcement Learning (RL) agent. The safety layer perturbs the agent’s action to prevent the battery from encountering unsafe or dangerous states. Further, our Deep RL framework focuses on learning a generalized policy that can be effectively employed with varying configurations of batteries. Our experimental results demonstrate that the safety-layer based action perturbation incurs fewer safety violations by avoiding unsafe states along with learning a robust policy for several battery configurations.

[AI-114] MELON: Multimodal Mixture-of-Experts with Spectral-Temporal Fusion for Long-Term Mobility Estimation in Critical Care

链接: https://arxiv.org/abs/2503.11695
作者: Jiaqing Zhang,Miguel Contreras,Jessica Sena,Andrea Davidson,Yuanfang Ren,Ziyuan Guan,Tezcan Ozrazgat-Baslanti,Tyler J. Loftus,Subhash Nerella,Azra Bihorac,Parisa Rashidi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Patient mobility monitoring in intensive care is critical for ensuring timely interventions and improving clinical outcomes. While accelerometry-based sensor data are widely adopted in training artificial intelligence models to estimate patient mobility, existing approaches face two key limitations highlighted in clinical practice: (1) modeling the long-term accelerometer data is challenging due to the high dimensionality, variability, and noise, and (2) the absence of efficient and robust methods for long-term mobility assessment. To overcome these challenges, we introduce MELON, a novel multimodal framework designed to predict 12-hour mobility status in the critical care setting. MELON leverages the power of a dual-branch network architecture, combining the strengths of spectrogram-based visual representations and sequential accelerometer statistical features. MELON effectively captures global and fine-grained mobility patterns by integrating a pre-trained image encoder for rich frequency-domain feature extraction and a Mixture-of-Experts encoder for sequence modeling. We trained and evaluated the MELON model on the multimodal dataset of 126 patients recruited from nine Intensive Care Units at the University of Florida Health Shands Hospital main campus in Gainesville, Florida. Experiments showed that MELON outperforms conventional approaches for 12-hour mobility status estimation with an overall area under the receiver operating characteristic curve (AUROC) of 0.82 (95%, confidence interval 0.78-0.86). Notably, our experiments also revealed that accelerometer data collected from the wrist provides robust predictive performance compared with data from the ankle, suggesting a single-sensor solution that can reduce patient burden and lower deployment costs…

[AI-115] Exploring Causality for HRI: A Case Study on Robotic Mental Well-being Coaching

链接: https://arxiv.org/abs/2503.11684
作者: Micol Spitale,Srikar Babu,Serhan Cakmak,Jiaee Cheong,Hatice Gunes
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:One of the primary goals of Human-Robot Interaction (HRI) research is to develop robots that can interpret human behavior and adapt their responses accordingly. Adaptive learning models, such as continual and reinforcement learning, play a crucial role in improving robots’ ability to interact effectively in real-world settings. However, these models face significant challenges due to the limited availability of real-world data, particularly in sensitive domains like healthcare and well-being. This data scarcity can hinder a robot’s ability to adapt to new situations. To address these challenges, causality provides a structured framework for understanding and modeling the underlying relationships between actions, events, and outcomes. By moving beyond mere pattern recognition, causality enables robots to make more explainable and generalizable decisions. This paper presents an exploratory causality-based analysis through a case study of an adaptive robotic coach delivering positive psychology exercises over four weeks in a workplace setting. The robotic coach autonomously adapts to multimodal human behaviors, such as facial valence and speech duration. By conducting both macro- and micro-level causal analyses, this study aims to gain deeper insights into how adaptability can enhance well-being during interactions. Ultimately, this research seeks to advance our understanding of how causality can help overcome challenges in HRI, particularly in real-world applications.

[AI-116] ming-Driven Global Placement by Efficient Critical Path Extraction DATE’25

链接: https://arxiv.org/abs/2503.11674
作者: Yunqi Shi,Siyuan Xu,Shixiong Kai,Xi Lin,Ke Xue,Mingxuan Yuan,Chao Qian
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: Accepted by DATE’25 as a Best Paper Award

点击查看摘要

Abstract:Timing optimization during the global placement of integrated circuits has been a significant focus for decades, yet it remains a complex, unresolved issue. Recent analytical methods typically use pin-level timing information to adjust net weights, which is fast and simple but neglects the path-based nature of the timing graph. The existing path-based methods, however, cannot balance the accuracy and efficiency due to the exponential growth of number of critical paths. In this work, we propose a GPU-accelerated timing-driven global placement framework, integrating accurate path-level information into the efficient DREAMPlace infrastructure. It optimizes the fine-grained pin-to-pin attraction objective and is facilitated by efficient critical path extraction. We also design a quadratic distance loss function specifically to align with the RC timing model. Experimental results demonstrate that our method significantly outperforms the current leading timing-driven placers, achieving an average improvement of 40.5% in total negative slack (TNS) and 8.3% in worst negative slack (WNS), as well as an improvement in half-perimeter wirelength (HPWL).

[AI-117] Optimizing Coverag e-Driven Verification Using Machine Learning and PyUVM: A Novel Approach

链接: https://arxiv.org/abs/2503.11666
作者: Suruchi Kumari,Deepak Narayan Gadde,Aman Kumar
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: To appear at 2025 IEEE International Symposium on Circuits and Systems, May 25-28 2025, London, United Kingdom

点击查看摘要

Abstract:The escalating complexity of System-on-Chip (SoC) designs has created a bottleneck in verification, with traditional techniques struggling to achieve complete coverage. Existing techniques, such as Constrained Random Verification (CRV) and coverage-driven methodologies, rely on time-consuming and redundant simulation regression, leading to higher verification costs and longer time-to-market due to the manual effort required to adjust constraints and drive the stimuli to achieve coverage objectives. To address this challenge, we propose a novel methodology that leverages supervised Machine Learning (ML) to optimize simulation regressions, resulting in reduced simulation run-time and the number of test simulations required to achieve target coverage goals. We also investigate and compare the effectiveness of various supervised learning algorithms from scikit-learn. Our results demonstrate that these algorithms can achieve at least 99% coverage regain with significantly reduced simulation cycles. We utilize Python Universal Verification Methodology (PyUVM) over SystemVerilog-Universal Verification Methodology (SV-UVM) for testbench creation, enabling simpler constructs using Python and facilitating the reuse of existing ML libraries. Our methodology is applied to three diverse designs, and our results show that it can significantly reduce verification costs, manual efforts, and time-to-market, while enhancing verification productivity and completeness, by automating the testbench update process and achieving target coverage goals.

[AI-118] MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLM s

链接: https://arxiv.org/abs/2503.11663
作者: Abhishek Moitra,Arkapravo Ghosh,Shrey Agarwal,Aporva Amarnath,Karthik Swaminathan,Priyadarshini Panda
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 13 figures. Accepted to The Eighth Annual Conference on Machine Learning and Systems (MLSys), 2025

点击查看摘要

Abstract:The computational and memory challenges of large language models (LLMs) have sparked several optimization approaches towards their efficient implementation. While prior LLM-targeted quantization, and prior works on sparse acceleration have significantly mitigated the memory and computation bottleneck, they do so assuming high power platforms such as GPUs and server-class FPGAs with large off-chip memory bandwidths and employ a generalized matrix multiplication (GEMM) execution of all the layers in the decoder. In such a GEMM-based execution, data is fetched from an off-chip memory, computed and stored back. However, at reduced off-chip memory capacities, as is the case with low-power edge devices, this implementation strategy significantly increases the attention computation latency owing to the repeated storage and fetch of large intermediate tokens to and from the off-chip memory. Moreover, fetching the weight matrices from a bandwidth constrained memory further aggravates the memory bottleneck problem. To this end, we introduce MEADOW, a framework that significantly reduces the off-chip memory access for LLMs with a novel token-parallel head-sequential (TPHS) dataflow. Additionally, MEADOW applies weight packing that performs loss-less decomposition of large weight matrices to their unique elements thereby, reducing the enormous weight fetch latency. MEADOW demonstrates 1.5x and 2.5x lower decode and prefill latency, respectively, compared to a GEMM-based LLM implementation on the low power Xilinx ZCU102 FPGA platform that consumes less than 10W. Additionally, MEADOW achieves an end-to-end latency improvement of over 40%, compared to prior LLM optimization works.

[AI-119] A 28 nm AI microcontroller with tightly coupled zero-standby power weight memory featuring standard logic compatible 4 Mb 4-bits/cell embedded flash technology

链接: https://arxiv.org/abs/2503.11660
作者: Daewung Kim,Seong Hwan Jeon,Young Hee Jeon,Kyung-Bae Kwon,Jigon Kim,Yeounghun Choi,Hyunseung Cha,Kitae Kwon,Daesik Park,Jongseuk Lee,Sihwan Kim,Seung-Hwan Song
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 6 pages, 8 figures, Accepted as a full paper by the 2025 EDGE AI FOUNDATION Austin

点击查看摘要

Abstract:This study introduces a novel AI microcontroller optimized for cost-effective, battery-powered edge AI applications. Unlike traditional single bit/cell memory configurations, the proposed microcontroller integrates zero-standby power weight memory featuring standard logic compatible 4-bits/cell embedded flash technology tightly coupled to a Near-Memory Computing Unit. This architecture enables efficient and low-power AI acceleration. Advanced state mapping and an overstress-free word line (WL) driver circuit extend verify levels, ensuring robust 16 state cell margin. A ping-pong buffer reduces internal data movement while supporting simultaneous multi-bit processing. The fabricated microcontroller demonstrated high reliability, maintaining accuracy after 160 hours of unpowered baking at 125 ^\circ C.

[AI-120] Circuit Diagram Retrieval Based on Hierarchical Circuit Graph Representation

链接: https://arxiv.org/abs/2503.11658
作者: Ming Gao,Ruichen Qiu,Zeng Hui Chang,Kanjian Zhang,Haikun Wei,Hong Cai Chen
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 11 pages, 10 figures, 7 tables, under review paper

点击查看摘要

Abstract:In the domain of analog circuit design, the retrieval of circuit diagrams has drawn a great interest, primarily due to its vital role in the consultation of legacy designs and the detection of design plagiarism. Existing image retrieval techniques are adept at handling natural images, which converts images into feature vectors and retrieval similar images according to the closeness of these vectors. Nonetheless, these approaches exhibit limitations when applied to the more specialized and intricate domain of circuit diagrams. This paper presents a novel approach to circuit diagram retrieval by employing a graph representation of circuit diagrams, effectively reformulating the retrieval task as a graph retrieval problem. The proposed methodology consists of two principal components: a circuit diagram recognition algorithm designed to extract the circuit components and topological structure of the circuit using proposed GAM-YOLO model and a 2-step connected domain filtering algorithm, and a hierarchical retrieval strategy based on graph similarity and different graph representation methods for analog circuits. Our methodology pioneers the utilization of graph representation in the retrieval of circuit diagrams, incorporating topological features that are commonly overlooked by standard image retrieval methods. The results of our experiments substantiate the efficacy of our approach in retrieving circuit diagrams across of different types.

[AI-121] RainScaleGAN: a Conditional Generative Adversarial Network for Rainfall Downscaling

链接: https://arxiv.org/abs/2503.13316
作者: Marcello Iotti,Paolo Davini,Jost von Hardenberg,Giuseppe Zappa
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 38 pages, 16 figures

点击查看摘要

Abstract:To this day, accurately simulating local-scale precipitation and reliably reproducing its distribution remains a challenging task. The limited horizontal resolution of Global Climate Models is among the primary factors undermining their skill in this context. The physical mechanisms driving the onset and development of precipitation, especially in extreme events, operate at spatio-temporal scales smaller than those numerically resolved, thus struggling to be captured accurately. In order to circumvent this limitation, several downscaling approaches have been developed over the last decades to address the discrepancy between the spatial resolution of models output and the resolution required by local-scale applications. In this paper, we introduce RainScaleGAN, a conditional deep convolutional Generative Adversarial Network (GAN) for precipitation downscaling. GANs have been effectively used in image super-resolution, an approach highly relevant for downscaling tasks. RainScaleGAN’s capabilities are tested in a perfect-model setup, where the spatial resolution of a precipitation dataset is artificially degraded from 0.25 ^\circ\times 0.25 ^\circ to 2 ^\circ\times 2 ^\circ , and RainScaleGAN is used to restore it. The developed model outperforms one of the leading precipitation downscaling method found in the literature. RainScaleGAN not only generates a synthetic dataset featuring plausible high-resolution spatial patterns and intensities, but also produces a precipitation distribution with statistics closely mirroring those of the ground-truth dataset. Given that RainScaleGAN’s approach is agnostic with respect to the underlying physics, the method has the potential to be applied to other physical variables such as surface winds or temperature.

[AI-122] Quantum-Enhanced LLM Efficient Fine Tuning

链接: https://arxiv.org/abs/2503.12790
作者: Xiaofei Kong,Lei Li,Menghan Dou,Zhaoyun Chen,Yuchun Wu,Guoping Guo
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) enables efficient fine-tuning of pre-trained language models via low-rank matrix approximation, which is effective in many scenarios. However, its low-rank representation capacity is constrained in complex tasks or high-rank dependency settings, potentially limiting model adaptability. Addressing the expressive bottleneck of classical low-rank approximation in fine-tuning large language models, this paper proposes a parameter-efficient fine-tuning method based on a Quantum Weighted Tensor Hybrid Network (QWTHN), which leverages Quantum Neural Network (QNN). The study investigates quantum-classical hybrid parameter-efficient fine-tuning in low-rank spaces. QWTHN decomposes pre-trained weights into quantum neural network and tensor network representations, utilizing quantum state superposition and other methods to break through classical rank limitations. Experiments show that the proposed quantum fine-tuning technique for large models approaches or even surpasses the parameter efficiency of LoRA. On the CPsyCounD and R1-Distill-SFT datasets, QWTHN, compared to classical LoRA, reduces training loss by up to 15% while using 76% fewer parameters, and achieves an 8.4% performance improvement on the CPsyCounD test set. This research not only realizes lightweight and efficient adaptation of quantum resources to billion-parameter models but also validates the practical path of quantum hardware driven by large model tasks, laying the first engineering-ready technical foundation for future quantum-enhanced AGI systems.

[AI-123] Fourier-Based 3D Multistage Transformer for Aberration Correction in Multicellular Specimens

链接: https://arxiv.org/abs/2503.12593
作者: Thayer Alshaabi,Daniel E. Milkie,Gaoxiang Liu,Cyna Shirazinejad,Jason L. Hong,Kemal Achour,Frederik Görlitz,Ana Milunovic-Jevtic,Cat Simmons,Ibrahim S. Abuzahriyeh,Erin Hong,Samara Erin Williams,Nathanael Harrison,Evan Huang,Eun Seok Bae,Alison N. Killilea,David G. Drubin,Ian A. Swinburne,Srigokul Upadhyayula,Eric Betzig
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Quantitative Methods (q-bio.QM)
*备注: 52 pages, 6 figures, 23 si figures, 8 si tables

点击查看摘要

Abstract:High-resolution tissue imaging is often compromised by sample-induced optical aberrations that degrade resolution and contrast. While wavefront sensor-based adaptive optics (AO) can measure these aberrations, such hardware solutions are typically complex, expensive to implement, and slow when serially mapping spatially varying aberrations across large fields of view. Here, we introduce AOViFT (Adaptive Optical Vision Fourier Transformer) – a machine learning-based aberration sensing framework built around a 3D multistage Vision Transformer that operates on Fourier domain embeddings. AOViFT infers aberrations and restores diffraction-limited performance in puncta-labeled specimens with substantially reduced computational cost, training time, and memory footprint compared to conventional architectures or real-space networks. We validated AOViFT on live gene-edited zebrafish embryos, demonstrating its ability to correct spatially varying aberrations using either a deformable mirror or post-acquisition deconvolution. By eliminating the need for the guide star and wavefront sensing hardware and simplifying the experimental workflow, AOViFT lowers technical barriers for high-resolution volumetric microscopy across diverse biological samples.

[AI-124] A Reservoir-based Model for Human-like Perception of Complex Rhythm Pattern

链接: https://arxiv.org/abs/2503.12509
作者: Zhongju Yuan,Geraint Wiggins,Dick Botteldooren
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Rhythm is a fundamental aspect of human behaviour, present from infancy and deeply embedded in cultural practices. Rhythm anticipation is a spontaneous cognitive process that typically occurs before the onset of actual beats. While most research in both neuroscience and artificial intelligence has focused on metronome-based rhythm tasks, studies investigating the perception of complex musical rhythm patterns remain limited. To address this gap, we propose a hierarchical oscillator-based model to better understand the perception of complex musical rhythms in biological systems. The model consists of two types of coupled neurons that generate oscillations, with different layers tuned to respond to distinct perception levels. We evaluate the model using several representative rhythm patterns spanning the upper, middle, and lower bounds of human musical perception. Our findings demonstrate that, while maintaining a high degree of synchronization accuracy, the model exhibits human-like rhythmic behaviours. Additionally, the beta band neuronal activity in the model mirrors patterns observed in the human brain, further validating the biological plausibility of the approach.

[AI-125] SING: Semantic Image Communications using Null-Space and INN-Guided Diffusion Models

链接: https://arxiv.org/abs/2503.12484
作者: Jiakang Chen,Selim F. Yilmaz,Di You,Pier Luigi Dragotti,Deniz Gündüz
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Joint source-channel coding systems based on deep neural networks (DeepJSCC) have recently demonstrated remarkable performance in wireless image transmission. Existing methods primarily focus on minimizing distortion between the transmitted image and the reconstructed version at the receiver, often overlooking perceptual quality. This can lead to severe perceptual degradation when transmitting images under extreme conditions, such as low bandwidth compression ratios (BCRs) and low signal-to-noise ratios (SNRs). In this work, we propose SING, a novel two-stage JSCC framework that formulates the recovery of high-quality source images from corrupted reconstructions as an inverse problem. Depending on the availability of information about the DeepJSCC encoder/decoder and the channel at the receiver, SING can either approximate the stochastic degradation as a linear transformation, or leverage invertible neural networks (INNs) for precise modeling. Both approaches enable the seamless integration of diffusion models into the reconstruction process, enhancing perceptual quality. Experimental results demonstrate that SING outperforms DeepJSCC and other approaches, delivering superior perceptual quality even under extremely challenging conditions, including scenarios with significant distribution mismatches between the training and test data.

[AI-126] When neural implant meets multimodal LLM : A dual-loop system for neuromodulation and naturalistic neuralbehavioral research

链接: https://arxiv.org/abs/2503.12334
作者: Edward Hong Wang,Cynthia Xin Wen
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:We propose a novel dual-loop system that synergistically combines responsive neurostimulation (RNS) implants with artificial intelligence-driven wearable devices for treating post-traumatic stress disorder (PTSD) and enabling naturalistic brain research. In PTSD Therapy Mode, an implanted closed-loop neural device monitors amygdala activity and provides on-demand stimulation upon detecting pathological theta oscillations, while an ensemble of wearables (smart glasses, smartwatches, smartphones) uses multimodal large language model (LLM) analysis of sensory data to detect environmental or physiological PTSD triggers and deliver timely audiovisual interventions. Logged events from both the neural and wearable loops are analyzed to personalize trigger detection and progressively transition patients to non-invasive interventions. In Neuroscience Research Mode, the same platform is adapted for real-world brain activity capture. Wearable-LLM systems recognize naturalistic events (social interactions, emotional situations, compulsive behaviors, decision making) and signal implanted RNS devices (via wireless triggers) to record synchronized intracranial data during these moments. This approach builds on recent advances in mobile intracranial EEG recording and closed-loop neuromodulation in humans (BRAIN Initiative, 2023) (Mobbs et al., 2021). We discuss how our interdisciplinary system could revolutionize PTSD therapy and cognitive neuroscience by enabling 24/7 monitoring, context-aware intervention, and rich data collection outside traditional labs. The vision is a future where AI-enhanced devices continuously collaborate with the human brain, offering therapeutic support and deep insights into neural function, with the resulting real-world context rich neural data, in turn, accelerating the development of more biologically-grounded and human-centric AI.

[AI-127] Language Models for Automated Classification of Brain MRI Reports and Growth Chart Generation

链接: https://arxiv.org/abs/2503.12143
作者: Maryam Daniali,Shivaram Karandikar,Dabriel Zimmerman,J. Eric Schmitt,Matthew J. Buczek,Benjamin Jung,Laura Mercedes,Jakob Seidlitz,Vanessa Troiani,Lena Dorfschmidt,Eren Kafadar,Remo Williams,Susan Sotardi,Arastoo Vosough,Scott Haag,Jenna M. Schabdach,Aaron Alexander-Bloch
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clinically acquired brain MRIs and radiology reports are valuable but underutilized resources due to the challenges of manual analysis and data heterogeneity. We developed fine-tuned language models (LMs) to classify brain MRI reports as normal (reports with limited pathology) or abnormal, fine-tuning BERT, BioBERT, ClinicalBERT, and RadBERT on 44,661 reports. We also explored the reasoning capabilities of a leading LM, Gemini 1.5-Pro, for normal report categorization. Automated image processing and modeling generated brain growth charts from LM-classified normal scans, comparing them to human-derived charts. Fine-tuned LMs achieved high classification performance (F1-Score 97%), with unbalanced training mitigating class imbalance. Performance was robust on out-of-distribution data, with full text outperforming summary (impression) sections. Gemini 1.5-Pro showed a promising categorization performance, especially with clinical inference. LM-derived brain growth charts were nearly identical to human-annotated charts (r = 0.99, p 2.2e-16). Our LMs offer scalable analysis of radiology reports, enabling automated classification of brain MRIs in large datasets. One application is automated generation of brain growth charts for benchmarking quantitative image features. Further research is needed to address data heterogeneity and optimize LM reasoning.

[AI-128] Adaptive Stochastic Gradient Descents on Manifolds with an Application on Weighted Low-Rank Approximation

链接: https://arxiv.org/abs/2503.11833
作者: Peiqi Yang,Conglong Xu,Hao Wu
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We prove a convergence theorem for stochastic gradient descents on manifolds with adaptive learning rate and apply it to the weighted low-rank approximation problem.

机器学习

[LG-0] Uncovering Utility Functions from Observed Outcomes

链接: https://arxiv.org/abs/2503.13432
作者: Marta Grzeskiewicz
类目: Machine Learning (cs.LG)
*备注: Working paper

点击查看摘要

Abstract:Determining consumer preferences and utility is a foundational challenge in economics. They are central in determining consumer behaviour through the utility-maximising consumer decision-making process. However, preferences and utilities are not observable and may not even be known to the individual making the choice; only the outcome is observed in the form of demand. Without the ability to observe the decision-making mechanism, demand estimation becomes a challenging task and current methods fall short due to lack of scalability or ability to identify causal effects. Estimating these effects is critical when considering changes in policy, such as pricing, the impact of taxes and subsidies, and the effect of a tariff. To address the shortcomings of existing methods, we combine revealed preference theory and inverse reinforcement learning to present a novel algorithm, Preference Extraction and Reward Learning (PEARL) which, to the best of our knowledge, is the only algorithm that can uncover a representation of the utility function that best rationalises observed consumer choice data given a specified functional form. We introduce a flexible utility function, the Input-Concave Neural Network which captures complex relationships across goods, including cross-price elasticities. Results show PEARL outperforms the benchmark on both noise-free and noisy synthetic data.

[LG-1] Measuring In-Context Computation Complexity via Hidden State Prediction

链接: https://arxiv.org/abs/2503.13431
作者: Vincent Herrmann,Róbert Csordás,Jürgen Schmidhuber
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting when a neural sequence model does “interesting” computation is an open problem. The next token prediction loss is a poor indicator: Low loss can stem from trivially predictable sequences that are uninteresting, while high loss may reflect unpredictable but also irrelevant information that can be ignored by the model. We propose a better metric: measuring the model’s ability to predict its own future hidden states. We show empirically that this metric – in contrast to the next token prediction loss – correlates with the intuitive interestingness of the task. To measure predictability, we introduce the architecture-agnostic “prediction of hidden states” (PHi) layer that serves as an information bottleneck on the main pathway of the network (e.g., the residual stream in Transformers). We propose a novel learned predictive prior that enables us to measure the novel information gained in each computation step, which serves as our metric. We show empirically that our metric predicts the description length of formal languages learned in-context, the complexity of mathematical reasoning problems, and the correctness of self-generated reasoning chains.

[LG-2] Investigating the effect of CPT in lateral spreading prediction using Explainable AI

链接: https://arxiv.org/abs/2503.13389
作者: Cheng-Hsi Hsiao,Ellen Rathje,Krishna Kumar
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:This study proposes an autoencoder approach to extract latent features from cone penetration test profiles to evaluate the potential of incorporating CPT data in an AI model. We employ autoencoders to compress 200 CPT profiles of soil behavior type index (Ic) and normalized cone resistance (qc1Ncs) into ten latent features while preserving critical information. We then utilize the extracted latent features with site parameters to train XGBoost models for predicting lateral spreading occurrences in the 2011 Christchurch earthquake. Models using the latent CPT features outperformed models with conventional CPT metrics or no CPT data, achieving over 83% accuracy. Explainable AI revealed the most crucial latent feature corresponding to soil behavior between 1-3 meter depths, highlighting this depth range’s criticality for liquefaction evaluation. The autoencoder approach provides an automated technique for condensing CPT profiles into informative latent features for machine-learning liquefaction models.

[LG-3] SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization WACV2025

链接: https://arxiv.org/abs/2503.13371
作者: Xulin Fan,Heting Gao,Ziyi Chen,Peng Chang,Mei Han,Mark Hasegawa-Johnson
类目: Machine Learning (cs.LG)
*备注: Accepted to WACV 2025

点击查看摘要

Abstract:Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image fidelity. Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task, with diffusion-based models achieving superior image fidelity but experiencing lower synchronization compared to their GAN-based counterparts. To this end, we propose SyncDiff, a simple yet effective approach to improve diffusion-based models using a temporal pose frame with information bottleneck and facial-informative audio features extracted from AVHuBERT, as conditioning input into the diffusion process. We evaluate SyncDiff on two canonical talking head datasets, LRS2 and LRS3 for direct comparison with other SOTA models. Experiments on LRS2/LRS3 datasets show that SyncDiff achieves a synchronization score 27.7%/62.3% relatively higher than previous diffusion-based methods, while preserving their high-fidelity characteristics.

[LG-4] Follow-the-Regularized-Leader with Adversarial Constraints

链接: https://arxiv.org/abs/2503.13366
作者: Ricardo N. Ferreira,Cláudia Soares
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Constrained Online Convex Optimization (COCO) can be seen as a generalization of the standard Online Convex Optimization (OCO) framework. At each round, a cost function and constraint function are revealed after a learner chooses an action. The goal is to minimize both the regret and cumulative constraint violation (CCV) against an adaptive adversary. We show for the first time that is possible to obtain the optimal O(\sqrtT) bound on both regret and CCV, improving the best known bounds of O \left( \sqrtT \right) and Õ \left( \sqrtT \right) for the regret and CCV, respectively.

[LG-5] Agents Play Thousands of 3D Video Games

链接: https://arxiv.org/abs/2503.13356
作者: Zhongwen Xu,Xianliang Wang,Siyi Li,Tao Yu,Liang Wang,Qiang Fu,Wei Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present PORTAL, a novel framework for developing artificial intelligence agents capable of playing thousands of 3D video games through language-guided policy generation. By transforming decision-making problems into language modeling tasks, our approach leverages large language models (LLMs) to generate behavior trees represented in domain-specific language (DSL). This method eliminates the computational burden associated with traditional reinforcement learning approaches while preserving strategic depth and rapid adaptability. Our framework introduces a hybrid policy structure that combines rule-based nodes with neural network components, enabling both high-level strategic reasoning and precise low-level control. A dual-feedback mechanism incorporating quantitative game metrics and vision-language model analysis facilitates iterative policy improvement at both tactical and strategic levels. The resulting policies are instantaneously deployable, human-interpretable, and capable of generalizing across diverse gaming environments. Experimental results demonstrate PORTAL’s effectiveness across thousands of first-person shooter (FPS) games, showcasing significant improvements in development efficiency, policy generalization, and behavior diversity compared to traditional approaches. PORTAL represents a significant advancement in game AI development, offering a practical solution for creating sophisticated agents that can operate across thousands of commercial video games with minimal development overhead. Experiment results on the 3D video games are best viewed on this https URL .

[LG-6] PERC: a suite of software tools for the curation of cryoEM data with application to simulation modelling and machine learning

链接: https://arxiv.org/abs/2503.13329
作者: Beatriz Costa-Gomes,Joel Greer,Nikolai Juraschko,James Parkhurst,Jola Mirecka,Marjan Famili,Camila Rangel-Smith,Oliver Strickson,Alan Lowe,Mark Basham,Tom Burnley
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Biomolecules (q-bio.BM)
*备注: 22 pages, 4 figures

点击查看摘要

Abstract:Ease of access to data, tools and models expedites scientific research. In structural biology there are now numerous open repositories of experimental and simulated datasets. Being able to easily access and utilise these is crucial for allowing researchers to make optimal use of their research effort. The tools presented here are useful for collating existing public cryoEM datasets and/or creating new synthetic cryoEM datasets to aid the development of novel data processing and interpretation algorithms. In recent years, structural biology has seen the development of a multitude of machine-learning based algorithms for aiding numerous steps in the processing and reconstruction of experimental datasets and the use of these approaches has become widespread. Developing such techniques in structural biology requires access to large datasets which can be cumbersome to curate and unwieldy to make use of. In this paper we present a suite of Python software packages which we collectively refer to as PERC (profet, EMPIARreader and CAKED). These are designed to reduce the burden which data curation places upon structural biology research. The protein structure fetcher (profet) package allows users to conveniently download and cleave sequences or structures from the Protein Data Bank or Alphafold databases. EMPIARreader allows lazy loading of Electron Microscopy Public Image Archive datasets in a machine-learning compatible structure. The Class Aggregator for Key Electron-microscopy Data (CAKED) package is designed to seamlessly facilitate the training of machine learning models on electron microscopy data, including electron-cryo-microscopy-specific data augmentation and labelling. These packages may be utilised independently or as building blocks in workflows. All are available in open source repositories and designed to be easily extensible to facilitate more advanced workflows if required.

[LG-7] SMPR: A structure-enhanced multimodal drug-disease prediction model for drug repositioning and cold start

链接: https://arxiv.org/abs/2503.13322
作者: Xin Dong,Rui Miao,Suyan Zhang,Shuaibing Jia,Leifeng Zhang,Yong Liang,Jianhua Zhang,Yi Zhun Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Repositioning drug-disease relationships has always been a hot field of research. However, actual cases of biologically validated drug relocation remain very limited, and existing models have not yet fully utilized the structural information of the drug. Furthermore, most repositioning models are only used to complete the relationship matrix, and their practicality is poor when dealing with drug cold start problems. This paper proposes a structure-enhanced multimodal relationship prediction model (SMRP). SMPR is based on the SMILE structure of the drug, using the Mol2VEC method to generate drug embedded representations, and learn disease embedded representations through heterogeneous network graph neural networks. Ultimately, a drug-disease relationship matrix is constructed. In addition, to reduce the difficulty of users’ use, SMPR also provides a cold start interface based on structural similarity based on reposition results to simply and quickly predict drug-related diseases. The repositioning ability and cold start capability of the model are verified from multiple perspectives. While the AUC and ACUPR scores of repositioning reach 99% and 61% respectively, the AUC of cold start achieve 80%. In particular, the cold start Recall indicator can reach more than 70%, which means that SMPR is more sensitive to positive samples. Finally, case analysis is used to verify the practical value of the model and visual analysis directly demonstrates the improvement of the structure to the model. For quick use, we also provide local deployment of the model and package it into an executable program.

[LG-8] GFSNetwork: Differentiable Feature Selection via Gumbel-Sigmoid Relaxation

链接: https://arxiv.org/abs/2503.13304
作者: Witold Wydmański,Marek Śmieja
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature selection in deep learning remains a critical challenge, particularly for high-dimensional tabular data where interpretability and computational efficiency are paramount. We present GFSNetwork, a novel neural architecture that performs differentiable feature selection through temperature-controlled Gumbel-Sigmoid sampling. Unlike traditional methods, where the user has to define the requested number of features, GFSNetwork selects it automatically during an end-to-end process. Moreover, GFSNetwork maintains constant computational overhead regardless of the number of input features. We evaluate GFSNetwork on a series of classification and regression benchmarks, where it consistently outperforms recent methods including DeepLasso, attention maps, as well as traditional feature selectors, while using significantly fewer features. Furthermore, we validate our approach on real-world metagenomic datasets, demonstrating its effectiveness in high-dimensional biological data. Concluding, our method provides a scalable solution that bridges the gap between neural network flexibility and traditional feature selection interpretability. We share our python implementation of GFSNetwork at this https URL, as well as a PyPi package (gfs_network).

[LG-9] On Local Posterior Structure in Deep Ensembles

链接: https://arxiv.org/abs/2503.13296
作者: Mikkel Jordahn,Jonas Vestergaard Jensen,Mikkel N. Schmidt,Michael Riis Andersen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Code and models available at this https URL

点击查看摘要

Abstract:Bayesian Neural Networks (BNNs) often improve model calibration and predictive uncertainty quantification compared to point estimators such as maximum-a-posteriori (MAP). Similarly, deep ensembles (DEs) are also known to improve calibration, and therefore, it is natural to hypothesize that deep ensembles of BNNs (DE-BNNs) should provide even further improvements. In this work, we systematically investigate this across a number of datasets, neural network architectures, and BNN approximation methods and surprisingly find that when the ensembles grow large enough, DEs consistently outperform DE-BNNs on in-distribution data. To shine light on this observation, we conduct several sensitivity and ablation studies. Moreover, we show that even though DE-BNNs outperform DEs on out-of-distribution metrics, this comes at the cost of decreased in-distribution performance. As a final contribution, we open-source the large pool of trained models to facilitate further research on this topic.

[LG-10] Graph Generative Models Evaluation with Masked Autoencoder

链接: https://arxiv.org/abs/2503.13271
作者: Chengen Wang,Murat Kantarcioglu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, numerous graph generative models (GGMs) have been proposed. However, evaluating these models remains a considerable challenge, primarily due to the difficulty in extracting meaningful graph features that accurately represent real-world graphs. The traditional evaluation techniques, which rely on graph statistical properties like node degree distribution, clustering coefficients, or Laplacian spectrum, overlook node features and lack scalability. There are newly proposed deep learning-based methods employing graph random neural networks or contrastive learning to extract graph features, demonstrating superior performance compared to traditional statistical methods, but their experimental results also demonstrate that these methods do not always working well across different metrics. Although there are overlaps among these metrics, they are generally not interchangeable, each evaluating generative models from a different perspective. In this paper, we propose a novel method that leverages graph masked autoencoders to effectively extract graph features for GGM evaluations. We conduct extensive experiments on graphs and empirically demonstrate that our method can be more reliable and effective than previously proposed methods across a number of GGM evaluation metrics, such as “Fréchet Distance (FD)” and “MMD Linear”. However, no single method stands out consistently across all metrics and datasets. Therefore, this study also aims to raise awareness of the significance and challenges associated with GGM evaluation techniques, especially in light of recent advances in generative models.

[LG-11] Neural network-based Godunov corrections for approximate Riemann solvers using bi-fidelity learning

链接: https://arxiv.org/abs/2503.13248
作者: Akshay Thakur,Matthew J. Zahr
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 22 pages, 16 figures

点击查看摘要

Abstract:The Riemann problem is fundamental in the computational modeling of hyperbolic partial differential equations, enabling the development of stable and accurate upwind schemes. While exact solvers provide robust upwinding fluxes, their high computational cost necessitates approximate solvers. Although approximate solvers achieve accuracy in many scenarios, they produce inaccurate solutions in certain cases. To overcome this limitation, we propose constructing neural network-based surrogate models, trained using supervised learning, designed to map interior and exterior conservative state variables to the corresponding exact flux. Specifically, we propose two distinct approaches: one utilizing a vanilla neural network and the other employing a bi-fidelity neural network. The performance of the proposed approaches is demonstrated through applications to one-dimensional and two-dimensional partial differential equations, showcasing their robustness and accuracy.

[LG-12] Highly Efficient Direct Analytics on Semantic-aware Time Series Data Compression

链接: https://arxiv.org/abs/2503.13246
作者: Guoyou Sun,Panagiotis Karras,Qi Zhang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Semantic communication has emerged as a promising paradigm to tackle the challenges of massive growing data traffic and sustainable data communication. It shifts the focus from data fidelity to goal-oriented or task-oriented semantic transmission. While deep learning-based methods are commonly used for semantic encoding and decoding, they struggle with the sequential nature of time series data and high computation cost, particularly in resource-constrained IoT environments. Data compression plays a crucial role in reducing transmission and storage costs, yet traditional data compression methods fall short of the demands of goal-oriented communication systems. In this paper, we propose a novel method for direct analytics on time series data compressed by the SHRINK compression algorithm. Through experimentation using outlier detection as a case study, we show that our method outperforms baselines running on uncompressed data in multiple cases, with merely 1% difference in the worst case. Additionally, it achieves four times lower runtime on average and accesses approximately 10% of the data volume, which enables edge analytics with limited storage and computation power. These results demonstrate that our approach offers reliable, high-speed outlier detection analytics for diverse IoT applications while extracting semantics from time-series data, achieving high compression, and reducing data transmission.

[LG-13] ProDiF: Protecting Domain-Invariant Features to Secure Pre-Trained Models Against Extraction ICLR

链接: https://arxiv.org/abs/2503.13224
作者: Tong Zhou,Shijin Duan,Gaowen Liu,Charles Fleming,Ramana Rao Kompella,Shaolei Ren,Xiaolin Xu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at the ICLR Workshop on Neural Network Weights as a New Data Modality 2025

点击查看摘要

Abstract:Pre-trained models are valuable intellectual property, capturing both domain-specific and domain-invariant features within their weight spaces. However, model extraction attacks threaten these assets by enabling unauthorized source-domain inference and facilitating cross-domain transfer via the exploitation of domain-invariant features. In this work, we introduce ProDiF, a novel framework that leverages targeted weight space manipulation to secure pre-trained models against extraction attacks. ProDiF quantifies the transferability of filters and perturbs the weights of critical filters in unsecured memory, while preserving actual critical weights in a Trusted Execution Environment (TEE) for authorized users. A bi-level optimization further ensures resilience against adaptive fine-tuning attacks. Experimental results show that ProDiF reduces source-domain accuracy to near-random levels and decreases cross-domain transferability by 74.65%, providing robust protection for pre-trained models. This work offers comprehensive protection for pre-trained DNN models and highlights the potential of weight space manipulation as a novel approach to model security.

[LG-14] MAME: Multidimensional Adaptive Metamer Exploration with Human Perceptual Feedback

链接: https://arxiv.org/abs/2503.13212
作者: Mina Kamao,Hayato Ono,Ayumu Yamashita,Kaoru Amano,Masataka Sawayama
类目: Machine Learning (cs.LG)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Alignment between human brain networks and artificial models is actively studied in machine learning and neuroscience. A widely adopted approach to explore their functional alignment is to identify metamers for both humans and models. Metamers refer to input stimuli that are physically different but equivalent within a given system. If a model’s metameric space completely matched the human metameric space, the model would achieve functional alignment with humans. However, conventional methods lack direct ways to search for human metamers. Instead, researchers first develop biologically inspired models and then infer about human metamers indirectly by testing whether model metamers also appear as metamers to humans. Here, we propose the Multidimensional Adaptive Metamer Exploration (MAME) framework, enabling direct high-dimensional exploration of human metameric space. MAME leverages online image generation guided by human perceptual feedback. Specifically, it modulates reference images across multiple dimensions by leveraging hierarchical responses from convolutional neural networks (CNNs). Generated images are presented to participants whose perceptual discriminability is assessed in a behavioral task. Based on participants’ responses, subsequent image generation parameters are adaptively updated online. Using our MAME framework, we successfully measured a human metameric space of over fifty dimensions within a single experiment. Experimental results showed that human discrimination sensitivity was lower for metameric images based on low-level features compared to high-level features, which image contrast metrics could not explain. The finding suggests that the model computes low-level information not essential for human perception. Our framework has the potential to contribute to developing interpretable AI and understanding of brain function in neuroscience.

[LG-15] Deep Learning Advancements in Anomaly Detection: A Comprehensive Survey

链接: https://arxiv.org/abs/2503.13195
作者: Haoqi Huang,Ping Wang,Jianhua Pei,Jiacheng Wang,Shahen Alexanian,Dusit Niyato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid expansion of data from diverse sources has made anomaly detection (AD) increasingly essential for identifying unexpected observations that may signal system failures, security breaches, or fraud. As datasets become more complex and high-dimensional, traditional detection methods struggle to effectively capture intricate patterns. Advances in deep learning have made AD methods more powerful and adaptable, improving their ability to handle high-dimensional and unstructured data. This survey provides a comprehensive review of over 180 recent studies, focusing on deep learning-based AD techniques. We categorize and analyze these methods into reconstruction-based and prediction-based approaches, highlighting their effectiveness in modeling complex data distributions. Additionally, we explore the integration of traditional and deep learning methods, highlighting how hybrid approaches combine the interpretability of traditional techniques with the flexibility of deep learning to enhance detection accuracy and model transparency. Finally, we identify open issues and propose future research directions to advance the field of AD. This review bridges gaps in existing literature and serves as a valuable resource for researchers and practitioners seeking to enhance AD techniques using deep learning.

[LG-16] PAUSE: Low-Latency and Privacy-Aware Active User Selection for Federated Learning

链接: https://arxiv.org/abs/2503.13173
作者: Ori Peleg,Natalie Lang,Stefano Rini,Nir Shlezinger,Kobi Cohen
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables multiple edge devices to collaboratively train a machine learning model without the need to share potentially private data. Federated learning proceeds through iterative exchanges of model updates, which pose two key challenges: First, the accumulation of privacy leakage over time, and second, communication latency. These two limitations are typically addressed separately: The former via perturbed updates to enhance privacy and the latter using user selection to mitigate latency - both at the expense of accuracy. In this work, we propose a method that jointly addresses the accumulation of privacy leakage and communication latency via active user selection, aiming to improve the trade-off among privacy, latency, and model performance. To achieve this, we construct a reward function that accounts for these three objectives. Building on this reward, we propose a multi-armed bandit (MAB)-based algorithm, termed Privacy-aware Active User SElection (PAUSE) which dynamically selects a subset of users each round while ensuring bounded overall privacy leakage. We establish a theoretical analysis, systematically showing that the reward growth rate of PAUSE follows that of the best-known rate in MAB literature. To address the complexity overhead of active user selection, we propose a simulated annealing-based relaxation of PAUSE and analyze its ability to approximate the reward-maximizing policy under reduced complexity. We numerically validate the privacy leakage, associated improved latency, and accuracy gains of our methods for the federated training in various scenarios.

[LG-17] Laplace-Net: Learning Dynamical Systems with External Forcing

链接: https://arxiv.org/abs/2503.13158
作者: Bernd Zimmering,Cecília Coelho,Vaibhav Gupta,Maria Maleshkova,Oliver Niggemann
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Preprint - under review

点击查看摘要

Abstract:Modelling forced dynamical systems - where an external input drives the system state - is critical across diverse domains such as engineering, finance, and the natural sciences. In this work, we propose Laplace-Net, a decoupled, solver-free neural framework for learning forced and delay-aware systems. It leverages a Laplace transform-based approach to decompose internal dynamics, external inputs, and initial values into established theoretical concepts, enhancing interpretability. Laplace-Net promotes transferability since the system can be rapidly re-trained or fine-tuned for new forcing signals, providing flexibility in applications ranging from controller adaptation to long-horizon forecasting. Experimental results on eight benchmark datasets - including linear, non-linear, and delayed systems - demonstrate the method’s improved accuracy and robustness compared to state-of-the-art approaches, particularly in handling complex and previously unseen inputs.

[LG-18] High-entropy Advantage in Neural Networks Generalizability

链接: https://arxiv.org/abs/2503.13145
作者: Entao Yang,Xiaotian Zhang,Yue Shang,Ge Zhang
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech)
*备注:

点击查看摘要

Abstract:While the 2024 Nobel Prize in Physics ignites a worldwide discussion on the origins of neural networks and their foundational links to physics, modern machine learning research predominantly focuses on computational and algorithmic advancements, overlooking a picture of physics. Here we introduce the concept of entropy into neural networks by reconceptualizing them as hypothetical physical systems where each parameter is a non-interacting ‘particle’ within a one-dimensional space. By employing a Wang-Landau algorithms, we construct the neural networks’ (with up to 1 million parameters) entropy landscapes as functions of training loss and test accuracy (or loss) across four distinct machine learning tasks, including arithmetic question, real-world tabular data, image recognition, and language modeling. Our results reveal the existence of \textitentropy advantage, where the high-entropy states generally outperform the states reached via classical training optimizer like stochastic gradient descent. We also find this advantage is more pronounced in narrower networks, indicating a need of different training optimizers tailored to different sizes of neural networks.

[LG-19] VeriLeaky: Navigating IP Protection vs Utility in Fine-Tuning for LLM -Driven Verilog Coding

链接: https://arxiv.org/abs/2503.13116
作者: Zeng Wang,Minghao Shao,Mohammed Nabeel,Prithwish Basu Roy,Likhitha Mankali,Jitendra Bhandari,Ramesh Karri,Ozgur Sinanoglu,Muhammad Shafique,Johann Knechtel
类目: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) offer significant potential for coding, yet fine-tuning (FT) with curated data is essential for niche languages like Verilog. Using proprietary intellectual property (IP) for FT presents a serious risk, as FT data can be leaked through LLM inference. This leads to a critical dilemma for design houses: seeking to build externally accessible LLMs offering competitive Verilog coding, how can they leverage in-house IP to enhance FT utility while ensuring IP protection? For the first time in the literature, we study this dilemma. Using LLaMA 3.1-8B, we conduct in-house FT on a baseline Verilog dataset (RTLCoder) supplemented with our own in-house IP, which is validated through multiple tape-outs. To rigorously assess IP leakage, we quantify structural similarity (AST/Dolos) and functional equivalence (Synopsys Formality) between generated codes and our in-house IP. We show that our IP can indeed be leaked, confirming the threat. As defense, we evaluate logic locking of Verilog codes (ASSURE). This offers some level of protection, yet reduces the IP’s utility for FT and degrades the LLM’s performance. Our study shows the need for novel strategies that are both effective and minimally disruptive to FT, an essential effort for enabling design houses to fully utilize their proprietary IP toward LLM-driven Verilog coding. Subjects: Cryptography and Security (cs.CR); Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2503.13116 [cs.CR] (or arXiv:2503.13116v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.13116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] Exploring the Potential of Bilevel Optimization for Calibrating Neural Networks

链接: https://arxiv.org/abs/2503.13113
作者: Gabriele Sanguin,Arjun Pakrashi,Marco Viola,Francesco Rinaldi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Handling uncertainty is critical for ensuring reliable decision-making in intelligent systems. Modern neural networks are known to be poorly calibrated, resulting in predicted confidence scores that are difficult to use. This article explores improving confidence estimation and calibration through the application of bilevel optimization, a framework designed to solve hierarchical problems with interdependent optimization levels. A self-calibrating bilevel neural-network training approach is introduced to improve a model’s predicted confidence scores. The effectiveness of the proposed framework is analyzed using toy datasets, such as Blobs and Spirals, as well as more practical simulated datasets, such as Blood Alcohol Concentration (BAC). It is compared with a well-known and widely used calibration strategy, isotonic regression. The reported experimental results reveal that the proposed bilevel optimization approach reduces the calibration error while preserving accuracy.

[LG-21] owards Better Sample Efficiency in Multi-Agent Reinforcement Learning via Exploration

链接: https://arxiv.org/abs/2503.13077
作者: Amir Baghi,Jens Sjölund,Joakim Bergdahl,Linus Gisslén,Alessandro Sestini
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Multi-agent reinforcement learning has shown promise in learning cooperative behaviors in team-based environments. However, such methods often demand extensive training time. For instance, the state-of-the-art method TiZero takes 40 days to train high-quality policies for a football environment. In this paper, we hypothesize that better exploration mechanisms can improve the sample efficiency of multi-agent methods. We propose two different approaches for better exploration in TiZero: a self-supervised intrinsic reward and a random network distillation bonus. Additionally, we introduce architectural modifications to the original algorithm to enhance TiZero’s computational efficiency. We evaluate the sample efficiency of these approaches through extensive experiments. Our results show that random network distillation improves training sample efficiency by 18.8% compared to the original TiZero. Furthermore, we evaluate the qualitative behavior of the models produced by both variants against a heuristic AI, with the self-supervised reward encouraging possession and random network distillation leading to a more offensive performance. Our results highlights the applicability of our random network distillation variant in practical settings. Lastly, due to the nature of the proposed method, we acknowledge its use beyond football simulation, especially in environments with strong multi-agent and strategic aspects.

[LG-22] Linear-Size Neural Network Representation of Piecewise Affine Functions in mathbbR2

链接: https://arxiv.org/abs/2503.13001
作者: Leo Zanotti
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Metric Geometry (math.MG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:It is shown that any continuous piecewise affine (CPA) function \mathbbR^2\to\mathbbR with p pieces can be represented by a ReLU neural network with two hidden layers and O§ neurons. Unlike prior work, which focused on convex pieces, this analysis considers CPA functions with connected but potentially non-convex pieces.

[LG-23] Enhancing Job Salary Prediction with Disentangled Composition Effect Modeling: A Neural Prototyping Approach

链接: https://arxiv.org/abs/2503.12978
作者: Yang Ji,Ying Sun,Hengshu Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the era of the knowledge economy, understanding how job skills influence salary is crucial for promoting recruitment with competitive salary systems and aligned salary expectations. Despite efforts on salary prediction based on job positions and talent demographics, there still lacks methods to effectively discern the set-structured skills’ intricate composition effect on job salary. While recent advances in neural networks have significantly improved accurate set-based quantitative modeling, their lack of explainability hinders obtaining insights into the skills’ composition effects. Indeed, model explanation for set data is challenging due to the combinatorial nature, rich semantics, and unique format. To this end, in this paper, we propose a novel intrinsically explainable set-based neural prototyping approach, namely \textbfLGDESetNet, for explainable salary prediction that can reveal disentangled skill sets that impact salary from both local and global perspectives. Specifically, we propose a skill graph-enhanced disentangled discrete subset selection layer to identify multi-faceted influential input subsets with varied semantics. Furthermore, we propose a set-oriented prototype learning method to extract globally influential prototypical sets. The resulting output is transparently derived from the semantic interplay between these input subsets and global prototypes. Extensive experiments on four real-world datasets demonstrate that our method achieves superior performance than state-of-the-art baselines in salary prediction while providing explainable insights into salary-influencing patterns.

[LG-24] Optimal Denoising in Score-Based Generative Models: The Role of Data Regularity

链接: https://arxiv.org/abs/2503.12966
作者: Eliot Beyler(SIERRA),Francis Bach(SIERRA)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based generative models achieve state-of-the-art sampling performance by denoising a distribution perturbed by Gaussian noise. In this paper, we focus on a single deterministic denoising step, and compare the optimal denoiser for the quadratic loss, we name ‘‘full-denoising’’, to the alternative ‘‘half-denoising’’ introduced by Hyvärinen (2024). We show that looking at the performances in term of distance between distribution tells a more nuanced story, with different assumptions on the data leading to very different this http URL prove that half-denoising is better than full-denoising for regular enough densities, while full-denoising is better for singular densities such as mixtures of Dirac measures or densities supported on a low-dimensional subspace. In the latter case, we prove that full-denoising can alleviate the curse of dimensionality under a linear manifold hypothesis.

[LG-25] Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs ICLR2025

链接: https://arxiv.org/abs/2503.12932
作者: Wei Hung,Shao-Hua Sun,Ping-Chun Hsieh
类目: Machine Learning (cs.LG)
*备注: 23 pages, 14 figures. Accepted at ICLR 2025

点击查看摘要

Abstract:Action-constrained reinforcement learning (ACRL) is a generic framework for learning control policies with zero action constraint violation, which is required by various safety-critical and resource-constrained applications. The existing ACRL methods can typically achieve favorable constraint satisfaction but at the cost of either high computational burden incurred by the quadratic programs (QP) or increased architectural complexity due to the use of sophisticated generative models. In this paper, we propose a generic and computationally efficient framework that can adapt a standard unconstrained RL method to ACRL through two modifications: (i) To enforce the action constraints, we leverage the classic acceptance-rejection method, where we treat the unconstrained policy as the proposal distribution and derive a modified policy with feasible actions. (ii) To improve the acceptance rate of the proposal distribution, we construct an augmented two-objective Markov decision process (MDP), which include additional self-loop state transitions and a penalty signal for the rejected actions. This augmented MDP incentives the learned policy to stay close to the feasible action sets. Through extensive experiments in both robot control and resource allocation domains, we demonstrate that the proposed framework enjoys faster training progress, better constraint satisfaction, and a lower action inference time simultaneously than the state-of-the-art ACRL methods. We have made the source code publicly available to encourage further research in this direction.

[LG-26] Augmented Invertible Koopman Autoencoder for long-term time series forecasting

链接: https://arxiv.org/abs/2503.12930
作者: Anthony Frion(Lab-STICC_OSE, IMT Atlantique - MEE),Lucas Drumetz(IMT Atlantique - MEE, Lab-STICC_OSE),Mauro Dalla Mura(GIPSA-SIGMAPHY),Guillaume Tochon(GIPSA-SIGMAPHY),Abdeldjalil Aïssa-El-Bey(IMT Atlantique - MEE, Lab-STICC_COSYDE)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Following the introduction of Dynamic Mode Decomposition and its numerous extensions, many neural autoencoder-based implementations of the Koopman operator have recently been proposed. This class of methods appears to be of interest for modeling dynamical systems, either through direct long-term prediction of the evolution of the state or as a powerful embedding for downstream methods. In particular, a recent line of work has developed invertible Koopman autoencoders (IKAEs), which provide an exact reconstruction of the input state thanks to their analytically invertible encoder, based on coupling layer normalizing flow models. We identify that the conservation of the dimension imposed by the normalizing flows is a limitation for the IKAE models, and thus we propose to augment the latent state with a second, non-invertible encoder network. This results in our new model: the Augmented Invertible Koopman AutoEncoder (AIKAE). We demonstrate the relevance of the AIKAE through a series of long-term time series forecasting experiments, on satellite image time series as well as on a benchmark involving predictions based on a large lookback window of observations.

[LG-27] Lifelong Reinforcement Learning with Similarity-Driven Weighting by Large Models

链接: https://arxiv.org/abs/2503.12923
作者: Zhiyi Huang,Xiaohan Shan,Jianmin Li
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Lifelong Reinforcement Learning (LRL) holds significant potential for addressing sequential tasks, but it still faces considerable challenges. A key difficulty lies in effectively preventing catastrophic forgetting and facilitating knowledge transfer while maintaining reliable decision-making performance across subsequent tasks in dynamic environments. To tackle this, we propose a novel framework, SDW (Similarity-Driven Weighting Framework), which leverages large-language-model-generated dynamic functions to precisely control the training process. The core of SDW lies in two functions pre-generated by large models: the task similarity function and the weight computation function. The task similarity function extracts multidimensional features from task descriptions to quantify the similarities and differences between tasks in terms of states, actions, and rewards. The weight computation function dynamically generates critical training parameters based on the similarity information, including the proportion of old task data stored in the Replay Buffer and the strategy consistency weight in the loss function, enabling an adaptive balance between learning new tasks and transferring knowledge from previous tasks. By generating function code offline prior to training, rather than relying on large-model inference during the training process, the SDW framework reduces computational overhead while maintaining efficiency in sequential task scenarios. Experimental results on Atari and MiniHack sequential tasks demonstrate that SDW significantly outperforms existing lifelong reinforcement learning methods.

[LG-28] COSMOS: Continuous Simplicial Neural Networks

链接: https://arxiv.org/abs/2503.12919
作者: Aref Einizade,Dorina Thanou,Fragkiskos D. Malliaros,Jhony H. Giraldo
类目: Machine Learning (cs.LG)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Simplicial complexes provide a powerful framework for modeling high-order interactions in structured data, making them particularly suitable for applications such as trajectory prediction and mesh processing. However, existing simplicial neural networks (SNNs), whether convolutional or attention-based, rely primarily on discrete filtering techniques, which can be restrictive. In contrast, partial differential equations (PDEs) on simplicial complexes offer a principled approach to capture continuous dynamics in such structures. In this work, we introduce COntinuous SiMplicial neural netwOrkS (COSMOS), a novel SNN architecture derived from PDEs on simplicial complexes. We provide theoretical and experimental justifications of COSMOS’s stability under simplicial perturbations. Furthermore, we investigate the over-smoothing phenomenon, a common issue in geometric deep learning, demonstrating that COSMOS offers better control over this effect than discrete SNNs. Our experiments on real-world datasets of ocean trajectory prediction and regression on partial deformable shapes demonstrate that COSMOS achieves competitive performance compared to state-of-the-art SNNs in complex and noisy environments.

[LG-29] Experiments with Optimal Model Trees

链接: https://arxiv.org/abs/2503.12902
作者: Sabino Francesco Roselli,Eibe Frank
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model trees provide an appealing way to perform interpretable machine learning for both classification and regression problems. In contrast to ``classic’’ decision trees with constant values in their leaves, model trees can use linear combinations of predictor variables in their leaf nodes to form predictions, which can help achieve higher accuracy and smaller trees. Typical algorithms for learning model trees from training data work in a greedy fashion, growing the tree in a top-down manner by recursively splitting the data into smaller and smaller subsets. Crucially, the selected splits are only locally optimal, potentially rendering the tree overly complex and less accurate than a tree whose structure is globally optimal for the training data. In this paper, we empirically investigate the effect of constructing globally optimal model trees for classification and regression with linear support vector machines at the leaf nodes. To this end, we present mixed-integer linear programming formulations to learn optimal trees, compute such trees for a large collection of benchmark data sets, and compare their performance against greedily grown model trees in terms of interpretability and accuracy. We also compare to classic optimal and greedily grown decision trees, random forests, and support vector machines. Our results show that optimal model trees can achieve competitive accuracy with very small trees. We also investigate the effect on the accuracy of replacing axis-parallel splits with multivariate ones, foregoing interpretability while potentially obtaining greater accuracy.

[LG-30] Early Detection of Forest Calamities in Homogeneous Stands – Deep Learning Applied to Bark-Beetle Outbreaks

链接: https://arxiv.org/abs/2503.12883
作者: Maximilian Kirsch,Jakob Wernicke,Pawan Datta,Christine Preisach
类目: Machine Learning (cs.LG)
*备注: 24 pages, 18 figures, submitted to IEEE: Journal of Selected Topics in Applied Earth Observations and Remote Sensing

点击查看摘要

Abstract:Climate change has increased the vulnerability of forests to insect-related damage, resulting in widespread forest loss in Central Europe and highlighting the need for effective, continuous monitoring systems. Remote sensing based forest health monitoring, oftentimes, relies on supervised machine learning algorithms that require labeled training data. Monitoring temporal patterns through time series analysis offers a potential alternative for earlier detection of disturbance but requires substantial storage resources. This study investigates the potential of a Deep Learning algorithm based on a Long Short Term Memory (LSTM) Autoencoder for the detection of anomalies in forest health (e.g. bark beetle outbreaks), utilizing Sentinel-2 time series data. This approach is an alternative to supervised machine learning methods, avoiding the necessity for labeled training data. Furthermore, it is more memory-efficient than other time series analysis approaches, as a robust model can be created using only a 26-week-long time series as input. In this study, we monitored pure stands of spruce in Thuringia, Germany, over a 7-year period from 2018 to the end of 2024. Our best model achieved a detection accuracy of 87% on test data and was able to detect 61% of all anomalies at a very early stage (more than a month before visible signs of forest degradation). Compared to another widely used time series break detection algorithm - BFAST (Breaks For Additive Season and Trend), our approach consistently detected higher percentage of anomalies at an earlier stage. These findings suggest that LSTM-based Autoencoders could provide a promising, resource-efficient approach to forest health monitoring, enabling more timely responses to emerging threats.

[LG-31] Island-Based Evolutionary Computation with Diverse Surrogates and Adaptive Knowledge Transfer for High-Dimensional Data-Driven Optimization

链接: https://arxiv.org/abs/2503.12856
作者: Xian-Rong Zhang,Yue-Jiao Gong,Zhiguang Cao,Jun Zhang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 31 pages

点击查看摘要

Abstract:In recent years, there has been a growing interest in data-driven evolutionary algorithms (DDEAs) employing surrogate models to approximate the objective functions with limited data. However, current DDEAs are primarily designed for lower-dimensional problems and their performance drops significantly when applied to large-scale optimization problems (LSOPs). To address the challenge, this paper proposes an offline DDEA named DSKT-DDEA. DSKT-DDEA leverages multiple islands that utilize different data to establish diverse surrogate models, fostering diverse subpopulations and mitigating the risk of premature convergence. In the intra-island optimization phase, a semi-supervised learning method is devised to fine-tune the surrogates. It not only facilitates data argumentation, but also incorporates the distribution information gathered during the search process to align the surrogates with the evolving local landscapes. Then, in the inter-island knowledge transfer phase, the algorithm incorporates an adaptive strategy that periodically transfers individual information and evaluates the transfer effectiveness in the new environment, facilitating global optimization efficacy. Experimental results demonstrate that our algorithm is competitive with state-of-the-art DDEAs on problems with up to 1000 dimensions, while also exhibiting decent parallelism and scalability. Our DSKT-DDEA is open-source and accessible at: this https URL.

[LG-32] An Optimization Framework for Differentially Private Sparse Fine-Tuning

链接: https://arxiv.org/abs/2503.12822
作者: Mehdi Makni,Kayhan Behdin,Gabriel Afriat,Zheng Xu,Sergei Vassilvitskii,Natalia Ponomareva,Hussein Hazimeh,Rahul Mazumder
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Differentially private stochastic gradient descent (DP-SGD) is broadly considered to be the gold standard for training and fine-tuning neural networks under differential privacy (DP). With the increasing availability of high-quality pre-trained model checkpoints (e.g., vision and language models), fine-tuning has become a popular strategy. However, despite recent progress in understanding and applying DP-SGD for private transfer learning tasks, significant challenges remain – most notably, the performance gap between models fine-tuned with DP-SGD and their non-private counterparts. Sparse fine-tuning on private data has emerged as an alternative to full-model fine-tuning; recent work has shown that privately fine-tuning only a small subset of model weights and keeping the rest of the weights fixed can lead to better performance. In this work, we propose a new approach for sparse fine-tuning of neural networks under DP. Existing work on private sparse finetuning often used fixed choice of trainable weights (e.g., updating only the last layer), or relied on public model’s weights to choose the subset of weights to modify. Such choice of weights remains suboptimal. In contrast, we explore an optimization-based approach, where our selection method makes use of the private gradient information, while using off the shelf privacy accounting techniques. Our numerical experiments on several computer vision models and datasets show that our selection method leads to better prediction accuracy, compared to full-model private fine-tuning or existing private sparse fine-tuning approaches.

[LG-33] BLIA: Detect model memorization in binary classification model through passive Label Inference attack

链接: https://arxiv.org/abs/2503.12801
作者: Mohammad Wahiduzzaman Khan,Sheng Chen,Ilya Mironov,Leizhen Zhang,Rabib Noor
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Model memorization has implications for both the generalization capacity of machine learning models and the privacy of their training data. This paper investigates label memorization in binary classification models through two novel passive label inference attacks (BLIA). These attacks operate passively, relying solely on the outputs of pre-trained models, such as confidence scores and log-loss values, without interacting with or modifying the training process. By intentionally flipping 50% of the labels in controlled subsets, termed “canaries,” we evaluate the extent of label memorization under two conditions: models trained without label differential privacy (Label-DP) and those trained with randomized response-based Label-DP. Despite the application of varying degrees of Label-DP, the proposed attacks consistently achieve success rates exceeding 50%, surpassing the baseline of random guessing and conclusively demonstrating that models memorize training labels, even when these labels are deliberately uncorrelated with the features.

[LG-34] Asynchronous Predictive Counterfactual Regret Minimization Algorithm in Solving Extensive-Form Games

链接: https://arxiv.org/abs/2503.12770
作者: Linjian Meng,Youzhi Zhang,Zhenxing Ge,Tianpei Yang,Yang Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual Regret Minimization (CFR) algorithms are widely used to compute a Nash equilibrium (NE) in two-player zero-sum imperfect-information extensive-form games (IIGs). Among them, Predictive CFR ^+ (PCFR ^+ ) is particularly powerful, achieving an exceptionally fast empirical convergence rate via the prediction in many games. However, the empirical convergence rate of PCFR ^+ would significantly degrade if the prediction is inaccurate, leading to unstable performance on certain IIGs. To enhance the robustness of PCFR ^+ , we propose a novel variant, Asynchronous PCFR ^+ (APCFR ^+ ), which employs an adaptive asynchronization of step-sizes between the updates of implicit and explicit accumulated counterfactual regrets to mitigate the impact of the prediction inaccuracy on convergence. We present a theoretical analysis demonstrating why APCFR ^+ can enhance the robustness. Finally, we propose a simplified version of APCFR ^+ called Simple APCFR ^+ (SAPCFR ^+ ), which uses a fixed asynchronization of step-sizes to simplify the implementation that only needs a single-line modification of the original PCFR+. Interestingly, SAPCFR ^+ achieves a constant-factor lower theoretical regret bound than PCFR ^+ in the worst case. Experimental results demonstrate that (i) both APCFR ^+ and SAPCFR ^+ outperform PCFR ^+ in most of the tested games, as well as (ii) SAPCFR ^+ achieves a comparable empirical convergence rate with APCFR ^+ .

[LG-35] Cohort-attention Evaluation Metric against Tied Data: Studying Performance of Classification Models in Cancer Detection

链接: https://arxiv.org/abs/2503.12755
作者: Longfei Wei,Fang Sheng,Jianfei Zhang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has significantly improved medical screening accuracy, particularly in cancer detection and risk assessment. However, traditional classification metrics often fail to account for imbalanced data, varying performance across cohorts, and patient-level inconsistencies, leading to biased evaluations. We propose the Cohort-Attention Evaluation Metrics (CAT) framework to address these challenges. CAT introduces patient-level assessment, entropy-based distribution weighting, and cohort-weighted sensitivity and specificity. Key metrics like CATSensitivity (CATSen), CATSpecificity (CATSpe), and CATMean ensure balanced and fair evaluation across diverse populations. This approach enhances predictive reliability, fairness, and interpretability, providing a robust evaluation method for AI-driven medical screening models.

[LG-36] Finite Samples for Shallow Neural Networks

链接: https://arxiv.org/abs/2503.12744
作者: Yu Xia,Zhiqiang Xu
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Numerical Analysis (math.NA)
*备注: 25 pages, 1 figure

点击查看摘要

Abstract:This paper investigates the ability of finite samples to identify two-layer irreducible shallow networks with various nonlinear activation functions, including rectified linear units (ReLU) and analytic functions such as the logistic sigmoid and hyperbolic tangent. An ``irreducible" network is one whose function cannot be represented by another network with fewer neurons. For ReLU activation functions, we first establish necessary and sufficient conditions for determining the irreducibility of a network. Subsequently, we prove a negative result: finite samples are insufficient for definitive identification of any irreducible ReLU shallow network. Nevertheless, we demonstrate that for a given irreducible network, one can construct a finite set of sampling points that can distinguish it from other network with the same neuron count. Conversely, for logistic sigmoid and hyperbolic tangent activation functions, we provide a positive result. We construct finite samples that enable the recovery of two-layer irreducible shallow analytic networks. To the best of our knowledge, this is the first study to investigate the exact identification of two-layer irreducible networks using finite sample function values. Our findings provide insights into the comparative performance of networks with different activation functions under limited sampling conditions.

[LG-37] In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

链接: https://arxiv.org/abs/2503.12734
作者: Jianliang He,Xintian Pan,Siyu Chen,Zhuoran Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query (KQ) weights, and a last-entry-only and zero-sum pattern in the output-value (OV) weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor – one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. Furthermore, compared to linear transformers, the softmax attention readily generalizes to sequences longer than those seen during training. We also extend our study to scenarios with non-isotropic covariates and multi-task linear regression. In the former, multi-head attention learns to implement a form of pre-conditioned gradient descent. In the latter, we uncover an intriguing regime where the interplay between head number and task number triggers a superposition phenomenon that efficiently resolves multi-task in-context learning. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.

[LG-38] A Linearized Alternating Direction Multiplier Method for Federated Matrix Completion Problems

链接: https://arxiv.org/abs/2503.12733
作者: Patrick Hytla,ran T. A. Nghia,Duy Nhat Phan,Andrew Rice
类目: Machine Learning (cs.LG)
*备注: 29 pages, 4 figures

点击查看摘要

Abstract:Matrix completion is fundamental for predicting missing data with a wide range of applications in personalized healthcare, e-commerce, recommendation systems, and social network analysis. Traditional matrix completion approaches typically assume centralized data storage, which raises challenges in terms of computational efficiency, scalability, and user privacy. In this paper, we address the problem of federated matrix completion, focusing on scenarios where user-specific data is distributed across multiple clients, and privacy constraints are uncompromising. Federated learning provides a promising framework to address these challenges by enabling collaborative learning across distributed datasets without sharing raw data. We propose \textttFedMC-ADMM for solving federated matrix completion problems, a novel algorithmic framework that combines the Alternating Direction Method of Multipliers with a randomized block-coordinate strategy and alternating proximal gradient steps. Unlike existing federated approaches, \textttFedMC-ADMM effectively handles multi-block nonconvex and nonsmooth optimization problems, allowing efficient computation while preserving user privacy. We analyze the theoretical properties of our algorithm, demonstrating subsequential convergence and establishing a convergence rate of \mathcalO(K^-1/2) , leading to a communication complexity of \mathcalO(\epsilon^-2) for reaching an \epsilon -stationary point. This work is the first to establish these theoretical guarantees for federated matrix completion in the presence of multi-block variables. To validate our approach, we conduct extensive experiments on real-world datasets, including MovieLens 1M, 10M, and Netflix. The results demonstrate that \textttFedMC-ADMM outperforms existing methods in terms of convergence speed and testing accuracy.

[LG-39] Can LLM s Formally Reason as Abstract Interpreters for Program Analysis?

链接: https://arxiv.org/abs/2503.12686
作者: Jacqueline L. Mitchell,Brian Hyeongseok Kim,Chenyu Zhou,Chao Wang
类目: Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:LLMs have demonstrated impressive capabilities in code generation and comprehension, but their potential in being able to perform program analysis in a formal, automatic manner remains under-explored. To that end, we systematically investigate whether LLMs can reason about programs using a program analysis framework called abstract interpretation. We prompt LLMs to follow two different strategies, denoted as Compositional and Fixed Point Equation, to formally reason in the style of abstract interpretation, which has never been done before to the best of our knowledge. We validate our approach using state-of-the-art LLMs on 22 challenging benchmark programs from the Software Verification Competition (SV-COMP) 2019 dataset, widely used in program analysis. Our results show that our strategies are able to elicit abstract interpretation-based reasoning in the tested models, but LLMs are susceptible to logical errors, especially while interpreting complex program structures, as well as general hallucinations. This highlights key areas for improvement in the formal reasoning capabilities of LLMs.

[LG-40] Algebraic Adversarial Attacks on Explainability Models

链接: https://arxiv.org/abs/2503.12683
作者: Lachlan Simpson,Federico Costanza,Kyle Millar,Adriel Cheng,Cheng-Chew Lim,Hong Gunn Chew
类目: Machine Learning (cs.LG); Group Theory (math.GR)
*备注:

点击查看摘要

Abstract:Classical adversarial attacks are phrased as a constrained optimisation problem. Despite the efficacy of a constrained optimisation approach to adversarial attacks, one cannot trace how an adversarial point was generated. In this work, we propose an algebraic approach to adversarial attacks and study the conditions under which one can generate adversarial examples for post-hoc explainability models. Phrasing neural networks in the framework of geometric deep learning, algebraic adversarial attacks are constructed through analysis of the symmetry groups of neural networks. Algebraic adversarial examples provide a mathematically tractable approach to adversarial examples. We validate our approach of algebraic adversarial examples on two well-known and one real-world dataset.

[LG-41] Discovering uncertainty: Gaussian constitutive neural networks with correlated weights

链接: https://arxiv.org/abs/2503.12679
作者: Jeremy A. McCulloch,Ellen Kuhl
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 1 table

点击查看摘要

Abstract:When characterizing materials, it can be important to not only predict their mechanical properties, but also to estimate the probability distribution of these properties across a set of samples. Constitutive neural networks allow for the automated discovery of constitutive models that exactly satisfy physical laws given experimental testing data, but are only capable of predicting the mean stress response. Stochastic methods treat each weight as a random variable and are capable of learning their probability distributions. Bayesian constitutive neural networks combine both methods, but their weights lack physical interpretability and we must sample each weight from a probability distribution to train or evaluate the model. Here we introduce a more interpretable network with fewer parameters, simpler training, and the potential to discover correlated weights: Gaussian constitutive neural networks. We demonstrate the performance of our new Gaussian network on biaxial testing data, and discover a sparse and interpretable four-term model with correlated weights. Importantly, the discovered distributions of material parameters across a set of samples can serve as priors to discover better constitutive models for new samples with limited data. We anticipate that Gaussian constitutive neural networks are a natural first step towards generative constitutive models informed by physical laws and parameter uncertainty.

[LG-42] RL-TIME: Reinforcement Learning-based Task Replication in Multicore Embedded Systems

链接: https://arxiv.org/abs/2503.12677
作者: Roozbeh Siyadatzadeh,Mohsen Ansari,Muhammad Shafique,Alireza Ejlali
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Embedded systems power many modern applications and must often meet strict reliability, real-time, thermal, and power requirements. Task replication can improve reliability by duplicating a task’s execution to handle transient and permanent faults, but blindly applying replication often leads to excessive overhead and higher temperatures. Existing design-time methods typically choose the number of replicas based on worst-case conditions, which can waste resources under normal operation. In this paper, we present RL-TIME, a reinforcement learning-based approach that dynamically decides the number of replicas according to actual system conditions. By considering both the reliability target and a core-level Thermal Safe Power (TSP) constraint at run-time, RL-TIME adapts the replication strategy to avoid unnecessary overhead and overheating. Experimental results show that, compared to state-of-the-art methods, RL-TIME reduces power consumption by 63%, increases schedulability by 53%, and respects TSP 72% more often.

[LG-43] ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

链接: https://arxiv.org/abs/2503.12668
作者: Liangyu Wang,Jie Ren,Hang Xu,Junxiao Wang,Huanyi Xie,David E. Keyes,Di Wang
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:Fine-tuning large pre-trained LLMs generally demands extensive GPU memory. Traditional first-order optimizers like SGD encounter substantial difficulties due to increased memory requirements from storing activations and gradients during both the forward and backward phases as the model size expands. Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations. Furthermore, by leveraging CPU capabilities, it’s feasible to enhance both the memory and processing power available to a single GPU. We propose a novel framework, ZO2 (Zeroth-Order Offloading), for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework dynamically shifts model parameters between the CPU and GPU as required, optimizing computation flow and maximizing GPU usage by minimizing downtime. This integration of parameter adjustments with ZO’s double forward operations reduces unnecessary data movement, enhancing the fine-tuning efficacy. Additionally, our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU. Employing this approach allows us to fine-tune extraordinarily large models, such as the OPT-175B with more than 175 billion parameters, on a mere 18GB GPU–achievements beyond the reach of traditional methods. Moreover, our framework achieves these results with almost no additional time overhead and absolutely no accuracy loss compared to standard zeroth-order methods. ZO2’s code has been open-sourced in this https URL.

[LG-44] uneNSearch: a hybrid transfer learning and local search approach for solving vehicle routing problems

链接: https://arxiv.org/abs/2503.12662
作者: Arthur Corrêa,Cristóvão Silva,Liming Xu,Alexandra Brintrup,Samuel Moniz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces TuneNSearch, a hybrid transfer learning and local search approach for addressing different variants of vehicle routing problems (VRP). Recently, multi-task learning has gained much attention for solving VRP variants. However, this adaptability often compromises the performance of the models. To address this challenge, we first pre-train a reinforcement learning model on the multi-depot VRP, followed by a short fine-tuning phase to adapt it to different variants. By leveraging the complexity of the multi-depot VRP, the pre-trained model learns richer node representations and gains more transferable knowledge compared to models trained on simpler routing problems, such as the traveling salesman problem. TuneNSearch employs, in the first stage, a Transformer-based architecture, augmented with a residual edge-graph attention network to capture the impact of edge distances and residual connections between layers. This architecture allows for a more precise capture of graph-structured data, improving the encoding of VRP’s features. After inference, our model is also coupled with a second stage composed of a local search algorithm, which yields substantial performance gains with minimal computational overhead added. Results show that TuneNSearch outperforms many existing state-of-the-art models trained for each VRP variant, requiring only one-fifth of the training epochs. Our approach demonstrates strong generalization, achieving high performance across different tasks, distributions and problem sizes, thus addressing a long-standing gap in the literature.

[LG-45] Realized Volatility Forecasting for New Issues and Spin-Offs using Multi-Source Transfer Learning

链接: https://arxiv.org/abs/2503.12648
作者: Andreas Teller,Uta Pigorsch,Christian Pigorsch
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: Submitted to the International Journal of Forecasting

点击查看摘要

Abstract:Forecasting the volatility of financial assets is essential for various financial applications. This paper addresses the challenging task of forecasting the volatility of financial assets with limited historical data, such as new issues or spin-offs, by proposing a multi-source transfer learning approach. Specifically, we exploit complementary source data of assets with a substantial historical data record by selecting source time series instances that are most similar to the limited target data of the new issue/spin-off. Based on these instances and the target data, we estimate linear and non-linear realized volatility models and compare their forecasting performance to forecasts of models trained exclusively on the target data, and models trained on the entire source and target data. The results show that our transfer learning approach outperforms the alternative models and that the integration of complementary data is also beneficial immediately after the initial trading day of the new issue/spin-off.

[LG-46] Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization

链接: https://arxiv.org/abs/2503.12645
作者: Dmitry Kovalev
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Optimization with matrix gradient orthogonalization has recently demonstrated impressive results in the training of deep neural networks (Jordan et al., 2024; Liu et al., 2025). In this paper, we provide a theoretical analysis of this approach. In particular, we show that the orthogonalized gradient method can be seen as a first-order trust-region optimization method, where the trust-region is defined in terms of the matrix spectral norm. Motivated by this observation, we provide the first theoretical analysis of the stochastic non-Euclidean trust-region gradient method with momentum, which recovers the Muon optimizer (Jordan et al., 2024) as a special case. In addition, we establish the convergence of the normalized SGD with momentum (Cutkosky and Mehta, 2020) in the constrained and composite setting, show that its iteration complexity of finding an \varepsilon -accurate solution can be improved from \mathcalO(\varepsilon^-3.5) to \mathcalO(\varepsilon^-3) under the star-convexity assumption, and obtain similar results for the Muon algorithm. Finally, our theoretical findings provide an explanation for the practical superiority of Muon compared to the Orthogonal-SGDM algorithm of Tuddenham et al. (2022).

[LG-47] Real-Time Cell Sorting with Scalable In Situ FPGA-Accelerated Deep Learning

链接: https://arxiv.org/abs/2503.12622
作者: Khayrul Islam,Ryan F. Forelli,Jianzhong Han,Deven Bhadane,Jian Huang,Joshua C. Agar,Nhan Tran,Seda Ogrenci,Yaling Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Precise cell classification is essential in biomedical diagnostics and therapeutic monitoring, particularly for identifying diverse cell types involved in various diseases. Traditional cell classification methods such as flow cytometry depend on molecular labeling which is often costly, time-intensive, and can alter cell integrity. To overcome these limitations, we present a label-free machine learning framework for cell classification, designed for real-time sorting applications using bright-field microscopy images. This approach leverages a teacher-student model architecture enhanced by knowledge distillation, achieving high efficiency and scalability across different cell types. Demonstrated through a use case of classifying lymphocyte subsets, our framework accurately classifies T4, T8, and B cell types with a dataset of 80,000 preprocessed images, accessible via an open-source Python package for easy adaptation. Our teacher model attained 98% accuracy in differentiating T4 cells from B cells and 93% accuracy in zero-shot classification between T8 and B cells. Remarkably, our student model operates with only 0.02% of the teacher model’s parameters, enabling field-programmable gate array (FPGA) deployment. Our FPGA-accelerated student model achieves an ultra-low inference latency of just 14.5~ \mu s and a complete cell detection-to-sorting trigger time of 24.7~ \mu s, delivering 12x and 40x improvements over the previous state-of-the-art real-time cell analysis algorithm in inference and total latency, respectively, while preserving accuracy comparable to the teacher model. This framework provides a scalable, cost-effective solution for lymphocyte classification, as well as a new SOTA real-time cell sorting implementation for rapid identification of subsets using in situ deep learning on off-the-shelf computing hardware.

[LG-48] SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models

链接: https://arxiv.org/abs/2503.12602
作者: Kunyang Sun,Dorian Bagni,Joseph M. Cavanagh,Yingze Wang,Jacob M. Sawyer,Andrew Gritsevskiy,Teresa Head-Gordon
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:Generative machine learning models for small molecule drug discovery have shown immense promise, but many molecules generated by this approach are too difficult to synthesize to be worth further investigation or further development. We present a novel approach by fine-tuning Meta’s Llama3 large language models (LLMs) to create SynLlama, which generates full synthetic pathways made of commonly accessible Enamine building blocks and robust organic reaction templates. SynLlama explores a large synthesizable space using significantly less data compared to other state-of-the-art methods, and offers strong performance in bottom-up synthesis, synthesizable analog generation, and hit expansion, offering medicinal chemists a valuable tool for drug discovery developments. We find that SynLlama can effectively generalize to unseen yet purchasable building blocks, meaning that its reconstruction capabilities extend to a broader synthesizable chemical space than the training data.

[LG-49] GraphEval: A Lightweight Graph-Based LLM Framework for Idea Evaluation

链接: https://arxiv.org/abs/2503.12600
作者: Tao Feng,Yihang Sun,Jiaxuan You
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The powerful capabilities of Large Language Models (LLMs) have led to their growing use in evaluating human-generated content, particularly in evaluating research ideas within academic settings. Existing solutions primarily rely on prompt-based LLM methods or fine-tuned lightweight language models for idea evaluation. However, these methods are often unstable and struggle to comprehend the complex semantic information embedded in the ideas, impeding their ability to perform high-quality evaluations. To address the above challenges, we propose GraphEval, a lightweight graph-based LLM framework for idea evaluation. Our insight is that a complex idea can be broken down into comprehensible viewpoint nodes using prompts from small LLMs. These viewpoint nodes can then be linked together through edges created from LLM-based relation extraction and/or BERT similarity scores. The created viewpoint-graph can be used to conveniently propagate scores across view-nodes to improve the robustness of the idea evaluations. In particular, we propose two lightweight graph-based methods for idea evaluation: (1) GraphEval-LP: a training-free label propagation algorithm that propagates evaluation scores from known view-nodes to unknown nodes; (2) GraphEval-GNN: a Graph Neural Networks (GNN) that is trained to predict the evaluation scores given the observed graph with minimal computation resources. Moreover, to overcome LLM’s limitation in objectively assessing the novelty of ideas, we further propose a novelty detection model to GraphEval-GNN to enhance its capability in judging idea novelty. Experiments on two datasets show GraphEval improves F1 scores by at least 14% with low computation and API costs. Additionally, GraphEval can effectively detect plagiarized ideas.

[LG-50] Focusing Robot Open-Ended Reinforcement Learning Through Users Purposes

链接: https://arxiv.org/abs/2503.12579
作者: Emilio Cartoni,Gianluca Cioccolini,Gianluca Baldassarre
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, accepted at RLDM 2025

点击查看摘要

Abstract:Open-Ended Learning (OEL) autonomous robots can acquire new skills and knowledge through direct interaction with their environment, relying on mechanisms such as intrinsic motivations and self-generated goals to guide learning processes. OEL robots are highly relevant for applications as they can autonomously leverage acquired knowledge to perform tasks beneficial to human users in unstructured environments, addressing challenges unforeseen at design time. However, OEL robots face a significant limitation: their openness may lead them to waste time learning information that is irrelevant to tasks desired by specific users. Here, we propose a solution called Purpose-Directed Open-Ended Learning' (POEL), based on the novel concept of purpose’ introduced in previous work. A purpose specifies what users want the robot to achieve. The key insight of this work is that purpose can focus OEL on learning self-generated classes of tasks that, while unknown during autonomous learning (as typical in OEL), involve objects relevant to the purpose. This concept is operationalised in a novel robot architecture capable of receiving a human purpose through speech-to-text, analysing the scene to identify objects, and using a Large Language Model to reason about which objects are purpose-relevant. These objects are then used to bias OEL exploration towards their spatial proximity and to self-generate rewards that favour interactions with them. The solution is tested in a simulated scenario where a camera-arm-gripper robot interacts freely with purpose-related and distractor objects. For the first time, the results demonstrate the potential advantages of purpose-focused OEL over state-of-the-art OEL methods, enabling robots to handle unstructured environments while steering their learning toward knowledge acquisition relevant to users.

[LG-51] Diffusion on Graph: Augmentation of Graph Structure for Node Classification

链接: https://arxiv.org/abs/2503.12563
作者: Yancheng Wang,Changyu Liu,Yingzhen Yang
类目: Machine Learning (cs.LG)
*备注: Published in Transactions on Machine Learning Research (TMLR) 2025

点击查看摘要

Abstract:Graph diffusion models have recently been proposed to synthesize entire graphs, such as molecule graphs. Although existing methods have shown great performance in generating entire graphs for graph-level learning tasks, no graph diffusion models have been developed to generate synthetic graph structures, that is, synthetic nodes and associated edges within a given graph, for node-level learning tasks. Inspired by the research in the computer vision literature using synthetic data for enhanced performance, we propose Diffusion on Graph (DoG), which generates synthetic graph structures to boost the performance of GNNs. The synthetic graph structures generated by DoG are combined with the original graph to form an augmented graph for the training of node-level learning tasks, such as node classification and graph contrastive learning (GCL). To improve the efficiency of the generation process, a Bi-Level Neighbor Map Decoder (BLND) is introduced in DoG. To mitigate the adverse effect of the noise introduced by the synthetic graph structures, a low-rank regularization method is proposed for the training of graph neural networks (GNNs) on the augmented graphs. Extensive experiments on various graph datasets for semi-supervised node classification and graph contrastive learning have been conducted to demonstrate the effectiveness of DoG with low-rank regularization. The code of DoG is available at this https URL.

[LG-52] EmoBipedNav: Emotion-aware Social Navigation for Bipedal Robots with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2503.12538
作者: Wei Zhu,Abirath Raju,Abdulaziz Shamsah,Anqi Wu,Seth Hutchinson,Ye Zhao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:This study presents an emotion-aware navigation framework – EmoBipedNav – using deep reinforcement learning (DRL) for bipedal robots walking in socially interactive environments. The inherent locomotion constraints of bipedal robots challenge their safe maneuvering capabilities in dynamic environments. When combined with the intricacies of social environments, including pedestrian interactions and social cues, such as emotions, these challenges become even more pronounced. To address these coupled problems, we propose a two-stage pipeline that considers both bipedal locomotion constraints and complex social environments. Specifically, social navigation scenarios are represented using sequential LiDAR grid maps (LGMs), from which we extract latent features, including collision regions, emotion-related discomfort zones, social interactions, and the spatio-temporal dynamics of evolving environments. The extracted features are directly mapped to the actions of reduced-order models (ROMs) through a DRL architecture. Furthermore, the proposed framework incorporates full-order dynamics and locomotion constraints during training, effectively accounting for tracking errors and restrictions of the locomotion controller while planning the trajectory with ROMs. Comprehensive experiments demonstrate that our approach exceeds both model-based planners and DRL-based baselines. The hardware videos and open-source code are available at this https URL.

[LG-53] me-EAPCR-T: A Universal Deep Learning Approach for Anomaly Detection in Industrial Equipment

链接: https://arxiv.org/abs/2503.12534
作者: Huajie Liang,Di Wang,Yuchao Lu,Mengke Song,Lei Liu,Ling An,Ying Liang,Xingjie Ma,Zhenyu Zhang,Chichun Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the advancement of Industry 4.0, intelligent manufacturing extensively employs sensors for real-time multidimensional data collection, playing a crucial role in equipment monitoring, process optimisation, and efficiency enhancement. Industrial data exhibit characteristics such as multi-source heterogeneity, nonlinearity, strong coupling, and temporal interactions, while also being affected by noise interference. These complexities make it challenging for traditional anomaly detection methods to extract key features, impacting detection accuracy and stability. Traditional machine learning approaches often struggle with such complex data due to limitations in processing capacity and generalisation ability, making them inadequate for practical applications. While deep learning feature extraction modules have demonstrated remarkable performance in image and text processing, they remain ineffective when applied to multi-source heterogeneous industrial data lacking explicit correlations. Moreover, existing multi-source heterogeneous data processing techniques still rely on dimensionality reduction and feature selection, which can lead to information loss and difficulty in capturing high-order interactions. To address these challenges, this study applies the EAPCR and Time-EAPCR models proposed in previous research and introduces a new model, Time-EAPCR-T, where Transformer replaces the LSTM module in the time-series processing component of Time-EAPCR. This modification effectively addresses multi-source data heterogeneity, facilitates efficient multi-source feature fusion, and enhances the temporal feature extraction capabilities of multi-source industrial this http URL results demonstrate that the proposed method outperforms existing approaches across four industrial datasets, highlighting its broad application potential.

[LG-54] Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

链接: https://arxiv.org/abs/2503.12533
作者: Haoqi Yuan,Yu Bai,Yuhui Fu,Bohan Zhou,Yicheng Feng,Xinrun Xu,Yi Zhan,Börje F. Karlsson,Zongqing Lu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Building autonomous robotic agents capable of achieving human-level performance in real-world embodied tasks is an ultimate goal in humanoid robot research. Recent advances have made significant progress in high-level cognition with Foundation Models (FMs) and low-level skill development for humanoid robots. However, directly combining these components often results in poor robustness and efficiency due to compounding errors in long-horizon tasks and the varied latency of different modules. We introduce Being-0, a hierarchical agent framework that integrates an FM with a modular skill library. The FM handles high-level cognitive tasks such as instruction understanding, task planning, and reasoning, while the skill library provides stable locomotion and dexterous manipulation for low-level control. To bridge the gap between these levels, we propose a novel Connector module, powered by a lightweight vision-language model (VLM). The Connector enhances the FM’s embodied capabilities by translating language-based plans into actionable skill commands and dynamically coordinating locomotion and manipulation to improve task success. With all components, except the FM, deployable on low-cost onboard computation devices, Being-0 achieves efficient, real-time performance on a full-sized humanoid robot equipped with dexterous hands and active vision. Extensive experiments in large indoor environments demonstrate Being-0’s effectiveness in solving complex, long-horizon tasks that require challenging navigation and manipulation subtasks. For further details and videos, visit this https URL.

[LG-55] Convergence Analysis of alpha-SVRG under Strong Convexity ICASSP2025

链接: https://arxiv.org/abs/2503.12454
作者: Sean Xiao,Sangwoo Park,Stefan Vlaski
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: ICASSP 2025

点击查看摘要

Abstract:Stochastic first-order methods for empirical risk minimization employ gradient approximations based on sampled data in lieu of exact gradients. Such constructions introduce noise into the learning dynamics, which can be corrected through variance-reduction techniques. There is increasing evidence in the literature that in many modern learning applications noise can have a beneficial effect on optimization and generalization. To this end, the recently proposed variance-reduction technique, alpha-SVRG [Yin et al., 2023] allows for fine-grained control of the level of residual noise in the learning dynamics, and has been reported to empirically outperform both SGD and SVRG in modern deep learning scenarios. By focusing on strongly convex environments, we first provide a unified convergence rate expression for alpha-SVRG under fixed learning rate, which reduces to that of either SGD or SVRG by setting alpha=0 or alpha=1, respectively. We show that alpha-SVRG has faster convergence rate compared to SGD and SVRG under suitable choice of alpha. Simulation results on linear regression validate our theory.

[LG-56] XAI-Driven Client Selection for Federated Learning in Scalable 6G Network Slicing

链接: https://arxiv.org/abs/2503.12435
作者: Martino Chiarani,Swastika Roy,Christos Verikoukis,Fabrizio Granelli
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 8 pages, 7 Figures

点击查看摘要

Abstract:In recent years, network slicing has embraced artificial intelligence (AI) models to manage the growing complexity of communication networks. In such a situation, AI-driven zero-touch network automation should present a high degree of flexibility and viability, especially when deployed in live production networks. However, centralized controllers suffer from high data communication overhead due to the vast amount of user data, and most network slices are reluctant to share private data. In federated learning systems, selecting trustworthy clients to participate in training is critical for ensuring system performance and reliability. The present paper proposes a new approach to client selection by leveraging an XAI method to guarantee scalable and fast operation of federated learning based analytic engines that implement slice-level resource provisioning at the RAN-Edge in a non-IID scenario. Attributions from XAI are used to guide the selection of devices participating in training. This approach enhances network trustworthiness for users and addresses the black-box nature of neural network models. The simulations conducted outperformed the standard approach in terms of both convergence time and computational cost, while also demonstrating high scalability.

[LG-57] ERL: Large-Scale Multi-Target Encirclement Using Transformer-Enhanced Reinforcement Learning IROS

链接: https://arxiv.org/abs/2503.12395
作者: Heng Zhang,Guoxiang Zhao,Xiaoqiang Ren
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This paper is currently under review at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025

点击查看摘要

Abstract:Pursuit-evasion (PE) problem is a critical challenge in multi-robot systems (MRS). While reinforcement learning (RL) has shown its promise in addressing PE tasks, research has primarily focused on single-target pursuit, with limited exploration of multi-target encirclement, particularly in large-scale settings. This paper proposes a Transformer-Enhanced Reinforcement Learning (TERL) framework for large-scale multi-target encirclement. By integrating a transformer-based policy network with target selection, TERL enables robots to adaptively prioritize targets and safely coordinate robots. Results show that TERL outperforms existing RL-based methods in terms of encirclement success rate and task completion time, while maintaining good performance in large-scale scenarios. Notably, TERL, trained on small-scale scenarios (15 pursuers, 4 targets), generalizes effectively to large-scale settings (80 pursuers, 20 targets) without retraining, achieving a 100% success rate.

[LG-58] GCBLANE: A graph-enhanced convolutional BiLSTM attention network for improved transcription factor binding site prediction

链接: https://arxiv.org/abs/2503.12377
作者: Jonas Chris Ferrao,Dickson Dias,Sweta Morajkar,Manisha Gokuldas Fal Dessai
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Identifying transcription factor binding sites (TFBS) is crucial for understanding gene regulation, as these sites enable transcription factors (TFs) to bind to DNA and modulate gene expression. Despite advances in high-throughput sequencing, accurately identifying TFBS remains challenging due to the vast genomic data and complex binding patterns. GCBLANE, a graph-enhanced convolutional bidirectional Long Short-Term Memory (LSTM) attention network, is introduced to address this issue. It integrates convolutional, multi-head attention, and recurrent layers with a graph neural network to detect key features for TFBS prediction. On 690 ENCODE ChIP-Seq datasets, GCBLANE achieved an average AUC of 0.943, and on 165 ENCODE datasets, it reached an AUC of 0.9495, outperforming advanced models that utilize multimodal approaches, including DNA shape information. This result underscores GCBLANE’s effectiveness compared to other methods. By combining graph-based learning with sequence analysis, GCBLANE significantly advances TFBS prediction.

[LG-59] Integrating mobile and fixed monitoring data for high-resolution PM2.5 mapping using machine learning

链接: https://arxiv.org/abs/2503.12367
作者: Rui Xu,Dawen Yao,Yuzhuang Pian,Ruhui Cao,Yixin Fu,Xinru Yang,Ting Gan,Yonghong Liu
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Constructing high resolution air pollution maps at lower cost is crucial for sustainable city management and public health risk assessment. However, traditional fixed-site monitoring lacks spatial coverage, while mobile low-cost sensors exhibit significant data instability. This study integrates PM2.5 data from 320 taxi-mounted mobile low-cost sensors and 52 fixed monitoring stations to address these limitations. By employing the machine learning methods, an appropriate mapping relationship was established between fixed and mobile monitoring concentration. The resulting pollution maps achieved 500-meter spatial and 5-minute temporal resolutions, showing close alignment with fixed monitoring data (+4.35% bias) but significant deviation from raw mobile data (-31.77%). The fused map exhibits the fine-scale spatial variability also observed in the mobile pollution map, while showing the stable temporal variability closer to that of the fixed pollution map (fixed: 1.12 plus or minus 0.73%, mobile: 3.15 plus or minus 2.44%, mapped: 1.01 plus or minus 0.65%). These findings demonstrate the potential of large-scale mobile low-cost sensor networks for high-resolution air quality mapping, supporting targeted urban environmental governance and health risk mitigation.

[LG-60] ASD Classification on Dynamic Brain Connectome using Temporal Random Walk with Transformer-based Dynamic Network Embedding

链接: https://arxiv.org/abs/2503.12366
作者: Suchanuch Piriyasatit,Chaohao Yuan,Ercan Engin Kuruoglu
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Autism Spectrum Disorder (ASD) is a complex neurological condition characterized by varied developmental impairments, especially in communication and social interaction. Accurate and early diagnosis of ASD is crucial for effective intervention, which is enhanced by richer representations of brain activity. The brain functional connectome, which refers to the statistical relationships between different brain regions measured through neuroimaging, provides crucial insights into brain function. Traditional static methods often fail to capture the dynamic nature of brain activity, in contrast, dynamic brain connectome analysis provides a more comprehensive view by capturing the temporal variations in the brain. We propose BrainTWT, a novel dynamic network embedding approach that captures temporal evolution of the brain connectivity over time and considers also the dynamics between different temporal network snapshots. BrainTWT employs temporal random walks to capture dynamics across different temporal network snapshots and leverages the Transformer’s ability to model long term dependencies in sequential data to learn the discriminative embeddings from these temporal sequences using temporal structure prediction tasks. The experimental evaluation, utilizing the Autism Brain Imaging Data Exchange (ABIDE) dataset, demonstrates that BrainTWT outperforms baseline methods in ASD classification.

[LG-61] Probabilistic Neural Networks (PNNs) with t-Distributed Outputs: Adaptive Prediction Intervals Beyond Gaussian Assumptions

链接: https://arxiv.org/abs/2503.12354
作者: Farhad Pourkamali-Anaraki
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 Figures, 1 Table

点击查看摘要

Abstract:Traditional neural network regression models provide only point estimates, failing to capture predictive uncertainty. Probabilistic neural networks (PNNs) address this limitation by producing output distributions, enabling the construction of prediction intervals. However, the common assumption of Gaussian output distributions often results in overly wide intervals, particularly in the presence of outliers or deviations from normality. To enhance the adaptability of PNNs, we propose t-Distributed Neural Networks (TDistNNs), which generate t-distributed outputs, parameterized by location, scale, and degrees of freedom. The degrees of freedom parameter allows TDistNNs to model heavy-tailed predictive distributions, improving robustness to non-Gaussian data and enabling more adaptive uncertainty quantification. We develop a novel loss function tailored for the t-distribution and derive efficient gradient computations for seamless integration into deep learning frameworks. Empirical evaluations on synthetic and real-world data demonstrate that TDistNNs improve the balance between coverage and interval width. Notably, for identical architectures, TDistNNs consistently produce narrower prediction intervals than Gaussian-based PNNs while maintaining proper coverage. This work contributes a flexible framework for uncertainty estimation in neural networks tasked with regression, particularly suited to settings involving complex output distributions.

[LG-62] EXPRESS: An LLM -Generated Explainable Property Valuation System with Neighbor Imputation

链接: https://arxiv.org/abs/2503.12344
作者: Wei-Wei Du,Yung-Chien Wang,Wen-Chih Peng
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:The demand for property valuation has attracted significant attention from sellers, buyers, and customers applying for loans. Reviews of existing approaches have revealed shortcomings in terms of not being able to handle missing value situations, as well as lacking interpretability, which means they cannot be used in real-world applications. To address these challenges, we propose an LLM-Generated EXplainable PRopErty valuation SyStem with neighbor imputation called EXPRESS, which provides the customizable missing value imputation technique, and addresses the opaqueness of prediction by providing the feature-wise explanation generated by LLM. The dynamic nearest neighbor search finds similar properties depending on different application scenarios by property configuration set by users (e.g., house age as criteria for the house in rural areas, and locations for buildings in urban areas). Motivated by the human appraisal procedure, we generate feature-wise explanations to provide users with a more intuitive understanding of the prediction results.

[LG-63] Empirical Privacy Variance

链接: https://arxiv.org/abs/2503.12314
作者: Yuzheng Hu,Fan Wu,Ruicheng Xian,Yuhang Liu,Lydia Zakynthinou,Pritish Kamath,Chiyuan Zhang,David Forsyth
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We propose the notion of empirical privacy variance and study it in the context of differentially private fine-tuning of language models. Specifically, we show that models calibrated to the same (\varepsilon, \delta) -DP guarantee using DP-SGD with different hyperparameter configurations can exhibit significant variations in empirical privacy, which we quantify through the lens of memorization. We investigate the generality of this phenomenon across multiple dimensions and discuss why it is surprising and relevant. Through regression analysis, we examine how individual and composite hyperparameters influence empirical privacy. The results reveal a no-free-lunch trade-off: existing practices of hyperparameter tuning in DP-SGD, which focus on optimizing utility under a fixed privacy budget, often come at the expense of empirical privacy. To address this, we propose refined heuristics for hyperparameter selection that explicitly account for empirical privacy, showing that they are both precise and practically useful. Finally, we take preliminary steps to understand empirical privacy variance. We propose two hypotheses, identify limitations in existing techniques like privacy auditing, and outline open questions for future research.

[LG-64] owards Learning High-Precision Least Squares Algorithms with Sequence Models ICLR2025

链接: https://arxiv.org/abs/2503.12295
作者: Jerry Liu,Jessica Grogan,Owen Dugan,Ashish Rao,Simran Arora,Atri Rudra,Christopher Ré
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 75 pages, 18 figures. ICLR 2025

点击查看摘要

Abstract:This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to apply broadly across problem instances. We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000x lower MSE than standard Transformers trained end-to-end and they incur a 10,000x smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.

[LG-65] Unified Modeling Language Code Generation from Diagram Images Using Multimodal Large Language Models

链接: https://arxiv.org/abs/2503.12293
作者: Averi Bates,Ryan Vavricka,Shane Carleton,Ruosi Shao,Chongle Pan
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Number of pages: 32, Number of figures: 23, Number of tables: 7, Submitted to the Journal of Machine Learning with Applications, Author Contributions: Averi Bates: Methodology, Development, Analysis, Data Curation, Drafting, Review. Ryan Vavricka: Data Curation, Development, Review. Shane Carleton: Supervision, Funding. Ruosi Shao: Review. Chongle Pan: Supervision, Review

点击查看摘要

Abstract:The Unified Modeling Language is a standardized visual language widely used for modeling and documenting the design of software systems. Although many tools generate UML diagrams from UML code, generating executable UML code from image-based UML diagrams remains challenging. This paper proposes a new approach to generate UML code using a large multimodal language model automatically. Synthetic UML activity and sequence diagram datasets were created to train and test the model. We compared standard fine-tuning with LoRA techniques to optimize base models. The experiments measured code generation accuracy across different model sizes and training strategies. These results demonstrated that domain-adapted MM-LLMs perform for UML code generation automation, whereby, at the best model, it achieved BLEU and SSIM scores of 0.779 and 0.942 on sequence diagrams. This will enable the modernization of legacy systems and decrease the manual effort in software development workflows.

[LG-66] A Bubble-Cluster Federated Learning Framework for Privacy-Preserving Demand Forecasting on Heterogeneous Retail Data

链接: https://arxiv.org/abs/2503.12220
作者: Yunbo Long,Liming Xu,Ge Zheng,Alexandra Brintrup
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables retailers to share model parameters for demand forecasting while maintaining privacy. However, heterogeneous data across diverse regions, driven by factors such as varying consumer behavior, poses challenges to the effectiveness of federated learning. To tackle this challenge, we propose Bubble-Cluster Federated Learning (BFL), a novel clustering-based federated learning framework tailored for sales prediction. By leveraging differential privacy and feature importance distribution, BFL groups retailers into distinct “bubbles”, each forming its own federated learning (FL) system to effectively isolate data heterogeneity. Within each bubble, Transformer models are designed to predict local sales for each client. Our experiments demonstrate that BFL significantly surpasses FedAvg and outperforms local learning in demand forecasting performance across all participating clients. Compared to local learning, BFL can achieve a 5.4% improvement in R\textsuperscript2, a 69% reduction in RMSE, and a 45% decrease in MAE. Our study highlights BFL’s adaptability in enabling effective federated learning through dynamic adjustments to noise levels and the range of clients participating in each bubble. This approach strategically groups participants into distinct “bubbles” while proactively identifying and filtering out risky clients that could compromise the FL system. The findings demonstrate BFL’s ability to enhance collaborative learning in regression tasks on heterogeneous data, achieving a balance between forecasting accuracy and privacy preservation in retail applications. Additionally, BFL’s capability to detect and neutralize poisoned data from clients enhances the system’s robustness and reliability, ensuring more secure and effective federated learning.

[LG-67] FHE-Coder: Evaluating LLM -agent ic Fully Homomorphic Encryption Code Generation

链接: https://arxiv.org/abs/2503.12217
作者: Mayank Kumar,Jiaqi Xue,Mengxin Zheng,Qian Lou
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Fully Homomorphic Encryption over the torus (TFHE) enables computation on encrypted data without decryption, making it a cornerstone of secure and confidential computing. Despite its potential in privacy preserving machine learning, secure multi party computation, private blockchain transactions, and secure medical diagnostics, its adoption remains limited due to cryptographic complexity and usability challenges. While various TFHE libraries and compilers exist, practical code generation remains a hurdle. We propose a compiler integrated framework to evaluate LLM inference and agentic optimization for TFHE code generation, focusing on logic gates and ReLU activation. Our methodology assesses error rates, compilability, and structural similarity across open and closedsource LLMs. Results highlight significant limitations in off-the-shelf models, while agentic optimizations such as retrieval augmented generation (RAG) and few-shot prompting reduce errors and enhance code fidelity. This work establishes the first benchmark for TFHE code generation, demonstrating how LLMs, when augmented with domain-specific feedback, can bridge the expertise gap in FHE code generation.

[LG-68] Cross-Modal Diffusion for Biomechanical Dynamical Systems Through Local Manifold Alignment

链接: https://arxiv.org/abs/2503.12214
作者: Sharmita Dey,Sarath Ravindran Nair
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:We present a mutually aligned diffusion framework for cross-modal biomechanical motion generation, guided by a dynamical systems perspective. By treating each modality, e.g., observed joint angles ( X ) and ground reaction forces ( Y ), as complementary observations of a shared underlying locomotor dynamical system, our method aligns latent representations at each diffusion step, so that one modality can help denoise and disambiguate the other. Our alignment approach is motivated by the fact that local time windows of X and Y represent the same phase of an underlying dynamical system, thereby benefiting from a shared latent manifold. We introduce a simple local latent manifold alignment (LLMA) strategy that incorporates first-order and second-order alignment within the latent space for robust cross-modal biomechanical generation without bells and whistles. Through experiments on multimodal human biomechanics data, we show that aligning local latent dynamics across modalities improves generation fidelity and yields better representations.

[LG-69] Multi-Agent Systems Execute Arbitrary Malicious Code

链接: https://arxiv.org/abs/2503.12188
作者: Harold Triedman,Rishi Jha,Vitaly Shmatikov
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 30 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Multi-agent systems coordinate LLM-based agents to perform tasks on users’ behalf. In real-world applications, multi-agent systems will inevitably interact with untrusted inputs, such as malicious Web content, files, email attachments, etc. Using several recently proposed multi-agent frameworks as concrete examples, we demonstrate that adversarial content can hijack control and communication within the system to invoke unsafe agents and functionalities. This results in a complete security breach, up to execution of arbitrary malicious code on the user’s device and/or exfiltration of sensitive data from the user’s containerized environment. We show that control-flow hijacking attacks succeed even if the individual agents are not susceptible to direct or indirect prompt injection, and even if they refuse to perform harmful actions. Comments: 30 pages, 5 figures, 8 tables Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2503.12188 [cs.CR] (or arXiv:2503.12188v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.12188 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-70] Efficient and Privacy-Preserved Link Prediction via Condensed Graphs

链接: https://arxiv.org/abs/2503.12156
作者: Yunbo Long,Liming Xu,Alexandra Brintrup
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Link prediction is crucial for uncovering hidden connections within complex networks, enabling applications such as identifying potential customers and products. However, this research faces significant challenges, including concerns about data privacy, as well as high computational and storage costs, especially when dealing with large-scale networks. Condensed graphs, which are much smaller than the original graphs while retaining essential information, has become an effective solution to both maintain data utility and preserve privacy. Existing methods, however, initialize synthetic graphs through random node selection without considering node connectivity, and are mainly designed for node classification tasks. As a result, their potential for privacy-preserving link prediction remains largely unexplored. We introduce HyDRO\textsuperscript+, a graph condensation method guided by algebraic Jaccard similarity, which leverages local connectivity information to optimize condensed graph structures. Extensive experiments on four real-world networks show that our method outperforms state-of-the-art methods and even the original networks in balancing link prediction accuracy and privacy preservation. Moreover, our method achieves nearly 20* faster training and reduces storage requirements by 452*, as demonstrated on the Computers dataset, compared to link prediction on the original networks. This work represents the first attempt to leverage condensed graphs for privacy-preserving link prediction information sharing in real-world complex networks. It offers a promising pathway for preserving link prediction information while safeguarding privacy, advancing the use of graph condensation in large-scale networks with privacy concerns.

[LG-71] A State Alignment-Centric Approach to Federated System Identification: The FedAlign Framework

链接: https://arxiv.org/abs/2503.12137
作者: Ertuğrul Keçeci,Müjde Güzelkaya,Tufan Kumbasar
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper presents FedAlign, a Federated Learning (FL) framework particularly designed for System Identification (SYSID) tasks by aligning state representations. Local workers can learn State-Space Models (SSMs) with equivalent representations but different dynamics. We demonstrate that directly aggregating these local SSMs via FedAvg results in a global model with altered system dynamics. FedAlign overcomes this problem by employing similarity transformation matrices to align state representations of local SSMs, thereby establishing a common parameter basin that retains the dynamics of local SSMs. FedAlign computes similarity transformation matrices via two distinct approaches: FedAlign-A and FedAlign-O. In FedAlign-A, we represent the global SSM in controllable canonical form (CCF). We apply control theory to analytically derive similarity transformation matrices that convert each local SSM into this form. Yet, establishing global SSM in CCF brings additional alignment challenges in multi input - multi output SYSID as CCF representation is not unique, unlike in single input - single output SYSID. In FedAlign-O, we address these alignment challenges by reformulating the local parameter basin alignment problem as an optimization task. We determine the parameter basin of a local worker as the common parameter basin and solve least square problems to obtain similarity transformation matrices needed to align the remaining local SSMs. Through the experiments conducted on synthetic and real-world datasets, we show that FedAlign outperforms FedAvg, converges faster, and provides improved stability of the global SSM thanks to the efficient alignment of local parameter basins.

[LG-72] Eval-PPO: Building an Efficient Threat Evaluator Using Proximal Policy Optimization

链接: https://arxiv.org/abs/2503.12098
作者: Wuzhou Sun,Siyi Li,Qingxiang Zou,Zixing Liao
类目: Machine Learning (cs.LG)
*备注: 8 pages, 8 figures

点击查看摘要

Abstract:In various game scenarios, selecting a fixed number of targets from multiple enemy units is an extremely challenging task. This difficulty stems from the complex relationship between the threat levels of enemy units and their feature characteristics, which complicates the design of rule-based evaluators. Moreover, traditional supervised learning methods face the challenge of lacking explicit labels during training when applied to this threat evaluation problem. In this study, we redefine the threat evaluation problem as a reinforcement learning task and introduce an efficient evaluator training algorithm, Eval-PPO, based on the Proximal Policy Optimization (PPO) algorithm. Eval-PPO integrates multidimensional enemy features and the state information of friendly units through systematic training, thereby achieving precise threat assessment. Compared with rule-based methods, Eval-PPO demonstrates a significant improvement in average success rate, with an increase of 17.84%.

[LG-73] Proof-Driven Clause Learning in Neural Network Verification

链接: https://arxiv.org/abs/2503.12083
作者: Omri Isac,Idan Refaeli,Haoze Wu,Clark Barrett,Guy Katz
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread adoption of deep neural networks (DNNs) requires efficient techniques for safety verification. Existing methods struggle to scale to real-world DNNs, and tremendous efforts are being put into improving their scalability. In this work, we propose an approach for improving the scalability of DNN verifiers using Conflict-Driven Clause Learning (CDCL) – an approach that has proven highly successful in SAT and SMT solving. We present a novel algorithm for deriving conflict clauses using UNSAT proofs, and propose several optimizations for expediting it. Our approach allows a modular integration of SAT solvers and DNN verifiers, and we implement it on top of an interface designed for this purpose. The evaluation of our implementation over several benchmarks suggests a 2X–3X improvement over a similar approach, with specific cases outperforming the state of the art.

[LG-74] Impact of Data Patterns on Biotype identification Using Machine Learning

链接: https://arxiv.org/abs/2503.12066
作者: Yuetong Yu,Ruiyang Ge,Ilker Hacihaliloglu,Alexander Rauscher,Roger Tam,Sophia Frangou
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract: Background: Patient stratification in brain disorders remains a significant challenge, despite advances in machine learning and multimodal neuroimaging. Automated machine learning algorithms have been widely applied for identifying patient subtypes (biotypes), but results have been inconsistent across studies. These inconsistencies are often attributed to algorithmic limitations, yet an overlooked factor may be the statistical properties of the input data. This study investigates the contribution of data patterns on algorithm performance by leveraging synthetic brain morphometry data as an exemplar. Methods: Four widely used algorithms-SuStaIn, HYDRA, SmileGAN, and SurrealGAN were evaluated using multiple synthetic pseudo-patient datasets designed to include varying numbers and sizes of clusters and degrees of complexity of morphometric changes. Ground truth, representing predefined clusters, allowed for the evaluation of performance accuracy across algorithms and datasets. Results: SuStaIn failed to process datasets with more than 17 variables, highlighting computational inefficiencies. HYDRA was able to perform individual-level classification in multiple datasets with no clear pattern explaining failures. SmileGAN and SurrealGAN outperformed other algorithms in identifying variable-based disease patterns, but these patterns were not able to provide individual-level classification. Conclusions: Dataset characteristics significantly influence algorithm performance, often more than algorithmic design. The findings emphasize the need for rigorous validation using synthetic data before real-world application and highlight the limitations of current clustering approaches in capturing the heterogeneity of brain disorders. These insights extend beyond neuroimaging and have implications for machine learning applications in biomedical research. Subjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM) Cite as: arXiv:2503.12066 [cs.LG] (or arXiv:2503.12066v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.12066 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuetong Yu [view email] [v1] Sat, 15 Mar 2025 09:44:00 UTC (2,955 KB)

[LG-75] Hierarchical Reinforcement Learning for Safe Mapless Navigation with Congestion Estimation

链接: https://arxiv.org/abs/2503.12036
作者: Jianqi Gao,Xizheng Pang,Qi Liu,Yanjie Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning-based mapless navigation holds significant potential. However, it faces challenges in indoor environments with local minima area. This paper introduces a safe mapless navigation framework utilizing hierarchical reinforcement learning (HRL) to enhance navigation through such areas. The high-level policy creates a sub-goal to direct the navigation process. Notably, we have developed a sub-goal update mechanism that considers environment congestion, efficiently avoiding the entrapment of the robot in local minimum areas. The low-level motion planning policy, trained through safe reinforcement learning, outputs real-time control instructions based on acquired sub-goal. Specifically, to enhance the robot’s environmental perception, we introduce a new obstacle encoding method that evaluates the impact of obstacles on the robot’s motion planning. To validate the performance of our HRL-based navigation framework, we conduct simulations in office, home, and restaurant environments. The findings demonstrate that our HRL-based navigation framework excels in both static and dynamic scenarios. Finally, we implement the HRL-based navigation framework on a TurtleBot3 robot for physical validation experiments, which exhibits its strong generalization capabilities.

[LG-76] A Survey on Federated Fine-tuning of Large Language Models

链接: https://arxiv.org/abs/2503.12016
作者: Yebo Wu,Chunlin Tian,Jingguang Li,He Sun,Kahou Tam,Li Li,Chengzhong Xu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success across a wide range of tasks, with fine-tuning playing a pivotal role in adapting them to specific downstream applications. Federated Learning (FL) offers a promising approach that enables collaborative model adaptation while ensuring data privacy, i.e., FedLLM. In this survey, we provide a systematic and thorough review of the integration of LLMs with FL. Specifically, we first trace the historical evolution of both LLMs and FL, while summarizing relevant prior surveys. We then present an in-depth analysis of the fundamental challenges encountered in deploying FedLLM. Following this, we conduct an extensive study of existing parameter-efficient fine-tuning (PEFT) methods and explore their applicability in FL. Furthermore, we introduce a comprehensive evaluation benchmark to rigorously assess FedLLM performance and discuss its diverse real-world applications across multiple domains. Finally, we identify critical open challenges and outline promising research directions to drive future advancements in FedLLM. We maintain an active \hrefthis https URLGitHub repository tracking cutting-edge advancements. This survey serves as a foundational resource for researchers and practitioners, offering insights into the evolving landscape of federated fine-tuning for LLMs while guiding future innovations in privacy-preserving AI.

[LG-77] Mixed-feature Logistic Regression Robust to Distribution Shifts AISTATS

链接: https://arxiv.org/abs/2503.12012
作者: Qingshi Sun,Nathan Justin,Andres Gomez,Phebe Vayanos
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: The 28th International Conference on Artificial Intelligence and Statistics (AISTATS), 2025

点击查看摘要

Abstract:Logistic regression models are widely used in the social and behavioral sciences and in high-stakes domains, due to their simplicity and interpretability properties. At the same time, such domains are permeated by distribution shifts, where the distribution generating the data changes between training and deployment. In this paper, we study a distributionally robust logistic regression problem that seeks the model that will perform best against adversarial realizations of the data distribution drawn from a suitably constructed Wasserstein ambiguity set. Our model and solution approach differ from prior work in that we can capture settings where the likelihood of distribution shifts can vary across features, significantly broadening the applicability of our model relative to the state-of-the-art. We propose a graph-based solution approach that can be integrated into off-the-shelf optimization solvers. We evaluate the performance of our model and algorithms on numerous publicly available datasets. Our solution achieves a 408x speed-up relative to the state-of-the-art. Additionally, compared to the state-of-the-art, our model reduces average calibration error by up to 36.19% and worst-case calibration error by up to 41.70%, while increasing the average area under the ROC curve (AUC) by up to 18.02% and worst-case AUC by up to 48.37%.

[LG-78] Automation and Feature Selection Enhancement with Reinforcement Learning (RL)

链接: https://arxiv.org/abs/2503.11991
作者: Sumana Sanyasipura Nagaraju
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective feature selection, representation and transformation are principal steps in machine learning to improve prediction accuracy, model generalization and computational efficiency. Reinforcement learning provides a new perspective towards balanced exploration of optimal feature subset using multi-agent and single-agent models. Interactive reinforcement learning integrated with decision tree improves feature knowledge, state representation and selection efficiency, while diversified teaching strategies improve both selection quality and efficiency. The state representation can further be enhanced by scanning features sequentially along with the usage of convolutional auto-encoder. Monte Carlo-based reinforced feature selection(MCRFS), a single-agent feature selection method reduces computational burden by incorporating early-stopping and reward-level interactive strategies. A dual-agent RL framework is also introduced that collectively selects features and instances, capturing the interactions between them. This enables the agents to navigate through complex data spaces. To outperform the traditional feature engineering, cascading reinforced agents are used to iteratively improve the feature space, which is a self-optimizing framework. The blend of reinforcement learning, multi-agent systems, and bandit-based approaches offers exciting paths for studying scalable and interpretable machine learning solutions to handle high-dimensional data and challenging predictive tasks.

[LG-79] Machine Learning-Based Model for Postoperative Stroke Prediction in Coronary Artery Disease

链接: https://arxiv.org/abs/2503.11973
作者: Haonan Pan,Shuheng Chen,Elham Pishgar,Kamiar Alaei,Greg Placencia,Maryam Pishgar
类目: Machine Learning (cs.LG)
*备注: 19 pages, 7 figures, submitted to PLOS One. The study employs machine learning techniques, particularly Support Vector Machines, to predict postoperative stroke risk in coronary artery disease patients undergoing revascularization. It utilizes the MIMIC-IV v3.1 database and incorporates SHapley Additive Properties analysis for model interpretation

点击查看摘要

Abstract:Coronary artery disease remains one of the leading causes of mortality globally. Despite advances in revascularization treatments like PCI and CABG, postoperative stroke is inevitable. This study aims to develop and evaluate a sophisticated machine learning prediction model to assess postoperative stroke risk in coronary revascularization this http URL research employed data from the MIMIC-IV database, consisting of a cohort of 7023 individuals. Study data included clinical, laboratory, and comorbidity variables. To reduce multicollinearity, variables with over 30% missing values and features with a correlation coefficient larger than 0.9 were deleted. The dataset has 70% training and 30% test. The Random Forest technique interpolated residual dataset missing values. Numerical values were normalized, whereas categorical variables were one-hot encoded. LASSO regularization selected features, and grid search found model hyperparameters. Finally, Logistic Regression, XGBoost, SVM, and CatBoost were employed for predictive modeling, and SHAP analysis assessed stroke risk for each variable. AUC of 0.855 (0.829-0.878) showed that SVM model outperformed logistic regression and CatBoost models in prior research. SHAP research showed that the Charlson Comorbidity Index (CCI), diabetes, chronic kidney disease, and heart failure are significant prognostic factors for postoperative stroke. This study shows that improved machine learning reduces overfitting and improves model predictive accuracy. Models using the CCI alone cannot predict postoperative stroke risk as accurately as those using independent comorbidity variables. The suggested technique provides a more thorough and individualized risk assessment by encompassing a wider range of clinically relevant characteristics, making it a better reference for preoperative risk assessments and targeted intervention.

[LG-80] Revisiting Gradient Descent: A Dual-Weight Method for Improved Learning

链接: https://arxiv.org/abs/2503.11965
作者: Xi Wang,Hideaki Shimazaki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel framework for learning in neural networks by decomposing each neuron’s weight vector into two distinct parts, W_1 and W_2 , thereby modeling contrastive information directly at the neuron level. Traditional gradient descent stores both positive (target) and negative (non-target) feature information in a single weight vector, often obscuring fine-grained distinctions. Our approach, by contrast, maintains separate updates for target and non-target features, ultimately forming a single effective weight W = W_1 - W_2 that is more robust to noise and class imbalance. Experimental results on both regression (California Housing, Wine Quality) and classification (MNIST, Fashion-MNIST, CIFAR-10) tasks suggest that this decomposition enhances generalization and resists overfitting, especially when training data are sparse or noisy. Crucially, the inference complexity remains the same as in the standard WX + \textbias setup, offering a practical solution for improved learning without additional inference-time overhead.

[LG-81] Entropy-regularized Gradient Estimators for Approximate Bayesian Inference

链接: https://arxiv.org/abs/2503.11964
作者: Jasmeet Kaur
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Effective uncertainty quantification is important for training modern predictive models with limited data, enhancing both accuracy and robustness. While Bayesian methods are effective for this purpose, they can be challenging to scale. When employing approximate Bayesian inference, ensuring the quality of samples from the posterior distribution in a computationally efficient manner is essential. This paper addresses the estimation of the Bayesian posterior to generate diverse samples by approximating the gradient flow of the Kullback-Leibler (KL) divergence and the cross entropy of the target approximation under the metric induced by the Stein Operator. It presents empirical evaluations on classification tasks to assess the method’s performance and discuss its effectiveness for Model-Based Reinforcement Learning that uses uncertainty-aware network dynamics models.

[LG-82] Effective and Efficient Cross-City Traffic Knowledge Transfer A Privacy-Preserving Perspective

链接: https://arxiv.org/abs/2503.11963
作者: Zhihao Zeng,Ziquan Fang,Yuting Huang,Lu Chen,Yunjun Gao
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Traffic prediction targets forecasting future traffic conditions using historical traffic data, serving a critical role in urban computing and transportation management. To mitigate the scarcity of traffic data while maintaining data privacy, numerous Federated Traffic Knowledge Transfer (FTT) approaches have been developed, which use transfer learning and federated learning to transfer traffic knowledge from data-rich cities to data-scarce cities, enhancing traffic prediction capabilities for the latter. However, current FTT approaches face challenges such as privacy leakage, cross-city data distribution discrepancies, low data quality, and inefficient knowledge transfer, limiting their privacy protection, effectiveness, robustness, and efficiency in real-world applications. To this end, we propose FedTT, an effective, efficient, and privacy-aware cross-city traffic knowledge transfer framework that transforms the traffic data domain from the data-rich cities and trains traffic models using the transformed data for the data-scarce cities. First, to safeguard data privacy, we propose a traffic secret transmission method that securely transmits and aggregates traffic domain-transformed data from source cities using a lightweight secret aggregation approach. Second, to mitigate the impact of traffic data distribution discrepancies on model performance, we introduce a traffic domain adapter to uniformly transform traffic data from the source cities’ domains to that of the target city. Third, to improve traffic data quality, we design a traffic view imputation method to fill in and predict missing traffic data. Finally, to enhance transfer efficiency, FedTT is equipped with a federated parallel training method that enables the simultaneous training of multiple modules. Extensive experiments using 4 real-life datasets demonstrate that FedTT outperforms the 14 state-of-the-art baselines. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2503.11963 [cs.LG] (or arXiv:2503.11963v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.11963 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-83] RePanda: Pandas-powered Tabular Verification and Reasoning

链接: https://arxiv.org/abs/2503.11921
作者: Atoosa Malemir Chegini,Keivan Rezaei,Hamid Eghbalzadeh,Soheil Feizi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fact-checking tabular data is essential for ensuring the accuracy of structured information. However, existing methods often rely on black-box models with opaque reasoning. We introduce RePanda, a structured fact verification approach that translates claims into executable pandas queries, enabling interpretable and verifiable reasoning. To train RePanda, we construct PanTabFact, a structured dataset derived from the TabFact train set, where claims are paired with executable queries generated using DeepSeek-Chat and refined through automated error correction. Fine-tuning DeepSeek-coder-7B-instruct-v1.5 on PanTabFact, RePanda achieves 84.09% accuracy on the TabFact test set. To evaluate Out-of-Distribution (OOD) generalization, we interpret question-answer pairs from WikiTableQuestions as factual claims and refer to this dataset as WikiFact. Without additional fine-tuning, RePanda achieves 84.72% accuracy on WikiFact, significantly outperforming all other baselines and demonstrating strong OOD robustness. Notably, these results closely match the zero-shot performance of DeepSeek-Chat (671B), indicating that our fine-tuning approach effectively distills structured reasoning from a much larger model into a compact, locally executable 7B model. Beyond fact verification, RePanda extends to tabular question answering by generating executable queries that retrieve precise answers. To support this, we introduce PanWiki, a dataset mapping WikiTableQuestions to pandas queries. Fine-tuning on PanWiki, RePanda achieves 75.1% accuracy in direct answer retrieval. These results highlight the effectiveness of structured execution-based reasoning for tabular verification and question answering. We have publicly released the dataset on Hugging Face at datasets/AtoosaChegini/PanTabFact. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.11921 [cs.LG] (or arXiv:2503.11921v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.11921 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Atoosa Malemir Chegini [view email] [v1] Fri, 14 Mar 2025 23:12:36 UTC (186 KB) Full-text links: Access Paper: View a PDF of the paper titled RePanda: Pandas-powered Tabular Verification and Reasoning, by Atoosa Malemir Chegini and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-84] Heterogenous graph neural networks for species distribution modeling

链接: https://arxiv.org/abs/2503.11900
作者: Lauren Harrell,Christine Kaeser-Chen,Burcu Karagol Ayan,Keith Anderson,Michelangelo Conserva,Elise Kleeman,Maxim Neumann,Matt Overlan,Melissa Chapman,Drew Purves
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE); Machine Learning (stat.ML)
*备注: 11 pages, 3 figures,

点击查看摘要

Abstract:Species distribution models (SDMs) are necessary for measuring and predicting occurrences and habitat suitability of species and their relationship with environmental factors. We introduce a novel presence-only SDM with graph neural networks (GNN). In our model, species and locations are treated as two distinct node sets, and the learning task is predicting detection records as the edges that connect locations to species. Using GNN for SDM allows us to model fine-grained interactions between species and the environment. We evaluate the potential of this methodology on the six-region dataset compiled by National Center for Ecological Analysis and Synthesis (NCEAS) for benchmarking SDMs. For each of the regions, the heterogeneous GNN model is comparable to or outperforms previously-benchmarked single-species SDMs as well as a feed-forward neural network baseline model.

[LG-85] Spatio-temporal Fourier Transformer (StFT) for Long-term Dynamics Prediction

链接: https://arxiv.org/abs/2503.11899
作者: Da Long,Shandian Zhe,Samuel Williams,Leonid Oliker,Zhe Bai
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:Simulating the long-term dynamics of multi-scale and multi-physics systems poses a significant challenge in understanding complex phenomena across science and engineering. The complexity arises from the intricate interactions between scales and the interplay of diverse physical processes. Neural operators have emerged as promising models for predicting such dynamics due to their flexibility and computational efficiency. However, they often fail to effectively capture multi-scale interactions or quantify the uncertainties inherent in the predictions. These limitations lead to rapid error accumulation, particularly in long-term forecasting of systems characterized by complex and coupled dynamics. To address these challenges, we propose a spatio-temporal Fourier transformer (StFT), in which each transformer block is designed to learn dynamics at a specific scale. By leveraging a structured hierarchy of StFT blocks, the model explicitly captures dynamics across both macro- and micro- spatial scales. Furthermore, a generative residual correction mechanism is integrated to estimate and mitigate predictive uncertainties, enhancing both the accuracy and reliability of long-term forecasts. Evaluations conducted on three benchmark datasets (plasma, fluid, and atmospheric dynamics) demonstrate the advantages of our approach over state-of-the-art ML methods.

[LG-86] PREAMBLE: Private and Efficient Aggregation of Block Sparse Vectors and Applications

链接: https://arxiv.org/abs/2503.11897
作者: Hilal Asi,Vitaly Feldman,Hannah Keller,Guy N. Rothblum,Kunal Talwar
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We revisit the problem of secure aggregation of high-dimensional vectors in a two-server system such as Prio. These systems are typically used to aggregate vectors such as gradients in private federated learning, where the aggregate itself is protected via noise addition to ensure differential privacy. Existing approaches require communication scaling with the dimensionality, and thus limit the dimensionality of vectors one can efficiently process in this setup. We propose PREAMBLE: Private Efficient Aggregation Mechanism for BLock-sparse Euclidean Vectors. PREAMBLE is a novel extension of distributed point functions that enables communication- and computation-efficient aggregation of block-sparse vectors, which are sparse vectors where the non-zero entries occur in a small number of clusters of consecutive coordinates. We then show that PREAMBLE can be combined with random sampling and privacy amplification by sampling results, to allow asymptotically optimal privacy-utility trade-offs for vector aggregation, at a fraction of the communication cost. When coupled with recent advances in numerical privacy accounting, our approach incurs a negligible overhead in noise variance, compared to the Gaussian mechanism used with Prio. Subjects: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2503.11897 [cs.CR] (or arXiv:2503.11897v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.11897 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-87] raining Diagonal Linear Networks with Stochastic Sharpness-Aware Minimization

链接: https://arxiv.org/abs/2503.11891
作者: Gabriel Clara,Sophie Langer,Johannes Schmidt-Hieber
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 54 pages, 3 figures

点击查看摘要

Abstract:We analyze the landscape and training dynamics of diagonal linear networks in a linear regression task, with the network parameters being perturbed by small isotropic normal noise. The addition of such noise may be interpreted as a stochastic form of sharpness-aware minimization (SAM) and we prove several results that relate its action on the underlying landscape and training dynamics to the sharpness of the loss. In particular, the noise changes the expected gradient to force balancing of the weight matrices at a fast rate along the descent trajectory. In the diagonal linear model, we show that this equates to minimizing the average sharpness, as well as the trace of the Hessian matrix, among all possible factorizations of the same matrix. Further, the noise forces the gradient descent iterates towards a shrinkage-thresholding of the underlying true parameter, with the noise level explicitly regulating both the shrinkage factor and the threshold.

[LG-88] Banking on Feedback: Text Analysis of Mobile Banking iOS and Google App Reviews

链接: https://arxiv.org/abs/2503.11861
作者: Yekta Amirkhalili,Ho Yi Wong
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:The rapid growth of mobile banking (m-banking), especially after the COVID-19 pandemic, has reshaped the financial sector. This study analyzes consumer reviews of m-banking apps from five major Canadian banks, collected from Google Play and iOS App stores. Sentiment analysis and topic modeling classify reviews as positive, neutral, or negative, highlighting user preferences and areas for improvement. Data pre-processing was performed with NLTK, a Python language processing tool, and topic modeling used Latent Dirichlet Allocation (LDA). Sentiment analysis compared methods, with Long Short-Term Memory (LSTM) achieving 82% accuracy for iOS reviews and Multinomial Naive Bayes 77% for Google Play. Positive reviews praised usability, reliability, and features, while negative reviews identified login issues, glitches, and dissatisfaction with this http URL is the first study to analyze both iOS and Google Play m-banking app reviews, offering insights into app strengths and weaknesses. Findings underscore the importance of user-friendly designs, stable updates, and better customer service. Advanced text analytics provide actionable recommendations for improving user satisfaction and experience.

[LG-89] Local Pan-Privacy for Federated Analytics

链接: https://arxiv.org/abs/2503.11850
作者: Vitaly Feldman,Audra McMillan,Guy N. Rothblum,Kunal Talwar
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pan-privacy was proposed by Dwork et al. as an approach to designing a private analytics system that retains its privacy properties in the face of intrusions that expose the system’s internal state. Motivated by federated telemetry applications, we study local pan-privacy, where privacy should be retained under repeated unannounced intrusions on the local state. We consider the problem of monitoring the count of an event in a federated system, where event occurrences on a local device should be hidden even from an intruder on that device. We show that under reasonable constraints, the goal of providing information-theoretic differential privacy under intrusion is incompatible with collecting telemetry information. We then show that this problem can be solved in a scalable way using standard cryptographic primitives.

[LG-90] Optimization-Augmented Machine Learning for Vehicle Operations in Emergency Medical Services

链接: https://arxiv.org/abs/2503.11848
作者: Maximiliane Rautenstrauß,Maximilian Schiffer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Minimizing response times to meet legal requirements and serve patients in a timely manner is crucial for Emergency Medical Service (EMS) systems. Achieving this goal necessitates optimizing operational decision-making to efficiently manage ambulances. Against this background, we study a centrally controlled EMS system for which we learn an online ambulance dispatching and redeployment policy that aims at minimizing the mean response time of ambulances within the system by dispatching an ambulance upon receiving an emergency call and redeploying it to a waiting location upon the completion of its service. We propose a novel combinatorial optimization-augmented machine learning pipeline that allows to learn efficient policies for ambulance dispatching and redeployment. In this context, we further show how to solve the underlying full-information problem to generate training data and propose an augmentation scheme that improves our pipeline’s generalization performance by mitigating a possible distribution mismatch with respect to the considered state space. Compared to existing methods that rely on augmentation during training, our approach offers substantial runtime savings of up to 87.9% while yielding competitive performance. To evaluate the performance of our pipeline against current industry practices, we conduct a numerical case study on the example of San Francisco’s 911 call data. Results show that the learned policies outperform the online benchmarks across various resource and demand scenarios, yielding a reduction in mean response time of up to 30%.

[LG-91] Systematic Classification of Studies Investigating Social Media Conversations about Long COVID Using a Novel Zero-Shot Transformer Framework

链接: https://arxiv.org/abs/2503.11845
作者: Nirmalya Thakur,Niven Francis Da Guia Fernandes,Madje Tobi Marc’Avent Tchona
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long COVID continues to challenge public health by affecting a considerable number of individuals who have recovered from acute SARS-CoV-2 infection yet endure prolonged and often debilitating symptoms. Social media has emerged as a vital resource for those seeking real-time information, peer support, and validating their health concerns related to Long COVID. This paper examines recent works focusing on mining, analyzing, and interpreting user-generated content on social media platforms to capture the broader discourse on persistent post-COVID conditions. A novel transformer-based zero-shot learning approach serves as the foundation for classifying research papers in this area into four primary categories: Clinical or Symptom Characterization, Advanced NLP or Computational Methods, Policy Advocacy or Public Health Communication, and Online Communities and Social Support. This methodology achieved an average confidence of 0.7788, with the minimum and maximum confidence being 0.1566 and 0.9928, respectively. This model showcases the ability of advanced language models to categorize research papers without any training data or predefined classification labels, thus enabling a more rapid and scalable assessment of existing literature. This paper also highlights the multifaceted nature of Long COVID research by demonstrating how advanced computational techniques applied to social media conversations can reveal deeper insights into the experiences, symptoms, and narratives of individuals affected by Long COVID.

[LG-92] st-Time Training Provably Improves Transformers as In-context Learners

链接: https://arxiv.org/abs/2503.11842
作者: Halil Alperen Gozeten,M. Emrullah Ildiz,Xuechen Zhang,Mahdi Soltanolkotabi,Marco Mondelli,Samet Oymak
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Test-time training (TTT) methods explicitly update the weights of a model to adapt to the specific test instance, and they have found success in a variety of settings, including most recently language modeling and reasoning. To demystify this success, we investigate a gradient-based TTT algorithm for in-context learning, where we train a transformer model on the in-context demonstrations provided in the test prompt. Specifically, we provide a comprehensive theoretical characterization of linear transformers when the update rule is a single gradient step. Our theory (i) delineates the role of alignment between pretraining distribution and target task, (ii) demystifies how TTT can alleviate distribution shift, and (iii) quantifies the sample complexity of TTT including how it can significantly reduce the eventual sample size required for in-context learning. As our empirical contribution, we study the benefits of TTT for TabPFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost.

[LG-93] rust Under Siege: Label Spoofing Attacks against Machine Learning for Android Malware Detection

链接: https://arxiv.org/abs/2503.11841
作者: Tianwei Lan,Luca Demetrio,Farid Nait-Abdesselam,Yufei Han,Simone Aonzo
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) malware detectors rely heavily on crowd-sourced AntiVirus (AV) labels, with platforms like VirusTotal serving as a trusted source of malware annotations. But what if attackers could manipulate these labels to classify benign software as malicious? We introduce label spoofing attacks, a new threat that contaminates crowd-sourced datasets by embedding minimal and undetectable malicious patterns into benign samples. These patterns coerce AV engines into misclassifying legitimate files as harmful, enabling poisoning attacks against ML-based malware classifiers trained on those data. We demonstrate this scenario by developing AndroVenom, a methodology for polluting realistic data sources, causing consequent poisoning attacks against ML malware detectors. Experiments show that not only state-of-the-art feature extractors are unable to filter such injection, but also various ML models experience Denial of Service already with 1% poisoned samples. Additionally, attackers can flip decisions of specific unaltered benign samples by modifying only 0.015% of the training data, threatening their reputation and market share and being unable to be stopped by anomaly detectors on training data. We conclude our manuscript by raising the alarm on the trustworthiness of the training process based on AV annotations, requiring further investigation on how to produce proper labels for ML malware detectors.

[LG-94] Performance Analysis of Decentralized Federated Learning Deployments

链接: https://arxiv.org/abs/2503.11828
作者: Chengyan Jiang,Jiamin Fan,Talal Halabi,Israat Haque
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The widespread adoption of smartphones and smart wearable devices has led to the widespread use of Centralized Federated Learning (CFL) for training powerful machine learning models while preserving data privacy. However, CFL faces limitations due to its overreliance on a central server, which impacts latency and system robustness. Decentralized Federated Learning (DFL) is introduced to address these challenges. It facilitates direct collaboration among participating devices without relying on a central server. Each device can independently connect with other devices and share model parameters. This work explores crucial factors influencing the convergence and generalization capacity of DFL models, emphasizing network topologies, non-IID data distribution, and training strategies. We first derive the convergence rate of different DFL model deployment strategies. Then, we comprehensively analyze various network topologies (e.g., linear, ring, star, and mesh) with different degrees of non-IID data and evaluate them over widely adopted machine learning models (e.g., classical, deep neural networks, and Large Language Models) and real-world datasets. The results reveal that models converge to the optimal one for IID data. However, the convergence rate is inversely proportional to the degree of non-IID data distribution. Our findings will serve as valuable guidelines for designing effective DFL model deployments in practical applications.

[LG-95] Online Assortment and Price Optimization Under Contextual Choice Models AISTATS2025

链接: https://arxiv.org/abs/2503.11819
作者: Yigit Efe Erginbas,Thomas A. Courtade,Kannan Ramchandran
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH); Machine Learning (stat.ML)
*备注: to be published in AISTATS 2025

点击查看摘要

Abstract:We consider an assortment selection and pricing problem in which a seller has N different items available for sale. In each round, the seller observes a d -dimensional contextual preference information vector for the user, and offers to the user an assortment of K items at prices chosen by the seller. The user selects at most one of the products from the offered assortment according to a multinomial logit choice model whose parameters are unknown. The seller observes which, if any, item is chosen at the end of each round, with the goal of maximizing cumulative revenue over a selling horizon of length T . For this problem, we propose an algorithm that learns from user feedback and achieves a revenue regret of order \widetildeO(d \sqrtK T / L_0 ) where L_0 is the minimum price sensitivity parameter. We also obtain a lower bound of order \Omega(d \sqrtT/ L_0) for the regret achievable by any algorithm.

[LG-96] he Architecture and Evaluation of Bayesian Neural Networks

链接: https://arxiv.org/abs/2503.11808
作者: Alisa Sheinkman,Sara Wade
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 24 pages

点击查看摘要

Abstract:As modern neural networks get more complex, specifying a model with high predictive performance and sound uncertainty quantification becomes a more challenging task. Despite some promising theoretical results on the true posterior predictive distribution of Bayesian neural networks, the properties of even the most commonly used posterior approximations are often questioned. Computational burdens and intractable posteriors expose miscalibrated Bayesian neural networks to poor accuracy and unreliable uncertainty estimates. Approximate Bayesian inference aims to replace unknown and intractable posterior distributions with some simpler but feasible distributions. The dimensions of modern deep models coupled with the lack of identifiability make Markov chain Monte Carlo tremendously expensive and unable to fully explore the multimodal posterior. On the other hand, variational inference benefits from improved computational complexity but lacks the asymptotical guarantees of sampling-based inference and tends to concentrate around a single mode. The performance of both approaches heavily depends on architectural choices; this paper aims to shed some light on this, by considering the computational costs, accuracy and uncertainty quantification in different scenarios including large width and out-of-sample data. To improve posterior exploration, different model averaging and ensembling techniques are studied, along with their benefits on predictive performance. In our experiments, variational inference overall provided better uncertainty quantification than Markov chain Monte Carlo; further, stacking and ensembles of variational approximations provided comparable to Markov chain Monte Carlo accuracy at a much-reduced cost.

[LG-97] Diffuse-CLoC: Guided Diffusion for Physics-based Character Look-ahead Control

链接: https://arxiv.org/abs/2503.11801
作者: Xiaoyu Huang,Takara Truong,Yunbo Zhang,Fangzhou Yu,Jean Pierre Sleiman,Jessica Hodgins,Koushil Sreenath,Farbod Farshidian
类目: Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We present Diffuse-CLoC, a guided diffusion framework for physics-based look-ahead control that enables intuitive, steerable, and physically realistic motion generation. While existing kinematics motion generation with diffusion models offer intuitive steering capabilities with inference-time conditioning, they often fail to produce physically viable motions. In contrast, recent diffusion-based control policies have shown promise in generating physically realizable motion sequences, but the lack of kinematics prediction limits their steerability. Diffuse-CLoC addresses these challenges through a key insight: modeling the joint distribution of states and actions within a single diffusion model makes action generation steerable by conditioning it on the predicted states. This approach allows us to leverage established conditioning techniques from kinematic motion generation while producing physically realistic motions. As a result, we achieve planning capabilities without the need for a high-level planner. Our method handles a diverse set of unseen long-horizon downstream tasks through a single pre-trained model, including static and dynamic obstacle avoidance, motion in-betweening, and task-space control. Experimental results show that our method significantly outperforms the traditional hierarchical framework of high-level motion diffusion and low-level tracking.

[LG-98] nsor Convolutional Network for Higher-Order Interaction Prediction in Sparse Tensors

链接: https://arxiv.org/abs/2503.11786
作者: Jun-Gi Jang,Jingrui He,Andrew Margenot,Hanghang Tong
类目: Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:Many real-world data, such as recommendation data and temporal graphs, can be represented as incomplete sparse tensors where most entries are unobserved. For such sparse tensors, identifying the top-k higher-order interactions that are most likely to occur among unobserved ones is crucial. Tensor factorization (TF) has gained significant attention in various tensor-based applications, serving as an effective method for finding these top-k potential interactions. However, existing TF methods primarily focus on effectively fusing latent vectors of entities, which limits their expressiveness. Since most entities in sparse tensors have only a few interactions, their latent representations are often insufficiently trained. In this paper, we propose TCN, an accurate and compatible tensor convolutional network that integrates seamlessly with existing TF methods for predicting higher-order interactions. We design a highly effective encoder to generate expressive latent vectors of entities. To achieve this, we propose to (1) construct a graph structure derived from a sparse tensor and (2) develop a relation-aware encoder, TCN, that learns latent representations of entities by leveraging the graph structure. Since TCN complements traditional TF methods, we seamlessly integrate TCN with existing TF methods, enhancing the performance of predicting top-k interactions. Extensive experiments show that TCN integrated with a TF method outperforms competitors, including TF methods and a hyperedge prediction method. Moreover, TCN is broadly compatible with various TF methods and GNNs (Graph Neural Networks), making it a versatile solution.

[LG-99] UBMF: Uncertainty-Aware Bayesian Meta-Learning Framework for Fault Diagnosis with Imbalanced Industrial Data

链接: https://arxiv.org/abs/2503.11774
作者: Zhixuan Lian,Shangyu Li,Qixuan Huang,Zijian Huang,Haifei Liu,Jianan Qiu,Puyu Yang,Laifa Tao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Fault diagnosis of mechanical equipment involves data collection, feature extraction, and pattern recognition but is often hindered by the imbalanced nature of industrial data, introducing significant uncertainty and reducing diagnostic reliability. To address these challenges, this study proposes the Uncertainty-Aware Bayesian Meta-Learning Framework (UBMF), which integrates four key modules: data perturbation injection for enhancing feature robustness, cross-task self-supervised feature extraction for improving transferability, uncertainty-based sample filtering for robust out-of-domain generalization, and Bayesian meta-knowledge integration for fine-grained classification. Experimental results on ten open-source datasets under various imbalanced conditions, including cross-task, small-sample, and unseen-sample scenarios, demonstrate the superiority of UBMF, achieving an average improvement of 42.22% across ten Any-way 1-5-shot diagnostic tasks. This integrated framework effectively enhances diagnostic accuracy, generalization, and adaptability, providing a reliable solution for complex industrial fault diagnosis.

[LG-100] Revisiting the Predictability of Performative Social Events

链接: https://arxiv.org/abs/2503.11713
作者: Juan C. Perdomo
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Theoretical Economics (econ.TH); Machine Learning (stat.ML)
*备注: 20 pages

点击查看摘要

Abstract:Social predictions do not passively describe the future; they actively shape it. They inform actions and change individual expectations in ways that influence the likelihood of the predicted outcome. Given these dynamics, to what extent can social events be predicted? This question was discussed throughout the 20th century by authors like Merton, Morgenstern, Simon, and others who considered it a central issue in social science methodology. In this work, we provide a modern answer to this old problem. Using recent ideas from performative prediction and outcome indistinguishability, we establish that one can always efficiently predict social events accurately, regardless of how predictions influence data. While achievable, we also show that these predictions are often undesirable, highlighting the limitations of previous desiderata. We end with a discussion of various avenues forward.

[LG-101] Physical knowledge improves prediction of EM Fields

链接: https://arxiv.org/abs/2503.11703
作者: Andrzej Dulny,Farzad Jabbarigargari,Andreas Hotho,Laura Maria Schreiber,Maxim Terekhov,Anna Krause
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a 3D U-Net model to predict the spatial distribution of electromagnetic fields inside a radio-frequency (RF) coil with a subject present, using the phase, amplitude, and position of the coils, along with the density, permittivity, and conductivity of the surrounding medium as inputs. To improve accuracy, we introduce a physics-augmented variant, U-Net Phys, which incorporates Gauss’s law of magnetism into the loss function using finite differences. We train our models on electromagnetic field simulations from CST Studio Suite for an eight-channel dipole array RF coil at 7T MRI. Experimental results show that U-Net Phys significantly outperforms the standard U-Net, particularly in predicting fields within the subject, demonstrating the advantage of integrating physical constraints into deep learning-based field prediction.

[LG-102] A Survey of Direct Preference Optimization

链接: https://arxiv.org/abs/2503.11701
作者: Shunyu Liu,Wenkai Fang,Zetian Hu,Junjie Zhang,Yang Zhou,Kongcheng Zhang,Rongcheng Tu,Ting-En Lin,Fei Huang,Mingli Song,Yongbin Li,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated unprecedented generative capabilities, yet their alignment with human values remains critical for ensuring helpful and harmless deployments. While Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful paradigm for aligning LLMs with human preferences, its reliance on complex reward modeling introduces inherent trade-offs in computational efficiency and training stability. In this context, Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative that directly optimizes LLMs using human preferences, thereby circumventing the need for explicit reward modeling. Owing to its theoretical elegance and computational efficiency, DPO has rapidly attracted substantial research efforts exploring its various implementations and applications. However, this field currently lacks systematic organization and comparative analysis. In this survey, we conduct a comprehensive overview of DPO and introduce a novel taxonomy, categorizing previous works into four key dimensions: data strategy, learning framework, constraint mechanism, and model property. We further present a rigorous empirical analysis of DPO variants across standardized benchmarks. Additionally, we discuss real-world applications, open challenges, and future directions for DPO. This work delivers both a conceptual framework for understanding DPO and practical guidance for practitioners, aiming to advance robust and generalizable alignment paradigms. All collected resources are available and will be continuously updated at this https URL.

[LG-103] Generalization of Video-Based Heart Rate Estimation Methods To Low Illumination and Elevated Heart Rates

链接: https://arxiv.org/abs/2503.11697
作者: Bhargav Acharya,William Saakyan,Barbara Hammer,Hanna Drimalla
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 10pages, 4 figures

点击查看摘要

Abstract:Heart rate is a physiological signal that provides information about an individual’s health and affective state. Remote photoplethysmography (rPPG) allows the estimation of this signal from video recordings of a person’s face. Classical rPPG methods make use of signal processing techniques, while recent rPPG methods utilize deep learning networks. Methods are typically evaluated on datasets collected in well-lit environments with participants at resting heart rates. However, little investigation has been done on how well these methods adapt to variations in illumination and heart rate. In this work, we systematically evaluate representative state-of-the-art methods for remote heart rate estimation. Specifically, we evaluate four classical methods and four deep learning-based rPPG estimation methods in terms of their generalization ability to changing scenarios, including low lighting conditions and elevated heart rates. For a thorough evaluation of existing approaches, we collected a novel dataset called CHILL, which systematically varies heart rate and lighting conditions. The dataset consists of recordings from 45 participants in four different scenarios. The video data was collected under two different lighting conditions (high and low) and normal and elevated heart rates. In addition, we selected two public datasets to conduct within- and cross-dataset evaluations of the rPPG methods. Our experimental results indicate that classical methods are not significantly impacted by low-light conditions. Meanwhile, some deep learning methods were found to be more robust to changes in lighting conditions but encountered challenges in estimating high heart rates. The cross-dataset evaluation revealed that the selected deep learning methods underperformed when influencing factors such as elevated heart rates and low lighting conditions were not present in the training set.

[LG-104] owards Resilient and Sustainable Global Industrial Systems: An Evolutionary-Based Approach

链接: https://arxiv.org/abs/2503.11688
作者: Václav Jirkovský,Jiří Kubalík,Petr Kadera,Arnd Schirrmann,Andreas Mitschke,Andreas Zindel
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper presents a new complex optimization problem in the field of automatic design of advanced industrial systems and proposes a hybrid optimization approach to solve the problem. The problem is multi-objective as it aims at finding solutions that minimize CO2 emissions, transportation time, and costs. The optimization approach combines an evolutionary algorithm and classical mathematical programming to design resilient and sustainable global manufacturing networks. Further, it makes use of the OWL ontology for data consistency and constraint management. The experimental validation demonstrates the effectiveness of the approach in both single and double sourcing scenarios. The proposed methodology, in general, can be applied to any industry case with complex manufacturing and supply chain challenges.

[LG-105] Review of Machine Learning for Micro-Electronic Design Verification

链接: https://arxiv.org/abs/2503.11687
作者: Christopher Bennett,Kerstin Eder
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 40 pages, 13 figures

点击查看摘要

Abstract:Microelectronic design verification remains a critical bottleneck in device development, traditionally mitigated by expanding verification teams and computational resources. Since the late 1990s, machine learning (ML) has been proposed to enhance verification efficiency, yet many techniques have not achieved mainstream adoption. This review, from the perspective of verification and ML practitioners, examines the application of ML in dynamic-based techniques for functional verification of microelectronic designs, and provides a starting point for those new to this interdisciplinary field. Historical trends, techniques, ML types, and evaluation baselines are analysed to understand why previous research has not been widely adopted in industry. The review highlights the application of ML, the techniques used and critically discusses their limitations and successes. Although there is a wealth of promising research, real-world adoption is hindered by challenges in comparing techniques, identifying suitable applications, and the expertise required for implementation. This review proposes that the field can progress through the creation and use of open datasets, common benchmarks, and verification targets. By establishing open evaluation criteria, industry can guide future research. Parallels with ML in software verification suggest potential for collaboration. Additionally, greater use of open-source designs and verification environments can allow more researchers from outside the hardware verification discipline to contribute to the challenge of verifying microelectronic designs.

[LG-106] Strain Problems got you in a Twist? Try StrainRelief: A Quantum-Accurate Tool for Ligand Strain Calculations

链接: https://arxiv.org/abs/2503.13352
作者: Ewan R. S. Wallace,Nathan C. Frey,Joshua A. Rackers
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ligand strain energy, the energy difference between the bound and unbound conformations of a ligand, is an important component of structure-based small molecule drug design. A large majority of observed ligands in protein-small molecule co-crystal structures bind in low-strain conformations, making strain energy a useful filter for structure-based drug design. In this work we present a tool for calculating ligand strain with a high accuracy. StrainRelief uses a MACE Neural Network Potential (NNP), trained on a large database of Density Functional Theory (DFT) calculations to estimate ligand strain of neutral molecules with quantum accuracy. We show that this tool estimates strain energy differences relative to DFT to within 1.4 kcal/mol, more accurately than alternative NNPs. These results highlight the utility of NNPs in drug discovery, and provide a useful tool for drug discovery teams.

[LG-107] Do you understand epistemic uncertainty? Think again! Rigorous frequentist epistemic uncertainty estimation in regression

链接: https://arxiv.org/abs/2503.13317
作者: Enrico Foglia,Benjamin Bobbia,Nikita Durasov,Michael Bauerheim,Pascal Fua,Stephane Moreau,Thierry Jardin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantifying model uncertainty is critical for understanding prediction reliability, yet distinguishing between aleatoric and epistemic uncertainty remains challenging. We extend recent work from classification to regression to provide a novel frequentist approach to epistemic and aleatoric uncertainty estimation. We train models to generate conditional predictions by feeding their initial output back as an additional input. This method allows for a rigorous measurement of model uncertainty by observing how prediction responses change when conditioned on the model’s previous answer. We provide a complete theoretical framework to analyze epistemic uncertainty in regression in a frequentist way, and explain how it can be exploited in practice to gauge a model’s uncertainty, with minimal changes to the original architecture.

[LG-108] Deep Hedging of Green PPAs in Electricity Markets

链接: https://arxiv.org/abs/2503.13056
作者: Richard Biegler-König,Daniel Oeltz
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注:

点击查看摘要

Abstract:In power markets, Green Power Purchase Agreements have become an important contractual tool of the energy transition from fossil fuels to renewable sources such as wind or solar radiation. Trading Green PPAs exposes agents to price risks and weather risks. Also, developed electricity markets feature the so-called cannibalisation effect : large infeeds induce low prices and vice versa. As weather is a non-tradable entity the question arises how to hedge and risk-manage in this highly incom-plete setting. We propose a ‘‘deep hedging’’ framework utilising machine learning methods to construct hedging strategies. The resulting strategies outperform static and dynamic benchmark strategies with respect to different risk measures.

[LG-109] E-Values Expand the Scope of Conformal Prediction

链接: https://arxiv.org/abs/2503.13050
作者: Etienne Gauthier,Francis Bach,Michael I. Jordan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Code available at: this https URL

点击查看摘要

Abstract:Conformal prediction is a powerful framework for distribution-free uncertainty quantification. The standard approach to conformal prediction relies on comparing the ranks of prediction scores: under exchangeability, the rank of a future test point cannot be too extreme relative to a calibration set. This rank-based method can be reformulated in terms of p-values. In this paper, we explore an alternative approach based on e-values, known as conformal e-prediction. E-values offer key advantages that cannot be achieved with p-values, enabling new theoretical and practical capabilities. In particular, we present three applications that leverage the unique strengths of e-values: batch anytime-valid conformal prediction, fixed-size conformal sets with data-dependent coverage, and conformal prediction under ambiguous ground truth. Overall, these examples demonstrate that e-value-based constructions provide a flexible expansion of the toolbox of conformal prediction.

[LG-110] Edgeworth Expansion for Semi-hard Triplet Loss

链接: https://arxiv.org/abs/2503.12893
作者: Masanari Kimura
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a higher-order asymptotic analysis for the semi-hard triplet loss using the Edgeworth expansion. It is known that this loss function enforces that embeddings of similar samples are close while those of dissimilar samples are separated by a specified margin. By refining the classical central limit theorem, our approach quantifies the impact of the margin parameter and the skewness of the underlying data distribution on the loss behavior. In particular, we derive explicit Edgeworth expansions that reveal first-order corrections in terms of the third cumulant, thereby characterizing non-Gaussian effects present in the distribution of distance differences between anchor-positive and anchor-negative pairs. Our findings provide detailed insight into the sensitivity of the semi-hard triplet loss to its parameters and offer guidance for choosing the margin to ensure training stability.

[LG-111] Epidemic Forecasting with a Hybrid Deep Learning Method Using CNN LSTM With WOA GWO Optimization: Global COVID-19 Case Study

链接: https://arxiv.org/abs/2503.12813
作者: Mousa Alizadeh,Mohammad Hossein Samaei,Azam Seilsepour,Mohammad TH Beheshti
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Effective epidemic modeling is essential for managing public health crises, requiring robust methods to predict disease spread and optimize resource allocation. This study introduces a novel deep learning framework that advances time series forecasting for infectious diseases, with its application to COVID 19 data as a critical case study. Our hybrid approach integrates Convolutional Neural Networks (CNNs) and Long Short Term Memory (LSTM) models to capture spatial and temporal dynamics of disease transmission across diverse regions. The CNN extracts spatial features from raw epidemiological data, while the LSTM models temporal patterns, yielding precise and adaptable predictions. To maximize performance, we employ a hybrid optimization strategy combining the Whale Optimization Algorithm (WOA) and Gray Wolf Optimization (GWO) to fine tune hyperparameters, such as learning rates, batch sizes, and training epochs enhancing model efficiency and accuracy. Applied to COVID 19 case data from 24 countries across six continents, our method outperforms established benchmarks, including ARIMA and standalone LSTM models, with statistically significant gains in predictive accuracy (e.g., reduced RMSE). This framework demonstrates its potential as a versatile method for forecasting epidemic trends, offering insights for resource planning and decision making in both historical contexts, like the COVID 19 pandemic, and future outbreaks.

[LG-112] Estimating stationary mass frequency by frequency

链接: https://arxiv.org/abs/2503.12808
作者: Milind Nakul,Vidya Muthukumar,Ashwin Pananjady
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Suppose we observe a trajectory of length n from an \alpha -mixing stochastic process over a finite but potentially large state space. We consider the problem of estimating the probability mass placed by the stationary distribution of any such process on elements that occur with a certain frequency in the observed sequence. We estimate this vector of probabilities in total variation distance, showing universal consistency in n and recovering known results for i.i.d. sequences as special cases. Our proposed methodology carefully combines the plug-in (or empirical) estimator with a recently-proposed modification of the Good–Turing estimator called \textscWingIt, which was originally developed for Markovian sequences. En route to controlling the error of our estimator, we develop new performance bounds on \textscWingIt and the plug-in estimator for \alpha -mixing stochastic processes. Importantly, the extensively used method of Poissonization can no longer be applied in our non i.i.d. setting, and so we develop complementary tools – including concentration inequalities for a natural self-normalized statistic of mixing sequences – that may prove independently useful in the design and analysis of estimators for related problems.

[LG-113] Causal Feature Learning in the Social Sciences

链接: https://arxiv.org/abs/2503.12784
作者: Jingzhou Huang,Jiuyao Lu,Alexander Williams Tolbert
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Variable selection poses a significant challenge in causal modeling, particularly within the social sciences, where constructs often rely on inter-related factors such as age, socioeconomic status, gender, and race. Indeed, it has been argued that such attributes must be modeled as macro-level abstractions of lower-level manipulable features, in order to preserve the modularity assumption essential to causal inference. This paper accordingly extends the theoretical framework of Causal Feature Learning (CFL). Empirically, we apply the CFL algorithm to diverse social science datasets, evaluating how CFL-derived macrostates compare with traditional microstates in downstream modeling tasks.

[LG-114] Stabilization Analysis and Mode Recognition of Kerosene Supersonic Combustion: A Deep Learning Approach Based on Res-CNN-beta-VAE

链接: https://arxiv.org/abs/2503.12765
作者: Weiming Xu,Tao Yang,Chang Liu,Kun Wu,Peng Zhang
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures

点击查看摘要

Abstract:The scramjet engine is a key propulsion system for hypersonic vehicles, leveraging supersonic airflow to achieve high specific impulse, making it a promising technology for aerospace applications. Understanding and controlling the complex interactions between fuel injection, turbulent combustion, and aerodynamic effects of compressible flows are crucial for ensuring stable combustion in scramjet engines. However, identifying stable modes in scramjet combustors is often challenging due to limited experimental measurement means and extremely complex spatiotemporal evolution of supersonic turbulent combustion. This work introduces an innovative deep learning framework that combines dimensionality reduction via the Residual Convolutional Neural Network-beta-Variational Autoencoder (Res-CNN-beta-VAE) model with unsupervised clustering (K-means) to identify and analyze dynamical combustion modes in a supersonic combustor. By mapping high-dimensional data of combustion snapshots to a reduced three-dimensional latent space, the Res-CNN-beta-VAE model captures the essential temporal and spatial features of flame behaviors and enables the observation of transitions between combustion states. By analyzing the standard deviation of latent variable trajectories, we introduce a novel method for objectively distinguishing between dynamic transitions, which provides a scalable and expert-independent alternative to traditional classification methods. Besides, the unsupervised K-means clustering approach effectively identifies the complex interplay between the cavity and the jet-wake stabilization mechanisms, offering new insights into the system’s behavior across different gas-to-liquid mass flow ratios (GLRs).

[LG-115] SNPL: Simultaneous Policy Learning and Evaluation for Safe Multi-Objective Policy Improvement

链接: https://arxiv.org/abs/2503.12760
作者: Brian Cho,Ana-Roxana Pop,Ariel Evince,Nathan Kallus
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:To design effective digital interventions, experimenters face the challenge of learning decision policies that balance multiple objectives using offline data. Often, they aim to develop policies that maximize goal outcomes, while ensuring there are no undesirable changes in guardrail outcomes. To provide credible recommendations, experimenters must not only identify policies that satisfy the desired changes in goal and guardrail outcomes, but also offer probabilistic guarantees about the changes these policies induce. In practice, however, policy classes are often large, and digital experiments tend to produce datasets with small effect sizes relative to noise. In this setting, standard approaches such as data splitting or multiple testing often result in unstable policy selection and/or insufficient statistical power. In this paper, we provide safe noisy policy learning (SNPL), a novel approach that leverages the concept of algorithmic stability to address these challenges. Our method enables policy learning while simultaneously providing high-confidence guarantees using the entire dataset, avoiding the need for data-splitting. We present finite-sample and asymptotic versions of our algorithm that ensure the recommended policy satisfies high-probability guarantees for avoiding guardrail regressions and/or achieving goal outcome improvements. We test both variants of our approach approach empirically on a real-world application of personalizing SMS delivery. Our results on real-world data suggest that our approach offers dramatic improvements in settings with large policy classes and low signal-to-noise across both finite-sample and asymptotic safety guarantees, offering up to 300% improvements in detection rates and 150% improvements in policy gains at significantly smaller sample sizes.

[LG-116] Enhancing Circuit Trainability with Selective Gate Activation Strategy

链接: https://arxiv.org/abs/2503.12738
作者: Jeihee Cho,Junyong Lee,Daniel Justice,Shiho Kim
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Hybrid quantum-classical computing relies heavily on Variational Quantum Algorithms (VQAs) to tackle challenges in diverse fields like quantum chemistry and machine learning. However, VQAs face a critical limitation: the balance between circuit trainability and expressibility. Trainability, the ease of optimizing circuit parameters for problem-solving, is often hampered by the Barren Plateau, where gradients vanish and hinder optimization. On the other hand, increasing expressibility, the ability to represent a wide range of quantum states, often necessitates deeper circuits with more parameters, which in turn exacerbates trainability issues. In this work, we investigate selective gate activation strategies as a potential solution to these challenges within the context of Variational Quantum Eigensolvers (VQEs). We evaluate three different approaches: activating gates randomly without considering their type or parameter magnitude, activating gates randomly but limited to a single gate type, and activating gates based on the magnitude of their parameter values. Experiment results reveal that the Magnitude-based strategy surpasses other methods, achieving improved convergence.

[LG-117] Fast filtering of non-Gaussian models using Amortized Optimal Transport Maps

链接: https://arxiv.org/abs/2503.12633
作者: Mohammad Al-Jarrah,Bamdad Hosseini,Amirhossein Taghvaei
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:In this paper, we present the amortized optimal transport filter (A-OTF) designed to mitigate the computational burden associated with the real-time training of optimal transport filters (OTFs). OTFs can perform accurate non-Gaussian Bayesian updates in the filtering procedure, but they require training at every time step, which makes them expensive. The proposed A-OTF framework exploits the similarity between OTF maps during an initial/offline training stage in order to reduce the cost of inference during online calculations. More precisely, we use clustering algorithms to select relevant subsets of pre-trained maps whose weighted average is used to compute the A-OTF model akin to a mixture of experts. A series of numerical experiments validate that A-OTF achieves substantial computational savings during online inference while preserving the inherent flexibility and accuracy of OTF.

[LG-118] Ensemble Kalman-Bucy filtering for nonlinear model predictive control

链接: https://arxiv.org/abs/2503.12474
作者: Sebastian Reich
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Numerical Analysis (math.NA); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We consider the problem of optimal control for partially observed dynamical systems. Despite its prevalence in practical applications, there are still very few algorithms available, which take uncertainties in the current state estimates and future observations into account. In other words, most current approaches separate state estimation from the optimal control problem. In this paper, we extend the popular ensemble Kalman filter to receding horizon optimal control problems in the spirit of nonlinear model predictive control. We provide an interacting particle approximation to the forward-backward stochastic differential equations arising from Pontryagin’s maximum principle with the forward stochastic differential equation provided by the time-continuous ensemble Kalman-Bucy filter equations. The receding horizon control laws are approximated as linear and are continuously updated as in nonlinear model predictive control. We illustrate the performance of the proposed methodology for an inverted pendulum example.

[LG-119] Nonlinear Principal Component Analysis with Random Bernoulli Features for Process Monitoring

链接: https://arxiv.org/abs/2503.12456
作者: Ke Chen,Dandan Jiang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The process generates substantial amounts of data with highly complex structures, leading to the development of numerous nonlinear statistical methods. However, most of these methods rely on computations involving large-scale dense kernel matrices. This dependence poses significant challenges in meeting the high computational demands and real-time responsiveness required by online monitoring systems. To alleviate the computational burden of dense large-scale matrix multiplication, we incorporate the bootstrap sampling concept into random feature mapping and propose a novel random Bernoulli principal component analysis method to efficiently capture nonlinear patterns in the process. We derive a convergence bound for the kernel matrix approximation constructed using random Bernoulli features, ensuring theoretical robustness. Subsequently, we design four fast process monitoring methods based on random Bernoulli principal component analysis to extend its nonlinear capabilities for handling diverse fault scenarios. Finally, numerical experiments and real-world data analyses are conducted to evaluate the performance of the proposed methods. Results demonstrate that the proposed methods offer excellent scalability and reduced computational complexity, achieving substantial cost savings with minimal performance loss compared to traditional kernel-based approaches.

[LG-120] Central and Central-Parietal EEG Signatures of Parkinsons Disease

链接: https://arxiv.org/abs/2503.12392
作者: Artem Lensky
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates EEG as a potential early biomarker by applying deep learning techniques to resting-state EEG recordings from 31 subjects (15 with PD and 16 healthy controls). EEG signals were rigorously preprocessed to remove tremor artifacts, then converted to wavelet-based images by grouping spatially adjacent electrodes into triplets for convolutional neural network (CNN) classification. Our analysis across different brain regions and frequency bands showed distinct spatial-spectral patterns of PD-related neural oscillations. We identified high classification accuracy (74%) in the gamma band (40-62.4 Hz) for central-parietal electrodes (CP1, Pz, CP2), and 76% accuracy using central electrodes (C3, Cz, C4) with full-spectrum 0.4-62.4 Hz. In particular, we observed pronounced right-hemisphere involvement, specifically in parieto-occipital regions. Unlike previous studies that achieved higher accuracies by potentially including tremor artifacts, our approach isolates genuine neurophysiological alterations in cortical activity. These findings suggest that specific EEG-based oscillatory patterns, especially central-parietal gamma activity, may provide diagnostic information for PD, potentially before the onset of motor symptoms.

[LG-121] A Comparative Study of Invariance-Aware Loss Functions for Deep Learning-based Gridless Direction-of-Arrival Estimation ICASSP2025

链接: https://arxiv.org/abs/2503.12386
作者: Kuan-Lin Chen,Bhaskar D. Rao
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages. Accepted at ICASSP 2025

点击查看摘要

Abstract:Covariance matrix reconstruction has been the most widely used guiding objective in gridless direction-of-arrival (DoA) estimation for sparse linear arrays. Many semidefinite programming (SDP)-based methods fall under this category. Although deep learning-based approaches enable the construction of more sophisticated objective functions, most methods still rely on covariance matrix reconstruction. In this paper, we propose new loss functions that are invariant to the scaling of the matrices and provide a comparative study of losses with varying degrees of invariance. The proposed loss functions are formulated based on the scale-invariant signal-to-distortion ratio between the target matrix and the Gram matrix of the prediction. Numerical results show that a scale-invariant loss outperforms its non-invariant counterpart but is inferior to the recently proposed subspace loss that is invariant to the change of basis. These results provide evidence that designing loss functions with greater degrees of invariance is advantageous in deep learning-based gridless DoA estimation.

[LG-122] Simulation-based Bayesian inference under model misspecification

链接: https://arxiv.org/abs/2503.12315
作者: Ryan P. Kelly,David J. Warne,David T. Frazier,David J. Nott,Michael U. Gutmann,Christopher Drovandi
类目: Methodology (stat.ME); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 46 pages, 8 figures

点击查看摘要

Abstract:Simulation-based Bayesian inference (SBI) methods are widely used for parameter estimation in complex models where evaluating the likelihood is challenging but generating simulations is relatively straightforward. However, these methods commonly assume that the simulation model accurately reflects the true data-generating process, an assumption that is frequently violated in realistic scenarios. In this paper, we focus on the challenges faced by SBI methods under model misspecification. We consolidate recent research aimed at mitigating the effects of misspecification, highlighting three key strategies: i) robust summary statistics, ii) generalised Bayesian inference, and iii) error modelling and adjustment parameters. To illustrate both the vulnerabilities of popular SBI methods and the effectiveness of misspecification-robust alternatives, we present empirical results on an illustrative example.

[LG-123] Probabilistic Forecasting for Dynamical Systems with Missing or Imperfect Data

链接: https://arxiv.org/abs/2503.12273
作者: Siddharth Rout,Eldad Haber,Stéphane Gaudreault
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Dynamical Systems (math.DS); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:The modeling of dynamical systems is essential in many fields, but applying machine learning techniques is often challenging due to incomplete or noisy data. This study introduces a variant of stochastic interpolation (SI) for probabilistic forecasting, estimating future states as distributions rather than single-point predictions. We explore its mathematical foundations and demonstrate its effectiveness on various dynamical systems, including the challenging WeatherBench dataset.

[LG-124] Support Collapse of Deep Gaussian Processes with Polynomial Kernels for a Wide Regime of Hyperparameters

链接: https://arxiv.org/abs/2503.12266
作者: Daryna Chernobrovkina,Steffen Grünewälder
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We analyze the prior that a Deep Gaussian Process with polynomial kernels induces. We observe that, even for relatively small depths, averaging effects occur within such a Deep Gaussian Process and that the prior can be analyzed and approximated effectively by means of the Berry-Esseen Theorem. One of the key findings of this analysis is that, in the absence of careful hyper-parameter tuning, the prior of a Deep Gaussian Process either collapses rapidly towards zero as the depth increases or places negligible mass on low norm functions. This aligns well with experimental findings and mirrors known results for convolution based Deep Gaussian Processes.

[LG-125] Auditing Differential Privacy in the Black-Box Setting

链接: https://arxiv.org/abs/2503.12045
作者: Kaining Shi,Cong Ma
类目: Methodology (stat.ME); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: work in progress, comments are welcomed

点击查看摘要

Abstract:This paper introduces a novel theoretical framework for auditing differential privacy (DP) in a black-box setting. Leveraging the concept of f -differential privacy, we explicitly define type I and type II errors and propose an auditing mechanism based on conformal inference. Our approach robustly controls the type I error rate under minimal assumptions. Furthermore, we establish a fundamental impossibility result, demonstrating the inherent difficulty of simultaneously controlling both type I and type II errors without additional assumptions. Nevertheless, under a monotone likelihood ratio (MLR) assumption, our auditing mechanism effectively controls both errors. We also extend our method to construct valid confidence bands for the trade-off function in the finite-sample regime.

[LG-126] Bayes and Biased Estimators Without Hyper-parameter Estimation: Comparable Performance to the Empirical-Bayes-Based Regularized Estimator

链接: https://arxiv.org/abs/2503.11854
作者: Yue Ju,Bo Wahlberg,Håkan Hjalmarsson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Regularized system identification has become a significant complement to more classical system identification. It has been numerically shown that kernel-based regularized estimators often perform better than the maximum likelihood estimator in terms of minimizing mean squared error (MSE). However, regularized estimators often require hyper-parameter estimation. This paper focuses on ridge regression and the regularized estimator by employing the empirical Bayes hyper-parameter estimator. We utilize the excess MSE to quantify the MSE difference between the empirical-Bayes-based regularized estimator and the maximum likelihood estimator for large sample sizes. We then exploit the excess MSE expressions to develop both a family of generalized Bayes estimators and a family of closed-form biased estimators. They have the same excess MSE as the empirical-Bayes-based regularized estimator but eliminate the need for hyper-parameter estimation. Moreover, we conduct numerical simulations to show that the performance of these new estimators is comparable to the empirical-Bayes-based regularized estimator, while computationally, they are more efficient.

[LG-127] Ranking and Selection with Simultaneous Input Data Collection

链接: https://arxiv.org/abs/2503.11773
作者: Yuhao Wang,Enlu Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this paper, we propose a general and novel formulation of ranking and selection with the existence of streaming input data. The collection of multiple streams of such data may consume different types of resources, and hence can be conducted simultaneously. To utilize the streaming input data, we aggregate simulation outputs generated under heterogeneous input distributions over time to form a performance estimator. By characterizing the asymptotic behavior of the performance estimators, we formulate two optimization problems to optimally allocate budgets for collecting input data and running simulations. We then develop a multi-stage simultaneous budget allocation procedure and provide its statistical guarantees such as consistency and asymptotic normality. We conduct several numerical studies to demonstrate the competitive performance of the proposed procedure.

[LG-128] Generalized Bayesian Ensemble Survival Tree (GBEST) model

链接: https://arxiv.org/abs/2503.11738
作者: Elena Ballante,Pietro Muliere,Silvia Figini
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper proposes a new class of predictive models for survival analysis called Generalized Bayesian Ensemble Survival Tree (GBEST). It is well known that survival analysis poses many different challenges, in particular when applied to small data or censorship mechanism. Our contribution is the proposal of an ensemble approach that uses Bayesian bootstrap and beta Stacy bootstrap methods to improve the outcome in survival application with a special focus on small datasets. More precisely, a novel approach to integrate Beta Stacy Bayesian bootstrap in bagging tree models for censored data is proposed in this paper. Empirical evidence achieved on simulated and real data underlines that our approach performs better in terms of predictive performances and stability of the results compared with classical survival models available in the literature. In terms of methodology our novel contribution considers the adaptation of recent Bayesian ensemble approaches to survival data, providing a new model called Generalized Bayesian Ensemble Survival Tree (GBEST). A further result in terms of computational novelty is the implementation in R of GBEST, available in a public GitHub repository.

信息检索

[IR-0] Federated Mixture-of-Expert for Non-Overlapped Cross-Domain Sequential Recommendation

链接: https://arxiv.org/abs/2503.13254
作者: Yu Liu,Hanbin Jiang,Lei Zhu,Yu Zhang,Yuqi Mao,Jiangxia Cao,Shuchao Pang
类目: Information Retrieval (cs.IR)
*备注: Work in progress

点击查看摘要

Abstract:In the real world, users always have multiple interests while surfing different services to enrich their daily lives, e.g., watching hot short videos/live streamings. To describe user interests precisely for a better user experience, the recent literature proposes cross-domain techniques by transferring the other related services (a.k.a. domain) knowledge to enhance the accuracy of target service prediction. In practice, naive cross-domain techniques typically require there exist some overlapped users, and sharing overall information across domains, including user historical logs, user/item embeddings, and model parameter checkpoints. Nevertheless, other domain’s user-side historical logs and embeddings are not always available in real-world RecSys designing, since users may be totally non-overlapped across domains, or the privacy-preserving policy limits the personalized information sharing across domains. Thereby, a challenging but valuable problem is raised: How to empower target domain prediction accuracy by utilizing the other domain model parameters checkpoints only? To answer the question, we propose the FMoE-CDSR, which explores the non-overlapped cross-domain sequential recommendation scenario from the federated learning perspective.

[IR-1] Disentangling the Power Dynamics in Participatory Data Physicalisation

链接: https://arxiv.org/abs/2503.13018
作者: Silvia Cazacu,Georgia Panagiotidou,Therese Steenberghen,Andrew Vande Moere
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: In CHI’25, ACM (2025)

点击查看摘要

Abstract:Participatory data physicalisation (PDP) is recognised for its potential to support data-driven decisions among stakeholders who collaboratively construct physical elements into commonly insightful visualisations. Like all participatory processes, PDP is however influenced by underlying power dynamics that might lead to issues regarding extractive participation, marginalisation, or exclusion, among others. We first identified the decisions behind these power dynamics by developing an ontology that synthesises critical theoretical insights from both visualisation and participatory design research, which were then systematically applied unto a representative corpus of 23 PDP artefacts. By revealing how shared decisions are guided by different agendas, this paper presents three contributions: 1) a cross-disciplinary ontology that facilitates the systematic analysis of existing and novel PDP artefacts and processes; which leads to 2) six PDP agendas that reflect the key power dynamics in current PDP practice, revealing the diversity of orientations towards stakeholder participation in PDP practice; and 3) a set of critical considerations that should guide how power dynamics can be balanced, such as by reflecting on how issues are represented, data is contextualised, participants express their meanings, and how participants can dissent with flexible artefact construction. Consequently, this study advances a feminist research agenda by guiding researchers and practitioners in openly reflecting on and sharing responsibilities in data physicalisation and participatory data visualisation.

[IR-2] Leverag ing the Dynamics of Leadership in Group Recommendation Systems

链接: https://arxiv.org/abs/2503.12877
作者: Peijin Yu,Shin’ichi Konomi
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: 30 pages, 16 figures

点击查看摘要

Abstract:In the field of group recommendation systems (GRS), effectively addressing the diverse preferences of group members poses a significant challenge. Traditional GRS approaches often aggregate individual preferences into a collective group preference to generate recommendations, which may overlook the intricate interactions between group members. We introduce a novel approach to group recommendation, with a specific focus on small groups sharing common interests. In particular, we present a web-based restaurant recommendation system that enhances user satisfaction by modeling mutual interactions among group members. Drawing inspiration from group decision-making literature and leveraging graph theory, we propose a recommendation algorithm that emphasizes the dynamics of relationships and trust within the group. By representing group members as nodes and their interactions as directed edges, the algorithm captures pairwise relationships to foster consensus and improve the alignment of recommendations with group preferences. This interaction-focused framework ultimately seeks to enhance overall group satisfaction with the recommended choices.

[IR-3] LLM SeR: Enhancing Sequential Recommendation via LLM -based Data Augmentation

链接: https://arxiv.org/abs/2503.12547
作者: Yuqi Sun,Qidong Liu,Haiping Zhu,Feng Tian
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential Recommender Systems (SRS) have become a cornerstone of online platforms, leveraging users’ historical interaction data to forecast their next potential engagement. Despite their widespread adoption, SRS often grapple with the long-tail user dilemma, resulting in less effective recommendations for individuals with limited interaction records. The advent of Large Language Models (LLMs), with their profound capability to discern semantic relationships among items, has opened new avenues for enhancing SRS through data augmentation. Nonetheless, current methodologies encounter obstacles, including the absence of collaborative signals and the prevalence of hallucination this http URL this work, we present LLMSeR, an innovative framework that utilizes Large Language Models (LLMs) to generate pseudo-prior items, thereby improving the efficacy of Sequential Recommender Systems (SRS). To alleviate the challenge of insufficient collaborative signals, we introduce the Semantic Interaction Augmentor (SIA), a method that integrates both semantic and collaborative information to comprehensively augment user interaction data. Moreover, to weaken the adverse effects of hallucination in SRS, we develop the Adaptive Reliability Validation (ARV), a validation technique designed to assess the reliability of the generated pseudo items. Complementing these advancements, we also devise a Dual-Channel Training strategy, ensuring seamless integration of data augmentation into the SRS training this http URL experiments conducted with three widely-used SRS models demonstrate the generalizability and efficacy of LLMSeR.

[IR-4] A novel association and ranking approach identifies factors affecting educational outcomes of STEM majors

链接: https://arxiv.org/abs/2503.12321
作者: Kira Adaricheva,Jonathan T. Brockman,Gillian Z. Elston,Lawrence Hobbie,Skylar Homan,Mohamad Khalefa,Jiyun V. Kim,Rochelle K. Nelson,Sarah Samad,Oren Segal
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: 34 pages, 7 figures

点击查看摘要

Abstract:Improving undergraduate success in STEM requires identifying actionable factors that impact student outcomes, allowing institutions to prioritize key leverage points for change. We examined academic, demographic, and institutional factors that might be associated with graduation rates at two four-year colleges in the northeastern United States using a novel association algorithm called D-basis to rank attributes associated with graduation. Importantly, the data analyzed included tracking data from the National Student Clearinghouse on students who left their original institutions to determine outcomes following transfer. Key predictors of successful graduation include performance in introductory STEM courses, the choice of first mathematics class, and flexibility in major selection. High grades in introductory biology, general chemistry, and mathematics courses were strongly correlated with graduation. At the same time, students who switched majors - especially from STEM to non-STEM - had higher overall graduation rates. Additionally, Pell eligibility and demographic factors, though less predictive overall, revealed disparities in time to graduation and retention rates. The findings highlight the importance of early academic support in STEM gateway courses and the implementation of institutional policies that provide flexibility in major selection. Enhancing student success in introductory mathematics, biology, and chemistry courses could greatly influence graduation rates. Furthermore, customized mathematics pathways and focused support for STEM courses may assist institutions in optimizing student outcomes. This study offers data-driven insights to guide strategies to increase STEM degree completion. Comments: 34 pages, 7 figures Subjects: Computers and Society (cs.CY); Information Retrieval (cs.IR) MSC classes: 62Rxx Cite as: arXiv:2503.12321 [cs.CY] (or arXiv:2503.12321v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2503.12321 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] Bridging Textual-Collaborative Gap through Semantic Codes for Sequential Recommendation

链接: https://arxiv.org/abs/2503.12183
作者: Enze Liu,Bowen Zheng,Wayne Xin Zhao,Ji-Rong Wen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recent years, substantial research efforts have been devoted to enhancing sequential recommender systems by integrating abundant side information with ID-based collaborative information. This study specifically focuses on leveraging the textual metadata (e.g., titles and brands) associated with items. While existing methods have achieved notable success by combining text and ID representations, they often struggle to strike a balance between textual information embedded in text representations and collaborative information from sequential patterns of user behavior. In light of this, we propose CoCoRec, a novel Code-based textual and Collaborative semantic fusion method for sequential Recommendation. The key idea behind our approach is to bridge the gap between textual and collaborative information using semantic codes. Specifically, we generate fine-grained semantic codes from multi-view text embeddings through vector quantization techniques. Subsequently, we develop a code-guided semantic-fusion module based on the cross-attention mechanism to flexibly extract and integrate relevant information from text representations. In order to further enhance the fusion of textual and collaborative semantics, we introduce an optimization strategy that employs code masking with two specific objectives: masked code modeling and masked sequence alignment. The merit of these objectives lies in leveraging mask prediction tasks and augmented item representations to capture code correlations within individual items and enhance the sequence modeling of the recommendation backbone. Extensive experiments conducted on four public datasets demonstrate the superiority of CoCoRec, showing significant improvements over various sequential recommendation models. Our code is available at this https URL.

[IR-6] Genicious: Contextual Few-shot Prompting for Insights Discovery

链接: https://arxiv.org/abs/2503.12062
作者: Vineet Kumar,Ronald Tony,Darshita Rathore,Vipasha Rana,Bhuvanesh Mandora,Kanishka,Chetna Bansal,Anindya Moitra
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 3 figures, CODS-COMAD Dec 24, Jodhpur, India

点击查看摘要

Abstract:Data and insights discovery is critical for decision-making in modern organizations. We present Genicious, an LLM-aided interface that enables users to interact with tabular datasets and ask complex queries in natural language. By benchmarking various prompting strategies and language models, we have developed an end-to-end tool that leverages contextual few-shot prompting, achieving superior performance in terms of latency, accuracy, and scalability. Genicious empowers stakeholders to explore, analyze and visualize their datasets efficiently while ensuring data security through role-based access control and a Text-to-SQL approach.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-03-18

目录

概览 (2025-03-18)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载