本篇博文主要内容为 2025-03-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-03-24)
今日共更新539篇论文,其中:
- 自然语言处理共79篇(Computation and Language (cs.CL))
- 人工智能共164篇(Artificial Intelligence (cs.AI))
- 计算机视觉共154篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共139篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂任务中多步逻辑推理能力不足的问题。传统的方法依赖于从过程奖励模型中获得的标量奖励信号来评估候选推理步骤,但这些标量奖励缺乏指导每一步推理所需的细致定性信息。论文提出了一种新颖的推理时间扩展方法——逐步自然语言自批评(PANEL),其关键在于利用自生成的自然语言批评作为反馈来引导步骤级搜索过程。通过为每个候选推理步骤生成丰富且可读性强的批评,PANEL保留了重要的定性信息,从而在推理过程中实现更明智的决策。这种方法避免了对任务特定验证器及其相关训练开销的需求,使其适用于多种不同任务。实验结果表明,PANEL显著提升了推理性能,在AIME和GPQA等具有挑战性的推理基准测试中表现优于传统的基于标量奖励的方法。代码已公开以支持和鼓励该领域的进一步研究。
链接: https://arxiv.org/abs/2503.17363
作者: Yansi Li,Jiahao Xu,Tian Liang,Xingyu Chen,Zhiwei He,Qiuzhi Liu,Rui Wang,Zhuosheng Zhang,Zhaopeng Tu,Haitao Mi,Dong Yu
机构: Tencent(腾讯); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Enhancing the reasoning capabilities of large language models (LLMs), particularly for complex tasks requiring multi-step logical deductions, remains a significant challenge. Traditional inference time scaling methods utilize scalar reward signals from process reward models to evaluate candidate reasoning steps, but these scalar rewards lack the nuanced qualitative information essential for understanding and justifying each step. In this paper, we propose a novel inference-time scaling approach – stepwise natural language self-critique (PANEL), which employs self-generated natural language critiques as feedback to guide the step-level search process. By generating rich, human-readable critiques for each candidate reasoning step, PANEL retains essential qualitative information, facilitating better-informed decision-making during inference. This approach bypasses the need for task-specific verifiers and the associated training overhead, making it broadly applicable across diverse tasks. Experimental results on challenging reasoning benchmarks, including AIME and GPQA, demonstrate that PANEL significantly enhances reasoning performance, outperforming traditional scalar reward-based methods. Our code is available at this https URL to support and encourage future research in this promising field.
zh
[NLP-1] OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement
【速读】: 本文旨在探索大型视觉语言模型(Large Vision-Language Models, LVLMs)是否能够集成复杂的推理能力,并评估其在多模态推理任务中的影响。受大型语言模型(Large Language Models, LLMs)通过可验证奖励的强化学习(Reinforcement Learning, RL)实现自我验证和自我修正等高级推理行为的启发,研究提出了一种结合监督微调(Supervised Fine-Tuning, SFT)与迭代强化学习的方法。关键在于通过高质量图像描述生成推理步骤,从纯文本模型中蒸馏出推理能力,并利用迭代过程不断优化模型的泛化能力和推理性能,最终开发出具备一致增强推理表现的OpenVLThinker模型。
链接: https://arxiv.org/abs/2503.17352
作者: Yihe Deng,Hritik Bansal,Fan Yin,Nanyun Peng,Wei Wang,Kai-Wei Chang
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 23 pages, 11 figures, 8 tables
点击查看摘要
Abstract:Recent advancements demonstrated by DeepSeek-R1 have shown that complex reasoning abilities in large language models (LLMs), including sophisticated behaviors such as self-verification and self-correction, can be achieved by RL with verifiable rewards and significantly improves model performance on challenging tasks such as AIME. Motivated by these findings, our study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs) and assesses their impact on challenging multimodal reasoning tasks. We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization. Initially, reasoning capabilities were distilled from pure-text R1 models by generating reasoning steps using high-quality captions of the images sourced from diverse visual datasets. Subsequently, iterative RL training further enhance reasoning skills, with each iteration’s RL-improved model generating refined SFT datasets for the next round. This iterative process yielded OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrating the potential of our strategy for robust vision-language reasoning. The code, model and data are held at this https URL.
zh
[NLP-2] Efficient Intent-Based Filtering for Multi-Party Conversations Using Knowledge Distillation from LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理多党对话时因资源消耗巨大(需大量内存和计算能力)而导致的高成本问题。论文的关键解决方案是提出了一种基于意图的过滤方法,通过知识蒸馏从LLMs中提取知识,并将其应用于MobileBERT模型的微调,以实现多标签意图分类。这种方法结合多种策略构建了一个包含目标意图的多样化多党对话数据集,从而能够在计算资源受限的环境中高效运行,仅将与目标应用相关的对话片段传递给LLM进行进一步处理,显著降低了整体运营成本。实验结果验证了该方法在效率与性能之间的良好平衡。
链接: https://arxiv.org/abs/2503.17336
作者: Reem Gody,Mohamed Abdelghaffar,Mohammed Jabreel,Ahmed Tawfik
机构: Microsoft AI (微软人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have showcased remarkable capabilities in conversational AI, enabling open-domain responses in chat-bots, as well as advanced processing of conversations like summarization, intent classification, and insights generation. However, these models are resource-intensive, demanding substantial memory and computational power. To address this, we propose a cost-effective solution that filters conversational snippets of interest for LLM processing, tailored to the target downstream application, rather than processing every snippet. In this work, we introduce an innovative approach that leverages knowledge distillation from LLMs to develop an intent-based filter for multi-party conversations, optimized for compute power constrained environments. Our method combines different strategies to create a diverse multi-party conversational dataset, that is annotated with the target intents and is then used to fine-tune the MobileBERT model for multi-label intent classification. This model achieves a balance between efficiency and performance, effectively filtering conversation snippets based on their intents. By passing only the relevant snippets to the LLM for further processing, our approach significantly reduces overall operational costs depending on the intents and the data distribution as demonstrated in our experiments.
zh
[NLP-3] FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models
【速读】: 该论文旨在解决基于大参数量语言模型(如1.5B参数)的复杂推理任务训练效率低下以及长链路思维链(long chain-of-thought)推理性能不足的问题。论文提出了一种名为\textbf\textscFastCuRL的高效 Curriculum Reinforcement Learning 方法,其关键在于结合了长度感知的数据分割策略与逐步扩展上下文窗口的训练方法。通过将原始训练数据按输入提示长度分为三个层次,并利用逐步增加上下文窗口长度的分段数据集进行训练,\textbf\textscFastCuRL不仅显著减少了50%的训练步骤,同时在五个基准数据集(MATH 500, AIME 2024, AMC 2023, Minerva Math, 和 OlympiadBench)上实现了超越DeepScaleR的表现,且仅需单节点8GPU的资源即可完成所有训练阶段。
链接: https://arxiv.org/abs/2503.17287
作者: Mingyang Song,Mao Zheng,Zheng Li,Wenjie Yang,Xuan Luo,Yue Pan,Feng Zhang
机构: Tencent Hunyuan (腾讯混元)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In this paper, we propose \textbf\textscFastCuRL, a simple yet efficient \textbfCurriculum \textbfReinforcement \textbfLearning approach with context window extending strategy to accelerate the reinforcement learning training efficiency for R1-like reasoning models while enhancing their performance in tackling complex reasoning tasks with long chain-of-thought rationales, particularly with a 1.5B parameter language model. \textbf\textscFastCuRL consists of two main procedures: length-aware training data segmentation and context window extension training. Specifically, the former first splits the original training data into three different levels by the input prompt length, and then the latter leverages segmented training datasets with a progressively increasing context window length to train the reasoning model. Experimental results demonstrate that \textbf\textscFastCuRL-1.5B-Preview surpasses DeepScaleR-1.5B-Preview across all five datasets (including MATH 500, AIME 2024, AMC 2023, Minerva Math, and OlympiadBench) while only utilizing 50% of training steps. Furthermore, all training stages for FastCuRL-1.5B-Preview are completed using just a single node with 8 GPUs.
zh
[NLP-4] An Iterative Feedback Mechanism for Improving Natural Language Class Descriptions in Open-Vocabulary Object Detection
【速读】: 该论文试图解决如何让非技术用户能够灵活定义新类别的目标以满足多样化应用场景的问题,同时确保自动目标识别系统(Automatic Target Recognition, ATR)的可持续性和可重用性。关键在于提出了一种结合文本嵌入分析技术和对比示例嵌入组合的方法,用于改进非技术用户提供的自然语言描述的质量,从而无需重新训练模型即可快速适应新的目标类别定义。通过多个公开可用的开放词汇目标检测模型验证了所提反馈机制的有效性。
链接: https://arxiv.org/abs/2503.17285
作者: Louis Y. Kim,Michelle Karker,Victoria Valledor,Seiyoung C. Lee,Karl F. Brzoska,Margaret Duff,Anthony Palladino
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: To appear in the Proceedings of SPIE 13463 Automatic Target Recognition XXXV, Orlando, FL, 2025
点击查看摘要
Abstract:Recent advances in open-vocabulary object detection models will enable Automatic Target Recognition systems to be sustainable and repurposed by non-technical end-users for a variety of applications or missions. New, and potentially nuanced, classes can be defined with natural language text descriptions in the field, immediately before runtime, without needing to retrain the model. We present an approach for improving non-technical users’ natural language text descriptions of their desired targets of interest, using a combination of analysis techniques on the text embeddings, and proper combinations of embeddings for contrastive examples. We quantify the improvement that our feedback mechanism provides by demonstrating performance with multiple publicly-available open-vocabulary object detection models.
zh
[NLP-5] CASE – Condition-Aware Sentence Embeddings for Conditional Semantic Textual Similarity Measurement
【速读】: 该论文旨在解决如何有效地根据上下文条件修改句子嵌入的问题。现有句子嵌入方法虽有进展,但在处理句子嵌入与上下文之间的关系时仍缺乏明确的最佳实践。论文提出了一种名为Condition-Aware Sentence Embeddings (CASE) 的高效且精确的方法来创建在给定条件下生成的句子嵌入。其关键是首先利用大语言模型(Large Language Model, LLM)为上下文生成嵌入,并通过池化过程让句子影响上下文中各标记的注意力分数;其次,采用监督非线性投影以降低基于LLM的文本嵌入维度。实验表明,CASE在现有标准基准数据集上的表现显著优于先前提出的Conditional Semantic Textual Similarity (C-STS) 方法,特别是从条件嵌入中减去原始嵌入可提升LLM文本嵌入的C-STS性能,同时所提出的监督降维方法不仅减少了嵌入维度,还显著提升了性能。
链接: https://arxiv.org/abs/2503.17279
作者: Gaifan Zhang,Yi Zhou,Danushka Bollegala
机构: University of Liverpool (利物浦大学); Cardiff University (卡迪夫大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The meaning conveyed by a sentence often depends on the context in which it appears. Despite the progress of sentence embedding methods, it remains unclear how to best modify a sentence embedding conditioned on its context. To address this problem, we propose Condition-Aware Sentence Embeddings (CASE), an efficient and accurate method to create an embedding for a sentence under a given condition. First, CASE creates an embedding for the condition using a Large Language Model (LLM), where the sentence influences the attention scores computed for the tokens in the condition during pooling. Next, a supervised nonlinear projection is learned to reduce the dimensionality of the LLM-based text embeddings. We show that CASE significantly outperforms previously proposed Conditional Semantic Textual Similarity (C-STS) methods on an existing standard benchmark dataset. We find that subtracting the condition embedding consistently improves the C-STS performance of LLM-based text embeddings. Moreover, we propose a supervised dimensionality reduction method that not only reduces the dimensionality of LLM-based embeddings but also significantly improves their performance.
zh
[NLP-6] KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal Financial and Preprocessing Applications
【速读】: 该论文旨在解决专业领域(法律、金融和政府文本)中现有分词器性能不足的问题。论文的关键解决方案在于提出了一组专门针对这些领域的分词器——KL3M 分词器。首先,通过引入基于 BPE 的领域特定分词器,在处理领域相关文档时,其 kl3m-004-128k-cased 分词器相比 GPT-4o 和 Llama3 使用更少的词汇量却实现了更高效的分词效果,分别减少 9%-17% 的 token 数。对于专业术语,cased 分词器进一步优化,法律术语的 token 数减少了高达 83%,金融术语减少了 39%。其次,开发了适用于文本校正任务(如 OCR 后处理)的字符级 BPE 分词器,其词汇量分别为 4K、8K 和 16K,并保持错误文本与正确文本之间的 token 边界一致,从而便于模型学习纠正模式。这些创新显著提升了长篇法律和金融文档的处理效率,同时降低了计算需求并保留了领域术语的意义。
链接: https://arxiv.org/abs/2503.17247
作者: Michael J Bommarito,Daniel Martin Katz,Jillian Bommarito
机构: ALEA Institute; Illinois Tech - Chicago Kent Law (伊利诺伊理工学院-芝加哥肯特法学院); Bucerius Law School (布塞留斯法律与商业管理学院); Stanford CodeX (斯坦福CodeX)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 tables, 3 figures; Source code available at this https URL
点击查看摘要
Abstract:We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased tokenizer uses 9-17% fewer tokens than GPT-4o and Llama3 for domain-specific documents, despite having a smaller vocabulary. For specialized terminology, our cased tokenizer is even more efficient, using up to 83% fewer tokens for legal terms and 39% fewer tokens for financial terms. Second, we develop character-level BPE tokenizers (4K, 8K, and 16K vocabulary sizes) for text correction tasks like OCR post-processing. These tokenizers keep consistent token boundaries between error-containing and correct text, making it easier for models to learn correction patterns. These tokenizers help professional applications by fitting more text in context windows, reducing computational needs, and preserving the meaning of domain-specific terms. Our analysis shows these efficiency gains directly benefit the processing of long legal and financial documents. We release all tokenizers and code through GitHub and Hugging Face to support further research in specialized tokenization. Comments: 12 pages, 7 tables, 3 figures; Source code available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.17247 [cs.CL] (or arXiv:2503.17247v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.17247 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Michael Bommarito [view email] [v1] Fri, 21 Mar 2025 15:51:43 UTC (1,406 KB)
zh
[NLP-7] SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
【速读】: 该论文旨在解决在下游任务微调大型语言模型(Large Language Models, LLMs)时,即使使用良性微调数据集,也可能无意中削弱模型的安全性对齐(safety alignment)的问题。论文的关键解决方案是提出SafeMERGE,这是一种微调后的框架,通过在微调层与安全性对齐层之间进行选择性合并(selective merging),仅当这些层的行为偏离安全行为时才进行合并,合并依据余弦相似性标准(cosine similarity criterion)进行度量,从而在保持任务效用的同时保留安全性。实验表明,SafeMERGE相比其他基线方法能够有效减少有害输出,同时通常不会显著降低性能,有时甚至提升性能。
链接: https://arxiv.org/abs/2503.17239
作者: Aladin Djuhera,Swanand Ravindra Kadhe,Farhan Ahmed,Syed Zawad,Holger Boche
机构: Technical University Munich (慕尼黑工业大学); IBM Research (IBM研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Fine-tuning large language models (LLMs) on downstream tasks can inadvertently erode their safety alignment, even for benign fine-tuning datasets. We address this challenge by proposing SafeMERGE, a post-fine-tuning framework that preserves safety while maintaining task utility. It achieves this by selectively merging fine-tuned and safety-aligned model layers only when those deviate from safe behavior, measured by a cosine similarity criterion. We evaluate SafeMERGE against other fine-tuning- and post-fine-tuning-stage approaches for Llama-2-7B-Chat and Qwen-2-7B-Instruct models on GSM8K and PubMedQA tasks while exploring different merging strategies. We find that SafeMERGE consistently reduces harmful outputs compared to other baselines without significantly sacrificing performance, sometimes even enhancing it. The results suggest that our selective, subspace-guided, and per-layer merging method provides an effective safeguard against the inadvertent loss of safety in fine-tuned LLMs while outperforming simpler post-fine-tuning-stage defenses.
zh
[NLP-8] FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)生成虚假内容(hallucinated content)的问题,特别是在需要高度事实准确性(factuality)的应用场景中面临的挑战。传统方法通常在句子或段落级别检测幻觉内容,而论文提出了一种名为FactSelfCheck的新方法,通过基于黑盒采样的方式实现细粒度的事实级(fact-level)幻觉检测。其关键在于将文本表示为包含三元组形式事实的知识图谱,并通过分析多个LLM响应之间的事实一致性来计算细粒度的幻觉分数,无需依赖外部资源或训练数据。这种方法不仅在性能上与现有领先方法竞争,还能显著提升事实内容的修正效果,尤其在事实级检测下可使事实性内容提高35%,远超句子级方法的8%改进。
链接: https://arxiv.org/abs/2503.17229
作者: Albert Sawczyn,Jakub Binkowski,Denis Janiak,Bogdan Gabrys,Tomasz Kajdanowicz
机构: Wrocław University of Science and Technology ( Wrocław University of Science and Technology ); University of Technology Sydney (悉尼科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
点击查看摘要
Abstract:Large Language Models (LLMs) frequently generate hallucinated content, posing significant challenges for applications where factuality is crucial. While existing hallucination detection methods typically operate at the sentence level or passage level, we propose FactSelfCheck, a novel black-box sampling-based method that enables fine-grained fact-level detection. Our approach represents text as knowledge graphs consisting of facts in the form of triples. Through analyzing factual consistency across multiple LLM responses, we compute fine-grained hallucination scores without requiring external resources or training data. Our evaluation demonstrates that FactSelfCheck performs competitively with leading sampling-based methods while providing more detailed insights. Most notably, our fact-level approach significantly improves hallucination correction, achieving a 35% increase in factual content compared to the baseline, while sentence-level SelfCheckGPT yields only an 8% improvement. The granular nature of our detection enables more precise identification and correction of hallucinated content.
zh
[NLP-9] Automating Adjudication of Cardiovascular Events Using Large Language Models
【速读】: 本文旨在解决临床试验中心血管事件(如心肌梗死和中风)人工裁定过程所面临的挑战,包括耗时、资源密集以及评审者间差异导致的潜在偏倚和试验进展受阻等问题。解决方案的关键在于提出了一种基于大型语言模型(LLMs)的新框架,通过两阶段方法实现心血管事件裁定的自动化:第一阶段利用基于LLM的管道从非结构化临床数据中提取事件信息;第二阶段结合树状思维(Tree of Thoughts)方法和临床终点委员会(CEC)指南进行裁定。该框架在特定临床试验数据上的测试结果显示,事件提取的F1得分为0.82,裁定准确率为0.68,并引入了CLEART评分作为评估AI生成临床推理质量的新指标。这一方法显著降低了裁定时间和成本,同时保持了高质量、一致性和可审计性,有助于更快识别和应对与心血管疗法相关的风险。
链接: https://arxiv.org/abs/2503.17222
作者: Sonish Sivarajkumar,Kimia Ameri,Chuqin Li,Yanshan Wang,Min Jiang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Cardiovascular events, such as heart attacks and strokes, remain a leading cause of mortality globally, necessitating meticulous monitoring and adjudication in clinical trials. This process, traditionally performed manually by clinical experts, is time-consuming, resource-intensive, and prone to inter-reviewer variability, potentially introducing bias and hindering trial progress. This study addresses these critical limitations by presenting a novel framework for automating the adjudication of cardiovascular events in clinical trials using Large Language Models (LLMs). We developed a two-stage approach: first, employing an LLM-based pipeline for event information extraction from unstructured clinical data and second, using an LLM-based adjudication process guided by a Tree of Thoughts approach and clinical endpoint committee (CEC) guidelines. Using cardiovascular event-specific clinical trial data, the framework achieved an F1-score of 0.82 for event extraction and an accuracy of 0.68 for adjudication. Furthermore, we introduce the CLEART score, a novel, automated metric specifically designed for evaluating the quality of AI-generated clinical reasoning in adjudicating cardiovascular events. This approach demonstrates significant potential for substantially reducing adjudication time and costs while maintaining high-quality, consistent, and auditable outcomes in clinical trials. The reduced variability and enhanced standardization also allow for faster identification and mitigation of risks associated with cardiovascular therapies.
zh
[NLP-10] A Language Anchor-Guided Method for Robust Noisy Domain Generalization
【速读】: 该论文致力于解决现实世界中机器学习应用面临的两大挑战:数据分布偏移(distribution shift)和标签噪声(label noise)。模型在训练过程中容易过度关注冗余且无信息价值的特征,导致泛化能力受限;而标签噪声进一步加剧了过拟合问题,使得现有方法难以区分真正不变的特征与误导性的虚假特征。为应对这些问题,论文提出了Anchor Alignment and Adaptive Weighting (A3W) 算法。其关键是利用自然语言处理(NLP)锚点引导的样本重加权机制来提取更具代表性的特征,并通过调整每个样本对损失函数的贡献权重,增强模型对噪声标签的鲁棒性。这一方法通过引入基于语义表示的领域不变先验知识,显著提升了模型在不同数据集和噪声水平下的准确性和鲁棒性。
链接: https://arxiv.org/abs/2503.17211
作者: Zilin Dai,Lehong Wang,Fangzhou Lin,Yidong Wang,Zhigang Li,Kazunori D Yamada,Ziming Zhang,Wang Lu
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Real-world machine learning applications often struggle with two major challenges: distribution shift and label noise. Models tend to overfit by focusing on redundant and uninformative features in the training data, which makes it hard for them to generalize to the target domain. Noisy data worsens this problem by causing further overfitting to the noise, meaning that existing methods often fail to tell the difference between true, invariant features and misleading, spurious ones. To tackle these issues, we introduce Anchor Alignment and Adaptive Weighting (A3W). This new algorithm uses sample reweighting guided by natural language processing (NLP) anchors to extract more representative features. In simple terms, A3W leverages semantic representations from natural language models as a source of domain-invariant prior knowledge. Additionally, it employs a weighted loss function that adjusts each sample’s contribution based on its similarity to the corresponding NLP anchor. This adjustment makes the model more robust to noisy labels. Extensive experiments on standard benchmark datasets show that A3W consistently outperforms state-of-the-art domain generalization methods, offering significant improvements in both accuracy and robustness across different datasets and noise levels.
zh
[NLP-11] CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization
【速读】: 该论文旨在解决使用语言模型评估创意文本(如人类撰写的故事)时因多标注者评分主观性而导致的挑战。传统方法如Self-Consistency (SC) 在生成看似流畅的解释时存在目标不匹配的问题,导致预测性能不佳。为克服这一挑战,论文提出了一种名为Chain of Keywords (CoKe) 的新方法,其关键在于在生成自由文本解释之前先生成一组关键词序列,这些关键词指导评价语言模型的评分预测。通过生成多样化的关键词集合并聚合对应的评分,CoKe不仅实现了与人工标注相当的性能,还在StoryER数据集上显著优于GPT-4,同时大幅减少了参数数量。
链接: https://arxiv.org/abs/2503.17136
作者: Brihi Joshi,Sriram Venkatapathy,Mohit Bansal,Nanyun Peng,Haw-Shiuan Chang
机构: University of Southern California (南加州大学); Amazon AGI Foundations (亚马逊AGI基础研究团队); University of Massachusets Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Evaluating creative text such as human-written stories using language models has always been a challenging task – owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model’s predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating ‘fluent-looking’ explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose \textbfC hain- \textbfo f- \textbfKe ywords (CoKe), that generates a sequence of keywords \textitbefore generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.
zh
[NLP-12] Modifying Large Language Model Post-Training for Diverse Creative Writing
【速读】: 该论文试图解决在创意写作任务中,大规模语言模型(Large Language Models, LLMs)在后训练阶段侧重提升生成质量但忽视输出多样性的问题。为了解决这一问题,论文提出了一种新的解决方案,其关键是将偏差(deviation,即训练样本与具有相同提示的所有其他样本之间的差异程度)纳入训练目标,以促进从罕见高质量实例中学习。通过采用基于这种思想的直接偏好优化(Direct Preference Optimization, DPO)和几率比偏好优化(Odds Ratio Preference Optimization, ORPO),论文展示了所提方法能够在最小化降低生成质量的同时显著提升模型的输出多样性。
链接: https://arxiv.org/abs/2503.17126
作者: John Joon Young Chung,Vishakh Padmakumar,Melissa Roemmele,Yuqian Sun,Max Kreminski
机构: Midjourney; New York University (纽约大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:As creative writing tasks do not have singular correct answers, large language models (LLMs) trained to perform these tasks should be able to generate diverse valid outputs. However, LLM post-training often focuses on improving generation quality but neglects to facilitate output diversity. Hence, in creative writing generation, we investigate post-training approaches to promote both output diversity and quality. Our core idea is to include deviation – the degree of difference between a training sample and all other samples with the same prompt – in the training objective to facilitate learning from rare high-quality instances. By adopting our approach to direct preference optimization (DPO) and odds ratio preference optimization (ORPO), we demonstrate that we can promote the output diversity of trained models while minimally decreasing quality. Our best model with 8B parameters could achieve on-par diversity as a human-created dataset while having output quality similar to the best instruction-tuned models we examined, GPT-4o and DeepSeek-R1. We further validate our approaches with a human evaluation, an ablation, and a comparison to an existing diversification approach, DivPO.
zh
[NLP-13] A Study into Investigating Temporal Robustness of LLM s
【速读】: 该论文试图解决大型语言模型(LLMs)在处理基于时间的信息、时间推理以及时间事实知识相关问题时表现不足的问题。具体而言,论文关注LLMs在时间范围理解、时间方向识别以及时间敏感任务中的鲁棒性不足。解决方案的关键在于设计了一组包含八种时间敏感的鲁棒性测试,用于评估六种流行LLMs在零样本设置下的性能,并揭示其在时间重述和不同时间粒度参考上的脆弱性。此外,论文展示了如何利用这些测试自动判断模型的时间鲁棒性,以实时支持用户提问,并通过改进使时间问答(Temporal QA)性能提升高达55%。
链接: https://arxiv.org/abs/2503.17073
作者: Jonas Wallat,Abdelrahman Abdallah,Adam Jatowt,Avishek Anand
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 8 pages
点击查看摘要
Abstract:Large Language Models (LLMs) encapsulate a surprising amount of factual world knowledge. However, their performance on temporal questions and historical knowledge is limited because they often cannot understand temporal scope and orientation or neglect the temporal aspect altogether. In this study, we aim to measure precisely how robust LLMs are for question answering based on their ability to process temporal information and perform tasks requiring temporal reasoning and temporal factual knowledge. Specifically, we design eight time-sensitive robustness tests for factual information to check the sensitivity of six popular LLMs in the zero-shot setting. Overall, we find LLMs lacking temporal robustness, especially to temporal reformulations and the use of different granularities of temporal references. We show how a selection of these eight tests can be used automatically to judge a model’s temporal robustness for user questions on the fly. Finally, we apply the findings of this study to improve the temporal QA performance by up to 55 percent.
zh
[NLP-14] Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM -Judges Correlate with Humans?
【速读】: 该论文试图解决自动文本摘要评估指标和LLM-as-a-Judge模型在非英语语言中的有效性研究不足的问题。解决方案的关键在于构建了一个新的数据集BASSE(包含Basque和Spanish两种语言的2,040个抽象型摘要的人类判断),这些摘要由人工或五种不同的提示下的大型语言模型(LLMs)生成,并通过标注者对每个摘要的连贯性、一致性、流畅性、相关性和5W1H五个标准进行5点Likert量表评估。基于此数据重新评估了传统的自动度量方法以及在英语任务中表现优异的多种LLM-as-a-Judge模型,发现当前专有判别LLM与人类判断具有最高的相关性,而开源LLM表现较差。同时,论文公开了BASSE数据集、代码及首个大规模Basque摘要数据集(包含22,525篇新闻文章及其子标题)。
链接: https://arxiv.org/abs/2503.17039
作者: Jeremy Barnes,Naiara Perez,Alba Bonet-Jover,Begoña Altuna
机构: HiTZ Center (HiTZ 中心), University of the Basque Country (UPV/EHU) (巴斯克大学); Department of Software and Computing Systems (软件与计算系统系), University of Alicante (阿利坎特大学), Spain (西班牙); GOI Institute (Goi 研究所), Basque Summer University (UEU) (巴斯克暑期大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Studies on evaluation metrics and LLM-as-a-Judge models for automatic text summarization have largely been focused on English, limiting our understanding of their effectiveness in other languages. Through our new dataset BASSE (BAsque and Spanish Summarization Evaluation), we address this situation by collecting human judgments on 2,040 abstractive summaries in Basque and Spanish, generated either manually or by five LLMs with four different prompts. For each summary, annotators evaluated five criteria on a 5-point Likert scale: coherence, consistency, fluency, relevance, and 5W1H. We use these data to reevaluate traditional automatic metrics used for evaluating summaries, as well as several LLM-as-a-Judge models that show strong performance on this task in English. Our results show that currently proprietary judge LLMs have the highest correlation with human judgments, followed by criteria-specific automatic metrics, while open-sourced judge LLMs perform poorly. We release BASSE and our code publicly, along with the first large-scale Basque summarization dataset containing 22,525 news articles with their subheads.
zh
[NLP-15] xt2Model: Generating dynamic chemical reactor models using large language models (LLM s)
【速读】: 本文旨在探索大型语言模型(LLMs)在化学工程领域的特定任务中如何辅助研究与工业应用。具体而言,论文提出从文本描述生成Modelica代码格式的动态化学反应器模型作为用户输入,并通过微调Llama 3.1 8B Instruct模型以处理不同反应器场景的合成生成Modelica代码。关键在于利用微调技术显著提升生成Modelica模型在语义和句法准确性上的表现,尽管所提出的微调模型在泛化能力方面仍不及GPT4o。
链接: https://arxiv.org/abs/2503.17004
作者: Sophia Rupprecht,Yassine Hounat,Monisha Kumar,Giacomo Lastrucci,Artur M. Schweidtmann
机构: Delft University of Technology (代尔夫特理工大学)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:As large language models have shown remarkable capabilities in conversing via natural language, the question arises as to how LLMs could potentially assist chemical engineers in research and industry with domain-specific tasks. We generate dynamic chemical reactor models in Modelica code format from textual descriptions as user input. We fine-tune Llama 3.1 8B Instruct on synthetically generated Modelica code for different reactor scenarios. We compare the performance of our fine-tuned model to the baseline Llama 3.1 8B Instruct model and GPT4o. We manually assess the models’ predictions regarding the syntactic and semantic accuracy of the generated dynamic models. We find that considerable improvements are achieved by the fine-tuned model with respect to both the semantic and the syntactic accuracy of the Modelica models. However, the fine-tuned model lacks a satisfactory ability to generalize to unseen scenarios compared to GPT4o.
zh
[NLP-16] A Survey on Personalized Alignment – The Missing Piece for Large Language Models in Real-World Applications
【速读】: 该论文试图解决大型语言模型(LLMs)在实际应用中无法同时适应个体偏好并保持与普遍人类价值观一致的关键问题。当前的对齐技术采用一刀切的方法,未能满足用户多样化的需求和背景。论文提出了解决方案的核心在于个性化对齐范式,通过构建一个包含偏好记忆管理、个性化生成和基于反馈的对齐的统一框架,系统性地分析实现方法并在不同场景下评估其有效性。关键在于如何在符合伦理边界的前提下,使LLMs能够灵活适配个体需求。
链接: https://arxiv.org/abs/2503.17003
作者: Jian Guan,Junfei Wu,Jia-Nan Li,Chuanqi Cheng,Wei Wu
机构: Ant Group; Institute of Automation, Chinese Academy of Sciences (自动化研究所, 中国科学院); Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院, 中国人民大学), Beijing, China (中国)
类目: Computation and Language (cs.CL)
备注: 9 pages
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their transition to real-world applications reveals a critical limitation: the inability to adapt to individual preferences while maintaining alignment with universal human values. Current alignment techniques adopt a one-size-fits-all approach that fails to accommodate users’ diverse backgrounds and needs. This paper presents the first comprehensive survey of personalized alignment-a paradigm that enables LLMs to adapt their behavior within ethical boundaries based on individual preferences. We propose a unified framework comprising preference memory management, personalized generation, and feedback-based alignment, systematically analyzing implementation approaches and evaluating their effectiveness across various scenarios. By examining current techniques, potential risks, and future challenges, this survey provides a structured foundation for developing more adaptable and ethically-aligned LLMs.
zh
[NLP-17] oken Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models
【速读】: 该论文旨在解决视频大规模语言模型(Video LLMs)中极端令牌压缩的需求,即如何在保持空间-时间连贯性的同时,以极少量的令牌表示长视频序列。现有方法如令牌剪枝和合并通常会破坏关键的空间-时间位置嵌入,无法在计算效率与较少令牌数量之间取得良好平衡,导致令牌序列仍然较长,限制了其在需要极端令牌压缩场景中的应用。
论文的关键解决方案是提出了一种名为“Token Dynamics”的新型视频表示框架。该框架通过将视觉嵌入与网格级运动信息分离,并将其重构为一个紧凑的令牌基底(由描述对象级内容的令牌聚类形成)以及捕捉网格间详细空间-时间运动模式的令牌动态图谱,实现了动态减少令牌数量。此外,引入了跨动态注意力机制,将运动特征集成到令牌基底中而不增加令牌长度,从而保持了紧凑性和空间-时间完整性。实验结果表明,这种方法可以将令牌数量减少至原始令牌的0.07%,且性能仅下降1.13%,同时提出了固定长度和自适应长度两种新的子任务以进一步优化长令牌序列的表示能力。
链接: https://arxiv.org/abs/2503.16980
作者: Haichao Zhang,Zhuowei Li,Dimitris Metaxas,Yun Fu
机构: Northeastern University (东北大学); Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Token-based video representation has emerged as a promising approach for enabling large language models to interpret video content. However, existing token reduction techniques, such as token pruning and token merging, often disrupt essential spatial-temporal positional embeddings, failing to adequately balance computational efficiency with fewer tokens. Consequently, these methods result in relatively lengthy token sequences, limiting their applicability in scenarios requiring extreme token compression, such as video large language models. In this paper, we introduce the novel task of extreme short token reduction, aiming to represent extensive video sequences with a minimal number of tokens. To address this challenge, we propose Token Dynamics, a new video representation framework that dynamically reduces token count while preserving spatial-temporal coherence. Specifically, we disentangle video representations by separating visual embeddings from grid-level motion information, structuring them into: 1. a concise token base, created by clustering tokens that describe object-level content; 2. a token dynamics map, capturing detailed spatial-temporal motion patterns across grids. Furthermore, we introduce a cross-dynamics attention mechanism that integrates motion features into the token base without increasing token length, thereby maintaining compactness and spatial-temporal integrity. The experiments demonstrate a reduction of token count to merely 0.07% of the original tokens, with only a minor performance drop of 1.13%. Additionally, we propose two novel subtasks within extreme token reduction (fixed-length and adaptive-length compression), both effectively representing long token sequences for video-language tasks. Our method offers significantly lower theoretical complexity, fewer tokens, and enhanced throughput, thus providing an efficient solution for video LLMs.
zh
[NLP-18] When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making
【速读】: 该论文试图解决视觉语言模型(Visual Language Models, VLMs)在多模态人本决策任务中复杂决策能力不足的问题,尤其是在需要深入推理人类需求和价值观的人类中心场景下。研究发现,仅接收文本描述的大规模语言模型(Large Language Models, LLMs)的表现意外地优于处理实际图像的同类规模VLMs,这表明视觉对齐可能限制了VLM的能力。
解决方案的关键在于提出一种全新的仅文本训练方法,利用合成的文本数据增强VLM的语言组件,并将学习到的能力迁移到多模态推理中,从而避免昂贵的配对图像-文本数据需求。此外,通过使用LLMs生成的训练数据而非依赖更大的教师模型(如GPT-4),VLMs能够通过自我改进实现显著的性能提升。这些发现确立了一种更高效且可扩展的方法来增强VLMs在人本决策任务中的能力,并为通过自我改进机制优化VLMs开辟了新途径。
链接: https://arxiv.org/abs/2503.16965
作者: Zhe Hu,Jing Li,Yu Yin
机构: Department of Computing, The Hong Kong Polytechnic University (香港理工大学计算系); Research Centre for Data Science & Artificial Intelligence (数据科学与人工智能研究中心); Department of Computer and Data Sciences, Case Western Reserve University (凯斯西储大学计算机与数据科学系)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Embodied decision-making is fundamental for AI agents operating in real-world environments. While Visual Language Models (VLMs) have advanced this capability, they still struggle with complex decisions, particularly in human-centered situations that require deep reasoning about human needs and values. In this study, we systematically evaluate open-sourced VLMs on multimodal human-centered decision-making tasks. We find that LLMs receiving only textual descriptions unexpectedly outperform their VLM counterparts of similar scale that process actual images, suggesting that visual alignment may hinder VLM abilities. To address this challenge, we propose a novel text-only training approach with synthesized textual data. This method strengthens VLMs’ language components and transfers the learned abilities to multimodal inference, eliminating the need for expensive image-text paired data. Furthermore, we show that VLMs can achieve substantial performance gains through self-improvement, using training data generated by their LLM counterparts rather than relying on larger teacher models like GPT-4. Our findings establish a more efficient and scalable approach to enhancing VLMs’ human-centered decision-making capabilities, opening new avenues for optimizing VLMs through self-improvement mechanisms.
zh
[NLP-19] Assessing the Reliability and Validity of GPT -4 in Annotating Emotion Appraisal Ratings
【速读】: 该论文旨在评估和改进 GPT-4 在特定评价维度(21个具体的 appraisal ratings)上的表现,与人类注释者进行对比。论文的关键在于探索如何通过多数投票(majority voting)策略提升 GPT-4 的性能,并发现其在单一提示下预测评价维度和情绪标签的有效性,同时揭示增加指令复杂度对性能的负面影响。此外,研究还表明较长事件描述有助于提高模型和人类注释者的评分准确性。这一工作促进了大型语言模型(LLMs)在心理学领域的应用,并提出了优化 GPT-4 在评价注释任务中的表现的策略。
链接: https://arxiv.org/abs/2503.16883
作者: Deniss Ruder,Andero Uusberg,Kairit Sirts
机构: Institute of Computer Science (计算机科学研究所), University of Tartu (塔尔图大学); Institute of Psychology (心理学研究所), University of Tartu (塔尔图大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Appraisal theories suggest that emotions arise from subjective evaluations of events, referred to as appraisals. The taxonomy of appraisals is quite diverse, and they are usually given ratings on a Likert scale to be annotated in an experiencer-annotator or reader-annotator paradigm. This paper studies GPT-4 as a reader-annotator of 21 specific appraisal ratings in different prompt settings, aiming to evaluate and improve its performance compared to human annotators. We found that GPT-4 is an effective reader-annotator that performs close to or even slightly better than human annotators, and its results can be significantly improved by using a majority voting of five completions. GPT-4 also effectively predicts appraisal ratings and emotion labels using a single prompt, but adding instruction complexity results in poorer performance. We also found that longer event descriptions lead to more accurate annotations for both model and human annotator ratings. This work contributes to the growing usage of LLMs in psychology and the strategies for improving GPT-4 performance in annotating appraisals.
zh
[NLP-20] Federated Cross-Domain Click-Through Rate Prediction With Large Language Model Augmentation
【速读】: 该论文旨在解决在严格隐私约束下跨域点击率预测(Cross-Domain Click-Through Rate, CCTR)的挑战,特别是在用户-项目交互数据稀疏且分散于多个域的情况下。传统方法通常假设特征空间同质化,并依赖集中式数据共享,忽视了跨域差异性和隐私保护协议带来的复杂权衡。论文提出了一种名为FedCCTR-LM(Federated Cross-Domain CTR Prediction with Large Language Model Augmentation)的联邦框架,通过同步数据增强、表征解耦和自适应隐私保护来克服这些限制。其关键创新包括:首先,PrivAugNet利用大型语言模型增强用户与项目的表示并扩展交互序列,缓解数据稀疏性和特征不完整性;其次,IDST-CL模块通过域内表征对齐和跨域表征解耦实现特定域偏好与共享偏好的分离,提升知识迁移能力;最后,AdaLDP机制动态调整噪声注入以平衡隐私保证与预测准确性。实证评估表明,FedCCTR-LM在异构联邦环境中显著优于现有基线,提供了鲁棒、隐私保护且可泛化的跨域CTR预测性能。
链接: https://arxiv.org/abs/2503.16875
作者: Jiangcheng Qin,Xueyuan Zhang,Baisong Liu,Jiangbo Qian,Yangyang Wang
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
点击查看摘要
Abstract:Accurately predicting click-through rates (CTR) under stringent privacy constraints poses profound challenges, particularly when user-item interactions are sparse and fragmented across domains. Conventional cross-domain CTR (CCTR) methods frequently assume homogeneous feature spaces and rely on centralized data sharing, neglecting complex inter-domain discrepancies and the subtle trade-offs imposed by privacy-preserving protocols. Here, we present Federated Cross-Domain CTR Prediction with Large Language Model Augmentation (FedCCTR-LM), a federated framework engineered to address these limitations by synchronizing data augmentation, representation disentanglement, and adaptive privacy protection. Our approach integrates three core innovations. First, the Privacy-Preserving Augmentation Network (PrivAugNet) employs large language models to enrich user and item representations and expand interaction sequences, mitigating data sparsity and feature incompleteness. Second, the Independent Domain-Specific Transformer with Contrastive Learning (IDST-CL) module disentangles domain-specific and shared user preferences, employing intra-domain representation alignment (IDRA) and crossdomain representation disentanglement (CDRD) to refine the learned embeddings and enhance knowledge transfer across domains. Finally, the Adaptive Local Differential Privacy (AdaLDP) mechanism dynamically calibrates noise injection to achieve an optimal balance between rigorous privacy guarantees and predictive accuracy. Empirical evaluations on four real-world datasets demonstrate that FedCCTR-LM substantially outperforms existing baselines, offering robust, privacy-preserving, and generalizable cross-domain CTR prediction in heterogeneous, federated environments.
zh
[NLP-21] Sparse Logit Sampling: Accelerating Knowledge Distillation in LLM s
【速读】: 该论文旨在解决在大型语言模型(Large Language Models)预训练过程中应用知识蒸馏(Knowledge Distillation)时所面临的挑战,特别是现有稀疏知识蒸馏方法(如缓存Top-K概率)存在偏差估计的问题,这会导致学生模型性能不佳且校准效果较差。论文的关键在于提出了一种基于重要性采样的随机采样知识蒸馏方法(Random Sampling Knowledge Distillation),该方法能够提供无偏估计,保持梯度期望不变,并显著减少需要存储的logits数量,从而实现对学生模型的更高效训练(相比交叉熵训练仅增加10%的开销),同时在参数规模从3亿到30亿的范围内维持与完整蒸馏相当的竞争性性能。
链接: https://arxiv.org/abs/2503.16870
作者: Anshumann,Mohd Abbas Zaidi,Akhil Kedia,Jinwoo Ahn,Taehwak Kwon,Kangwook Lee,Haejun Lee,Joohyung Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Anshumann, Mohd Abbas Zaidi and Akhil Kedia have Equal Contribution
点击查看摘要
Abstract:Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation’, which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.
zh
[NLP-22] Joint Extraction Matters: Prompt-Based Visual Question Answering for Multi-Field Document Information Extraction
【速读】: 该论文旨在解决视觉问答(Visual Question Answering, VQA)在文档图像信息提取任务中,现有方法通常孤立查询各个字段而忽略多字段间潜在依赖关系的问题。论文的关键在于探讨联合提取多个字段相比于单独提取的优势,并通过在多种大型视觉语言模型和数据集上的实验表明,联合提取通常能够提升准确性,尤其是在字段间存在强数值或上下文依赖时。此外,论文分析了性能随请求项数量的变化规律,并通过基于回归的度量方法量化字段间的相互关系,提出多字段提示可以减轻因表面形式相似和相关数值导致的混淆,从而为设计鲁棒的文档信息提取 VQA 系统提供实用方法。
链接: https://arxiv.org/abs/2503.16868
作者: Mengsay Loem,Taiju Hosaka
机构: Sansan, Inc. (三山股份有限公司)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Visual question answering (VQA) has emerged as a flexible approach for extracting specific pieces of information from document images. However, existing work typically queries each field in isolation, overlooking potential dependencies across multiple items. This paper investigates the merits of extracting multiple fields jointly versus separately. Through experiments on multiple large vision language models and datasets, we show that jointly extracting fields often improves accuracy, especially when the fields share strong numeric or contextual dependencies. We further analyze how performance scales with the number of requested items and use a regression based metric to quantify inter field relationships. Our results suggest that multi field prompts can mitigate confusion arising from similar surface forms and related numeric values, providing practical methods for designing robust VQA systems in document information extraction tasks.
zh
[NLP-23] MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering
【速读】: 该论文试图解决文本新闻与时间序列演变之间关系理解这一在应用数据科学中至关重要但尚未充分探索的挑战。现有跨模态时间序列数据集难以评估复杂的跨模态推理和问答能力,而这些能力对于捕捉叙事信息与时间模式之间的复杂交互至关重要。为了解决这一问题,论文提出了Multimodal Time Series Benchmark (MTBench),这是一个大规模基准测试集,旨在评估大型语言模型(LLMs)在金融和天气领域的时间序列与文本理解能力。MTBench的关键在于它提供了配对的时间序列和文本数据,包括与股票价格波动对应的金融新闻以及与历史温度记录对齐的天气报告,从而实现结构化数值趋势与非结构化文本叙述的同时推理。通过这种丰富的数据设计,MTBench支持多样化的任务,如时间序列预测、语义和技术趋势分析及新闻驱动的问答(QA),这些任务要求模型深入理解文本和时间序列数据。论文的关键解决方案在于构建这样一个全面的跨模态测试平台,以推动模型在捕捉时间依赖性、从文本上下文中提取关键见解以及融合多模态信息方面的能力提升。
链接: https://arxiv.org/abs/2503.16858
作者: Jialin Chen,Aosong Feng,Ziyu Zhao,Juan Garza,Gaukhar Nurbek,Cheng Qin,Ali Maatouk,Leandros Tassiulas,Yifeng Gao,Rex Ying
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages
点击查看摘要
Abstract:Understanding the relationship between textual news and time-series evolution is a critical yet under-explored challenge in applied data science. While multimodal learning has gained traction, existing multimodal time-series datasets fall short in evaluating cross-modal reasoning and complex question answering, which are essential for capturing complex interactions between narrative information and temporal patterns. To bridge this gap, we introduce Multimodal Time Series Benchmark (MTBench), a large-scale benchmark designed to evaluate large language models (LLMs) on time series and text understanding across financial and weather domains. MTbench comprises paired time series and textual data, including financial news with corresponding stock price movements and weather reports aligned with historical temperature records. Unlike existing benchmarks that focus on isolated modalities, MTbench provides a comprehensive testbed for models to jointly reason over structured numerical trends and unstructured textual narratives. The richness of MTbench enables formulation of diverse tasks that require a deep understanding of both text and time-series data, including time-series forecasting, semantic and technical trend analysis, and news-driven question answering (QA). These tasks target the model’s ability to capture temporal dependencies, extract key insights from textual context, and integrate cross-modal information. We evaluate state-of-the-art LLMs on MTbench, analyzing their effectiveness in modeling the complex relationships between news narratives and temporal patterns. Our findings reveal significant challenges in current models, including difficulties in capturing long-term dependencies, interpreting causality in financial and weather trends, and effectively fusing multimodal information.
zh
[NLP-24] MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers
【速读】: 该论文试图解决机器全面理解科学论文的问题,这反映了高水平的人工通用智能 (Artificial General Intelligence, AGI),需要模型具备跨碎片化和异构信息源进行推理的能力,这是一个复杂且重要的挑战。目前,视觉-语言模型 (Vision-Language Models, VLMs) 在单模态任务(如基于单一图像或文本页面的推理)中取得了显著进展,但其利用多模态跨源信息进行推理的能力仍是一个未解难题。为此,论文提出了一种名为 MMCR 的高难度基准数据集,用于评估 VLMs 处理科学论文中跨源信息推理的能力。该基准包含 276 条高质量的人工标注问题,覆盖 7 个学科领域和 10 种任务类型。实验结果表明,现有模型在跨源推理任务中面临重大挑战,即使表现最佳的模型 GPT-4o 在整体上的准确率也仅为 48.55%,而在多表格理解任务中的准确率仅达 20%。此外,研究进一步探索了链式思维 (Chain-of-Thought, CoT) 技术对跨源推理的影响,发现该技术对小型模型有负面影响,而大型模型则表现出显著性能提升。
解决方案的关键在于设计一个能够有效评估和推动 VLMs 跨源信息推理能力的数据集(MMCR),并通过实验揭示现有模型在此任务上的局限性以及 CoT 技术的应用潜力,从而明确未来研究方向,促进更强大的 VLMs 的开发。
链接: https://arxiv.org/abs/2503.16856
作者: Yang Tian,Zheng Lu,Mingqi Gao,Zheng Liu,Bo Zhao
机构: School of AI, Shanghai Jiao Tong University (上海交通大学); South China University of Technology (华南理工大学); Southeast University (东南大学); Beijing Academy of Artificial Intelligence (北京智源人工智能研究院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Fully comprehending scientific papers by machines reflects a high level of Artificial General Intelligence, requiring the ability to reason across fragmented and heterogeneous sources of information, presenting a complex and practically significant challenge. While Vision-Language Models (VLMs) have made remarkable strides in various tasks, particularly those involving reasoning with evidence source from single image or text page, their ability to use cross-source information for reasoning remains an open problem. This work presents MMCR, a high-difficulty benchmark designed to evaluate VLMs’ capacity for reasoning with cross-source information from scientific papers. The benchmark comprises 276 high-quality questions, meticulously annotated by humans across 7 subjects and 10 task types. Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models. Notably, even the top-performing model, GPT-4o, achieved only 48.55% overall accuracy, with only 20% accuracy in multi-table comprehension tasks, while the second-best model, Qwen2.5-VL-72B, reached 39.86% overall accuracy. Furthermore, we investigated the impact of the Chain-of-Thought (CoT) technique on cross-source reasoning and observed a detrimental effect on small models, whereas larger models demonstrated substantially enhanced performance. These results highlight the pressing need to develop VLMs capable of effectively utilizing cross-source information for reasoning.
zh
[NLP-25] Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
【速读】: 该论文试图解决语言模型在处理需要听觉常识知识的任务时表现不佳的问题。传统方法通过扩展语言模型以从外部音频数据库检索知识来应对这一挑战,但这种方法存在相关音频数据不足及构建和查询数据库成本高昂等局限性。论文的关键解决方案是提出了一种名为“Imagine to Hear”的新方法,该方法利用生成式模型动态生成听觉知识。具体而言,该框架从给定提示中检测多个与音频相关的文本片段,并生成相应的听觉知识。此外,还开发了多种机制高效处理多条听觉知识,包括基于CLAP的拒绝采样器和语言-音频融合模块。实验结果表明,该方法在AuditoryBench数据集上实现了最先进的性能,且无需依赖外部数据库,突显了基于生成的方法的有效性。
链接: https://arxiv.org/abs/2503.16853
作者: Suho Yoo,Hyunjong Ok,Jaeho Lee
机构: HJ AILAB (HJ AILAB); POSTECH (POSTECH)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint
点击查看摘要
Abstract:Language models pretrained on text-only corpora often struggle with tasks that require auditory commonsense knowledge. Previous work addresses this problem by augmenting the language model to retrieve knowledge from external audio databases. This approach has several limitations, such as the potential lack of relevant audio in databases and the high costs associated with constructing and querying the databases. To address these issues, we propose Imagine to Hear, a novel approach that dynamically generates auditory knowledge using generative models. Our framework detects multiple audio-related textual spans from the given prompt and generates corresponding auditory knowledge. We develop several mechanisms to efficiently process multiple auditory knowledge, including a CLAP-based rejection sampler and a language-audio fusion module. Our experiments show that our method achieves state-of-the-art performance on AuditoryBench without relying on external databases, highlighting the effectiveness of our generation-based approach.
zh
[NLP-26] owards LLM Guardrails via Sparse Representation Steering
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自然语言生成任务中因无控制输出而引发的重大伦理与安全风险。具体而言,现有的基于表征工程的方法虽能通过修改激活向量中的丰富语义信息来引导模型行为,但面临三大主要挑战:缺乏精细控制能力、生成内容质量下降以及可解释性差。为应对这些挑战,论文提出了一种基于稀疏编码的表征工程方法(SRE),其关键在于将多语义激活分解到一个结构化且单语义的特征空间中。通过利用稀疏自编码技术,SRE 方法仅调整与特定任务相关的稀疏特征维度,从而实现模型行为的精确且可解释的引导,同时保持生成内容的质量。实验结果表明,SRE 在安全性、公平性和真实性等三个关键领域均表现出色,验证了其作为细粒度和可解释激活引导框架的有效性。
链接: https://arxiv.org/abs/2503.16851
作者: Zeqing He,Zhibo Wang,Huiyu Xu,Kui Ren
机构: The State Key Laboratory of Blockchain and Data Security, Zhejiang University, China (浙江大学区块链与数据安全国家重点实验室); School of Cyber Science and Technology, Zhejiang University, China (浙江大学网络空间安全学院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance in natural language generation tasks, yet their uncontrolled outputs pose significant ethical and safety risks. Recently, representation engineering methods have shown promising results in steering model behavior by modifying the rich semantic information encoded in activation vectors. However, due to the difficulty of precisely disentangling semantic directions within high-dimensional representation space, existing approaches suffer from three major limitations: lack of fine-grained control, quality degradation of generated content, and poor interpretability. To address these challenges, we propose a sparse encoding-based representation engineering method, named SRE, which decomposes polysemantic activations into a structured, monosemantic feature space. By leveraging sparse autoencoding, our approach isolates and adjusts only task-specific sparse feature dimensions, enabling precise and interpretable steering of model behavior while preserving content quality. We validate our method on three critical domains, i.e., safety, fairness, and truthfulness using the open-source LLM Gemma-2-2B-it. Experimental results show that SRE achieves superior controllability while maintaining the overall quality of generated content (i.e., controllability and quality), demonstrating its effectiveness as a fine-grained and interpretable activation steering framework.
zh
[NLP-27] he Deployment of End-to-End Audio Language Models Should Take into Account the Principle of Least Privilege
【速读】: 该论文试图解决的问题是如何在保障安全性和隐私的前提下,合理构建与部署端到端音频语言模型(End-to-End Audio Language Models, Audio LMs),以应对由直接处理语音输入而引入的新安全风险,如潜在的敏感声纹属性滥用及其法律影响。论文强调,解决方案的关键在于遵循“最小权限原则”(Principle of Least Privilege)来指导模型架构的选择,即评估端到端建模是否为特定应用所必需,并确定信息访问的适当范围。此外,论文指出当前音频语言模型基准测试存在的不足,并提出需解决的关键技术与政策相关研究问题,以促进此类模型的负责任部署。
链接: https://arxiv.org/abs/2503.16833
作者: Luxi He,Xiangyu Qi,Michel Liao,Inyoung Cheong,Prateek Mittal,Danqi Chen,Peter Henderson
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:We are at a turning point for language models that accept audio input. The latest end-to-end audio language models (Audio LMs) process speech directly instead of relying on a separate transcription step. This shift preserves detailed information, such as intonation or the presence of multiple speakers, that would otherwise be lost in transcription. However, it also introduces new safety risks, including the potential misuse of speaker identity cues and other sensitive vocal attributes, which could have legal implications. In this position paper, we urge a closer examination of how these models are built and deployed. We argue that the principle of least privilege should guide decisions on whether to deploy cascaded or end-to-end models. Specifically, evaluations should assess (1) whether end-to-end modeling is necessary for a given application; and (2), the appropriate scope of information access. Finally, We highlight related gaps in current audio LM benchmarks and identify key open research questions, both technical and policy-related, that must be addressed to enable the responsible deployment of end-to-end Audio LMs.
zh
[NLP-28] When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal Large Language Models in Cultural Mixture Contexts
【速读】: 该论文试图解决多模态大型语言模型(MLLMs)在处理跨文化输入时存在的文化偏见问题,特别是模型对视觉特征中人物种族的过度依赖导致实体分类错误的现象。为评估MLLMs对不同民族文化的鲁棒性,论文引入了一个名为MixCuBe的跨文化偏差基准,并研究了来自五个国家和四个民族的元素。解决方案的关键在于通过构建这一基准数据集,揭示了高资源文化下的模型表现优于低资源文化,且GPT-4o在低资源文化场景下原始与扰动环境间的准确率差异可达58%。论文提供的数据集已公开可用。
链接: https://arxiv.org/abs/2503.16826
作者: Jun Seong Kim,Kyaw Ye Thu,Javad Ismayilzada,Junyeong Park,Eunsu Kim,Huzama Ahmad,Na Min An,James Thorne,Alice Oh
机构: School of Computing KAIST (KAIST 计算学院); Graduate School of AI KAIST (KAIST 人工智能研究生院)
类目: Computation and Language (cs.CL)
备注: 12 pages
点击查看摘要
Abstract:In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs. For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it. However, current MLLMs show an over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to different ethnicity, we introduce MixCuBe, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures. Our dataset is publicly available at: this https URL.
zh
[NLP-29] Conversational User-AI Intervention: A Study on Prompt Rewriting for Improved LLM Response Generation ACL
【速读】: 该论文试图解决人类在与大型语言模型(LLMs)对话时,因提示词设计不当导致难以获得有效响应的问题。解决方案的关键在于利用LLMs重新改写用户提出的次优提示词,通过重述提示词来更准确地表达用户的信息需求,从而提升对话系统的响应质量,同时保持用户原始意图不变。研究发现,这种改写方法在较长对话中表现更佳,因为可以更精确地进行上下文推断。此外,LLMs在解释提示词时通常需要并能够合理推测用户的意图和目标,这一特性进一步支持了通过提示词改写改善人机交互的潜力。
链接: https://arxiv.org/abs/2503.16789
作者: Rupak Sarkar,Bahareh Sarrafzadeh,Nirupama Chandrasekaran,Nagu Rangan,Philip Resnik,Longqi Yang,Sujay Kumar Jauhar
机构: Microsoft Corporation (微软公司), Redmond; University of Maryland (马里兰大学), College Park
类目: Computation and Language (cs.CL)
备注: 8 pages, ACL style
点击查看摘要
Abstract:Human-LLM conversations are increasingly becoming more pervasive in peoples’ professional and personal lives, yet many users still struggle to elicit helpful responses from LLM Chatbots. One of the reasons for this issue is users’ lack of understanding in crafting effective prompts that accurately convey their information needs. Meanwhile, the existence of real-world conversational datasets on the one hand, and the text understanding faculties of LLMs on the other, present a unique opportunity to study this problem, and its potential solutions at scale. Thus, in this paper we present the first LLM-centric study of real human-AI chatbot conversations, focused on investigating aspects in which user queries fall short of expressing information needs, and the potential of using LLMs to rewrite suboptimal user prompts. Our findings demonstrate that rephrasing ineffective prompts can elicit better responses from a conversational system, while preserving the user’s original intent. Notably, the performance of rewrites improves in longer conversations, where contextual inferences about user needs can be made more accurately. Additionally, we observe that LLMs often need to – and inherently do – make \emphplausible assumptions about a user’s intentions and goals when interpreting prompts. Our findings largely hold true across conversational domains, user intents, and LLMs of varying sizes and families, indicating the promise of using prompt rewriting as a solution for better human-AI interactions.
zh
[NLP-30] Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models
【速读】: 该论文旨在解决工具学习领域中现有方法存在的两个主要问题:一是微调方法限制模型仅能使用训练数据中已见的工具;二是通过在提示中添加工具演示的方法效率较低。为应对这些问题,论文提出了一种名为Chain-of-Tools的新工具学习方法。该方法充分利用冻结的大语言模型(LLMs)强大的语义表示能力,在包含大量可能未见过工具的灵活工具池中,通过CoT推理完成工具调用。关键在于利用大规模且灵活的工具池,并结合冻结的LLMs实现高效工具调用,同时验证了该方法在处理大量未见过工具场景下的有效性。此外,论文构建了一个新的数据集SimpleToolQuestions以支持实验,并通过在多个基准测试中的实验证明了该方法优于基线方法。
链接: https://arxiv.org/abs/2503.16779
作者: Mengsong Wu,Tong Zhu,Han Han,Xiang Zhang,Wenbiao Shao,Wenliang Chen
机构: Soochow University (苏州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 10 figures
点击查看摘要
Abstract:Tool learning can further broaden the usage scenarios of large language models (LLMs). However most of the existing methods either need to finetune that the model can only use tools seen in the training data, or add tool demonstrations into the prompt with lower efficiency. In this paper, we present a new Tool Learning method Chain-of-Tools. It makes full use of the powerful semantic representation capability of frozen LLMs to finish tool calling in CoT reasoning with a huge and flexible tool pool which may contain unseen tools. Especially, to validate the effectiveness of our approach in the massive unseen tool scenario, we construct a new dataset SimpleToolQuestions. We conduct experiments on two numerical reasoning benchmarks (GSM8K-XL and FuncQA) and two knowledge-based question answering benchmarks (KAMEL and SimpleToolQuestions). Experimental results show that our approach performs better than the baseline. We also identify dimensions of the model output that are critical in tool selection, enhancing the model interpretability. Our code and data are available at: this https URL .
zh
[NLP-31] SPACER: A Parallel Dataset of Speech Production And Comprehension of Error Repairs
【速读】: 该论文旨在解决自然语言交流中语音错误的检测与纠正问题,特别是探索说话者与理解者在修正单字替代错误(single-word substitution errors)时的不同策略。为实现这一目标,研究提出了SPACER数据集,这是一个包含自然语境下语音错误及其被修正实例的平行数据集,来源于Switchboard语料库,并结合了说话者的自我修正以及离线文本编辑实验中理解者的反应。论文的关键在于通过构建这样一个平行数据集,首次系统性地对比分析了说话者与理解者在语音错误修正上的不对称性:说话者倾向于修复引入较大语义或音位偏离的错误,而理解者则更可能修正那些与更合理替代方案音位相似或不符合先前上下文的错误。此解决方案的核心在于提供了一个能够支持未来综合研究语言产生与理解的资源。
链接: https://arxiv.org/abs/2503.16745
作者: Shiva Upadhye,Jiaxuan Li,Richard Futrell
机构: University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL)
备注: 11 pages, 11 figures
点击查看摘要
Abstract:Speech errors are a natural part of communication, yet they rarely lead to complete communicative failure because both speakers and comprehenders can detect and correct errors. Although prior research has examined error monitoring and correction in production and comprehension separately, integrated investigation of both systems has been impeded by the scarcity of parallel data. In this study, we present SPACER, a parallel dataset that captures how naturalistic speech errors are corrected by both speakers and comprehenders. We focus on single-word substitution errors extracted from the Switchboard corpus, accompanied by speaker’s self-repairs and comprehenders’ responses from an offline text-editing experiment. Our exploratory analysis suggests asymmetries in error correction strategies: speakers are more likely to repair errors that introduce greater semantic and phonemic deviations, whereas comprehenders tend to correct errors that are phonemically similar to more plausible alternatives or do not fit into prior contexts. Our dataset enables future research on integrated approaches toward studying language production and comprehension.
zh
[NLP-32] Design and Implementation of an FPGA-Based Tiled Matrix Multiplication Accelerator for Transformer Self-Attention on the Xilinx KV260 SoM
【速读】: 本文旨在解决基于 Transformer 的大型语言模型(LLMs)在注意力机制和前馈层中的大矩阵乘法计算瓶颈问题。论文聚焦于加速多头自注意力(MHA)模块中的 Q、K 和 V 线性投影操作,这些操作被识别为计算的关键瓶颈。解决方案的关键创新点包括:在 Xilinx KV260 板载 FPGA 上设计了一种针对此类工作负载优化的分块矩阵乘法加速器,采用了一种持久化片上存储、两级分块策略以实现数据重用,并构建了一种类似脉动阵列的展开计算引擎。通过高层次综合(HLS)实现并与 DistilBERT 集成,该加速器在 Q、K、V 投影任务中实现了显著的性能提升和能效改进,相比 CPU 基线,独立的 GEMM 测试展示了高达 7 倍的加速比和约 200 倍的效率提升。尽管整体 DistilBERT 加速效果较为保守,但结果验证了基于 FPGA 的加速在 Transformer 模型关键组件上的潜力。
链接: https://arxiv.org/abs/2503.16731
作者: Zhaoqin “Richie” Li,Sicheng Chen
机构: University of California, Irvine(加州大学欧文分校)
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 4 figures, 2 tables. Prepared in ACM conference style. Preprint under review
点击查看摘要
Abstract:Transformer-based LLMs spend most of their compute in large matrix multiplications for attention and feed-forward layers. Recognizing that the Q, K, and V linear projections within the Multi-Head Self-Attention (MHA) module represent a critical computational bottleneck, we strategically focused our efforts on accelerating these operations. We present a tiled matrix multiplication accelerator optimized for such workloads on a Xilinx KV260 on-board FPGA. Key innovations include persistent on-chip storage for one matrix operand, two-level tiling for data reuse, and a systolic-like unrolled compute engine. Implemented via high-level synthesis (HLS) and integrated with DistilBERT for Q, K, V projections, our accelerator achieves significant speedup and energy efficiency gains over CPU baselines. Standalone GEMM benchmarks show up to a 7x speedup over an ARM CPU (PyTorch) and ~200x over naive numpy, with a throughput of up to 3.1 GFLOPs on 768x3072 matrices. Although the overall end-to-end DistilBERT acceleration is more modest, our results validate the potential of FPGA-based acceleration for critical components of Transformer models.
zh
[NLP-33] Natural Language Generation
【速读】: 该论文旨在概述自然语言生成(Natural Language Generation, NLG)领域的研究范围与方法,并探讨其与其他自然语言处理(Natural Language Processing, NLP)子领域的关系。论文的核心问题是明确NLG的定义边界及其与相关领域(如机器翻译Machine Translation, MT 和对话系统Dialog Systems)的区别,同时分析随着大规模语言模型(Large Language Models, LLMs)的发展,这些子领域在生成自然语言和评估自动生成文本的方法上如何趋于融合。关键在于理解NLG在信息表达中的核心作用,以及其在不同应用场景(如数据到文本、文本摘要、图像描述等)中的灵活性和多样性,同时强调LLMs对传统NLP子领域方法论的影响。
链接: https://arxiv.org/abs/2503.16728
作者: Emiel van Miltenburg,Chenghua Lin
机构: Tilburg University (蒂尔堡大学), Department of Cognition and Communication; University of Manchester (曼彻斯特大学), Department of Computer Science
类目: Computation and Language (cs.CL)
备注: 3 pages + references. Submitted for publication in the Encyclopedia of Language Linguistics
点击查看摘要
Abstract:This article provides a brief overview of the field of Natural Language Generation. The term Natural Language Generation (NLG), in its broadest definition, refers to the study of systems that verbalize some form of information through natural language. That information could be stored in a large database or knowledge graph (in data-to-text applications), but NLG researchers may also study summarisation (text-to-text) or image captioning (image-to-text), for example. As a subfield of Natural Language Processing, NLG is closely related to other sub-disciplines such as Machine Translation (MT) and Dialog Systems. Some NLG researchers exclude MT from their definition of the field, since there is no content selection involved where the system has to determine what to say. Conversely, dialog systems do not typically fall under the header of Natural Language Generation since NLG is just one component of dialog systems (the others being Natural Language Understanding and Dialog Management). However, with the rise of Large Language Models (LLMs), different subfields of Natural Language Processing have converged on similar methodologies for the production of natural language and the evaluation of automatically generated text.
zh
[NLP-34] CAARMA: Class Augmentation with Adversarial Mixup Regularization
【速读】: 该论文致力于解决现实世界中说话人验证任务面临的类别多样性不足问题,即真实世界的说话人数据集通常缺乏足够的类别多样性,难以通过传统方法有效且泛化地学习到将同类实例紧密聚类同时保持类别间分离的能力。为应对这一挑战,论文提出了一种名为CAARMA的类别增强框架,其关键在于通过在嵌入空间中进行数据混合生成合成类别,从而扩展训练类别数量。此外,为了确保合成类别的真实性,CAARMA采用了一种新颖的对抗精炼机制,以最小化合成类别与真实类别之间的类别区分度。实验结果表明,该方法在多个说话人验证及零样本比较任务中实现了显著提升(相较于基线模型提升了8%)。
链接: https://arxiv.org/abs/2503.16718
作者: Massa Baali,Xiang Li,Hao Chen,Rita Singh,Bhiksha Raj
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Speaker verification is a typical zero-shot learning task, where inference of unseen classes is performed by comparing embeddings of test instances to known examples. The models performing inference must hence naturally generate embeddings that cluster same-class instances compactly, while maintaining separation across classes. In order to learn to do so, they are typically trained on a large number of classes (speakers), often using specialized losses. However real-world speaker datasets often lack the class diversity needed to effectively learn this in a generalizable manner. We introduce CAARMA, a class augmentation framework that addresses this problem by generating synthetic classes through data mixing in the embedding space, expanding the number of training classes. To ensure the authenticity of the synthetic classes we adopt a novel adversarial refinement mechanism that minimizes categorical distinctions between synthetic and real classes. We evaluate CAARMA on multiple speaker verification tasks, as well as other representative zero-shot comparison-based speech analysis tasks and obtain consistent improvements: our framework demonstrates a significant improvement of 8% over all baseline models. Code for CAARMA will be released.
zh
[NLP-35] WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching NAACL2025
【速读】: 该论文旨在解决神经声码器中直接应用流匹配(Flow Matching)导致音频质量不佳的问题。为了解决这一挑战,论文提出了一种重新参数化的流匹配模型WaveFM,用于基于mel频谱图条件的语音合成。WaveFM的关键创新在于采用了mel条件先验分布(mel-conditioned prior distribution)而非标准高斯先验分布,以减少合成过程中的不必要的运输成本。此外,论文引入了辅助损失函数,包括改进的多分辨率STFT损失,以进一步提升音频质量。为了在保证样本质量的同时加速推理过程,WaveFM还设计了一种定制的一致性蒸馏方法。实验结果表明,WaveFM在质量和效率上均优于现有扩散声码器,并实现了单步波形生成。
链接: https://arxiv.org/abs/2503.16689
作者: Tianze Luo,Xingchen Miao,Wenbo Duan
机构: Tsinghua University (清华大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted to the main conference of NAACL 2025. The codes are available at this https URL
点击查看摘要
Abstract:Flow matching offers a robust and stable approach to training diffusion models. However, directly applying flow matching to neural vocoders can result in subpar audio quality. In this work, we present WaveFM, a reparameterized flow matching model for mel-spectrogram conditioned speech synthesis, designed to enhance both sample quality and generation speed for diffusion vocoders. Since mel-spectrograms represent the energy distribution of waveforms, WaveFM adopts a mel-conditioned prior distribution instead of a standard Gaussian prior to minimize unnecessary transportation costs during synthesis. Moreover, while most diffusion vocoders rely on a single loss function, we argue that incorporating auxiliary losses, including a refined multi-resolution STFT loss, can further improve audio quality. To speed up inference without degrading sample quality significantly, we introduce a tailored consistency distillation method for WaveFM. Experiment results demonstrate that our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders, while enabling waveform generation in a single inference step.
zh
[NLP-36] hrough the LLM Looking Glass: A Socratic Self-Assessment of Donkeys Elephants and Markets
【速读】: 该论文试图解决如何检测和减轻大型语言模型(LLMs)生成文本中的媒体偏见问题,特别是那些微妙且主观的意识形态偏见。论文的关键解决方案在于通过自评估的方法直接衡量模型的偏见,而非依赖外部解读。为此,研究使用了两个数据集(PoliGen 和 EconoLex),分别涵盖政治和经济领域的讨论,并对八种广泛使用的 LLMs 进行了评估,通过让模型生成文章并分析其自我表述中的意识形态倾向来实现。这种方法旨在减少对媒体偏见的主观判断,从而更准确地识别和量化模型的潜在偏差。
链接: https://arxiv.org/abs/2503.16674
作者: Molly Kennedy,Ayyoob Imani,Timo Spinde,Hinrich Schütze
机构: Ludwig-Maximilians-Universität München (慕尼黑大学); Microsoft (微软); University of Göttingen (哥廷根大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While detecting and avoiding bias in LLM-generated text is becoming increasingly important, media bias often remains subtle and subjective, making it particularly difficult to identify and mitigate. In this study, we assess media bias in LLM-generated content and LLMs’ ability to detect subtle ideological bias. We conduct this evaluation using two datasets, PoliGen and EconoLex, covering political and economic discourse, respectively. We evaluate eight widely used LLMs by prompting them to generate articles and analyze their ideological preferences via self-assessment. By using self-assessment, the study aims to directly measure the models’ biases rather than relying on external interpretations, thereby minimizing subjective judgments about media bias. Our results reveal a consistent preference of Democratic over Republican positions across all models. Conversely, in economic topics, biases vary among Western LLMs, while those developed in China lean more strongly toward socialism.
zh
[NLP-37] Accelerating Antibiotic Discovery with Large Language Models and Knowledge Graphs
【速读】: 该论文试图解决抗生素发现领域中高昂成本、冗长周期以及高失败率的问题,特别是由已知化合物的重复发现(rediscovery)所加剧的挑战。论文的关键解决方案在于提出了一种基于大型语言模型(LLM)的管道系统,该系统充当警报系统,通过整合生物体与化学文献构建知识图谱(Knowledge Graph, KG),实现分类学分辨率、同义词处理以及多层级证据分类,从而有效检测已有抗生素活性的先验证据,避免昂贵的重复研究。这一方法的核心优势在于提高证据审查效率、降低假阴性率,并加速决策过程。
链接: https://arxiv.org/abs/2503.16655
作者: Maxime Delmas,Magdalena Wysocka,Danilo Gusicuma,André Freitas
机构: Idiap Research Institute (瑞士IDIAP研究所); National Biomarker Centre (NBC), CRUK Manchester Institute (英国国家生物标志物中心, CRUK曼彻斯特研究所); Department of Computer Science, Univ. of Manchester (曼彻斯特大学计算机科学系)
类目: Computation and Language (cs.CL)
备注: 11 pages, 9 figures, 3 tables
点击查看摘要
Abstract:The discovery of novel antibiotics is critical to address the growing antimicrobial resistance (AMR). However, pharmaceutical industries face high costs (over 1 billion), long timelines, and a high failure rate, worsened by the rediscovery of known compounds. We propose an LLM-based pipeline that acts as an alarm system, detecting prior evidence of antibiotic activity to prevent costly rediscoveries. The system integrates organism and chemical literature into a Knowledge Graph (KG), ensuring taxonomic resolution, synonym handling, and multi-level evidence classification. We tested the pipeline on a private list of 73 potential antibiotic-producing organisms, disclosing 12 negative hits for evaluation. The results highlight the effectiveness of the pipeline for evidence reviewing, reducing false negatives, and accelerating decision-making. The KG for negative hits and the user interface for interactive exploration will be made publicly available.
zh
[NLP-38] Leverag ing Large Language Models for Explainable Activity Recognition in Smart Homes: A Critical Evaluation
【速读】: 该论文旨在解决现有基于传感器的日常活动识别(Activities of Daily Living, ADLs)解释方法在灵活性与可扩展性方面的不足。现有方法通常生成固定格式的解释,缺乏自然语言的灵活性,并且难以适应大规模应用。论文的关键在于探索将可解释人工智能(Explainable Artificial Intelligence, XAI)与大型语言模型(Large Language Models, LLMs)相结合的可能性,利用LLMs在人类活动理解方面的优势,提升解释生成的质量与效率。具体而言,论文研究了两种解决方案:一是利用LLMs作为无需标注数据的零样本(zero-shot)ADL识别模型;二是当有标注数据可用时,使用LLMs自动化生成更灵活且高效的解释,以提高现有数据驱动XAI方法的识别性能。关键在于结合XAI与LLMs的能力,平衡解释的透明性和模型的性能表现。
链接: https://arxiv.org/abs/2503.16622
作者: Michele Fiori,Gabriele Civitarese,Priyankar Choudhary,Claudio Bettini
机构: University of Milan (米兰大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Explainable Artificial Intelligence (XAI) aims to uncover the inner reasoning of machine learning models. In IoT systems, XAI improves the transparency of models processing sensor data from multiple heterogeneous devices, ensuring end-users understand and trust their outputs. Among the many applications, XAI has also been applied to sensor-based Activities of Daily Living (ADLs) recognition in smart homes. Existing approaches highlight which sensor events are most important for each predicted activity, using simple rules to convert these events into natural language explanations for non-expert users. However, these methods produce rigid explanations lacking natural language flexibility and are not scalable. With the recent rise of Large Language Models (LLMs), it is worth exploring whether they can enhance explanation generation, considering their proven knowledge of human activities. This paper investigates potential approaches to combine XAI and LLMs for sensor-based ADL recognition. We evaluate if LLMs can be used: a) as explainable zero-shot ADL recognition models, avoiding costly labeled data collection, and b) to automate the generation of explanations for existing data-driven XAI approaches when training data is available and the goal is higher recognition rates. Our critical evaluation provides insights into the benefits and challenges of using LLMs for explainable ADL recognition.
zh
[NLP-39] Classification of User Reports for Detection of Faulty Computer Components using NLP Models: A Case Study
【速读】: 该论文旨在解决现有计算机故障报告平台在有效利用用户提供的文本描述方面存在的显著不足,这些不足限制了用户以自然语言自由表达其遇到的问题。为应对这一挑战,论文提出了一种创新方法,利用自然语言处理(NLP)模型对用户报告进行分类,以识别故障的计算机组件(如CPU、内存、主板、显卡等)。解决方案的关键在于采用NLP技术分析用户生成的非结构化文本,并通过构建包含341条用户报告的数据集验证了该方法的有效性,最终在实验评估中达到了79%的分类准确率。
链接: https://arxiv.org/abs/2503.16614
作者: Maria de Lourdes M. Silva,André L. C. Mendonça,Eduardo R. D. Neto,Iago C. Chaves,Felipe T. Brito,Victor A. E. Farias,Javam C. Machado
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 2 figures
点击查看摘要
Abstract:Computer manufacturers typically offer platforms for users to report faults. However, there remains a significant gap in these platforms’ ability to effectively utilize textual reports, which impedes users from describing their issues in their own words. In this context, Natural Language Processing (NLP) offers a promising solution, by enabling the analysis of user-generated text. This paper presents an innovative approach that employs NLP models to classify user reports for detecting faulty computer components, such as CPU, memory, motherboard, video card, and more. In this work, we build a dataset of 341 user reports obtained from many sources. Additionally, through extensive experimental evaluation, our approach achieved an accuracy of 79% with our dataset.
zh
[NLP-40] Distributed LLM s and Multimodal Large Language Models : A Survey on Advances Challenges and Future Directions
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在分布式计算环境下的可扩展性、隐私保护及多模态应用挑战。论文的关键在于探讨和总结分布式解决方案在各类语言模型中的应用,包括大型语言模型(Large Language Models, LLMs)、视觉语言模型(Vision Language Models, VLMs)、多模态大型语言模型(Multimodal Large Language Models, MLLMs)以及小型语言模型(Small Language Models, SLMs)。论文强调了多模态模型在处理多种数据形式(如文本、图像和音频)及其集成方面的进展,并围绕分布式训练、推理、微调和部署等关键环节进行综述。此外,论文基于去中心化的主要关注领域对现有研究进行分类,揭示当前方法的不足,并提出增强分布式语言模型鲁棒性和适用性的未来研究方向。关键解决方案在于通过分布式计算策略提升扩展性,同时采用去中心化技术解决隐私和资源利用问题,以实现更广泛的应用场景。
链接: https://arxiv.org/abs/2503.16585
作者: Hadi Amini,Md Jueal Mia,Yasaman Saadati,Ahmed Imteaj,Seyedsina Nabavirazavi,Urmish Thakker,Md Zarif Hossain,Awal Ahmed Fime,S.S. Iyengar
机构: Knight Foundation School of Computing and Information Sciences, Florida International University (佛罗里达国际大学), Miami, FL, USA; Security, Optimization, and Learning for InterDependent networks laboratory (solid lab) (固态实验室), Florida International University (佛罗里达国际大学), Miami, FL, USA; School of Computing, Southern Illinois University (南伊利诺伊大学), Carbondale, IL, USA; Deep Learning Research, SambaNova Systems (山巴诺瓦系统公司), Palo Alto, CA, USA
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Language models (LMs) are machine learning models designed to predict linguistic patterns by estimating the probability of word sequences based on large-scale datasets, such as text. LMs have a wide range of applications in natural language processing (NLP) tasks, including autocomplete and machine translation. Although larger datasets typically enhance LM performance, scalability remains a challenge due to constraints in computational power and resources. Distributed computing strategies offer essential solutions for improving scalability and managing the growing computational demand. Further, the use of sensitive datasets in training and deployment raises significant privacy concerns. Recent research has focused on developing decentralized techniques to enable distributed training and inference while utilizing diverse computational resources and enabling edge AI. This paper presents a survey on distributed solutions for various LMs, including large language models (LLMs), vision language models (VLMs), multimodal LLMs (MLLMs), and small language models (SLMs). While LLMs focus on processing and generating text, MLLMs are designed to handle multiple modalities of data (e.g., text, images, and audio) and to integrate them for broader applications. To this end, this paper reviews key advancements across the MLLM pipeline, including distributed training, inference, fine-tuning, and deployment, while also identifying the contributions, limitations, and future areas of improvement. Further, it categorizes the literature based on six primary focus areas of decentralization. Our analysis describes gaps in current methodologies for enabling distributed solutions for LMs and outline future research directions, emphasizing the need for novel solutions to enhance the robustness and applicability of distributed LMs.
zh
[NLP-41] Investigating Retrieval-Augmented Generation in Quranic Studies: A Study of 13 Open-Source Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理敏感且领域特定任务(如与古兰经研究相关的问答)时,因幻觉(hallucinations)导致响应偏离权威来源的问题,从而提升其在宗教语境下的可靠性。论文的关键解决方案是结合检索增强生成(Retrieval-Augmented Generation, RAG)技术,通过引入描述性的古兰经章节数据集(包括章节含义、历史背景及特性),使模型在回应前能够获取相关知识。此外,论文评估了不同规模的开源LLMs(大、中、小型),揭示了模型大小与响应质量之间的权衡关系,强调了优化较小架构以实现高效性和高质量响应的可能性。
链接: https://arxiv.org/abs/2503.16581
作者: Zahra Khalila,Arbi Haza Nasution,Winda Monika,Aytug Onan,Yohei Murakami,Yasir Bin Ismail Radi,Noor Mohammad Osmani
机构: Department of Informatics Engineering, Universitas Islam Riau (伊斯兰穆罕默迪亚大学), Pekanbaru 28284, Indonesia; Department of Library Information, Universitas Lancang Kuning (兰章昆宁大学), Riau 28266, Indonesia; Department of Computer Engineering, College of Engineering and Architecture, Izmir Katip Celebi University (伊兹密尔凯提普赛利比大学), Izmir, 35620 Turkey; Faculty of Information Science and Engineering, Ritsumeikan University (立命馆大学), Kusatsu, Shiga 525-8577, Japan; Faculty of Al-Quran & Sunnah, Universiti Islam Antarabangsa Tuanku Syed Sirajuddin (泰益穆斯林国际大学), Kuala Perlis, Perlis 02000, Malaysia; Department Of Qur’an And Sunnah Studies, Ahas Kirkhs, International Islamic University Malaysia (国际伊斯兰大学马来西亚分校), Malaysia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, keywords: Large-language-models; retrieval-augmented generation; question answering; Quranic studies; Islamic teachings
点击查看摘要
Abstract:Accurate and contextually faithful responses are critical when applying large language models (LLMs) to sensitive and domain-specific tasks, such as answering queries related to quranic studies. General-purpose LLMs often struggle with hallucinations, where generated responses deviate from authoritative sources, raising concerns about their reliability in religious contexts. This challenge highlights the need for systems that can integrate domain-specific knowledge while maintaining response accuracy, relevance, and faithfulness. In this study, we investigate 13 open-source LLMs categorized into large (e.g., Llama3:70b, Gemma2:27b, QwQ:32b), medium (e.g., Gemma2:9b, Llama3:8b), and small (e.g., Llama3.2:3b, Phi3:3.8b). A Retrieval-Augmented Generation (RAG) is used to make up for the problems that come with using separate models. This research utilizes a descriptive dataset of Quranic surahs including the meanings, historical context, and qualities of the 114 surahs, allowing the model to gather relevant knowledge before responding. The models are evaluated using three key metrics set by human evaluators: context relevance, answer faithfulness, and answer relevance. The findings reveal that large models consistently outperform smaller models in capturing query semantics and producing accurate, contextually grounded responses. The Llama3.2:3b model, even though it is considered small, does very well on faithfulness (4.619) and relevance (4.857), showing the promise of smaller architectures that have been well optimized. This article examines the trade-offs between model size, computational efficiency, and response quality while using LLMs in domain-specific applications.
zh
[NLP-42] SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors
【速读】: 该论文旨在解决老龄化社会中语音技术面临的性能瓶颈问题,主要由于现有系统训练数据不足,未能充分捕捉老年人特有的嗓音特征(如presbyphonia)及方言差异。论文指出,当前老年语音数据集中关于超老年人群(75岁及以上)的数据极为有限,且录制方式与标注维度过于简单,进一步加剧了这一问题。为应对这一挑战,论文提出的关键解决方案是构建SeniorTalk,这是一个精心标注的中文口语对话数据集。该数据集包含来自202名参与者、涉及101段自然对话的55.53小时语音,确保了性别、地域和年龄分布的均衡,并通过多维度详细标注支持广泛的语音任务。通过在说话人验证、说话人辨识、语音识别及语音编辑等任务上的广泛实验,该研究为面向老年人群的语音技术发展提供了重要参考。
链接: https://arxiv.org/abs/2503.16578
作者: Yang Chen,Hui Wang,Shiyao Wang,Junyang Chen,Jiabei He,Jiaming Zhou,Xi Yang,Yequan Wang,Yonghua Lin,Yong Qin
机构: College of Computer Science, Nankai University (南开大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:While voice technologies increasingly serve aging populations, current systems exhibit significant performance gaps due to inadequate training data capturing elderly-specific vocal characteristics like presbyphonia and dialectal variations. The limited data available on super-aged individuals in existing elderly speech datasets, coupled with overly simple recording styles and annotation dimensions, exacerbates this issue. To address the critical scarcity of speech data from individuals aged 75 and above, we introduce SeniorTalk, a carefully annotated Chinese spoken dialogue dataset. This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants, ensuring a strategic balance across gender, region, and age. Through detailed annotation across multiple dimensions, it can support a wide range of speech tasks. We perform extensive experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks, offering crucial insights for the development of speech technologies targeting this age group.
zh
[NLP-43] Extract Match and Score: An Evaluation Paradigm for Long Question-context-answer Triplets in Financial Analysis
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在长文本评估中的不足问题。传统评价指标在处理短文本时仍可适用,但在评估长篇答案的质量时效果显著下降,尤其是在涉及复杂真实场景如金融分析或法规遵从的情况下。论文以金融领域的实际案例展示了处理“长问题-上下文-答案三元组”的应用,并构建了一个包含长三元组的真实金融数据集,揭示了传统指标的局限性。为了解决这一问题,论文提出了一种名为提取、匹配与评分(Extract, Match, and Score, EMS)的有效评估方法,该方法专门针对长文本LLMs输出的复杂性设计,为实践者提供了一种可靠的方法来评估LLMs在复杂真实场景中的性能。
链接: https://arxiv.org/abs/2503.16575
作者: Bo Hu,Han Yuan,Vlad Pandelea,Wuqiong Luo,Yingzhu Zhao,Zheng Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid advancement of large language models (LLMs) has sparked widespread adoption across diverse applications, making robust evaluation frameworks crucial for assessing their performance. While conventional evaluation metrics remain applicable for shorter texts, their efficacy diminishes when evaluating the quality of long-form answers. This limitation is particularly critical in real-world scenarios involving extended questions, extensive context, and long-form answers, such as financial analysis or regulatory compliance. In this paper, we use a practical financial use case to illustrate applications that handle “long question-context-answer triplets”. We construct a real-world financial dataset comprising long triplets and demonstrate the inadequacies of traditional metrics. To address this, we propose an effective Extract, Match, and Score (EMS) evaluation approach tailored to the complexities of long-form LLMs’ outputs, providing practitioners with a reliable methodology for assessing LLMs’ performance in complex real-world scenarios.
zh
[NLP-44] Gene42: Long-Range Genomic Foundation Model With Dense Attention
【速读】: 该论文旨在解决基因组学领域中处理超长序列(高达192,000个碱基对)且保持单核苷酸分辨率的问题。传统方法受限于固定长度或依赖卷积算子的空间状态模型,难以有效建模复杂的基因组数据模式与依赖关系。论文提出的Gene42是一种新型的基因组基础模型(Genomic Foundation Model, GFM),其关键创新在于采用解码器-only架构(类似LLaMA风格)并结合密集自注意力机制,通过迭代扩展上下文长度从初始的4,096 bp至最终的192,000 bp,从而实现对大规模基因组数据的全面处理及复杂模式捕捉。这一方案显著提升了模型在多种任务上的性能,包括生物类型分类、调控区域识别、染色质特征预测、变异致病性预测以及物种分类等。
链接: https://arxiv.org/abs/2503.16565
作者: Kirill Vishniakov,Boulbaba Ben Amor,Engin Tekin,Nancy A. ElNaker,Karthik Viswanathan,Aleksandr Medvedev,Aahan Singh,Maryam Nadeem,Mohammad Amaan Sayeed,Praveenkumar Kanithi,Tiago Magalhaes,Natalia Vassilieva,Dwarikanath Mahapatra,Marco Pimentel,and Shadab Khan
机构: M42 (M42); Inception Institute of Artificial Intelligence (起源人工智能研究所), Abu Dhabi, UAE; Cerebras Systems (Cerebras系统), Sunnyvale, CA, USA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We introduce Gene42, a novel family of Genomic Foundation Models (GFMs) designed to manage context lengths of up to 192,000 base pairs (bp) at a single-nucleotide resolution. Gene42 models utilize a decoder-only (LLaMA-style) architecture with a dense self-attention mechanism. Initially trained on fixed-length sequences of 4,096 bp, our models underwent continuous pretraining to extend the context length to 192,000 bp. This iterative extension allowed for the comprehensive processing of large-scale genomic data and the capture of intricate patterns and dependencies within the human genome. Gene42 is the first dense attention model capable of handling such extensive long context lengths in genomics, challenging state-space models that often rely on convolutional operators among other mechanisms. Our pretrained models exhibit notably low perplexity values and high reconstruction accuracy, highlighting their strong ability to model genomic data. Extensive experiments on various genomic benchmarks have demonstrated state-of-the-art performance across multiple tasks, including biotype classification, regulatory region identification, chromatin profiling prediction, variant pathogenicity prediction, and species classification. The models are publicly available at this http URL.
zh
[NLP-45] Chem42: a Family of chemical Language Models for Target-aware Ligand Generation
【速读】: 该论文旨在解决传统化学语言模型(cLMs)在药物发现中的局限性,即它们难以有效整合目标特异性信息,从而限制了从头设计新型配体的能力。为了解决这一问题,论文提出了Chem42这一创新性的生成式化学语言模型家族。Chem42的关键突破在于通过与Prot42蛋白语言模型结合,实现了分子结构、相互作用及结合模式的跨模态高级表征。这种集成方法不仅提升了生成配体的化学合理性与合成可达性,还显著增强了其对特定生物靶点的识别能力,从而在多个蛋白靶点上的评估中表现出色,超越了现有方法。通过缩小潜在药物候选分子的空间,Chem42有望加速药物发现流程,并为精准医学提供强大的生成式AI工具。
链接: https://arxiv.org/abs/2503.16563
作者: Aahan Singh,Engin Tekin,Maryam Nadeem,Nancy A. ElNaker,Mohammad Amaan Sayeed,Natalia Vassilieva,Boulbaba Ben Amor
机构: Inception Institute of Artificial Intelligence ( inception.ai ); Cerebras Systems ( Cerebras Systems ); Inception Institute of Artificial Intelligence ( inception.ai )
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Revolutionizing drug discovery demands more than just understanding molecular interactions - it requires generative models that can design novel ligands tailored to specific biological targets. While chemical Language Models (cLMs) have made strides in learning molecular properties, most fail to incorporate target-specific insights, restricting their ability to drive de-novo ligand generation. Chem42, a cutting-edge family of generative chemical Language Models, is designed to bridge this gap. By integrating atomic-level interactions with multimodal inputs from Prot42, a complementary protein Language Model, Chem42 achieves a sophisticated cross-modal representation of molecular structures, interactions, and binding patterns. This innovative framework enables the creation of structurally valid, synthetically accessible ligands with enhanced target specificity. Evaluations across diverse protein targets confirm that Chem42 surpasses existing approaches in chemical validity, target-aware design, and predicted binding affinity. By reducing the search space of viable drug candidates, Chem42 could accelerate the drug discovery pipeline, offering a powerful generative AI tool for precision medicine. Our Chem42 models set a new benchmark in molecule property prediction, conditional molecule generation, and target-aware ligand design. The models are publicly available at this http URL.
zh
[NLP-46] FutureGen: LLM -RAG Approach to Generate the Future Work of Scientific Article
【速读】: 该论文旨在解决如何有效生成科学文章的未来工作建议,并提出了一种结合Retrieval-Augmented Generation (RAG) 和 Large Language Models (LLMs) 的方法以改进生成内容的质量与可靠性。论文的关键解决方案在于引入LLM反馈机制优化生成内容,并采用LLM-as-a-judge的方法进行评估。此外,通过结合RAG技术和LLM反馈,论文展示了该方法在定性与定量指标上的优越性能,同时通过人工评估验证了LLM作为提取器和评判器的有效性。
链接: https://arxiv.org/abs/2503.16561
作者: Ibrahim Al Azher,Miftahul Jannat Mokarrama,Zhishuai Guo,Sagnik Ray Choudhury,Hamed Alhoori
机构: Northern Illinois University (北方伊利诺伊大学), Dekalb, IL, USA; University of North Texas (北德克萨斯大学), Denton, TX, USA
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 5 figures
点击查看摘要
Abstract:The future work section of a scientific article outlines potential research directions by identifying gaps and limitations of a current study. This section serves as a valuable resource for early-career researchers seeking unexplored areas and experienced researchers looking for new projects or collaborations. In this study, we generate future work suggestions from key sections of a scientific article alongside related papers and analyze how the trends have evolved. We experimented with various Large Language Models (LLMs) and integrated Retrieval-Augmented Generation (RAG) to enhance the generation process. We incorporate a LLM feedback mechanism to improve the quality of the generated content and propose an LLM-as-a-judge approach for evaluation. Our results demonstrated that the RAG-based approach with LLM feedback outperforms other methods evaluated through qualitative and quantitative metrics. Moreover, we conduct a human evaluation to assess the LLM as an extractor and judge. The code and dataset for this project are here, code: HuggingFace
zh
[NLP-47] Explainable AI Components for Narrative Map Extraction ECIR2025
【速读】: 该论文试图解决复杂叙事提取系统中用户信任建立的问题,特别是在生成可解释输出方面的需求日益增加。解决方案的关键在于提出了一种多层级解释的可解释人工智能(Explainable Artificial Intelligence, XAI)系统,该系统通过主题聚类提供低层级文档关系解释、通过连接解释刻画事件关系,并通过高层级结构解释揭示整体叙事模式。论文通过针对2021年古巴抗议活动叙事的用户研究验证了该系统的有效性,结果显示连接解释和重要事件检测显著增强了用户对系统决策的信任,而多层级解释方法帮助用户合理建立对系统叙事提取能力的信任。这一工作不仅推动了叙事提取领域的可解释性技术发展,还为构建支持有效人机协作的可靠叙事提取系统提供了实用洞见。
链接: https://arxiv.org/abs/2503.16554
作者: Brian Keith,Fausto German,Eric Krokos,Sarah Joseph,Chris North
机构: Universidad Católica del Norte (天主教大学 of the North); Virginia Tech (弗吉尼亚理工学院暨州立大学); U.S. Government (美国政府)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Text2Story Workshop 2025 at ECIR 2025
点击查看摘要
Abstract:As narrative extraction systems grow in complexity, establishing user trust through interpretable and explainable outputs becomes increasingly critical. This paper presents an evaluation of an Explainable Artificial Intelligence (XAI) system for narrative map extraction that provides meaningful explanations across multiple levels of abstraction. Our system integrates explanations based on topical clusters for low-level document relationships, connection explanations for event relationships, and high-level structure explanations for overall narrative patterns. In particular, we evaluate the XAI system through a user study involving 10 participants that examined narratives from the 2021 Cuban protests. The analysis of results demonstrates that participants using the explanations made the users trust in the system’s decisions, with connection explanations and important event detection proving particularly effective at building user confidence. Survey responses indicate that the multi-level explanation approach helped users develop appropriate trust in the system’s narrative extraction capabilities. This work advances the state-of-the-art in explainable narrative extraction while providing practical insights for developing reliable narrative extraction systems that support effective human-AI collaboration.
zh
[NLP-48] A Foundational individual Mobility Prediction Model based on Open-Source Large Language Models
【速读】: 该论文旨在解决现有基于大型语言模型(Large Language Models, LLMs)的个体移动性预测模型在适应不同城市和多样化用户背景方面存在的局限性问题。这些模型通常仅针对特定数据集训练或采用单一精心设计的提示词,导致其难以泛化至其他场景。为了解决这些问题,论文提出了一种统一的微调框架,用于训练一个基于开源LLM的基础个体移动性预测模型。该方案的关键在于通过广泛的实验验证,在六个真实世界的移动性数据集上展示了所提模型在预测准确性和跨领域迁移能力方面的优越性能,超越了现有的基于深度学习和LLMs的方法。
链接: https://arxiv.org/abs/2503.16553
作者: Zhenlin Qin,Leizhen Wang,Francisco Camara Pereira,Zhenlinag Ma
机构: Department of Civil and Architectural Engineering, KTH Royal Institute of Technology (瑞典皇家理工学院), Sweden; Department of Data Science and Artificial Intelligence, Monash University (蒙纳士大学), Australia; Department of Technology, Management and Economics, Technical University of Denmark (丹麦技术大学), Denmark
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are widely applied to domain-specific tasks due to their massive general knowledge and remarkable inference capacities. Current studies on LLMs have shown immense potential in applying LLMs to model individual mobility prediction problems. However, most LLM-based mobility prediction models only train on specific datasets or use single well-designed prompts, leading to difficulty in adapting to different cities and users with diverse contexts. To fill these gaps, this paper proposes a unified fine-tuning framework to train a foundational open source LLM-based mobility prediction model. We conducted extensive experiments on six real-world mobility datasets to validate the proposed model. The results showed that the proposed model achieved the best performance in prediction accuracy and transferability over state-of-the-art models based on deep learning and LLMs.
zh
[NLP-49] Unified Enhancement of the Generalization and Robustness of Language Models via Bi-Stage Optimization
【速读】: 该论文旨在解决神经网络语言模型(Language Models, LMs)在泛化能力和鲁棒性方面的显著挑战,当前研究多集中于单独提升泛化或鲁棒性,缺乏同时兼顾两者的方法。论文提出了一种两阶段优化框架——统一增强泛化与鲁棒性(Uniformly Enhancing Generalization and Robustness, UEGR),其关键是通过前向传播阶段利用自适应丢弃(adaptive dropout)丰富对抗样本的输出概率分布以生成多样化的子模型,并结合这些分布的JS散度和对抗损失来强化输出稳定性;而后向传播阶段则计算参数显著性分数,仅选择性更新最关键参数以减少不必要的偏差并巩固模型的韧性。理论分析表明,该框架通过梯度正则化限制模型对输入扰动的敏感性,并通过选择性参数更新展平损失函数曲面,从而有效提升泛化与鲁棒性。实验结果显示,该方法在13个公开的语言数据集上显著优于其他现有方法,达到了最先进的性能水平。
链接: https://arxiv.org/abs/2503.16550
作者: Yudao Sun,Juan Yin,Juan Zhao,Fan Zhang,Yongheng Liu,Hongji Chen
机构: Department of New Networks, Peng Cheng Laboratory (鹏城实验室), Shenzhen 518055, China; Department of Industrial Systems Engineering and Management, National University of Singapore; School of Data Science and Artificial Intelligence, Chang’an University (长安大学), Shaanxi, 710064, China
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Neural network language models (LMs) are confronted with significant challenges in generalization and robustness. Currently, many studies focus on improving either generalization or robustness in isolation, without methods addressing both aspects simultaneously, which presents a significant challenge in developing LMs that are both robust and generalized. In this paper, we propose a bi-stage optimization framework to uniformly enhance both the generalization and robustness of LMs, termed UEGR. Specifically, during the forward propagation stage, we enrich the output probability distributions of adversarial samples by adaptive dropout to generate diverse sub models, and incorporate JS divergence and adversarial losses of these output distributions to reinforce output stability. During backward propagation stage, we compute parameter saliency scores and selectively update only the most critical parameters to minimize unnecessary deviations and consolidate the model’s resilience. Theoretical analysis shows that our framework includes gradient regularization to limit the model’s sensitivity to input perturbations and selective parameter updates to flatten the loss landscape, thus improving both generalization and robustness. The experimental results show that our method significantly improves the generalization and robustness of LMs compared to other existing methods across 13 publicly available language datasets, achieving state-of-the-art (SOTA) performance.
zh
[NLP-50] EmpathyAgent : Can Embodied Agents Conduct Empathetic Actions?
【速读】: 该论文试图解决的问题是:现有研究虽关注具身代理在任务解决和社会互动能力方面的表现,但忽视了其是否能够理解和提供类似人类的情感支持。为解决这一问题,论文提出了关键方案——引入EmpathyAgent基准数据集,这是一个用于评估和提升代理在多样化场景下情感行为表现的首个基准。EmpathyAgent包含10,000个多模态样本及其对应的同理任务计划,并设置了三种不同挑战。此外,论文设计了一套专门针对同理心过程的评估工具包,以系统性评估代理的同理行为能力。通过使用EmpathyAgent微调Llama3-8B模型,发现其在提升同理行为方面具有潜力。论文希望借此建立标准基准推动具身同理心代理的研究进展。代码和数据已公开发布。
链接: https://arxiv.org/abs/2503.16545
作者: Xinyan Chen,Jiaxin Ge,Hongming Dai,Qiang Zhou,Qiuxuan Feng,Jingtong Hu,Yizhou Wang,Jiaming Liu,Shanghang Zhang
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Empathy is fundamental to human interactions, yet it remains unclear whether embodied agents can provide human-like empathetic support. Existing works have studied agents’ tasks solving and social interactions abilities, but whether agents can understand empathetic needs and conduct empathetic behaviors remains overlooked. To address this, we introduce EmpathyAgent, the first benchmark to evaluate and enhance agents’ empathetic actions across diverse scenarios. EmpathyAgent contains 10,000 multimodal samples with corresponding empathetic task plans and three different challenges. To systematically evaluate the agents’ empathetic actions, we propose an empathy-specific evaluation suite that evaluates the agents’ empathy process. We benchmark current models and found that exhibiting empathetic actions remains a significant challenge. Meanwhile, we train Llama3-8B using EmpathyAgent and find it can potentially enhance empathetic behavior. By establishing a standard benchmark for evaluating empathetic actions, we hope to advance research in empathetic embodied agents. Our code and data are publicly available at this https URL.
zh
[NLP-51] Causal Discovery and Counterfactual Reasoning to Optimize Persuasive Dialogue Policies
【速读】: 该论文旨在解决现有对话系统难以适应动态变化的用户状态以实现更有效说服的问题。解决方案的关键在于提出了一种结合因果发现(Causal Discovery)和反事实推理(Counterfactual Reasoning)的新方法。具体而言,论文利用Greedy Relaxation of the Sparsest Permutation (GRaSP) 算法识别用户话语策略与系统话语策略之间的因果关系,并将用户策略视为状态,系统策略视为动作。GRaSP确定用户策略作为影响系统响应的因果因素,进而指导双向条件生成对抗网络(Bidirectional Conditional Generative Adversarial Networks, BiCoGAN)生成反事实话语。随后,通过Dueling Double Deep Q-Network (D3QN) 模型利用反事实数据优化系统话语选择策略。实验结果表明,该方法在PersuasionForGood数据集上的说服效果显著优于基线方法,验证了因果发现提升反事实推理及强化学习策略优化的有效性。
链接: https://arxiv.org/abs/2503.16544
作者: Donghuo Zeng,Roberto Legaspi,Yuewen Sun,Xinshuai Dong,Kazushi Ikeda,Peter Spirtes,Kun Zhang
机构: Human-Centered AI Labs, KDDI Research, Inc. (KDDI研究所); Department of Philosophy, Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 21 pages, 8 figures
点击查看摘要
Abstract:Tailoring persuasive conversations to users leads to more effective persuasion. However, existing dialogue systems often struggle to adapt to dynamically evolving user states. This paper presents a novel method that leverages causal discovery and counterfactual reasoning for optimizing system persuasion capability and outcomes. We employ the Greedy Relaxation of the Sparsest Permutation (GRaSP) algorithm to identify causal relationships between user and system utterance strategies, treating user strategies as states and system strategies as actions. GRaSP identifies user strategies as causal factors influencing system responses, which inform Bidirectional Conditional Generative Adversarial Networks (BiCoGAN) in generating counterfactual utterances for the system. Subsequently, we use the Dueling Double Deep Q-Network (D3QN) model to utilize counterfactual data to determine the best policy for selecting system utterances. Our experiments with the PersuasionForGood dataset show measurable improvements in persuasion outcomes using our approach over baseline methods. The observed increase in cumulative rewards and Q-values highlights the effectiveness of causal discovery in enhancing counterfactual reasoning and optimizing reinforcement learning policies for online dialogue systems.
zh
[NLP-52] Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)中幻觉现象(hallucinations)在多语言应用场景中的可靠性挑战,特别是针对大规模语言模型(Large Language Models, LLMs)在跨语言一致性评估方面的不足。现有幻觉检测基准主要集中在英语和其他少数广泛使用的语言上,缺乏对多样化语言环境中模型性能差异的全面评估。为填补这一空白,论文提出了Poly-FEVER,这是一个专门设计用于评估LLMs幻觉检测能力的大规模多语言事实验证基准。Poly-FEVER包含来自FEVER、Climate-FEVER和SciFact的11种语言共计77,973条标注的事实声明,首次提供了针对多种语言幻觉模式分析的大型数据集。其关键解决方案在于通过构建这一多语言基准,揭示话题分布和网络资源可用性如何影响幻觉频率,并识别影响模型准确性的语言特定偏差,从而促进LLMs在不同语言间的系统性评估及更可靠、更具包容性的AI系统的开发。
链接: https://arxiv.org/abs/2503.16541
作者: Hanzhi Zhang,Sumera Anjum,Heng Fan,Weijian Zheng,Yan Huang,Yunhe Feng
机构: University of North Texas (北德克萨斯大学); Argonne National Laboratory (阿贡国家实验室)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Hallucinations in generative AI, particularly in Large Language Models (LLMs), pose a significant challenge to the reliability of multilingual applications. Existing benchmarks for hallucination detection focus primarily on English and a few widely spoken languages, lacking the breadth to assess inconsistencies in model performance across diverse linguistic contexts. To address this gap, we introduce Poly-FEVER, a large-scale multilingual fact verification benchmark specifically designed for evaluating hallucination detection in LLMs. Poly-FEVER comprises 77,973 labeled factual claims spanning 11 languages, sourced from FEVER, Climate-FEVER, and SciFact. It provides the first large-scale dataset tailored for analyzing hallucination patterns across languages, enabling systematic evaluation of LLMs such as ChatGPT and the LLaMA series. Our analysis reveals how topic distribution and web resource availability influence hallucination frequency, uncovering language-specific biases that impact model accuracy. By offering a multilingual benchmark for fact verification, Poly-FEVER facilitates cross-linguistic comparisons of hallucination detection and contributes to the development of more reliable, language-inclusive AI systems. The dataset is publicly available to advance research in responsible AI, fact-checking methodologies, and multilingual NLP, promoting greater transparency and robustness in LLM performance. The proposed Poly-FEVER is available at: this https URL.
zh
[NLP-53] Do Multimodal Large Language Models Understand Welding?
【速读】: 该论文旨在评估多模态大型语言模型(Multimodal LLMs, MLLMs)在高技能生产工作中的性能,特别是焊接领域的应用。论文通过构建包含真实世界和在线焊接图像的数据集,并由领域专家标注,研究了两种最先进的MLLMs在三种不同应用场景(RV\Marine、Aeronautical和Farming)下评估焊接可接受性的表现。结果显示,尽管这两种模型在在线图像上的表现更优,可能与它们的先验暴露或记忆有关,但它们在未见过的真实世界焊接图像上的表现也相对较好。此外,论文引入了一种名为WeldPrompt的提示策略,结合了链式思维生成与情境学习以减少幻觉并提升推理能力。WeldPrompt在某些场景中提升了模型的召回率,但在其他场景中表现不一。这些结果揭示了MLLMs在高风险技术领域中的局限性和潜力,并强调了微调、领域特定数据以及更复杂的提示策略对于提高模型可靠性的重要性。研究为工业应用中的多模态学习开辟了新的研究方向。论文的关键在于提出并验证一种结合链式思维生成与情境学习的提示策略(即WeldPrompt),以改善模型在高风险技术领域的表现。
链接: https://arxiv.org/abs/2503.16537
作者: Grigorii Khvatskii,Yong Suk Lee,Corey Angst,Maria Gibbs,Robert Landers,Nitesh V. Chawla
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages
点击查看摘要
Abstract:This paper examines the performance of Multimodal LLMs (MLLMs) in skilled production work, with a focus on welding. Using a novel data set of real-world and online weld images, annotated by a domain expert, we evaluate the performance of two state-of-the-art MLLMs in assessing weld acceptability across three contexts: RV \ Marine, Aeronautical, and Farming. While both models perform better on online images, likely due to prior exposure or memorization, they also perform relatively well on unseen, real-world weld images. Additionally, we introduce WeldPrompt, a prompting strategy that combines Chain-of-Thought generation with in-context learning to mitigate hallucinations and improve reasoning. WeldPrompt improves model recall in certain contexts but exhibits inconsistent performance across others. These results underscore the limitations and potentials of MLLMs in high-stakes technical domains and highlight the importance of fine-tuning, domain-specific data, and more sophisticated prompting strategies to improve model reliability. The study opens avenues for further research into multimodal learning in industry applications.
zh
[NLP-54] Word2Minecraft: Generating 3D Game Levels through Large Language Models
【速读】: 该论文试图解决如何利用大型语言模型(Large Language Models)根据结构化故事生成可玩的《Minecraft》游戏关卡的问题。解决方案的关键在于提出了一种灵活的框架(flexible framework),能够将叙事元素如主角目标、反派挑战及环境设定转化为具有空间和玩法约束的游戏关卡,并通过引入缩放算法(scaling algorithm)来保持空间一致性的同时动态调整关键游戏元素,从而实现复杂度可定制的动态关卡生成。
链接: https://arxiv.org/abs/2503.16536
作者: Shuo Huang,Muhammad Umair Nasir,Steven James,Julian Togelius
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We present Word2Minecraft, a system that leverages large language models to generate playable game levels in Minecraft based on structured stories. The system transforms narrative elements-such as protagonist goals, antagonist challenges, and environmental settings-into game levels with both spatial and gameplay constraints. We introduce a flexible framework that allows for the customization of story complexity, enabling dynamic level generation. The system employs a scaling algorithm to maintain spatial consistency while adapting key game elements. We evaluate Word2Minecraft using both metric-based and human-based methods. Our results show that GPT-4-Turbo outperforms GPT-4o-Mini in most areas, including story coherence and objective enjoyment, while the latter excels in aesthetic appeal. We also demonstrate the system’ s ability to generate levels with high map enjoyment, offering a promising step forward in the intersection of story generation and game design. We open-source the code at this https URL
zh
[NLP-55] Gender and content bias in Large Language Models : a case study on Google Gemini 2.0 Flash Experimental
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在内容审核中的伦理偏差问题,特别是针对性别偏见及暴力内容接受度的差异。论文通过对比Google开发的Gemini 2.0与ChatGPT-4o的表现,评估其在内容审核方面的实践差异。关键在于Gemini 2.0在减少性别偏见方面取得进展,女性特定提示的接受率显著提高,但同时对性相关内容采取更宽松的态度,并维持较高水平的暴力内容接受率,尤其在涉及性别特定场景时。尽管减少了某些偏见,但这种变化可能以牺牲内容安全性为代价,引发了关于是否真正改善的争议。论文强调,实现透明、公平且包容的审核机制仍需持续优化,以避免放大有害内容。
链接: https://arxiv.org/abs/2503.16534
作者: Roberto Balestri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:This study evaluates the biases in Gemini 2.0 Flash Experimental, a state-of-the-art large language model (LLM) developed by Google, focusing on content moderation and gender disparities. By comparing its performance to ChatGPT-4o, examined in a previous work of the author, the analysis highlights some differences in ethical moderation practices. Gemini 2.0 demonstrates reduced gender bias, notably with female-specific prompts achieving a substantial rise in acceptance rates compared to results obtained by ChatGPT-4o. It adopts a more permissive stance toward sexual content and maintains relatively high acceptance rates for violent prompts, including gender-specific cases. Despite these changes, whether they constitute an improvement is debatable. While gender bias has been reduced, this reduction comes at the cost of permitting more violent content toward both males and females, potentially normalizing violence rather than mitigating harm. Male-specific prompts still generally receive higher acceptance rates than female-specific ones. These findings underscore the complexities of aligning AI systems with ethical standards, highlighting progress in reducing certain biases while raising concerns about the broader implications of the model’s permissiveness. Ongoing refinements are essential to achieve moderation practices that ensure transparency, fairness, and inclusivity without amplifying harmful content.
zh
[NLP-56] From Patient Consultations to Graphs: Leverag ing LLM s for Patient Journey Knowledge Graph Construction
【速读】: 该论文旨在解决现有医疗数据系统碎片化的问题,这些系统难以提供患者旅程(Patient Journeys)的整体视图,从而阻碍了协调护理和个性化干预的实施。论文提出的关键解决方案是构建患者旅程知识图谱(Patient Journey Knowledge Graphs, PJKGs),通过将多样化的患者信息整合到一个统一且结构化的表示中,以克服碎片化数据的挑战。具体而言,研究提出了一种利用大型语言模型(Large Language Models, LLMs)处理和结构化正式临床文档及非结构化医患对话的方法,生成能够捕捉临床诊疗事件之间时序与因果关系的知识图谱。这种方法不仅支持高级时间推理,还提供了个性化的护理洞察。
链接: https://arxiv.org/abs/2503.16533
作者: Hassan S. Al Khatib,Sudip Mittal,Shahram Rahimi,Nina Marhamati,Sean Bozorgzad
机构: Mississippi State University (密西西比州立大学); University of Alabama (阿拉巴马大学); Potentia Analytics Inc (Potentia Analytics Inc)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The transition towards patient-centric healthcare necessitates a comprehensive understanding of patient journeys, which encompass all healthcare experiences and interactions across the care spectrum. Existing healthcare data systems are often fragmented and lack a holistic representation of patient trajectories, creating challenges for coordinated care and personalized interventions. Patient Journey Knowledge Graphs (PJKGs) represent a novel approach to addressing the challenge of fragmented healthcare data by integrating diverse patient information into a unified, structured representation. This paper presents a methodology for constructing PJKGs using Large Language Models (LLMs) to process and structure both formal clinical documentation and unstructured patient-provider conversations. These graphs encapsulate temporal and causal relationships among clinical encounters, diagnoses, treatments, and outcomes, enabling advanced temporal reasoning and personalized care insights. The research evaluates four different LLMs, such as Claude 3.5, Mistral, Llama 3.1, and Chatgpt4o, in their ability to generate accurate and computationally efficient knowledge graphs. Results demonstrate that while all models achieved perfect structural compliance, they exhibited variations in medical entity processing and computational efficiency. The paper concludes by identifying key challenges and future research directions. This work contributes to advancing patient-centric healthcare through the development of comprehensive, actionable knowledge graphs that support improved care coordination and outcome prediction.
zh
[NLP-57] EEG-CLIP : Learning EEG representations from natural language descriptions
【速读】: 该论文旨在解决现有深度网络在脑电图(EEG)解码任务中的局限性,这些网络通常仅针对特定任务(如病理学或性别解码)进行训练。论文提出了一种更通用的方法,通过利用临床EEG记录的医学报告,学习医学报告与EEG记录之间的映射关系。这种方法最初在计算机视觉领域被用于匹配图像及其文本描述,并随后实现了基于文本类别提示的零样本解码。论文的关键在于开发了一个对比学习框架EEG-CLIP,它将EEG时间序列及其对应的临床文本描述对齐到共享嵌入空间中。这一框架的核心创新点在于通过对比学习实现文本与EEG表示的有效对齐,从而为多样化的EEG解码任务提供支持,包括少样本和零样本设置。实验结果表明,EEG-CLIP能够非平凡地对齐文本和EEG表示,为学习通用EEG表示提供了有前景的途径,可能简化通过零样本解码或从较少训练样本中训练任务特定模型的分析过程。
链接: https://arxiv.org/abs/2503.16531
作者: Tidiane Camaret N’dir,Robin Tibor Schirrmeister
机构: Faculty of Engineering (工程学院), Albert-Ludwigs-Universität Freiburg (阿尔伯特-路德维希斯-弗赖堡大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Deep networks for electroencephalogram (EEG) decoding are currently often trained to only solve a specific task like pathology or gender decoding. A more general approach leveraging the medical reports of clinical EEG recordings is to learn mappings between medical reports and EEG recordings. This approach was pioneered in the computer vision domain matching images and their text captions and subsequently allowed to do successful zero-shot decoding using textual class prompts. In this work, we follow this approach and develop a contrastive learning framework EEG-CLIP that aligns EEG time series and their corresponding clinical text descriptions in a shared embedding space. We investigate its potential for versatile EEG decoding, assessing performance on a range of few-shot and zero-shot settings. Overall, results show that EEG-CLIP manages to nontrivially align text and EEG representations. Our work presents a promising approach to learn general EEG representations, which could enable easier analyses of diverse decoding questions through zero shot decoding or training task-specific models from fewer training examples. The code for reproducing our results is available at this https URL.
zh
[NLP-58] Enhancing LLM Generation with Knowledge Hypergraph for Evidence-Based Medicine
【速读】: 该论文旨在解决基于证据的医学(Evidence-based Medicine, EBM)在应用大规模语言模型(Large Language Models, LLMs)时面临的两大挑战:一是分散证据的收集,二是这些证据的有效组织以支持复杂的查询需求。论文的关键解决方案包括两部分:首先,利用LLMs从多个来源聚合分散的证据,并提出了一种基于知识超图的证据管理模型来整合这些证据并捕捉其复杂关系;其次,开发了重要性驱动的证据优先级排序(Importance-Driven Evidence Prioritization, IDEP)算法,通过LLMs生成多特征证据及其重要性评分,从而实现证据的排序与最终检索结果的生成。实验结果表明,该方法在医学测验、幻觉检测和决策支持等EBM相关领域优于现有的检索增强生成(Retrieval-Augmented Generation, RAG)技术。
链接: https://arxiv.org/abs/2503.16530
作者: Chengfeng Dou,Ying Zhang,Zhi Jin,Wenpin Jiao,Haiyan Zhao,Yongqiang Zhao,Zhengwei Tao
机构: School of Computer Science, Peking University (北京大学计算机学院); Key Laboratory of High Confidence Software Technologies(PKU), MOE, China (教育部高可信软件技术重点实验室(北京大学)); Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University (北京交通大学交通数据分析与挖掘重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Evidence-based medicine (EBM) plays a crucial role in the application of large language models (LLMs) in healthcare, as it provides reliable support for medical decision-making processes. Although it benefits from current retrieval-augmented generation~(RAG) technologies, it still faces two significant challenges: the collection of dispersed evidence and the efficient organization of this evidence to support the complex queries necessary for EBM. To tackle these issues, we propose using LLMs to gather scattered evidence from multiple sources and present a knowledge hypergraph-based evidence management model to integrate these evidence while capturing intricate relationships. Furthermore, to better support complex queries, we have developed an Importance-Driven Evidence Prioritization (IDEP) algorithm that utilizes the LLM to generate multiple evidence features, each with an associated importance score, which are then used to rank the evidence and produce the final retrieval results. Experimental results from six datasets demonstrate that our approach outperforms existing RAG techniques in application domains of interest to EBM, such as medical quizzing, hallucination detection, and decision support. Testsets and the constructed knowledge graph can be accessed at \hrefthis https URLthis https URL.
zh
[NLP-59] Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts
【速读】: 该论文旨在解决DeepSeek-R1系列蒸馏模型在中文语境下的安全性评估不足以及蒸馏过程对模型安全性的潜在负面影响问题。论文的关键解决方案是利用CHiSafetyBench这一综合中文安全基准,对DeepSeek-R1系列蒸馏模型进行全面的安全性评估,并针对发现的安全漏洞实施针对性的安全增强措施。通过这些改进,所提出的安全增强模型不仅显著提升了安全性,同时保持了原有的推理能力,未出现明显退化。论文还开源了这些增强模型,为未来DeepSeek模型的研究与优化提供了有价值的资源。
链接: https://arxiv.org/abs/2503.16529
作者: Wenjing Zhang,Xuejiao Lei,Zhaoxiang Liu,Limin Han,Jiaojiao Zhao,Beibei Huang,Zhenhong Long,Junting Guo,Meijuan An,Rongjia Du,Ning Wang,Kai Wang,Shiguo Lian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 21 pages,13 figures
点击查看摘要
Abstract:DeepSeek-R1, renowned for its exceptional reasoning capabilities and open-source strategy, is significantly influencing the global artificial intelligence landscape. However, it exhibits notable safety shortcomings. Recent research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 achieves a 100% attack success rate when processing harmful prompts. Furthermore, multiple security firms and research institutions have identified critical security vulnerabilities within the model. Although China Unicom has uncovered safety vulnerabilities of R1 in Chinese contexts, the safety capabilities of the remaining distilled models in the R1 series have not yet been comprehensively evaluated. To address this gap, this study utilizes the comprehensive Chinese safety benchmark CHiSafetyBench to conduct an in-depth safety evaluation of the DeepSeek-R1 series distilled models. The objective is to assess the safety capabilities of these models in Chinese contexts both before and after distillation, and to further elucidate the adverse effects of distillation on model safety. Building on these findings, we implement targeted safety enhancements for six distilled models. Evaluation results indicate that the enhanced models achieve significant improvements in safety while maintaining reasoning capabilities without notable degradation. We open-source the safety-enhanced models at this https URL to serve as a valuable resource for future research and optimization of DeepSeek models.
zh
[NLP-60] HDLCoRe: A Training-Free Framework for Mitigating Hallucinations in LLM -Generated HDL
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在硬件描述语言(Hardware Description Languages, HDL)代码生成任务中因数据稀缺导致的显著局限性,如幻觉现象(hallucinations)和错误代码生成问题。论文的关键解决方案在于提出了一种无需微调模型的框架HDLCoRe,通过提示工程技术和检索增强生成(Retrieval-Augmented Generation, RAG)技术提升LLMs的HDL生成能力。HDLCoRe包含两个核心组件:一是具有自我验证功能的HDL感知链式思维(Chain-of-Thought, CoT)提示技术,能够按任务复杂度和类型分类,融入领域特定知识,并通过逐步自模拟指导LLMs进行错误修正;二是两阶段异构RAG系统,通过关键组件提取解决格式不一致问题,并通过顺序过滤和重新排序高效检索相关HDL示例。这一方法在RTLLM2.0基准测试中展现了卓越性能,显著减少了幻觉现象并提升了语法和功能正确性。
链接: https://arxiv.org/abs/2503.16528
作者: Heng Ping,Shixuan Li,Peiyu Zhang,Anzhe Cheng,Shukai Duan,Nikos Kanakaris,Xiongye Xiao,Wei Yang,Shahin Nazarian,Andrei Irimia,Paul Bogdan
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, when applied to hardware description languages (HDL), these models exhibit significant limitations due to data scarcity, resulting in hallucinations and incorrect code generation. To address these challenges, we propose HDLCoRe, a training-free framework that enhances LLMs’ HDL generation capabilities through prompt engineering techniques and retrieval-augmented generation (RAG). Our approach consists of two main components: (1) an HDL-aware Chain-of-Thought (CoT) prompting technique with self-verification that classifies tasks by complexity and type, incorporates domain-specific knowledge, and guides LLMs through step-by-step self-simulation for error correction; and (2) a two-stage heterogeneous RAG system that addresses formatting inconsistencies through key component extraction and efficiently retrieves relevant HDL examples through sequential filtering and re-ranking. HDLCoRe eliminates the need for model fine-tuning while substantially improving LLMs’ HDL generation capabilities. Experimental results demonstrate that our framework achieves superior performance on the RTLLM2.0 benchmark, significantly reducing hallucinations and improving both syntactic and functional correctness.
zh
[NLP-61] LLM Generated Persona is a Promise with a Catch
【速读】: 该论文试图解决通过传统方法获取真实人物数据面临的高成本、物流挑战及隐私约束等问题,并提出如何利用大型语言模型(Large Language Models, LLMs)生成合成人物以实现更可靠和可扩展的人物模拟。然而,当前基于LLMs的人物生成方法依赖于随意性和启发式的生成技术,缺乏方法学严谨性与模拟精度,导致下游任务中存在系统性偏差。研究通过大规模实验揭示了这些偏差可能导致与现实结果的重大偏离。因此,论文强调需要建立严格的人物生成科学,并提出了所需的方法创新、组织与机构支持以及实证基础,以提升LLM驱动的人物模拟的可靠性和可扩展性。为支持该领域的进一步研究,论文开源了一百万个人物数据供公众访问和分析。
链接: https://arxiv.org/abs/2503.16527
作者: Ang Li,Haozhe Chen,Hongseok Namkoong,Tianyi Peng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:The use of large language models (LLMs) to simulate human behavior has gained significant attention, particularly through personas that approximate individual characteristics. Persona-based simulations hold promise for transforming disciplines that rely on population-level feedback, including social science, economic analysis, marketing research, and business operations. Traditional methods to collect realistic persona data face significant challenges. They are prohibitively expensive and logistically challenging due to privacy constraints, and often fail to capture multi-dimensional attributes, particularly subjective qualities. Consequently, synthetic persona generation with LLMs offers a scalable, cost-effective alternative. However, current approaches rely on ad hoc and heuristic generation techniques that do not guarantee methodological rigor or simulation precision, resulting in systematic biases in downstream tasks. Through extensive large-scale experiments including presidential election forecasts and general opinion surveys of the U.S. population, we reveal that these biases can lead to significant deviations from real-world outcomes. Our findings underscore the need to develop a rigorous science of persona generation and outline the methodological innovations, organizational and institutional support, and empirical foundations required to enhance the reliability and scalability of LLM-driven persona simulations. To support further research and development in this area, we have open-sourced approximately one million generated personas, available for public access and analysis at this https URL.
zh
[NLP-62] KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference
【速读】: 该论文旨在解决现有Key-Value (KV)缓存技术在处理大规模语言模型(LLMs)和多模态大语言模型(MLLMs)推理任务时存在的两个主要问题:1) 前缀缓存(prefix caching)因严格的文本前缀匹配而导致的缓存复用粒度较粗;2) 语义缓存(semantic caching)因缺乏多样性而可能损失响应的多样性。为了解决这些问题,论文提出了KVShare技术,其关键是通过语义对齐算法和差异编辑操作实现细粒度的KV缓存重用,从而显著提升缓存命中率,并在保持输出质量的同时有效降低GPU资源消耗。
链接: https://arxiv.org/abs/2503.16525
作者: Huan Yang,Renji Zhang,Deyu Zhang
机构: Central South University (中南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper presents KVShare, a multi-user Key-Value (KV) Cache sharing technology based on semantic similarity, designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Addressing the limitations of existing prefix caching (strict text prefix matching) and semantic caching (loss of response diversity), KVShare achieves fine-grained KV cache reuse through semantic alignment algorithms and differential editing operations. Experiments on real-world user conversation datasets demonstrate that KVShare improves KV cache hit rates by over 60%, while maintaining output quality comparable to full computation (no significant degradation in BLEU and Rouge-L metrics). This approach effectively reduces GPU resource consumption and is applicable to scenarios with repetitive queries, such as healthcare and education.
zh
[NLP-63] Mind2: Mind-to-Mind Emotional Support System with Bidirectional Cognitive Discourse Analysis
【速读】: 本文旨在解决情感支持(Emotional Support, ES)系统在生成有效支持性对话时面临的两大挑战:缺乏及时上下文(timely context)和可解释性(interpretability),这些问题阻碍了系统的可信度与公众信任。为应对这些挑战,论文提出了一种基于认知模型的情感支持框架——Mind-to-Mind (Mind2),其核心是从话语分析的角度优化可解释的情感支持上下文建模。关键在于引入动态话语上下文传播窗口(dynamic discourse context propagation window),以适应会话过程中不断发展的上下文,并通过双向优先化细节反映每位参与者对另一方信念的认知,结合心理预期效用(psychological expected utility)、认知理性(cognitive rationality)以及心智理论(Theory-of-Mind)来提取认知知识,从而显著提升系统的可解释性和性能。实验结果表明,Mind2在仅使用10%可用训练数据的情况下,达到了与现有先进ES系统相当的表现。
链接: https://arxiv.org/abs/2503.16523
作者: Shi Yin Hong,Uttamasha Oyshi,Quan Mai,Gibson Nkhata,Susan Gauch
机构: University of Arkansas (阿肯色大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures, and 3 tables; WI-IAT 2024
点击查看摘要
Abstract:Emotional support (ES) systems alleviate users’ mental distress by generating strategic supportive dialogues based on diverse user situations. However, ES systems are limited in their ability to generate effective ES dialogues that include timely context and interpretability, hindering them from earning public trust. Driven by cognitive models, we propose Mind-to-Mind (Mind2), an ES framework that approaches interpretable ES context modeling for the ES dialogue generation task from a discourse analysis perspective. Specifically, we perform cognitive discourse analysis on ES dialogues according to our dynamic discourse context propagation window, which accommodates evolving context as the conversation between the ES system and user progresses. To enhance interpretability, Mind2 prioritizes details that reflect each speaker’s belief about the other speaker with bidirectionality, integrating Theory-of-Mind, physiological expected utility, and cognitive rationality to extract cognitive knowledge from ES conversations. Experimental results support that Mind2 achieves competitive performance versus state-of-the-art ES systems while trained with only 10% of the available training data.
zh
[NLP-64] Not All Personas Are Worth It: Culture-Reflective Persona Data Augmentation
【速读】: 该论文试图解决现有persona数据集在文化多样性和适应性方面的不足,这些问题限制了构建具有文化感知能力的对话式AI系统的有效性。论文的关键解决方案在于提出了一种两步法pipeline,用于生成特定文化背景的persona,并引入了一个包含200,000个persona的数据集KoPersona,专门捕捉韩国的文化价值观、行为和社会细微差别。通过多种指标的综合评估验证了KoPersona的质量及其与韩国文化的关联性。这一方法不仅推动了基于persona的研究进展,还建立了一个可扩展的框架,能够为不同语言和文化背景创建相关且适配的persona。
链接: https://arxiv.org/abs/2503.16520
作者: Ji-Eun Han,Yoonseok Heo
机构: KT (KT Corporation)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Incorporating personas into conversational AI models is crucial for achieving authentic and engaging interactions. However, the cultural diversity and adaptability of existing persona datasets is often overlooked, reducing their efficacy in building culturally aware AI systems. To address this issue, we propose a two-step pipeline for generating culture-specific personas and introduce KoPersona, a dataset comprising 200,000 personas designed to capture Korean cultural values, behaviors, and social nuances. A comprehensive evaluation through various metrics validates the quality of KoPersona and its relevance to Korean culture. This work not only contributes to persona-based research, but also establishes a scalable approach for creating culturally relevant personas adaptable to various languages and cultural contexts.
zh
[NLP-65] Using LLM s for Automated Privacy Policy Analysis: Prompt Engineering Fine-Tuning and Explainability
【速读】: 该论文试图解决的问题是如何利用大型语言模型(Large Language Models, LLMs)实现隐私政策分析的自动化,并探索其在概念分类任务中的潜力与效果。传统基于机器学习的方法虽已在隐私政策分析中取得一定成效,但针对LLMs在此领域的应用研究尚显不足,尤其是其性能表现及可解释性缺乏系统性评估。
解决方案的关键在于结合提示工程(prompt engineering)与LoRA(低秩适应)微调技术,对四种最先进的隐私政策语料库及其对应的分类体系进行综合评估。通过这种方式,LLM-based分类器不仅显著提升了分类性能,还在多个数据集和概念上保持了一致的优越表现。此外,论文进一步评估了基于LLMs分类器的可解释性,结果显示在完整性、逻辑性和可理解性三个指标上的评分均超过91.1%,表明LLMs不仅能提升分类性能,还能增强检测结果的可解释性。
链接: https://arxiv.org/abs/2503.16516
作者: Yuxin Chen,Peng Tang,Weidong Qiu,Shujun Li
机构: School of Cyber Science and Engineering, Shanghai Jiao Tong University (上海交通大学网络科学与工程学院); Institute of Cyber Security for Society (iCSS), University of Kent (肯特大学社会网络安全研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Privacy policies are widely used by digital services and often required for legal purposes. Many machine learning based classifiers have been developed to automate detection of different concepts in a given privacy policy, which can help facilitate other automated tasks such as producing a more reader-friendly summary and detecting legal compliance issues. Despite the successful applications of large language models (LLMs) to many NLP tasks in various domains, there is very little work studying the use of LLMs for automated privacy policy analysis, therefore, if and how LLMs can help automate privacy policy analysis remains under-explored. To fill this research gap, we conducted a comprehensive evaluation of LLM-based privacy policy concept classifiers, employing both prompt engineering and LoRA (low-rank adaptation) fine-tuning, on four state-of-the-art (SOTA) privacy policy corpora and taxonomies. Our experimental results demonstrated that combining prompt engineering and fine-tuning can make LLM-based classifiers outperform other SOTA methods, \emphsignificantly and \emphconsistently across privacy policy corpora/taxonomies and concepts. Furthermore, we evaluated the explainability of the LLM-based classifiers using three metrics: completeness, logicality, and comprehensibility. For all three metrics, a score exceeding 91.1% was observed in our evaluation, indicating that LLMs are not only useful to improve the classification performance, but also to enhance the explainability of detection results.
zh
[NLP-66] Highlighting Case Studies in LLM Literature Review of Interdisciplinary System Science
【速读】: 该论文旨在评估大型语言模型(Large Language Models, LLMs)在辅助系统性文献综述(Systematic Literature Review, SLR)任务中的性能,并探索其在提取证据以回答特定研究问题时的准确性。论文的关键在于研究LLMs能否忠实地从学术文献中提取引用内容,并有效回答研究问题。为此,作者开发了一种语义文本高亮工具以辅助领域专家评审LLMs的回答,并采用两种方法验证LLMs响应的正确性:专家评审与Transformer嵌入的余弦相似度比较。结果表明,最先进的LLMs能够以大于95%的准确率复现文本引用,并以约83%的准确率回答研究问题。此外,论文提供了证据,证明余弦相似度作为一种衡量语义相似性的指标是有效的。因此,该研究的核心解决方案在于结合专家知识与计算方法,评估和优化LLMs在SLR任务中的表现。
链接: https://arxiv.org/abs/2503.16515
作者: Lachlan McGinness,Peter Baumgartner
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) were used to assist four Commonwealth Scientific and Industrial Research Organisation (CSIRO) researchers to perform systematic literature reviews (SLR). We evaluate the performance of LLMs for SLR tasks in these case studies. In each, we explore the impact of changing parameters on the accuracy of LLM responses. The LLM was tasked with extracting evidence from chosen academic papers to answer specific research questions. We evaluate the models’ performance in faithfully reproducing quotes from the literature and subject experts were asked to assess the model performance in answering the research questions. We developed a semantic text highlighting tool to facilitate expert review of LLM responses. We found that state of the art LLMs were able to reproduce quotes from texts with greater than 95% accuracy and answer research questions with an accuracy of approximately 83%. We use two methods to determine the correctness of LLM responses; expert review and the cosine similarity of transformer embeddings of LLM and expert answers. The correlation between these methods ranged from 0.48 to 0.77, providing evidence that the latter is a valid metric for measuring semantic similarity. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.16515 [cs.CL] (or arXiv:2503.16515v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.16515 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: AI 2024: 37th Australasian Joint Conference on Artificial Intelligence, Melbourne, 2024 Related DOI: https://doi.org/10.1007/978-981-96-0348-0_3 Focus to learn more DOI(s) linking to related resources
zh
[NLP-67] Medifact at PerAnsSumm 2025: Leverag ing Lightweight Models for Perspective-Specific Summarization of Clinical QA Forums ALT NAACL2025
【速读】: 该论文旨在解决视角感知的医疗问答摘要生成问题(PerAnsSumm 2025 挑战),特别是针对开放性问题的社区问答(Community Question Answering, CQA)场景。论文的关键在于提出了一种基于少量标注数据的学习框架,利用 Snorkel-BART-SVM 管线实现分类与摘要生成。其中,SVM 模型通过 Snorkel 进行弱监督训练以增强零样本学习能力;提取与视角相关的句子后,使用预训练的 BART-CNN 模型进行摘要生成。这种方法在共享任务中取得了第 12 名的成绩,展示了其计算效率和上下文准确性,从而推动了医疗 CQA 领域的研究,并为临床决策支持系统提供了贡献。
链接: https://arxiv.org/abs/2503.16513
作者: Nadia Saeed
机构: National University of Computer and Emerging Sciences (NUCES-FAST)(国家科技大学(NUCES-FAST))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper accepted in PerAnsSumm: Perspective-aware Healthcare answer summarization, a shared task organized at the CL4Health workshop colocated with NAACL 2025
点击查看摘要
Abstract:The PerAnsSumm 2025 challenge focuses on perspective-aware healthcare answer summarization (Agarwal et al., 2025). This work proposes a few-shot learning framework using a Snorkel-BART-SVM pipeline for classifying and summarizing open-ended healthcare community question-answering (CQA). An SVM model is trained with weak supervision via Snorkel, enhancing zero-shot learning. Extractive classification identifies perspective-relevant sentences, which are then summarized using a pretrained BART-CNN model. The approach achieved 12th place among 100 teams in the shared task, demonstrating computational efficiency and contextual accuracy. By leveraging pretrained summarization models, this work advances medical CQA research and contributes to clinical decision support systems.
zh
[NLP-68] oken-Level Uncertainty-Aware Objective for Language Model Post-Training
【速读】: 该论文旨在解决因果语言模型中令牌级不确定性的问题,并提出通过结合两种训练目标——掩码最大似然估计(Masked Maximum Likelihood, MLE)与自蒸馏(Self-Distillation)——来有效缓解这一问题。论文的关键在于揭示掩码 MLE 在减少认识论不确定性(Epistemic Uncertainty)方面的有效性,同时指出其易受过拟合影响的局限性。为应对这一挑战,作者引入自蒸馏正则化以提升或保持模型在分布外任务上的性能。实验结果表明,这种联合训练方法在多个架构(如 Gemma、LLaMA、Phi)和数据集(如 Alpaca、ShareGPT、GSM8K)上显著提升了模型表现,同时减轻了过拟合并保持了后训练阶段的适应性。论文的核心贡献在于证明了不确定性感知训练提供了一种有效的机制来改进语言模型的训练。
链接: https://arxiv.org/abs/2503.16511
作者: Tingkai Liu,Ari S. Benjamin,Anthony M. Zador
机构: Cold Spring Harbor Laboratory (冷泉港实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In the current work, we connect token-level uncertainty in causal language modeling to two types of training objectives: 1) masked maximum likelihood (MLE), 2) self-distillation. We show that masked MLE is effective in reducing epistemic uncertainty, and serve as an effective token-level automatic curriculum learning technique. However, masked MLE is prone to overfitting and requires self-distillation regularization to improve or maintain performance on out-of-distribution tasks. We demonstrate significant performance gain via the proposed training objective - combined masked MLE and self-distillation - across multiple architectures (Gemma, LLaMA, Phi) and datasets (Alpaca, ShareGPT, GSM8K), mitigating overfitting while maintaining adaptability during post-training. Our findings suggest that uncertainty-aware training provides an effective mechanism for enhancing language model training.
zh
[NLP-69] Earthquake Response Analysis with AI
【速读】: 该论文旨在解决利用社交媒体数据(尤其是Twitter)进行地震灾害响应分析的问题,以帮助应急响应人员、政府机构、人道主义组织及非政府组织提升灾害应对策略并优化资源分配。解决方案的关键在于开发了一个结合自然语言处理(NLP)技术的机器学习(ML)框架,通过从地震期间发布的推文中提取相关信息,包括地理位置信息以识别受灾区域、生成灾害严重程度图,并利用网络地理信息系统(WebGIS)展示有价值的信息。
链接: https://arxiv.org/abs/2503.16509
作者: Deep Patel,Panthadeep Bhattacharjee,Amit Reza,Priodyuti Pradhan
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY); Adaptation and Self-Organizing Systems (nlin.AO)
备注: 10 pages, 5 figures
点击查看摘要
Abstract:A timely and effective response is crucial to minimize damage and save lives during natural disasters like earthquakes. Microblogging platforms, particularly Twitter, have emerged as valuable real-time information sources for such events. This work explores the potential of leveraging Twitter data for earthquake response analysis. We develop a machine learning (ML) framework by incorporating natural language processing (NLP) techniques to extract and analyze relevant information from tweets posted during earthquake events. The approach primarily focuses on extracting location data from tweets to identify affected areas, generating severity maps, and utilizing WebGIS to display valuable information. The insights gained from this analysis can aid emergency responders, government agencies, humanitarian organizations, and NGOs in enhancing their disaster response strategies and facilitating more efficient resource allocation during earthquake events.
zh
[NLP-70] Scalable Evaluation of Online Moderation Strategies via Synthetic Simulations
【速读】: 本文旨在解决在线内容 moderation 领域缺乏大规模评估替代策略有效性的难题,主要由于合适的数据集匮乏以及难以组织人类参与者(讨论者、版主和评估者)参与多次实验。为应对这一挑战,论文提出了一种利用大型语言模型(LLMs)进行合成实验的方法,从而在初期阶段绕过对人类参与的需求。解决方案的关键在于开发了一个名为“SynDisco”的高效开源 Python 框架,用于模拟数百次对话,并生成了一个包含 LLM 生成和标注讨论的大规模数据集(Virtual Moderation Dataset, VMD),结合对问题的强化学习(Reinforcement Learning, RL)公式化,提出了一种新的版主策略,显著优于现有的版主指南和开箱即用的 LLM 版主配置。
链接: https://arxiv.org/abs/2503.16505
作者: Dimitris Tsirmpas,Ion Androutsopoulos,John Pavlopoulos
机构: Dept. of Informatics, Athens University of Economics and Business (雅典经济与商业大学), Greece; Archimedes, Athena Research Center (雅典研究与技术中心阿基米德), Greece
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, 6 tables, 9 figures
点击查看摘要
Abstract:Despite the ever-growing importance of online moderation, there has been no large-scale study evaluating the effectiveness of alternative moderation strategies. This is largely due to the lack of appropriate datasets, and the difficulty of getting human discussants, moderators, and evaluators involved in multiple experiments. In this paper, we propose a methodology for leveraging synthetic experiments performed exclusively by Large Language Models (LLMs) to initially bypass the need for human participation in experiments involving online moderation. We evaluate six LLM moderation configurations; two currently used real-life moderation strategies (guidelines issued for human moderators for online moderation and real-life facilitation), two baseline strategies (guidelines elicited for LLM alignment work, and LLM moderation with minimal prompting) a baseline with no moderator at all, as well as our own proposed strategy inspired by a Reinforcement Learning (RL) formulation of the problem. We find that our own moderation strategy significantly outperforms established moderation guidelines, as well as out-of-the-box LLM moderation. We also find that smaller LLMs, with less intensive instruction-tuning, can create more varied discussions than larger models. In order to run these experiments, we create and release an efficient, purpose-built, open-source Python framework, dubbed “SynDisco” to easily simulate hundreds of discussions using LLM user-agents and moderators. Additionally, we release the Virtual Moderation Dataset (VMD), a large dataset of LLM-generated and LLM-annotated discussions, generated by three families of open-source LLMs accompanied by an exploratory analysis of the dataset.
zh
[NLP-71] Llm s Virtual Users and Bias: Predicting Any Survey Question Without Human Data
【速读】: 该论文试图解决传统调查方法效率低、成本高的问题,并探索大型语言模型(Large Language Models, LLMs)在模拟人类调查响应中的潜力。研究的关键在于利用LLMs生成虚拟人口来回答调查问题,从而预测与人类响应相媲美的结果。通过对比LLMs(如GPT-4o、GPT-3.5等)与传统随机森林算法(Random Forests)在世界价值观调查(World Values Survey, WVS)数据上的表现,发现LLMs整体表现出竞争力,尤其无需额外训练数据即可实现高效预测。然而,LLMs在预测特定宗教和人口群体的响应时存在偏差,而随机森林在充足数据下表现更优。进一步研究表明,移除LLMs的审查机制显著提高了预测准确性,特别是在代表性不足的人群中。因此,解决方案的关键在于减少LLMs的偏差并重新评估审查策略,以提高其在公共舆论研究中的可靠性和公平性。
链接: https://arxiv.org/abs/2503.16498
作者: Enzo Sinacola,Arnault Pachot,Thierry Petit
机构: emotia.com
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted, proceedings of the 17th International Conference on Machine Learning and Computing
点击查看摘要
Abstract:Large Language Models (LLMs) offer a promising alternative to traditional survey methods, potentially enhancing efficiency and reducing costs. In this study, we use LLMs to create virtual populations that answer survey questions, enabling us to predict outcomes comparable to human responses. We evaluate several LLMs-including GPT-4o, GPT-3.5, Claude 3.5-Sonnet, and versions of the Llama and Mistral models-comparing their performance to that of a traditional Random Forests algorithm using demographic data from the World Values Survey (WVS). LLMs demonstrate competitive performance overall, with the significant advantage of requiring no additional training data. However, they exhibit biases when predicting responses for certain religious and population groups, underperforming in these areas. On the other hand, Random Forests demonstrate stronger performance than LLMs when trained with sufficient data. We observe that removing censorship mechanisms from LLMs significantly improves predictive accuracy, particularly for underrepresented demographic segments where censored models struggle. These findings highlight the importance of addressing biases and reconsidering censorship approaches in LLMs to enhance their reliability and fairness in public opinion research.
zh
[NLP-72] Human Preferences for Constructive Interactions in Language Model Alignment
【速读】: 该论文试图解决如何通过语言模型(Language Models, LMs)的个性化与跨文化对齐来促进建设性对话而非加剧社会分歧的问题。解决方案的关键在于利用包含超过7,500段来自74个国家个体与21个大型语言模型(Large Language Models, LLMs)互动的数据集,分析人类偏好数据中反映建设性交互的语言特征,并揭示用户在不同价值观导向下对推理与好奇心等属性的权衡。研究发现,用户倾向于选择有逻辑且细致的回答,而排斥过多个人故事叙述,同时强调AI应反映用户价值观的群体更重视好奇心而非推理能力。此外,论文指出用户可以设定对话基调,LLMs会响应用户的语言风格,包括潜在的毒性特征。
链接: https://arxiv.org/abs/2503.16480
作者: Yara Kyrychenko,Jon Roozenbeek,Brandon Davidson,Sander van der Linden,Ramit Debnath
机构: University of Cambridge (剑桥大学); King’s College London (伦敦国王学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 1 Figure, 1 Table, 11 pages
点击查看摘要
Abstract:As large language models (LLMs) enter the mainstream, aligning them to foster constructive dialogue rather than exacerbate societal divisions is critical. Using an individualized and multicultural alignment dataset of over 7,500 conversations of individuals from 74 countries engaging with 21 LLMs, we examined how linguistic attributes linked to constructive interactions are reflected in human preference data used for training AI. We found that users consistently preferred well-reasoned and nuanced responses while rejecting those high in personal storytelling. However, users who believed that AI should reflect their values tended to place less preference on reasoning in LLM responses and more on curiosity. Encouragingly, we observed that users could set the tone for how constructive their conversation would be, as LLMs mirrored linguistic attributes, including toxicity, in user queries.
zh
[NLP-73] Human-Centered AI in Multidisciplinary Medical Discussions: Evaluating the Feasibility of a Chat-Based Approach to Case Assessment
【速读】: 该论文旨在探索在以人为核心的医疗领域中,利用人工智能(AI)聊天平台辅助多学科专家协作评估复杂病例的可行性。论文聚焦于处于多重病症状态的心血管疾病患者,通过与医生合作评估模拟病例,分析AI应用带来的效率提升及讨论内容的量化结果。关键解决方案在于开发基于历史病例报告和医疗错误案例构建的模拟病例,并结合知识图谱的中心性度量,验证多学科评估相较于单学科评估在知识表达上的优势。研究结果表明,AI辅助总结显著减少了医疗讨论所需时间(平均减少79.98%),同时保持了结构化知识表示,且AI生成摘要的幻觉率较低(整体3.62%,有害幻觉率平均0.49%)。因此,该研究的核心贡献在于证明了AI辅助聊天式讨论作为以人为本的多学科医疗决策方法的可行性。
链接: https://arxiv.org/abs/2503.16464
作者: Shinnosuke Sawano,Satoshi Kodera
机构: The University of Tokyo Hospital (东京大学医院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 2 figures, 3 tables, 2 supplemental figures
点击查看摘要
Abstract:In this study, we investigate the feasibility of using a human-centered artificial intelligence (AI) chat platform where medical specialists collaboratively assess complex cases. As the target population for this platform, we focus on patients with cardiovascular diseases who are in a state of multimorbidity, that is, suffering from multiple chronic conditions. We evaluate simulated cases with multiple diseases using a chat application by collaborating with physicians to assess feasibility, efficiency gains through AI utilization, and the quantification of discussion content. We constructed simulated cases based on past case reports, medical errors reports and complex cases of cardiovascular diseases experienced by the physicians. The analysis of discussions across five simulated cases demonstrated a significant reduction in the time required for summarization using AI, with an average reduction of 79.98%. Additionally, we examined hallucination rates in AI-generated summaries used in multidisciplinary medical discussions. The overall hallucination rate ranged from 1.01% to 5.73%, with an average of 3.62%, whereas the harmful hallucination rate varied from 0.00% to 2.09%, with an average of 0.49%. Furthermore, morphological analysis demonstrated that multidisciplinary assessments enabled a more complex and detailed representation of medical knowledge compared with single physician assessments. We examined structural differences between multidisciplinary and single physician assessments using centrality metrics derived from the knowledge graph. In this study, we demonstrated that AI-assisted summarization significantly reduced the time required for medical discussions while maintaining structured knowledge representation. These findings can support the feasibility of AI-assisted chat-based discussions as a human-centered approach to multidisciplinary medical decision-making.
zh
[NLP-74] Improving Interactive Diagnostic Ability of a Large Language Model Agent Through Clinical Experience Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在交互式诊断场景中的性能下降问题,特别是其在初始诊断阶段信息收集效率低下及初步诊断形成能力不足的问题。论文指出,尽管LLMs在特定非交互式诊断任务中表现良好,但在需要主动信息获取的交互式诊断中,其性能显著恶化。为解决这一局限性,研究提出了一种插拔式增强(Plug-and-Play Enhanced, PPME)LLM代理作为解决方案。该方法基于来自中美医疗机构的超过350万份电子病历数据,通过监督学习与强化学习技术,开发了专门用于初始疾病诊断和现病史询问的模型。实验结果表明,PPME LLM相比基线模型提升了超过30%,并在交互式诊断场景中接近使用完整临床数据的诊断准确性水平。因此,该方案的关键在于通过引入专业化模型来提升LLMs在初始诊断阶段的信息处理能力。
链接: https://arxiv.org/abs/2503.16463
作者: Zhoujian Sun,Ziyi Liu,Cheng Luo,Jiebin Chu,Zhengxing Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 30 pages
点击查看摘要
Abstract:Recent advances in large language models (LLMs) have shown promising results in medical diagnosis, with some studies indicating superior performance compared to human physicians in specific scenarios. However, the diagnostic capabilities of LLMs are often overestimated, as their performance significantly deteriorates in interactive diagnostic settings that require active information gathering. This study investigates the underlying mechanisms behind the performance degradation phenomenon and proposes a solution. We identified that the primary deficiency of LLMs lies in the initial diagnosis phase, particularly in information-gathering efficiency and initial diagnosis formation, rather than in the subsequent differential diagnosis phase. To address this limitation, we developed a plug-and-play method enhanced (PPME) LLM agent, leveraging over 3.5 million electronic medical records from Chinese and American healthcare facilities. Our approach integrates specialized models for initial disease diagnosis and inquiry into the history of the present illness, trained through supervised and reinforcement learning techniques. The experimental results indicate that the PPME LLM achieved over 30% improvement compared to baselines. The final diagnostic accuracy of the PPME LLM in interactive diagnostic scenarios approached levels comparable to those achieved using complete clinical data. These findings suggest a promising potential for developing autonomous diagnostic systems, although further validation studies are needed.
zh
[NLP-75] Integrating Personality into Digital Humans: A Review of LLM -Driven Approaches for Virtual Reality
【速读】: 该论文旨在解决如何通过大型语言模型(Large Language Models, LLMs)使虚拟现实(Virtual Reality, VR)环境中的数字人类(Digital Humans)具备细腻人格特质的问题。论文的关键在于探索实现这一目标的方法,包括零样本(zero-shot)、少样本(few-shot)以及微调(fine-tuning)等技术路径。同时,论文强调了解决计算资源需求高、延迟问题以及多模态交互标准化评估框架缺乏等挑战的重要性。通过填补这些研究空白,论文为教育、治疗及游戏领域的应用发展奠定了基础,并推动跨学科合作以重新定义VR中的人机交互方式。
链接: https://arxiv.org/abs/2503.16457
作者: Iago Alves Brito,Julia Soares Dollis,Fernanda Bufon Färber,Pedro Schindler Freire Brasil Ribeiro,Rafael Teixeira Sousa,Arlindo Rodrigues Galvão Filho
机构: Advanced Knowledge Center in Immersive Technologies (沉浸式技术知识中心); Federal University of Goiás (戈亚斯联邦大学); Federal University of Mato Grosso (马托格罗索联邦大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The integration of large language models (LLMs) into virtual reality (VR) environments has opened new pathways for creating more immersive and interactive digital humans. By leveraging the generative capabilities of LLMs alongside multimodal outputs such as facial expressions and gestures, virtual agents can simulate human-like personalities and emotions, fostering richer and more engaging user experiences. This paper provides a comprehensive review of methods for enabling digital humans to adopt nuanced personality traits, exploring approaches such as zero-shot, few-shot, and fine-tuning. Additionally, it highlights the challenges of integrating LLM-driven personality traits into VR, including computational demands, latency issues, and the lack of standardized evaluation frameworks for multimodal interactions. By addressing these gaps, this work lays a foundation for advancing applications in education, therapy, and gaming, while fostering interdisciplinary collaboration to redefine human-computer interaction in VR.
zh
[NLP-76] he Application of MATEC (Multi-AI Agent Team Care) Framework in Sepsis Care
【速读】: 该论文旨在解决在资源匮乏或农村医院中因医疗专家和专业人员短缺导致的脓毒症患者诊疗效果不佳的问题。解决方案的关键在于开发了一种名为MATEC(Multi-AI Agent Team Care)的框架,通过集成一个由五个医生代理、四个健康专业代理以及一个风险预测模型代理组成的AI团队,并额外提供33个医生代理用于会诊,从而为脓毒症护理提供全面支持。这一框架通过十名教学医院的主治医师测试,结果显示其非常有用且准确,表明MATEC框架可能在辅助医疗专业人员,特别是在资源匮乏的医院环境中具有潜在价值。
链接: https://arxiv.org/abs/2503.16433
作者: Andrew Cho,Jason M. Woo,Brian Shi,Aishwaryaa Udeshi,Jonathan S. H. Woo
机构: Princeton University (普林斯顿大学); University of Pittsburgh (匹兹堡大学); Department of Medicine, Penn Medicine Princeton Health, University of Pennsylvania Health System (宾夕法尼亚大学卫生系统普林斯顿健康医学系)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 15 pages
点击查看摘要
Abstract:Under-resourced or rural hospitals have limited access to medical specialists and healthcare professionals, which can negatively impact patient outcomes in sepsis. To address this gap, we developed the MATEC (Multi-AI Agent Team Care) framework, which integrates a team of specialized AI agents for sepsis care. The sepsis AI agent team includes five doctor agents, four health professional agents, and a risk prediction model agent, with an additional 33 doctor agents available for consultations. Ten attending physicians at a teaching hospital evaluated this framework, spending approximately 40 minutes on the web-based MATEC application and participating in the 5-point Likert scale survey (rated from 1-unfavorable to 5-favorable). The physicians found the MATEC framework very useful (Median=4, P=0.01), and very accurate (Median=4, P0.01). This pilot study demonstrates that a Multi-AI Agent Team Care framework (MATEC) can potentially be useful in assisting medical professionals, particularly in under-resourced hospital settings.
zh
[NLP-77] Multimodal Transformer Models for Turn-taking Prediction: Effects on Conversational Dynamics of Human-Agent Interaction during Cooperative Gameplay
【速读】: 本文研究多模态轮替预测在人机交互(Human-Agent Interaction, HAI)中的应用,特别是在合作游戏环境下的潜力。论文通过模型开发与用户研究相结合的方式,旨在提升语音对话系统(Spoken Dialogue Systems, SDSs)的会话动态性。解决方案的关键在于提出了一种基于Transformer的新型深度学习模型,该模型能够实时整合文本、视觉、音频以及游戏上下文等多模态信息,利用跨模态Transformer架构有效融合这些异构数据源,从而实现更全面的轮替事件预测。实验结果显示,该模型相较于基准模型具有显著优势,准确率达到87.3%,宏F1分数为83.0%。此外,通过用户研究进一步验证,在包含英韩双语参与者的“Don’t Starve Together”游戏中,部署此模型不仅增强了人机对话的流畅性和自然度,还保持了对话频率的平衡,同时维持了良好的交互质量。这一工作深入揭示了轮替能力对用户体验及交互质量的影响,强调了构建更具情境适应性和响应性的对话代理的重要性。
链接: https://arxiv.org/abs/2503.16432
作者: Young-Ho Bae,Casey C. Bennett
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 36 pages
点击查看摘要
Abstract:This study investigates multimodal turn-taking prediction within human-agent interactions (HAI), particularly focusing on cooperative gaming environments. It comprises both model development and subsequent user study, aiming to refine our understanding and improve conversational dynamics in spoken dialogue systems (SDSs). For the modeling phase, we introduce a novel transformer-based deep learning (DL) model that simultaneously integrates multiple modalities - text, vision, audio, and contextual in-game data to predict turn-taking events in real-time. Our model employs a Crossmodal Transformer architecture to effectively fuse information from these diverse modalities, enabling more comprehensive turn-taking predictions. The model demonstrates superior performance compared to baseline models, achieving 87.3% accuracy and 83.0% macro F1 score. A human user study was then conducted to empirically evaluate the turn-taking DL model in an interactive scenario with a virtual avatar while playing the game “Dont Starve Together”, comparing a control condition without turn-taking prediction (n=20) to an experimental condition with our model deployed (n=40). Both conditions included a mix of English and Korean speakers, since turn-taking cues are known to vary by culture. We then analyzed the interaction quality, examining aspects such as utterance counts, interruption frequency, and participant perceptions of the avatar. Results from the user study suggest that our multimodal turn-taking model not only enhances the fluidity and naturalness of human-agent conversations, but also maintains a balanced conversational dynamic without significantly altering dialogue frequency. The study provides in-depth insights into the influence of turn-taking abilities on user perceptions and interaction quality, underscoring the potential for more contextually adaptive and responsive conversational agents.
zh
[NLP-78] Assessing Consistency and Reproducibility in the Outputs of Large Language Models : Evidence Across Diverse Finance and Accounting Tasks
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)在金融与会计研究领域输出一致性与可重复性的问题。论文的关键解决方案在于通过大规模实验评估五类常见任务(分类、情感分析、摘要生成、文本生成和预测)下LLMs的输出一致性,并利用多轮运行(总计超过340万条输出)验证其可重复性表现。研究发现,尽管存在任务依赖性的变异性,但简单聚合策略(如跨3-5次运行的结果整合)显著提升了结果的一致性。此外,论文通过模拟分析表明,尽管LLMs的输出存在一定不一致性,但下游统计推断仍保持稳健。这一系列工作旨在缓解对“G-hacking”(选择性报告有利结果)的担忧,证明这些风险在金融与会计任务中相对较低。
链接: https://arxiv.org/abs/2503.16974
作者: Julian Junyan Wang,Victor Xiaoqi Wang
机构: 未知
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 96 pages
点击查看摘要
Abstract:This study provides the first comprehensive assessment of consistency and reproducibility in Large Language Model (LLM) outputs in finance and accounting research. We evaluate how consistently LLMs produce outputs given identical inputs through extensive experimentation with 50 independent runs across five common tasks: classification, sentiment analysis, summarization, text generation, and prediction. Using three OpenAI models (GPT-3.5-turbo, GPT-4o-mini, and GPT-4o), we generate over 3.4 million outputs from diverse financial source texts and data, covering MDAs, FOMC statements, finance news articles, earnings call transcripts, and financial statements. Our findings reveal substantial but task-dependent consistency, with binary classification and sentiment analysis achieving near-perfect reproducibility, while complex tasks show greater variability. More advanced models do not consistently demonstrate better consistency and reproducibility, with task-specific patterns emerging. LLMs significantly outperform expert human annotators in consistency and maintain high agreement even where human experts significantly disagree. We further find that simple aggregation strategies across 3-5 runs dramatically improve consistency. Simulation analysis reveals that despite measurable inconsistency in LLM outputs, downstream statistical inferences remain remarkably robust. These findings address concerns about what we term “G-hacking,” the selective reporting of favorable outcomes from multiple Generative AI runs, by demonstrating that such risks are relatively low for finance and accounting tasks.
zh
计算机视觉
[CV-0] Position: Interactive Generative Video as Next-Generation Game Engine
【速读】:该论文旨在解决传统游戏引擎因预设内容导致的创意瓶颈与开发成本高昂的问题。解决方案的关键在于提出生成式游戏引擎(Generative Game Engine, GGE),其基础为交互式生成视频(Interactive Generative Video, IGV)。IGV 能够实现无限高质量内容合成、具备物理感知的世界建模能力、用户可控的互动性、长期记忆功能以及因果推理能力,这些特性使 GGE 在下一代游戏开发中能够实现前所未有的新颖内容生成。论文还提出了一个包含五个核心模块的框架及分层成熟度路线图(L0-L4),以指导 GGE 的发展演进。这一工作为游戏开发在人工智能时代的转型描绘了新的方向,展望了一个由 AI 驱动的生成系统从根本上改变游戏创作与体验的未来。
链接: https://arxiv.org/abs/2503.17359
作者: Jiwen Yu,Yiran Qin,Haoxuan Che,Quande Liu,Xintao Wang,Pengfei Wan,Di Zhang,Xihui Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Modern game development faces significant challenges in creativity and cost due to predetermined content in traditional game engines. Recent breakthroughs in video generation models, capable of synthesizing realistic and interactive virtual environments, present an opportunity to revolutionize game creation. In this position paper, we propose Interactive Generative Video (IGV) as the foundation for Generative Game Engines (GGE), enabling unlimited novel content generation in next-generation gaming. GGE leverages IGV’s unique strengths in unlimited high-quality content synthesis, physics-aware world modeling, user-controlled interactivity, long-term memory capabilities, and causal reasoning. We present a comprehensive framework detailing GGE’s core modules and a hierarchical maturity roadmap (L0-L4) to guide its evolution. Our work charts a new course for game development in the AI era, envisioning a future where AI-powered generative systems fundamentally reshape how games are created and experienced.
zh
[CV-1] Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image
【速读】:该论文试图解决在机器人和VR/AR应用中,快速相机运动导致的高程度运动模糊问题,这使得现有的基于清晰图像的相机位姿估计算法失效。论文的关键解决方案是提出了一种新颖的框架,将运动模糊作为运动估计的丰富线索,而非视为不需要的伪影。其方法通过从单一运动模糊图像直接预测密集的运动流场和单目深度图,并在小运动假设下通过求解线性最小二乘问题来恢复瞬时相机速度,从而生成类似IMU的鲁棒测量结果以捕捉快速且剧烈的相机运动。训练过程中,通过构建包含ScanNet++v2衍生的真实合成运动模糊的大规模数据集,并利用完全可微分的端到端管道优化模型进一步提升性能。
链接: https://arxiv.org/abs/2503.17358
作者: Jerred Chen,Ronald Clark
机构: University of Oxford (牛津大学); Department of Computer Science (计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:In many robotics and VR/AR applications, fast camera motions cause a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a large-scale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.
zh
[CV-2] me-Series U-Net with Recurrence for Noise-Robust Imaging Photoplethysmography
【速读】:该论文致力于解决非接触式生命体征遥测问题,特别是在无法使用接触式设备(如传感器)或其过于侵入性或昂贵的情况下。论文提出了一种模块化且可解释的管道用于从面部视频估计脉搏信号,该方法在公开可用的数据集上达到了最先进的性能。解决方案的关键在于其模块化设计,包括人脸和地标检测、时间序列提取以及脉搏信号/脉搏率估计三个模块。与许多直接从输入视频映射到输出信号或心率的单一黑盒深度学习方法不同,这种模块化方法允许单独解释管道的每一部分。此外,提出的TURNIP模块通过利用时间序列U-Net结合循环机制,能够在存在运动干扰的情况下准确重建基础脉搏信号波形,并计算心率和脉搏率变异性指标。同时,系统能够检测因极端头部姿态导致的脸部部分遮挡区域,并保持估计的鲁棒性。最终,该算法无需专用传感器或皮肤接触即可提供可靠的心率估计,在RGB和近红外数据集上超越了现有的成像光电容积描记法(iPPG)方法。
链接: https://arxiv.org/abs/2503.17351
作者: Vineet R. Shenoy,Shaoju Wu,Armand Comas,Tim K. Marks,Suhas Lohit,Hassan Mansour
机构: Johns Hopkins University Baltimore, MD, USA (约翰斯·霍普金斯大学巴尔的摩分校, 美国马里兰州巴尔的摩市); Boston Children’s Hospital and Harvard Medical School, Boston, MA USA (波士顿儿童医院和哈佛医学院, 美国马萨诸塞州波士顿市); Google (谷歌); Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA (三菱电机研究实验室, 美国马萨诸塞州剑桥市)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 Pages, 8 figures
点击查看摘要
Abstract:Remote estimation of vital signs enables health monitoring for situations in which contact-based devices are either not available, too intrusive, or too expensive. In this paper, we present a modular, interpretable pipeline for pulse signal estimation from video of the face that achieves state-of-the-art results on publicly available this http URL imaging photoplethysmography (iPPG) system consists of three modules: face and landmark detection, time-series extraction, and pulse signal/pulse rate estimation. Unlike many deep learning methods that make use of a single black-box model that maps directly from input video to output signal or heart rate, our modular approach enables each of the three parts of the pipeline to be interpreted individually. The pulse signal estimation module, which we call TURNIP (Time-Series U-Net with Recurrence for Noise-Robust Imaging Photoplethysmography), allows the system to faithfully reconstruct the underlying pulse signal waveform and uses it to measure heart rate and pulse rate variability metrics, even in the presence of motion. When parts of the face are occluded due to extreme head poses, our system explicitly detects such “self-occluded” regions and maintains estimation robustness despite the missing information. Our algorithm provides reliable heart rate estimates without the need for specialized sensors or contact with the skin, outperforming previous iPPG methods on both color (RGB) and near-infrared (NIR) datasets.
zh
[CV-3] Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer
【速读】:本文旨在解决运动迁移任务中运动与外观难以解耦的问题,特别是在基于扩散的视频Diffusion Transformers (DiT) 模型中,由于其采用3D全注意力机制未能显式分离时空信息,导致这一挑战更为显著。为了解决这一问题,论文提出了一种名为DeT的方法,通过引入一种简单而有效的时域核(temporal kernel),在时域维度上平滑DiT特征,从而促进前景运动与背景外观的解耦。同时,该时域核能够有效捕捉与运动密切相关的时域特征变化。此外,作者还通过在潜在特征空间的密集轨迹上引入显式监督,进一步增强运动一致性。另外,论文构建了一个通用且具有挑战性的运动迁移基准MTBench,并提出了考虑全局与局部运动相似性的混合运动保真度指标,以实现比以往工作更全面的评估。综上所述,DeT的关键在于通过引入时域核和平行监督机制,显著提升了运动迁移任务中的运动保真度与编辑保真度之间的平衡。
链接: https://arxiv.org/abs/2503.17350
作者: Qingyu Shi,Jianzong Wu,Jinbin Bai,Jiangning Zhang,Lu Qi,Xiangtai Li,Yunhai Tong
机构: PKU(北京大学); NTU(Nanyang Technological University, 南洋理工大学); NUS(National University of Singapore, 新加坡国立大学); ZJU(Zhejiang University, 浙江大学); UC Merced(加州大学 Merced 分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The motion transfer task involves transferring motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusion-based methods primarily rely on separate spatial and temporal attention mechanisms within 3D U-Net. In contrast, state-of-the-art video Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric that considers both the global and local motion similarity. Therefore, our work provides a more comprehensive evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity.
zh
[CV-4] Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
【速读】:本文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在空间推理任务上的不足,尽管这些模型在物体识别方面表现出色。研究受人类视觉双通路(腹侧-背侧)模型的启发,发现VLMs中的视觉嵌入主要被视为语义上的“词袋”,由于其嵌入范数过大,掩盖了细微但重要的位置线索。为验证这一洞察,作者通过广泛的诊断实验表明,移除标记顺序或精细的空间细节对性能影响甚微。基于这些发现,提出了一种简单且可解释的干预措施,包括归一化视觉嵌入范数和提取中间层富含空间信息的特征,以恢复模型的空间意识。实验证明,在合成数据集和标准基准数据集上均提升了空间推理能力,强调了以可解释性为导向设计的重要性。本研究不仅揭示了当前VLM架构的基本局限性,还为增强视觉场景的结构化感知提供了实用见解。
链接: https://arxiv.org/abs/2503.17349
作者: Jianing Qi,Jiawei Liu,Hao Tang,Zhigang Zhu
机构: CUNY Graduate Center (纽约城市大学研究生中心); Borough of Manhattan Community College (曼哈顿社区学院); The City College of New York (纽约市立大学城市学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning such as accurately understanding the relative positions of objects. Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities. Our interpretability-driven analysis reveals a critical underlying cause: vision embeddings in VLMs are treated primarily as semantic ``bag-of-tokens," overshadowing subtle yet crucial positional cues due to their disproportionately large embedding norms. We validate this insight through extensive diagnostic experiments, demonstrating minimal performance impact when token orders or fine-grained spatial details are removed. Guided by these findings, we propose simple, interpretable interventions, including normalizing vision embedding norms and extracting mid-layer spatially rich features, to restore spatial awareness. Empirical results on both our synthetic data and standard benchmarks demonstrate improved spatial reasoning capabilities, highlighting the value of interpretability-informed design choices. Our study not only uncovers fundamental limitations in current VLM architectures but also provides actionable insights for enhancing structured perception of visual scenes.
zh
[CV-5] Dereflection Any Image with Diffusion Priors and Diversified Data
【速读】:该论文旨在解决单图像去反射(reflection removal)这一极具挑战性的任务,主要由于目标场景与不需要的反射之间复杂的纠缠关系。现有方法受限于高质量、多样化数据的稀缺性以及不足的修复先验知识,导致其在多种真实场景中的泛化能力有限。论文的关键解决方案包括:首先,构建了一个名为“多样反射去除(Diverse Reflection Removal, DRR)”的数据集,通过随机旋转目标场景中的反射介质,实现反射角度和强度的变化,从而在规模、质量和多样性方面树立了新的基准;其次,提出了一种基于扩散(diffusion-based)的框架,包含一步扩散以实现确定性输出和快速推理,并设计了三阶段渐进式训练策略,特别是引入反射不变微调(reflection-invariant finetuning),以鼓励在不同反射模式下的稳定一致输出。实验结果表明,该方法在常见基准数据集和具有挑战性的野外图像上均达到了最先进的性能(SOTA),展现了出色的跨多样化真实场景的泛化能力。
链接: https://arxiv.org/abs/2503.17347
作者: Jichen Hu,Chen Yang,Zanwei Zhou,Jiemin Fang,Xiaokang Yang,Qi Tian,Wei Shen
机构: MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University (教育部人工智能重点实验室,上海交通大学人工智能研究院); Huawei Inc. (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Reflection removal of a single image remains a highly challenging task due to the complex entanglement between target scenes and unwanted reflections. Despite significant progress, existing methods are hindered by the scarcity of high-quality, diverse data and insufficient restoration priors, resulting in limited generalization across various real-world scenarios. In this paper, we propose Dereflection Any Image, a comprehensive solution with an efficient data preparation pipeline and a generalizable model for robust reflection removal. First, we introduce a dataset named Diverse Reflection Removal (DRR) created by randomly rotating reflective mediums in target scenes, enabling variation of reflection angles and intensities, and setting a new benchmark in scale, quality, and diversity. Second, we propose a diffusion-based framework with one-step diffusion for deterministic outputs and fast inference. To ensure stable learning, we design a three-stage progressive training strategy, including reflection-invariant finetuning to encourage consistent outputs across varying reflection patterns that characterize our dataset. Extensive experiments show that our method achieves SOTA performance on both common benchmarks and challenging in-the-wild images, showing superior generalization across diverse real-world scenes.
zh
[CV-6] Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation
【速读】:该论文旨在解决音乐驱动下自然、多样化且富有节奏感的人类舞蹈动作生成这一难题,现有方法在节拍对齐及运动动力学表现方面存在不足。论文提出Danceba框架,其关键在于通过门控机制增强对节奏感知特征的表征能力。具体而言,引入基于相位的节奏提取(Phase-Based Rhythm Extraction, PRE)以精确提取音乐相位数据中的节奏信息,并利用时间门控因果注意力机制(Temporal-Gated Causal Attention, TGCA)聚焦于全局节奏特征,确保舞蹈动作紧密跟随音乐节奏。此外,设计并行Mamba运动建模架构(Parallel Mamba Motion Modeling, PMMM),分别建模身体上下部分的运动与音乐特征,从而提升生成舞蹈动作的自然性和多样性。实验结果表明,Danceba在节奏对齐和动作多样性方面显著优于现有方法。
链接: https://arxiv.org/abs/2503.17340
作者: Congyi Fan,Jian Guan,Xuanjia Zhao,Dongli Xu,Youtian Lin,Tong Ye,Pengming Feng,Haiwei Pan
机构: Harbin Engineering University (哈尔滨工程大学); Shanghai Academy of AI for Science (上海人工智能科学研究院); Nanjing University (南京大学); State Key Laboratory of Space-Ground Integrated Information Technology (空间地面一体化信息网络国家重点实验室)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 10 pages, 6 figures
点击查看摘要
Abstract:Automatically generating natural, diverse and rhythmic human dance movements driven by music is vital for virtual reality and film industries. However, generating dance that naturally follows music remains a challenge, as existing methods lack proper beat alignment and exhibit unnatural motion dynamics. In this paper, we propose Danceba, a novel framework that leverages gating mechanism to enhance rhythm-aware feature representation for music-driven dance generation, which achieves highly aligned dance poses with enhanced rhythmic sensitivity. Specifically, we introduce Phase-Based Rhythm Extraction (PRE) to precisely extract rhythmic information from musical phase data, capitalizing on the intrinsic periodicity and temporal structures of music. Additionally, we propose Temporal-Gated Causal Attention (TGCA) to focus on global rhythmic features, ensuring that dance movements closely follow the musical rhythm. We also introduce Parallel Mamba Motion Modeling (PMMM) architecture to separately model upper and lower body motions along with musical features, thereby improving the naturalness and diversity of generated dance movements. Extensive experiments confirm that Danceba outperforms state-of-the-art methods, achieving significantly better rhythmic alignment and motion diversity. Project page: this https URL .
zh
[CV-7] Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors CVPR2025
【速读】:该论文试图解决多模态3D视觉回归任务中的信息利用不足问题,即如何更有效地结合多种辅助信息(如相机内参、相对位姿、稠密或稀疏深度等)与输入图像来提升预测精度。论文的关键在于提出了一种名为Pow3r的新模型,它通过轻量且通用的条件化机制,在单一网络架构中整合任意组合的辅助信息作为额外指导,从而在有可用先验信息时能够预测更精确的结果。此外,Pow3r采用随机模态子集训练策略,使模型能够在测试阶段适应不同级别的已知先验,进而实现更高分辨率的原生图像推理或点云补全等新能力。
链接: https://arxiv.org/abs/2503.17316
作者: Wonbong Jang,Philippe Weinzaepfel,Vincent Leroy,Lourdes Agapito,Jerome Revaud
机构: UCL (伦敦大学学院); Naver Labs Europe (NAVER LABS 欧洲)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
点击查看摘要
Abstract:We present Pow3r, a novel large 3D vision regression model that is highly versatile in the input modalities it accepts. Unlike previous feed-forward models that lack any mechanism to exploit known camera or scene priors at test time, Pow3r incorporates any combination of auxiliary information such as intrinsics, relative pose, dense or sparse depth, alongside input images, within a single network. Building upon the recent DUSt3R paradigm, a transformer-based architecture that leverages powerful pre-training, our lightweight and versatile conditioning acts as additional guidance for the network to predict more accurate estimates when auxiliary information is available. During training we feed the model with random subsets of modalities at each iteration, which enables the model to operate under different levels of known priors at test time. This in turn opens up new capabilities, such as performing inference in native image resolution, or point-cloud completion. Our experiments on 3D reconstruction, depth completion, multi-view depth prediction, multi-view stereo, and multi-view pose estimation tasks yield state-of-the-art results and confirm the effectiveness of Pow3r at exploiting all available information. The project webpage is this https URL.
zh
[CV-8] Exploring a Principled Framework for Deep Subspace Clustering ICLR2025
【速读】:该论文旨在解决真实世界数据偏离子空间并集(Union of Subspaces, UoS)假设的问题,现有深度子空间聚类算法在学习UoS表示和自表达系数时存在特征坍塌现象,并缺乏理论保证。论文的关键在于提出了一种名为PRO-DSC(Principled fRamewOrk for Deep Subspace Clustering)的框架,通过在自表达模型中引入有效的正则化项来学习结构化的表示与自表达系数,证明该正则化模型能够防止特征空间坍塌,并在特定条件下保证学习到的最优表示位于正交子空间的并集中。此外,论文提供了可扩展且高效的实现方法,并通过大量实验验证了理论结果与方法的优越性能。
链接: https://arxiv.org/abs/2503.17288
作者: Xianghan Meng,Zhiyuan Huang,Wei He,Xianbiao Qi,Rong Xiao,Chun-Guang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The paper is accepted by ICLR 2025. The first two authors are equally contributed
点击查看摘要
Abstract:Subspace clustering is a classical unsupervised learning task, built on a basic assumption that high-dimensional data can be approximated by a union of subspaces (UoS). Nevertheless, the real-world data are often deviating from the UoS assumption. To address this challenge, state-of-the-art deep subspace clustering algorithms attempt to jointly learn UoS representations and self-expressive coefficients. However, the general framework of the existing algorithms suffers from a catastrophic feature collapse and lacks a theoretical guarantee to learn desired UoS representation. In this paper, we present a Principled fRamewOrk for Deep Subspace Clustering (PRO-DSC), which is designed to learn structured representations and self-expressive coefficients in a unified manner. Specifically, in PRO-DSC, we incorporate an effective regularization on the learned representations into the self-expressive model, prove that the regularized self-expressive model is able to prevent feature space collapse, and demonstrate that the learned optimal representations under certain condition lie on a union of orthogonal subspaces. Moreover, we provide a scalable and efficient approach to implement our PRO-DSC and conduct extensive experiments to verify our theoretical findings and demonstrate the superior performance of our proposed deep subspace clustering approach. The code is available at this https URL.
zh
[CV-9] HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks CVPR2025
【速读】:该论文旨在解决现有视频层分解模型在处理新视频时训练耗时过长的问题。这些模型通常依赖于为每个视频独立训练的隐式神经表示(Implicit Neural Representations, INRs),导致适应新数据集的时间成本较高。为了解决这一问题,论文提出了一种基于元学习(Meta-learning)的策略,旨在学习一个通用的视频分解模型,以加速新视频的训练过程。该方案的关键在于采用超网络(Hypernetwork)架构,通过输入视频编码嵌入(video-encoder embedding)生成紧凑型INR基神经视频分解模型的参数。这种方法不仅缓解了单视频过拟合(single-video overfitting)的问题,还显著缩短了新、未见视频分解任务的收敛时间。
链接: https://arxiv.org/abs/2503.17276
作者: Maria Pilligua,Danna Xue,Javier Vazquez-Corral
机构: Universitat Autònoma de Barcelona (巴塞罗那自治大学); Computer Vision Center (计算机视觉中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, project page: this https URL
点击查看摘要
Abstract:Decomposing a video into a layer-based representation is crucial for easy video editing for the creative industries, as it enables independent editing of specific layers. Existing video-layer decomposition models rely on implicit neural representations (INRs) trained independently for each video, making the process time-consuming when applied to new videos. Noticing this limitation, we propose a meta-learning strategy to learn a generic video decomposition model to speed up the training on new videos. Our model is based on a hypernetwork architecture which, given a video-encoder embedding, generates the parameters for a compact INR-based neural video decomposition model. Our strategy mitigates the problem of single-video overfitting and, importantly, shortens the convergence of video decomposition on new, unseen videos. Our code is available at: this https URL
zh
[CV-10] Recovering Pulse Waves from Video Using Deep Unrolling and Deep Equilibrium Models
【速读】:该论文旨在解决基于摄像头的生命体征监测(即成像光电容积脉搏波描记法,iPPG)中的信号恢复与心率估计问题。传统方法要么依赖基于模型的稀疏先验并通过迭代优化恢复脉搏波,要么采用端到端的黑盒深度学习方法。本文提出了一种结合信号处理与深度学习的新方法,其关键在于利用深度网络进行去噪操作,并通过深度算法展开(Deep Algorithm Unfolding)和深度平衡模型(Deep Equilibrium Models)在逆问题框架下实现脉搏信号的估计与心率推断。实验表明,该方法能够有效去噪面部视频信号并准确推断真实脉搏率,在知名基准数据集上达到了最先进的心率估计性能,同时所需可训练参数量仅为同类竞争方法的五分之一。
链接: https://arxiv.org/abs/2503.17269
作者: Vineet R Shenoy,Suhas Lohit,Hassan Mansour,Rama Chellappa,Tim K. Marks
机构: Johns Hopkins University (约翰·霍普金斯大学); Mitsubishi Electric Research Laboratories (MERL) (三菱电机研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 13 pages, 9 figures
点击查看摘要
Abstract:Camera-based monitoring of vital signs, also known as imaging photoplethysmography (iPPG), has seen applications in driver-monitoring, perfusion assessment in surgical settings, affective computing, and more. iPPG involves sensing the underlying cardiac pulse from video of the skin and estimating vital signs such as the heart rate or a full pulse waveform. Some previous iPPG methods impose model-based sparse priors on the pulse signals and use iterative optimization for pulse wave recovery, while others use end-to-end black-box deep learning methods. In contrast, we introduce methods that combine signal processing and deep learning methods in an inverse problem framework. Our methods estimate the underlying pulse signal and heart rate from facial video by learning deep-network-based denoising operators that leverage deep algorithm unfolding and deep equilibrium models. Experiments show that our methods can denoise an acquired signal from the face and infer the correct underlying pulse rate, achieving state-of-the-art heart rate estimation performance on well-known benchmarks, all with less than one-fifth the number of learnable parameters as the closest competing method.
zh
[CV-11] Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment CVPR2025
【速读】:该论文旨在解决现有方法在预测人类轨迹(Human Trajectory Prediction, HTP)时未能显式利用人体姿态线索导致预测结果不合理的问题。为了解决这一挑战,论文提出了一种名为“Locomotion Embodiment”的框架,其关键是通过一个可微分的“Locomotion Value”函数替代不可微的物理模拟器,以显式评估预测轨迹的物理合理性,并以此指导生成式HTP网络的数据驱动训练。此外,文中引入的“Embodied Locomotion”损失函数能够有效提升基于多头结构的随机HTP网络的训练效率,同时提出的“Locomotion Value”过滤器可在推理阶段排除不合理的轨迹。实验表明,该方法显著提升了多种数据集和任务设置下的HTP性能。
链接: https://arxiv.org/abs/2503.17267
作者: Hiromu Taketsugu,Takeru Oba,Takahiro Maeda,Shohei Nobuhara,Norimichi Ukita
机构: Toyota Technological Institute (丰田工业大学) [Japan]; Kyoto Institute of Technology (京都工艺纤维大学) [Japan]
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025. Project page: this https URL
点击查看摘要
Abstract:Humans can predict future human trajectories even from momentary observations by using human pose-related cues. However, previous Human Trajectory Prediction (HTP) methods leverage the pose cues implicitly, resulting in implausible predictions. To address this, we propose Locomotion Embodiment, a framework that explicitly evaluates the physical plausibility of the predicted trajectory by locomotion generation under the laws of physics. While the plausibility of locomotion is learned with an indifferentiable physics simulator, it is replaced by our differentiable Locomotion Value function to train an HTP network in a data-driven manner. In particular, our proposed Embodied Locomotion loss is beneficial for efficiently training a stochastic HTP network using multiple heads. Furthermore, the Locomotion Value filter is proposed to filter out implausible trajectories at inference. Experiments demonstrate that our method enhances even the state-of-the-art HTP methods across diverse datasets and problem settings. Our code is available at: this https URL.
zh
[CV-12] Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras
【速读】:该论文致力于解决传统方法将事件相机中的运动(光流)和外观(图像亮度)恢复视为独立任务的问题,这与事件相机的本质不符,并忽视了两者之间的内在联系。为了解决这一问题,论文提出了一种无监督学习框架,通过单一网络联合估计光流和图像强度。关键在于从事件生成模型出发,推导出基于事件的光度误差作为光流和图像强度的函数,并将其与对比度最大化框架相结合,形成一个全面的损失函数,为光流和强度估计提供适当的约束。实验结果表明,该模型在无监督学习类别下光流估计的端点误差(EPE)和平均误差(AE)分别提升了20%和25%,并在高动态范围场景中实现了具有竞争力的图像强度估计性能,同时推理时间短于其他光流模型和许多图像重建模型。
链接: https://arxiv.org/abs/2503.17262
作者: Shuang Guo,Friedhelm Hamann,Guillermo Gallego
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 14 page, 8 figures, 9 tables. Project page: this https URL
点击查看摘要
Abstract:Event cameras rely on motion to obtain information about scene appearance. In other words, for event cameras, motion and appearance are seen both or neither, which are encoded in the output event stream. Previous works consider recovering these two visual quantities as separate tasks, which does not fit with the nature of event cameras and neglects the inherent relations between both tasks. In this paper, we propose an unsupervised learning framework that jointly estimates optical flow (motion) and image intensity (appearance), with a single network. Starting from the event generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity, which is further combined with the contrast maximization framework, yielding a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show that our model achieves state-of-the-art performance for both optical flow (achieves 20% and 25% improvement in EPE and AE respectively in the unsupervised learning category) and intensity estimation (produces competitive results with other baselines, particularly in high dynamic range scenarios). Last but not least, our model achieves shorter inference time than all the other optical flow models and many of the image reconstruction models, while they output only one quantity. Project page: this https URL
zh
[CV-13] Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology
【速读】:本文旨在解决组织病理学 Whole Slide Images (WSIs) 中小样本分类的挑战。传统基于多重实例学习(Multiple Instance Learning, MIL)的方法依赖于聚合函数从病灶块表示中推导出整张幻灯片级别的预测,但需要大量带有病灶级别标签的数据进行训练。而基于视觉-语言模型(Vision-Language Models, VLMs)的方法虽能很好地对齐病灶块的视觉嵌入与候选类别文本提示,但缺乏必要的病理学先验知识。本文的关键创新在于利用语言模型中的病理学先验知识识别WSI分类中关键的局部组织类型(病灶块),并将这一特性整合到基于VLM的MIL框架中。通过滑块级提示学习,仅需少量带标注的WSI即可微调模型,从而在小样本WSI分类任务中展现出优于现有基于MIL和VLM方法的性能。
链接: https://arxiv.org/abs/2503.17238
作者: Devavrat Tomar,Guillaume Vray,Dwarikanath Mahapatra,Sudipta Roy,Jean-Philippe Thiran,Behzad Bozorgtabar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ISBI 2025
点击查看摘要
Abstract:In this paper, we address the challenge of few-shot classification in histopathology whole slide images (WSIs) by utilizing foundational vision-language models (VLMs) and slide-level prompt learning. Given the gigapixel scale of WSIs, conventional multiple instance learning (MIL) methods rely on aggregation functions to derive slide-level (bag-level) predictions from patch representations, which require extensive bag-level labels for training. In contrast, VLM-based approaches excel at aligning visual embeddings of patches with candidate class text prompts but lack essential pathological prior knowledge. Our method distinguishes itself by utilizing pathological prior knowledge from language models to identify crucial local tissue types (patches) for WSI classification, integrating this within a VLM-based MIL framework. Our approach effectively aligns patch images with tissue types, and we fine-tune our model via prompt learning using only a few labeled WSIs per category. Experimentation on real-world pathological WSI datasets and ablation studies highlight our method’s superior performance over existing MIL- and VLM-based methods in few-shot WSI classification tasks. Our code is publicly available at this https URL.
zh
[CV-14] Strong Baseline: Multi-UAV Tracking via YOLOv12 with BoT-SORT-ReID
【速读】:该论文旨在解决热红外视频中多无人飞行器(UAVs)检测与跟踪的挑战,这些问题源于低对比度、环境噪声以及小目标尺寸。论文提出了一种基于YOLOv12和BoT-SORT的跟踪框架,并通过定制化的训练和推理策略加以增强,而非依赖于YOLOv5与DeepSORT管道。关键在于利用改进的检测与跟踪方法,在不使用对比度增强或时间信息融合来丰富UAV特征的情况下,实现了具有竞争力的性能表现,从而确立了其作为多UAV跟踪任务“强基准”的地位。
链接: https://arxiv.org/abs/2503.17237
作者: Yu-Hsi Chen
机构: The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 5 tables
点击查看摘要
Abstract:Detecting and tracking multiple unmanned aerial vehicles (UAVs) in thermal infrared video is inherently challenging due to low contrast, environmental noise, and small target sizes. This paper provides a straightforward approach to address multi-UAV tracking in thermal infrared video, leveraging recent advances in detection and tracking. Instead of relying on the YOLOv5 with the DeepSORT pipeline, we present a tracking framework built on YOLOv12 and BoT-SORT, enhanced with tailored training and inference strategies. We evaluate our approach following the metrics from the 4th Anti-UAV Challenge and demonstrate competitive performance. Notably, we achieve strong results without using contrast enhancement or temporal information fusion to enrich UAV features, highlighting our approach as a “Strong Baseline” for the multi-UAV tracking task. We provide implementation details, in-depth experimental analysis, and a discussion of potential improvements. The code is available at this https URL .
zh
[CV-15] Leverag ing Text-to-Image Generation for Handling Spurious Correlation
【速读】:该论文试图解决深度神经网络在处理域外样本(out-of-distribution samples)时泛化能力不足的问题,具体表现为模型可能依赖于标签与图像无关特征之间的虚假相关性(spurious correlations),导致预测可靠性下降。为应对这一挑战,论文提出了一种利用文本到图像(Text-to-Image, T2I)扩散模型生成训练样本的技术。解决方案的关键在于:首先通过文本反转机制确定因果成分相关的视觉特征的最佳描述词;其次结合语言分割方法与扩散模型生成包含因果成分的新样本,并剔除不符合目标的生成样本;最后重新训练经验风险最小化(Empirical Risk Minimization, ERM)模型以减少其对虚假相关性的依赖。实验表明,该方法在不同基准测试中实现了优于现有最先进方法的最差组别准确率(worst-group accuracy)。
链接: https://arxiv.org/abs/2503.17226
作者: Aryan Yazdan Parast,Basim Azam,Naveed Akhtar
机构: University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deep neural networks trained with Empirical Risk Minimization (ERM) perform well when both training and test data come from the same domain, but they often fail to generalize to out-of-distribution samples. In image classification, these models may rely on spurious correlations that often exist between labels and irrelevant features of images, making predictions unreliable when those features do not exist. We propose a technique to generate training samples with text-to-image (T2I) diffusion models for addressing the spurious correlation problem. First, we compute the best describing token for the visual features pertaining to the causal components of samples by a textual inversion mechanism. Then, leveraging a language segmentation method and a diffusion model, we generate new samples by combining the causal component with the elements from other classes. We also meticulously prune the generated samples based on the prediction probabilities and attribution scores of the ERM model to ensure their correct composition for our objective. Finally, we retrain the ERM model on our augmented dataset. This process reduces the model’s reliance on spurious correlations by learning from carefully crafted samples for in which this correlation does not exist. Our experiments show that across different benchmarks, our technique achieves better worst-group accuracy than the existing state-of-the-art methods.
zh
[CV-16] Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation
【速读】:该论文旨在解决因数据获取成本高、隐私限制及特定领域数据稀缺导致的训练数据不足问题,特别是在大规模复杂机器学习模型中。论文提出通过结合神经符号方法(Neuro-Symbolic Conditioning)生成高质量的合成图像数据集,以提升场景图生成(Scene Graph Generation)模型的表现。关键在于利用结构化的符号表示(如场景图)显式编码关系约束,从而增强合成数据的质量。实验结果表明,这种神经符号条件化在标准召回率(Recall)上提升了多达+2.59%,在无图约束召回率上提升了+2.83%,证明了将神经符号与生成方法相结合能够为复杂视觉推理任务提供克服数据稀缺性的创新途径。
链接: https://arxiv.org/abs/2503.17224
作者: Giacomo Savazzi,Eugenio Lomurno,Cristian Sbrolli,Agnese Chiatti,Matteo Matteucci
机构: Politecnico di Milano (米兰理工大学); Department of Electronics, Information and Bioengineering (电子、信息和生物工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:As machine learning models increase in scale and complexity, obtaining sufficient training data has become a critical bottleneck due to acquisition costs, privacy constraints, and data scarcity in specialised domains. While synthetic data generation has emerged as a promising alternative, a notable performance gap remains compared to models trained on real data, particularly as task complexity grows. Concurrently, Neuro-Symbolic methods, which combine neural networks’ learning strengths with symbolic reasoning’s structured representations, have demonstrated significant potential across various cognitive tasks. This paper explores the utility of Neuro-Symbolic conditioning for synthetic image dataset generation, focusing specifically on improving the performance of Scene Graph Generation models. The research investigates whether structured symbolic representations in the form of scene graphs can enhance synthetic data quality through explicit encoding of relational constraints. The results demonstrate that Neuro-Symbolic conditioning yields significant improvements of up to +2.59% in standard Recall metrics and +2.83% in No Graph Constraint Recall metrics when used for dataset augmentation. These findings establish that merging Neuro-Symbolic and generative approaches produces synthetic data with complementary structural information that enhances model performance when combined with real data, providing a novel approach to overcome data scarcity limitations even for complex visual reasoning tasks.
zh
[CV-17] UniCon: Unidirectional Information Flow for Effective Control of Large-Scale Diffusion Models ICLR
【速读】:该论文试图解决在训练适配器(adapters)用于大规模扩散模型时的计算效率和内存占用问题。解决方案的关键在于提出了一种名为UniCon的新架构,它通过在扩散网络与控制适配器之间实现单向信息流(从扩散网络到适配器),使适配器能够独立生成最终输出。这种设计消除了扩散模型在适配器训练过程中计算和存储梯度的需求,从而显著降低了GPU内存使用(减少三分之一)并提升了训练速度(提高2.3倍),同时保持了相同的适配器参数规模。此外,UniCon无需额外的计算资源即可支持双倍参数量的适配器训练,展现了精确的控制响应性和卓越的生成能力。
链接: https://arxiv.org/abs/2503.17221
作者: Fanghua Yu,Jinjin Gu,Jinfan Hu,Zheyuan Li,Chao Dong
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (深圳先进技术研究院, 中国科学院); Shenzhen University of Advanced Technology (深圳先进技术学院); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted for publication at the International Conference on Learning Representations (ICLR) 2025
点击查看摘要
Abstract:We introduce UniCon, a novel architecture designed to enhance control and efficiency in training adapters for large-scale diffusion models. Unlike existing methods that rely on bidirectional interaction between the diffusion model and control adapter, UniCon implements a unidirectional flow from the diffusion network to the adapter, allowing the adapter alone to generate the final output. UniCon reduces computational demands by eliminating the need for the diffusion model to compute and store gradients during adapter training. Our results indicate that UniCon reduces GPU memory usage by one-third and increases training speed by 2.3 times, while maintaining the same adapter parameter size. Additionally, without requiring extra computational resources, UniCon enables the training of adapters with double the parameter volume of existing ControlNets. In a series of image conditional generation tasks, UniCon has demonstrated precise responsiveness to control inputs and exceptional generation capabilities.
zh
[CV-18] PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction
【速读】:该论文旨在解决文档版面分析领域中模型在跨多种文档类型泛化、处理复杂布局以及实现大规模数据实时处理方面存在的显著挑战。论文提出的解决方案关键在于开发了PP-DocLayout,这是一种能够在多种文档格式中精确识别23种版面区域的高效模型。PP-DocLayout提供了三种不同规模的模型(PP-DocLayout-L、PP-DocLayout-M和PP-DocLayout-S),以满足不同的应用需求,其中高精度模型PP-DocLayout-L基于RT-DETR-L检测器,在T4 GPU上实现了90.4%的mAP@0.5和每页13.4毫秒的端到端推理时间。这一方法不仅推动了文档版面分析领域的技术前沿,还为构建高质量训练数据提供了鲁棒方案,从而促进文档智能和多模态AI系统的发展。
链接: https://arxiv.org/abs/2503.17213
作者: Ting Sun,Cheng Cui,Yuning Du,Yi Liu
机构: PaddlePaddle Team, Baidu Inc. (飞桨团队, 百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Github Repo: this https URL
点击查看摘要
Abstract:Document layout analysis is a critical preprocessing step in document intelligence, enabling the detection and localization of structural elements such as titles, text blocks, tables, and formulas. Despite its importance, existing layout detection models face significant challenges in generalizing across diverse document types, handling complex layouts, and achieving real-time performance for large-scale data processing. To address these limitations, we present PP-DocLayout, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats. To meet different needs, we offer three models of varying scales. PP-DocLayout-L is a high-precision model based on the RT-DETR-L detector, achieving 90.4% mAP@0.5 and an end-to-end inference time of 13.4 ms per page on a T4 GPU. PP-DocLayout-M is a balanced model, offering 75.2% mAP@0.5 with an inference time of 12.7 ms per page on a T4 GPU. PP-DocLayout-S is a high-efficiency model designed for resource-constrained environments and real-time applications, with an inference time of 8.1 ms per page on a T4 GPU and 14.5 ms on a CPU. This work not only advances the state of the art in document layout analysis but also provides a robust solution for constructing high-quality training data, enabling advancements in document intelligence and multimodal AI systems. Code and models are available at this https URL .
zh
[CV-19] A Deep Learning Framework for Visual Attention Prediction and Analysis of News Interfaces
【速读】:该论文旨在解决新闻界面中新闻媒体争夺用户注意力时,现有显著性预测模型在人口统计学特征方面的局限性问题。尽管用户界面(UI)显著性检测领域近期取得进展,但现有数据集在规模和人口统计学代表性方面存在不足。论文的关键解决方案在于提出一个深度学习框架,通过增强SaRa(显著性排名)模型与DeepGaze IIE,提升了显著目标排序(SOR)性能10.7%。该框架优化了三个关键组件:显著图生成、网格段评分以及地图归一化。通过眼动追踪(30名参与者)和鼠标追踪(375名年龄在13至70岁之间的参与者)的双重实验分析,揭示了不同人口群体的注意力模式,并验证了鼠标追踪数据在大规模研究中的有效性。论文强调,显著性研究应优先采集更大规模且人口统计学代表性更强的数据样本,并明确报告具体的人口分布。
链接: https://arxiv.org/abs/2503.17212
作者: Matthew Kenely,Dylan Seychell,Carl James Debono,Chris Porter
机构: University of Malta (马耳他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: This is a preprint submitted to the 2025 IEEE Conference on Artificial Intelligence (CAI)
点击查看摘要
Abstract:News outlets’ competition for attention in news interfaces has highlighted the need for demographically-aware saliency prediction models. Despite recent advancements in saliency detection applied to user interfaces (UI), existing datasets are limited in size and demographic representation. We present a deep learning framework that enhances the SaRa (Saliency Ranking) model with DeepGaze IIE, improving Salient Object Ranking (SOR) performance by 10.7%. Our framework optimizes three key components: saliency map generation, grid segment scoring, and map normalization. Through a two-fold experiment using eye-tracking (30 participants) and mouse-tracking (375 participants aged 13–70), we analyze attention patterns across demographic groups. Statistical analysis reveals significant age-based variations (p 0.05, \epsilon^2 = 0.042), with older users (36–70) engaging more with textual content and younger users (13–35) interacting more with images. Mouse-tracking data closely approximates eye-tracking behavior (sAUC = 0.86) and identifies UI elements that immediately stand out, validating its use in large-scale studies. We conclude that saliency studies should prioritize gathering data from a larger, demographically representative sample and report exact demographic distributions.
zh
[CV-20] Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising CVPR
【速读】:该论文旨在解决非转移学习(Non-Transferable Learning, NTL)模型在黑盒系统中的安全性问题,具体而言,研究者关注是否可以信任部署为黑盒系统的NTL模型的安全性。尽管已有研究表明通过微调NTL模型可以在白盒场景下恢复未经授权域的性能,但这种攻击方法依赖于修改模型权重,在黑盒场景下失效。因此,论文揭示了黑盒NTL模型的第一个漏洞,并提出了一种新颖的攻击方法JailNTL,通过测试时数据伪装来突破NTL模型的“不可转移屏障”。
JailNTL的关键在于其两层伪装策略:(i) 数据内在伪装(Data-Intrinsic Disguising, DID),用于消除授权与未经授权域之间的差异,同时保留输入层面的类别相关特征;(ii) 模型引导伪装(Model-Guided Disguising, MGD),用于减少NTL模型输出统计特性的差异。实验表明,在黑盒场景下攻击最先进的NTL模型时,仅使用1%的授权样本,JailNTL能够将未经授权域的准确率提升高达55.7%,显著超越现有的最先进的白盒攻击效果。
链接: https://arxiv.org/abs/2503.17198
作者: Yongli Xiang,Ziming Hong,Lina Yao,Dadong Wang,Tongliang Liu
机构: Sydney AI Centre, The University of Sydney (悉尼大学人工智能中心); Data61, CSIRO (数据61,联邦科学与工业研究组织); The University of New South Wales (新南威尔士大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code is released at this https URL
点击查看摘要
Abstract:Non-transferable learning (NTL) has been proposed to protect model intellectual property (IP) by creating a “non-transferable barrier” to restrict generalization from authorized to unauthorized domains. Recently, well-designed attack, which restores the unauthorized-domain performance by fine-tuning NTL models on few authorized samples, highlights the security risks of NTL-based applications. However, such attack requires modifying model weights, thus being invalid in the black-box scenario. This raises a critical question: can we trust the security of NTL models deployed as black-box systems? In this work, we reveal the first loophole of black-box NTL models by proposing a novel attack method (dubbed as JailNTL) to jailbreak the non-transferable barrier through test-time data disguising. The main idea of JailNTL is to disguise unauthorized data so it can be identified as authorized by the NTL model, thereby bypassing the non-transferable barrier without modifying the NTL model weights. Specifically, JailNTL encourages unauthorized-domain disguising in two levels, including: (i) data-intrinsic disguising (DID) for eliminating domain discrepancy and preserving class-related content at the input-level, and (ii) model-guided disguising (MGD) for mitigating output-level statistics difference of the NTL model. Empirically, when attacking state-of-the-art (SOTA) NTL models in the black-box scenario, JailNTL achieves an accuracy increase of up to 55.7% in the unauthorized domain by using only 1% authorized samples, largely exceeding existing SOTA white-box attacks.
zh
[CV-21] FreeUV: Ground-Truth-Free Realistic Facial UV Texture Recovery via Cross-Assembly Inference Strategy CVPR2025
【速读】:该论文致力于解决从单视角2D图像恢复高质量3D人脸纹理的问题,尤其在数据受限和面部细节复杂(如化妆、皱纹、遮挡等)的情况下更具挑战性。论文提出了一种名为FreeUV的全新无Ground Truth的UV纹理恢复框架,其关键在于无需标注或合成的UV数据。FreeUV结合预训练的稳定扩散模型与Cross-Assembly推理策略,并通过独立训练的网络专注于真实外观与结构一致性,在推理阶段组合这些网络以生成连贯的纹理。这种方法能够精确捕捉复杂的面部特征,并在多样化的姿态和遮挡情况下表现出鲁棒性能。
链接: https://arxiv.org/abs/2503.17197
作者: Xingchao Yang,Takafumi Taketomi,Yuki Endo,Yoshihiro Kanamori
机构: CyberAgent (赛博代理公司); University of Tsukuba (筑波大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project: this https URL
点击查看摘要
Abstract:Recovering high-quality 3D facial textures from single-view 2D images is a challenging task, especially under constraints of limited data and complex facial details such as makeup, wrinkles, and occlusions. In this paper, we introduce FreeUV, a novel ground-truth-free UV texture recovery framework that eliminates the need for annotated or synthetic UV data. FreeUV leverages pre-trained stable diffusion model alongside a Cross-Assembly inference strategy to fulfill this objective. In FreeUV, separate networks are trained independently to focus on realistic appearance and structural consistency, and these networks are combined during inference to generate coherent textures. Our approach accurately captures intricate facial features and demonstrates robust performance across diverse poses and occlusions. Extensive experiments validate FreeUV’s effectiveness, with results surpassing state-of-the-art methods in both quantitative and qualitative metrics. Additionally, FreeUV enables new applications, including local editing, facial feature interpolation, and multi-view texture recovery. By reducing data requirements, FreeUV offers a scalable solution for generating high-fidelity 3D facial textures suitable for real-world scenarios.
zh
[CV-22] MSCA-Net:Multi-Scale Context Aggregation Network for Infrared Small Target Detection
【速读】:该论文旨在解决在复杂背景下红外小目标检测面临的低对比度和高噪声挑战,这些问题导致特征提取过程中关键细节的丢失,并且现有方法在整合全局与局部信息方面存在局限性,从而限制了检测效率和准确性。为应对这些挑战,论文提出了一种名为MSCA-Net的新网络架构,其关键在于集成了三个核心组件:多尺度增强检测注意力机制(MSEDA)、位置卷积块注意力模块(PCBAM)以及通道聚合块(CAB)。MSEDA通过多尺度特征融合注意机制自适应地聚合不同尺度的信息以丰富特征表示;PCBAM利用基于相关矩阵的策略捕捉全局与局部特征之间的关联,实现深层特征交互;而CAB则重新分配输入特征通道,促进有益特征的有效传递,进一步提升模型在复杂背景下的检测能力。实验结果表明,MSCA-Net在NUAA-SIRST、NUDT-SIRST和IRTSD-1K数据集上的mIoU得分分别达到了78.43%、94.56%和67.08%,证明了其卓越的小目标检测性能及其实际应用潜力。
链接: https://arxiv.org/abs/2503.17193
作者: Xiaojin Lu,Taoran yue,Jiaxi cai,Shibing Chu
机构: School of Physics and Electronic Engineering, Jiangsu University, China(物理与电子工程学院, 江苏大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Detecting infrared small targets in complex backgrounds remains a challenging task because of the low contrast and high noise levels inherent in infrared images. These factors often lead to the loss of crucial details during feature extraction. Moreover, existing detection methods have limitations in adequately integrating global and local information, which constrains the efficiency and accuracy of infrared small target detection. To address these challenges, this paper proposes a novel network architecture named MSCA-Net, which integrates three key components: Multi-Scale Enhanced Detection Attention mechanism(MSEDA), Positional Convolutional Block Attention Module (PCBAM), and Channel Aggregation Block (CAB). Specifically, MSEDA employs a multi-scale feature fusion attention mechanism to adaptively aggregate information across different scales, enriching feature representation. PCBAM captures the correlation between global and local features through a correlation matrix-based strategy, enabling deep feature interaction. Moreover, CAB redistributes input feature channels, facilitating the efficient transmission of beneficial features and further enhancing the model detection capability in complex backgrounds. The experimental results demonstrate that MSCA-Net achieves outstanding small target detection performance in complex backgrounds. Specifically, it attains mIoU scores of 78.43%, 94.56%, and 67.08% on the NUAA-SIRST, NUDT-SIRST, and IRTSD-1K datasets, respectively, underscoring its effectiveness and strong potential for real-world applications.
zh
[CV-23] D2Fusion: Dual-domain Fusion with Feature Superposition for Deepfake Detection
【速读】:该论文致力于解决现有Deepfake检测方法在跨域artifact信息探索不足的问题,特别是由于特征提取后不同域之间内在交互的缺乏,导致难以识别复杂的伪造线索。论文的关键解决方案在于引入了一种双向注意模块(bi-directional attention module)以捕获空间域中artifact线索的局部位置信息,实现精准的artifact定位,并缓解粗粒度处理的问题。为进一步捕捉频率域中的全局细微伪造信息(如纹理或边缘),论文设计了一个细粒度频率注意模块(fine-grained frequency attention module)。尽管这些模块能够独立提升特征质量,但直接融合多域特征无法有效提高检测性能。为此,论文提出了一种特征叠加策略(feature superposition strategy),通过将特征组件转化为波状tokens并基于其相位进行更新,增强真实与artifact特征之间的区分能力。这种方法在五个公开的Deepfake数据集上显著优于当前最先进的方法(SOTA),特别是在捕捉不同操作和现实场景下的异常方面表现出色。
链接: https://arxiv.org/abs/2503.17184
作者: Xueqi Qiu,Xingyu Miao,Fan Wan,Haoran Duan,Tejal Shah,Varun Ojhab,Yang Longa,Rajiv Ranjan
机构: Durham University (杜伦大学); Newcastle University (纽卡斯尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Deepfake detection is crucial for curbing the harm it causes to society. However, current Deepfake detection methods fail to thoroughly explore artifact information across different domains due to insufficient intrinsic interactions. These interactions refer to the fusion and coordination after feature extraction processes across different domains, which are crucial for recognizing complex forgery clues. Focusing on more generalized Deepfake detection, in this work, we introduce a novel bi-directional attention module to capture the local positional information of artifact clues from the spatial domain. This enables accurate artifact localization, thus addressing the coarse processing with artifact features. To further address the limitation that the proposed bi-directional attention module may not well capture global subtle forgery information in the artifact feature (e.g., textures or edges), we employ a fine-grained frequency attention module in the frequency domain. By doing so, we can obtain high-frequency information in the fine-grained features, which contains the global and subtle forgery information. Although these features from the diverse domains can be effectively and independently improved, fusing them directly does not effectively improve the detection performance. Therefore, we propose a feature superposition strategy that complements information from spatial and frequency domains. This strategy turns the feature components into the form of wave-like tokens, which are updated based on their phase, such that the distinctions between authentic and artifact features can be amplified. Our method demonstrates significant improvements over state-of-the-art (SOTA) methods on five public Deepfake datasets in capturing abnormalities across different manipulated operations and real-life.
zh
[CV-24] Radar-Guided Polynomial Fitting for Metric Depth Estimation
【速读】:本文旨在解决单目深度估计(Monocular Depth Estimation, MDE)模型在生成尺度无关深度图时存在的全局尺度对齐问题。现有方法通常依赖复杂的架构或昂贵传感器,而本文提出的关键解决方案是引入多项式拟合(Polynomial Fitting),利用廉价且普遍可用的雷达数据预测多项式系数,从而自适应地调整不同深度范围内的非均匀深度预测。与仅支持线性变换的方法不同,PolyRad 通过引入拐点实现了更广义的变换,能够有效修正局部区域之间的对齐偏差。此外,通过一种新颖的训练目标(即一阶导数正则化来强制单调性约束),该框架能够在保持结构一致性的同时实现上述改进。实验结果表明,PolyRad 在 nuScenes、ZJU-4DRadarCam 和 View-of-Delft 数据集上达到了最先进的性能,在平均绝对误差(MAE)上提升了 30.3%,在均方根误差(RMSE)上提升了 37.2%。
链接: https://arxiv.org/abs/2503.17182
作者: Patrick Rim,Hyoungseob Park,Vadim Ezhov,Jeffrey Moon,Alex Wong
机构: Yale University (耶鲁大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We propose PolyRad, a novel radar-guided depth estimation method that introduces polynomial fitting to transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a simple yet fundamental insight: using polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust depth predictions non-uniformly across depth ranges. Although MDE models often infer reasonably accurate local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale-and-shift transformation insufficient given three or more of these regions. In contrast, PolyRad generalizes beyond linear transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces monotonicity via first-derivative regularization. PolyRad achieves state-of-the-art performance on the nuScenes, ZJU-4DRadarCam, and View-of-Delft datasets, outperforming existing methods by 30.3% in MAE and 37.2% in RMSE.
zh
[CV-25] Which2comm: An Efficient Collaborative Perception Framework for 3D Object Detection
【速读】:该论文旨在解决多智能体协同感知系统中感知性能与通信成本之间的权衡问题。具体而言,在实际场景中有限的通信带宽限制了智能体间的数据传输量,导致协同感知系统的性能下降。为了解决这一问题,论文提出了一种名为Which2comm的新框架,其关键是利用基于物体级别的稀疏特征进行信息交换。通过引入语义检测框(Semantic Detection Boxes, SemDBs),该方法创新性地在智能体之间传输富含语义信息的物体级别稀疏特征,不仅大幅减少了所需的通信量,还提升了三维物体检测的性能。关键在于构建了一个全稀疏网络来提取单个智能体的SemDBs,并采用具有相对时间编码机制的时间融合方法以获取综合时空特征。
链接: https://arxiv.org/abs/2503.17175
作者: Duanrui Yu,Jing You,Xin Pei,Anqi Qu,Dingyu Wang,Shaocheng Jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Collaborative perception allows real-time inter-agent information exchange and thus offers invaluable opportunities to enhance the perception capabilities of individual agents. However, limited communication bandwidth in practical scenarios restricts the inter-agent data transmission volume, consequently resulting in performance declines in collaborative perception systems. This implies a trade-off between perception performance and communication cost. To address this issue, we propose Which2comm, a novel multi-agent 3D object detection framework leveraging object-level sparse features. By integrating semantic information of objects into 3D object detection boxes, we introduce semantic detection boxes (SemDBs). Innovatively transmitting these information-rich object-level sparse features among agents not only significantly reduces the demanding communication volume, but also improves 3D object detection performance. Specifically, a fully sparse network is constructed to extract SemDBs from individual agents; a temporal fusion approach with a relative temporal encoding mechanism is utilized to obtain the comprehensive spatiotemporal features. Extensive experiments on the V2XSet and OPV2V datasets demonstrate that Which2comm consistently outperforms other state-of-the-art methods on both perception performance and communication cost, exhibiting better robustness to real-world latency. These results present that for multi-agent collaborative 3D object detection, transmitting only object-level sparse features is sufficient to achieve high-precision and robust performance.
zh
[CV-26] Hi-ALPS – An Experimental Robustness Quantification of Six LiDAR-based Object Detection Systems for Autonomous Driving
【速读】:本文旨在解决基于点云的3D目标检测系统(3D OD)在对抗性扰动下的鲁棒性评估与提升问题。传统方法通过对抗样本测试OD的鲁棒性存在局限性,因为无法明确区分OD本身的脆弱性或攻击算法生成对抗样本的能力。为了解决这一问题,论文提出了Hi-ALPS(Hierarchical Adversarial-example-based LiDAR Perturbation Level System),其关键在于通过分级的对抗扰动方案逐步增加对OD的挑战难度,从而更全面地评估和提升OD的鲁棒性。Hi-ALPS结合启发式方法与经典对抗样本生成技术,量化了六种最先进的3D OD在不同扰动类型下的性能,并进一步探讨了现有防御措施的适用性及新的改进策略。
链接: https://arxiv.org/abs/2503.17168
作者: Alexandra Arzberger,Ramin Tavakoli Kolagari
机构: Nuremberg Institute of Technology (纽伦堡应用技术大学); Faculty of Computer Science (计算机科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Light Detection and Ranging (LiDAR) is an essential sensor technology for autonomous driving as it can capture high-resolution 3D data. As 3D object detection systems (OD) can interpret such point cloud data, they play a key role in the driving decisions of autonomous vehicles. Consequently, such 3D OD must be robust against all types of perturbations and must therefore be extensively tested. One approach is the use of adversarial examples, which are small, sometimes sophisticated perturbations in the input data that change, i.e., falsify, the prediction of the OD. These perturbations are carefully designed based on the weaknesses of the OD. The robustness of the OD cannot be quantified with adversarial examples in general, because if the OD is vulnerable to a given attack, it is unclear whether this is due to the robustness of the OD or whether the attack algorithm produces particularly strong adversarial examples. The contribution of this work is Hi-ALPS – Hierarchical Adversarial-example-based LiDAR Perturbation Level System, where higher robustness of the OD is required to withstand the perturbations as the perturbation levels increase. In doing so, the Hi-ALPS levels successively implement a heuristic followed by established adversarial example approaches. In a series of comprehensive experiments using Hi-ALPS, we quantify the robustness of six state-of-the-art 3D OD under different types of perturbations. The results of the experiments show that none of the OD is robust against all Hi-ALPS levels; an important factor for the ranking is that human observers can still correctly recognize the perturbed objects, as the respective perturbations are small. To increase the robustness of the OD, we discuss the applicability of state-of-the-art countermeasures. In addition, we derive further suggestions for countermeasures based on our experimental results.
zh
[CV-27] CoRLD: Contrastive Representation Learning Of Deformable Shapes In Images
【速读】:该论文旨在解决两类主要问题:一是现有变形形状表示方法在测试阶段依赖已知模板,限制了其灵活性;二是难以捕捉相似形状之间细微的体素级差异。为解决这些问题,论文提出了一种新颖的框架——学习形变空间中的对比表示学习(Contrastive Representation Learning of Deformable shapes, CoRLD)。其关键是通过类别感知的对比监督学习目标,在潜在形变空间中促进相似类别的表征接近,同时确保不同类别之间的分离,且无需输入参考图像即可预测形变变化,从而提升模型的灵活性与通用性。实验验证表明,CoRLD能够有效提取可与现有分类器结合使用的形变形状特征,显著提高分类准确性。
链接: https://arxiv.org/abs/2503.17162
作者: Tonmoy Hossain ana Miaomiao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deformable shape representations, parameterized by deformations relative to a given template, have proven effective for improved image analysis tasks. However, their broader applicability is hindered by two major challenges. First, existing methods mainly rely on a known template during testing, which is impractical and limits flexibility. Second, they often struggle to capture fine-grained, voxel-level distinctions between similar shapes (e.g., anatomical variations among healthy individuals, those with mild cognitive impairment, and diseased states). To address these limitations, we propose a novel framework - Contrastive Representation Learning of Deformable shapes (CoRLD) in learned deformation spaces and demonstrate its effectiveness in the context of image classification. Our CoRLD leverages a class-aware contrastive supervised learning objective in latent deformation spaces, promoting proximity among representations of similar classes while ensuring separation of dissimilar groups. In contrast to previous deep learning networks that require a reference image as input to predict deformation changes, our approach eliminates this dependency. Instead, template images are utilized solely as ground truth in the loss function during the training process, making our model more flexible and generalizable to a wide range of medical applications. We validate CoRLD on diverse datasets, including real brain magnetic resonance imaging (MRIs) and adrenal shapes derived from computed tomography (CT) scans. Experimental results show that our model effectively extracts deformable shape features, which can be easily integrated with existing classifiers to substantially boost the classification accuracy. Our code is available at GitHub.
zh
[CV-28] D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens
【速读】:该论文旨在解决图像生成领域中离散值和连续值标记在建模中的潜在互补性未被充分探索的问题。现有方法中,基于潜在的生成模型依赖于图像分词器,而自回归模型虽具备可扩展性和灵活性,但生成质量较差;扩散模型虽然利用连续值分词器实现更好的生成质量,但面临效率和复杂度的挑战。论文提出的关键解决方案是D2C(Discrete-to-Continuous),一种新颖的两阶段方法:第一阶段通过小型离散值生成器采样粗粒度图像特征的离散值标记;第二阶段在离散标记序列条件下学习细粒度图像特征的连续值标记,并设计了两种融合模块以实现无缝交互。实验结果表明,该方法在ImageNet-256数据集上的条件图像生成任务中优于多种连续值和离散值生成模型。
链接: https://arxiv.org/abs/2503.17155
作者: Panpan Wang,Liqiang Niu,Fandong Meng,Jinan Xu,Yufeng Chen,Jie Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In the domain of image generation, latent-based generative models occupy a dominant status; however, these models rely heavily on image tokenizer. To meet modeling requirements, autoregressive models possessing the characteristics of scalability and flexibility embrace a discrete-valued tokenizer, but face the challenge of poor image generation quality. In contrast, diffusion models take advantage of the continuous-valued tokenizer to achieve better generation quality but are subject to low efficiency and complexity. The existing hybrid models are mainly to compensate for information loss and simplify the diffusion learning process. The potential of merging discrete-valued and continuous-valued tokens in the field of image generation has not yet been explored. In this paper, we propose D2C, a novel two-stage method to enhance model generation capacity. In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator. Then in the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence. In addition, we design two kinds of fusion modules for seamless interaction. On the ImageNet-256 benchmark, extensive experiment results validate that our model achieves superior performance compared with several continuous-valued and discrete-valued generative models on the class-conditional image generation tasks.
zh
[CV-29] Enhancing Steering Estimation with Semantic-Aware GNNs ICCV2025
【速读】:该论文致力于解决自动驾驶中的转向估计问题,传统方法依赖于基于2D图像的模型。论文的关键创新在于通过混合架构结合3D神经网络模型与循环神经网络(Recurrent Neural Networks, RNNs)来整合3D空间信息,并以LiDAR点云作为输入进行时间建模。此外,为了减少对LiDAR的依赖,研究引入了一个预训练的统一模型,从单目图像中估计深度并重建伪3D点云,同时优化图结构构建策略,优先在语义相同点之间建立连接,仅保留20%的跨类别连接。这种方法不仅提高了计算效率,还保持了关键的空间关系。最终,论文在KITTI数据集上的实验表明,所提出的方案相比仅基于2D的方法提升了71%,验证了利用3D空间信息和高效图构建策略在转向估计任务中的优势,同时兼顾了单目图像的成本效益和避免了LiDAR系统的高昂成本。
链接: https://arxiv.org/abs/2503.17153
作者: Fouad Makiyeh,Huy-Dung Nguyen,Patrick Chareyre,Ramin Hasani,Marc Blanchon,Daniela Rus
机构: Hybrid Intelligence part of Capgemini Engineering (Hybrid Intelligence 部分隶属于凯捷工程公司); Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology (计算机科学与人工智能实验室,麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICCV 2025
点击查看摘要
Abstract:Steering estimation is a critical task in autonomous driving, traditionally relying on 2D image-based models. In this work, we explore the advantages of incorporating 3D spatial information through hybrid architectures that combine 3D neural network models with recurrent neural networks (RNNs) for temporal modeling, using LiDAR-based point clouds as input. We systematically evaluate four hybrid 3D models, all of which outperform the 2D-only baseline, with the Graph Neural Network (GNN) - RNN model yielding the best results. To reduce reliance on LiDAR, we leverage a pretrained unified model to estimate depth from monocular images, reconstructing pseudo-3D point clouds. We then adapt the GNN-RNN model, originally designed for LiDAR-based point clouds, to work with these pseudo-3D representations, achieving comparable or even superior performance compared to the LiDAR-based model. Additionally, the unified model provides semantic labels for each point, enabling a more structured scene representation. To further optimize graph construction, we introduce an efficient connectivity strategy where connections are predominantly formed between points of the same semantic class, with only 20% of inter-class connections retained. This targeted approach reduces graph complexity and computational cost while preserving critical spatial relationships. Finally, we validate our approach on the KITTI dataset, achieving a 71% improvement over 2D-only models. Our findings highlight the advantages of 3D spatial information and efficient graph construction for steering estimation, while maintaining the cost-effectiveness of monocular images and avoiding the expense of LiDAR-based systems. Comments: Submitted to ICCV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.17153 [cs.CV] (or arXiv:2503.17153v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.17153 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-30] Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models CVPR2025
【速读】:本文旨在探究视觉-语言模型(Vision-Language Models, VLMs)在图像域中的组合性(compositionality),特别是预训练VLMs的视觉嵌入空间是否呈现出类似自然语言表示的结构化组合模式。传统方法在分析视觉数据的组合特性时面临噪声和稀疏性等挑战,为此,论文提出了一种名为“测地线可分解嵌入”(Geodesically Decomposable Embeddings, GDE)的新框架。GDE的关键在于通过引入几何感知的组合结构来近似潜在空间中的图像表示,从而有效捕捉视觉嵌入的组合性。实验表明,GDE在组合分类任务上的表现优于假设潜在空间具有线性几何特性的对比方法,并且在组别鲁棒性(group robustness)任务中取得了超越特定任务优化方案的结果。因此,该研究揭示了VLMs能够在视觉领域自发发展出类似于人类的组合推理能力,使模型的底层过程更具可解释性。
链接: https://arxiv.org/abs/2503.17142
作者: Davide Berasi,Matteo Farina,Massimiliano Mancini,Elisa Ricci,Nicola Strisciuglio
机构: Fondazione Bruno Kessler (布鲁诺克莱斯勒基金会); University of Trento (特伦托大学); University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Camera-ready version for CVPR 2025 (with this http URL .)
点击查看摘要
Abstract:Vision-Language Models (VLMs) learn a shared feature space for text and images, enabling the comparison of inputs of different modalities. While prior works demonstrated that VLMs organize natural language representations into regular structures encoding composite meanings, it remains unclear if compositional patterns also emerge in the visual embedding space. In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data. We address these problems and propose a framework, called Geodesically Decomposable Embeddings (GDE), that approximates image representations with geometry-aware compositional structures in the latent space. We demonstrate that visual embeddings of pre-trained VLMs exhibit a compositional arrangement, and evaluate the effectiveness of this property in the tasks of compositional classification and group robustness. GDE achieves stronger performance in compositional classification compared to its counterpart method that assumes linear geometry of the latent space. Notably, it is particularly effective for group robustness, where we achieve higher results than task-specific solutions. Our results indicate that VLMs can automatically develop a human-like form of compositional reasoning in the visual domain, making their underlying processes more interpretable. Code is available at this https URL.
zh
[CV-31] mporal-Guided Spiking Neural Networks for Event-Based Human Action Recognition
【速读】:该论文致力于解决基于事件的摄像头与脉冲神经网络(Spiking Neural Networks, SNNs)在隐私保护的人体动作识别(Human Action Recognition, HAR)中的协同应用问题,特别是SNNs处理长时间序列信息能力不足的问题,这是实现精确HAR的关键挑战。论文提出的关键解决方案包括两种新颖的框架:时间分割型SNN(\textit{TS-SNN})和三维卷积SNN(\textit{3D-SNN})。其中,\textit{TS-SNN}通过将动作划分为较短的时间片段来提取长期时间信息,而\textit{3D-SNN}则通过引入三维组件替代二维空间元素以促进时间信息的传递。这些方法有效提升了SNN处理长时间序列信息的能力,从而显著改善了基于事件的HAR性能。
链接: https://arxiv.org/abs/2503.17132
作者: Siyuan Yang,Shilin Lu,Shizheng Wang,Meng Hwa Er,Zengwei Zheng,Alex C. Kot
机构: School of Electrical and Electronic Engineering, Nanyang Technological University (南洋理工大学), Singapore; College of Computing and Data Science, Nanyang Technological University (南洋理工大学), Singapore; Institute of Microelectronics, Chinese Academy of Sciences (中国科学院微电子研究所), China; Department of Computer Science and Computing, Zhejiang University City College (浙江大学城市学院), China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
备注:
点击查看摘要
Abstract:This paper explores the promising interplay between spiking neural networks (SNNs) and event-based cameras for privacy-preserving human action recognition (HAR). The unique feature of event cameras in capturing only the outlines of motion, combined with SNNs’ proficiency in processing spatiotemporal data through spikes, establishes a highly synergistic compatibility for event-based HAR. Previous studies, however, have been limited by SNNs’ ability to process long-term temporal information, essential for precise HAR. In this paper, we introduce two novel frameworks to address this: temporal segment-based SNN (\textitTS-SNN) and 3D convolutional SNN (\textit3D-SNN). The \textitTS-SNN extracts long-term temporal information by dividing actions into shorter segments, while the \textit3D-SNN replaces 2D spatial elements with 3D components to facilitate the transmission of temporal information. To promote further research in event-based HAR, we create a dataset, \textitFallingDetection-CeleX, collected using the high-resolution CeleX-V event camera (1280 \times 800) , comprising 7 distinct actions. Extensive experimental results show that our proposed frameworks surpass state-of-the-art SNN methods on our newly collected dataset and three other neuromorphic datasets, showcasing their effectiveness in handling long-range temporal information for event-based HAR.
zh
[CV-32] R-LiViT: A LiDAR-Visual-Thermal Dataset Enabling Vulnerable Road User Focused Roadside Perception ICCV2025
【速读】:该论文旨在解决自动驾驶中路侧感知系统在极端光照条件下对易受伤害道路用户(Vulnerable Road Users, VRUs)检测的不足问题。尽管激光雷达(LiDAR)和视觉(RGB)传感器已被广泛使用,但热成像由于其在恶劣光照条件下的优势却在现有数据集中未得到充分应用。论文的关键在于提出R-LiViT数据集,这是首个从路侧视角结合LiDAR、RGB和热成像的数据集,并特别关注VRUs的检测。通过采集三个交叉口在白天和夜晚的数据,该数据集包含10,000个LiDAR帧以及超过150个交通场景中的2,400张时间与空间对齐的RGB和热成像图片,并提供6类和8类的标注,从而为物体检测和跟踪等任务提供了全面资源。
链接: https://arxiv.org/abs/2503.17122
作者: Jonas Mirlach,Lei Wan,Andreas Wiedholz,Hannan Ejaz Keen,Andreas Eich
机构: XITASO GmbH (XITASO GmbH); LiangDao GmbH (LiangDao GmbH)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, submitted to ICCV2025
点击查看摘要
Abstract:In autonomous driving, the integration of roadside perception systems is essential for overcoming occlusion challenges and enhancing the safety of Vulnerable Road Users (VRUs). While LiDAR and visual (RGB) sensors are commonly used, thermal imaging remains underrepresented in datasets, despite its acknowledged advantages for VRU detection in extreme lighting conditions. In this paper, we present R-LiViT, the first dataset to combine LiDAR, RGB, and thermal imaging from a roadside perspective, with a strong focus on VRUs. R-LiViT captures three intersections during both day and night, ensuring a diverse dataset. It includes 10,000 LiDAR frames and 2,400 temporally and spatially aligned RGB and thermal images across over 150 traffic scenarios, with 6 and 8 annotated classes respectively, providing a comprehensive resource for tasks such as object detection and tracking. The dataset1 and the code for reproducing our evaluation results2 are made publicly available.
zh
[CV-33] he CASTLE 2024 Dataset: Advancing the Art of Multimodal Understanding
【速读】:该论文试图解决单视角 egocentric 视频数据集限制的问题,提出了一种多模态解决方案。关键在于构建了一个包含第一人称和第三人称视角视频与音频的 CASTLE 2024 数据集,通过来自 15 个时间对齐的数据源(包括 10 名志愿者的第一人称视角和 5 个固定摄像头的第三人称视角),记录了超过 600 小时的超高清视频,并确保无任何部分遮挡或模糊处理,从而提供了一个完整且多样化的数据资源以支持相关研究。
链接: https://arxiv.org/abs/2503.17116
作者: Luca Rossetto,Werner Bailer,Duc-Tien Dang-Nguyen,Graham Healy,Björn Þór Jónsson,Onanong Kongmeesub,Hoang-Bao Le,Stevan Rudinac,Klaus Schöffmann,Florian Spiess,Allie Tran,Minh-Triet Tran,Quang-Linh Tran,Cathal Gurrin
机构: Dublin City University(Dublin城市大学); JOANNEUM RESEARCH(JOANNEUM 研究院); University of Bergen(卑尔根大学); Dublin City University(Dublin城市大学); Reykjavik University(雷克雅未克大学); Dublin City University(Dublin城市大学); Dublin City University(Dublin城市大学); University of Amsterdam(阿姆斯特丹大学); Klagenfurt University(克拉根福大学); University of Basel(巴塞尔大学); Dublin City University(Dublin城市大学); VNU Ho Chi Minh University of Science(胡志明市越南国家大学自然科学大学); Dublin City University(Dublin城市大学); Dublin City University(Dublin城市大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 7 pages, 6 figures, dataset available via this https URL
点击查看摘要
Abstract:Egocentric video has seen increased interest in recent years, as it is used in a range of areas. However, most existing datasets are limited to a single perspective. In this paper, we present the CASTLE 2024 dataset, a multimodal collection containing ego- and exo-centric (i.e., first- and third-person perspective) video and audio from 15 time-aligned sources, as well as other sensor streams and auxiliary data. The dataset was recorded by volunteer participants over four days in a fixed location and includes the point of view of 10 participants, with an additional 5 fixed cameras providing an exocentric perspective. The entire dataset contains over 600 hours of UHD video recorded at 50 frames per second. In contrast to other datasets, CASTLE 2024 does not contain any partial censoring, such as blurred faces or distorted audio. The dataset is available via this https URL.
zh
[CV-34] Beyond Accuracy: What Matters in Designing Well-Behaved Models?
【速读】:该论文试图解决的问题是深度神经网络(DNNs)在预测性能之外的其他关键质量维度(如鲁棒性、校准性和公平性)表现不足的问题,并填补了现有研究仅关注部分质量维度而未探索更通用的“良好行为”形式这一空白。论文的关键解决方案是通过大规模研究同时分析九个不同的质量维度,包括对326个骨干模型的评估以及不同训练范式和模型架构对其影响的研究。此外,论文提出了QUBA分数(超越准确性的质量理解),这是一种综合多个质量维度的新指标,用于提供针对性的推荐。
链接: https://arxiv.org/abs/2503.17110
作者: Robin Hesse,Doğukan Bağcı,Bernt Schiele,Simone Schaub-Meyer,Stefan Roth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code: this https URL
点击查看摘要
Abstract:Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of “well-behavedness” of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird’s-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect the quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high fairness on ImageNet-1k classification and strong robustness against domain changes; (ii) self-supervised learning is an effective training paradigm to improve almost all considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.
zh
[CV-35] Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval CVPR2025
【速读】:该论文旨在解决零样本组合图像检索(Zero-Shot Composed Image Retrieval, ZS-CIR)任务中的关键挑战,即在参考图像缺失目标内容的情况下,依据操作文本准确检索目标图像。论文的核心问题是当参考图像缺乏必要的目标视觉内容时,如何有效生成或预测这些缺失信息以完成检索任务。
为了解决这一问题,论文提出了一种基于预测的映射网络PrediCIR。其关键在于通过潜空间中的自适应预测模块,在映射之前预先预测参考图像中缺失的目标视觉内容。具体而言,首先通过世界视图生成模块构建一个源视图,该视图省略了目标视图中的某些视觉内容,并结合从现有图像-标题对中提取的操作意图;随后,目标内容预测模块利用世界模型作为预测器,在潜空间内根据用户意图指导下的操作文本自适应地预测缺失的视觉信息。最终,将包含预测相关信息的图像映射到伪词标记,而无需额外监督。这种方法使模型在六个ZS-CIR任务上表现出强大的泛化能力,并显著提升了性能,超越当前最优方法1.73%至4.45%,达到了新的技术水平。
链接: https://arxiv.org/abs/2503.17109
作者: Yuanmin Tang,Jing Yu,Keke Gai,Jiamin Zhuang,Gang Xiong,Gaopeng Gou,Qi Wu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); School of Information Engineering, Minzu University of China (中央民族大学信息工程学院); Beijing Institute of Technology (北京理工大学); University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted to CVPR 2025
点击查看摘要
Abstract:Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent across domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to modify a reference image according to manipulation text to accurately retrieve a target image, especially when the reference image is missing essential target content. In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. Specifically, a world view generation module first constructs a source view by omitting certain visual content of a target view, coupled with an action that includes the manipulation intent derived from existing image-caption pairs. Then, a target content prediction module trains a world model as a predictor to adaptively predict the missing visual information guided by user intention in manipulating text at the latent space. The two modules map an image with the predicted relevant information to a pseudo-word token without extra supervision. Our model shows strong generalization ability on six ZS-CIR tasks. It obtains consistent and significant performance boosts ranging from 1.73% to 4.45% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at this https URL.
zh
[CV-36] GAA-TSO: Geometry-Aware Assisted Depth Completion for Transparent and Specular Objects
【速读】:该论文旨在解决透明和高光物体深度信息不完整且不准确的问题,这对下游机器人任务构成了重大挑战。论文的关键在于提出了一种面向透明和高光物体的几何感知辅助深度补全方法,专注于探索场景中的三维结构线索。具体而言,除了从RGB-D输入中提取二维特征外,还通过将输入深度映射到点云并构建三维分支来提取分层的场景级三维结构特征。为了利用三维几何信息,设计了多个门控跨模态融合模块,以有效将多层级三维几何特征传播到图像分支。此外,提出了自适应相关聚合策略,合理分配三维特征到对应的二维特征。实验结果表明,该方法在ClearGrasp、OOD、TransCG和STD数据集上的表现优于其他最先进的方法,并显著提升了下游机器人抓取任务的性能。
链接: https://arxiv.org/abs/2503.17106
作者: Yizhe Liu,Tong Jia,Da Cai,Hao Wang,Dongyue Chen
机构: College of Information Science and Engineering, Northeastern University (东北大学信息科学与工程学院), Shenyang 110819, China; National Frontiers Science Center for Industrial Intelligence and Systems Optimization (工业智能与系统优化全国重点实验室), Shenyang 110819, China
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Transparent and specular objects are frequently encountered in daily life, factories, and laboratories. However, due to the unique optical properties, the depth information on these objects is usually incomplete and inaccurate, which poses significant challenges for downstream robotics tasks. Therefore, it is crucial to accurately restore the depth information of transparent and specular objects. Previous depth completion methods for these objects usually use RGB information as an additional channel of the depth image to perform depth prediction. Due to the poor-texture characteristics of transparent and specular objects, these methods that rely heavily on color information tend to generate structure-less depth predictions. Moreover, these 2D methods cannot effectively explore the 3D structure hidden in the depth channel, resulting in depth ambiguity. To this end, we propose a geometry-aware assisted depth completion method for transparent and specular objects, which focuses on exploring the 3D structural cues of the scene. Specifically, besides extracting 2D features from RGB-D input, we back-project the input depth to a point cloud and build the 3D branch to extract hierarchical scene-level 3D structural features. To exploit 3D geometric information, we design several gated cross-modal fusion modules to effectively propagate multi-level 3D geometric features to the image branch. In addition, we propose an adaptive correlation aggregation strategy to appropriately assign 3D features to the corresponding 2D features. Extensive experiments on ClearGrasp, OOD, TransCG, and STD datasets show that our method outperforms other state-of-the-art methods. We further demonstrate that our method significantly enhances the performance of downstream robotic grasping tasks.
zh
[CV-37] R2LDM: An Efficient 4D Radar Super-Resolution Framework Leverag ing Diffusion Model
【速读】:该论文旨在解决4D雷达点云超分辨率的问题,即从原始雷达数据生成密度更高且更精确的LiDAR-like点云。论文的关键创新在于提出了R2LDM方法,其核心包括使用体素特征表示LiDAR和雷达点云,而非传统的范围图像或鸟瞰图(BEV)图像,从而更有效地捕捉三维形状信息。此外,引入了潜伏体素扩散模型(Latent Voxel Diffusion Model, LVDM),在潜伏空间中执行扩散过程,并利用新颖的潜伏点云重建模块(Latent Point Cloud Reconstruction, LPCR)从高维潜伏体素特征中重建点云。这些技术共同实现了对雷达点云的有效增强,并显著提升了下游任务的表现,如点云配准召回率提升31.7%,物体检测精度提升24.9%。
链接: https://arxiv.org/abs/2503.17097
作者: Boyuan Zheng,Shouyi Lu,Renbo Huang,Minqing Huang,Fan Lu,Wei Tian,Guirong Zhuo,Lu Xiong
机构: School of Automotive Studies, Tongji University (同济大学汽车学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We introduce R2LDM, an innovative approach for generating dense and accurate 4D radar point clouds, guided by corresponding LiDAR point clouds. Instead of utilizing range images or bird’s eye view (BEV) images, we represent both LiDAR and 4D radar point clouds using voxel features, which more effectively capture 3D shape information. Subsequently, we propose the Latent Voxel Diffusion Model (LVDM), which performs the diffusion process in the latent space. Additionally, a novel Latent Point Cloud Reconstruction (LPCR) module is utilized to reconstruct point clouds from high-dimensional latent voxel features. As a result, R2LDM effectively generates LiDAR-like point clouds from paired raw radar data. We evaluate our approach on two different datasets, and the experimental results demonstrate that our model achieves 6- to 10-fold densification of radar point clouds, outperforming state-of-the-art baselines in 4D radar point cloud super-resolution. Furthermore, the enhanced radar point clouds generated by our method significantly improve downstream tasks, achieving up to 31.7% improvement in point cloud registration recall rate and 24.9% improvement in object detection accuracy.
zh
[CV-38] Multi-modal Multi-platform Person Re-Identification: Benchmark and Method
【速读】:该论文旨在解决多模态与多平台条件下(multi-modality and multi-platform)行人再识别(Person Re-Identification, ReID)面临的挑战。传统方法局限于单一模态传感器数据(如静态摄像头),难以应对现实场景中日益增多的多模态信号(如RGB图像、红外图像、热成像等)。论文提出的关键解决方案是构建了一个名为MP-ReID的新基准数据集,并基于此提出了Uni-Prompt ReID框架。该框架通过设计特定的提示(prompts),有效应对跨模态和跨平台下的行人再识别难题,显著提升了性能,为复杂动态环境中的ReID研究奠定了坚实基础。
链接: https://arxiv.org/abs/2503.17096
作者: Ruiyang Ha,Songyi Jiang,Bin Li,Bikang Pan,Yihang Zhu,Junjie Zhang,Xiatian Zhu,Shaogang Gong,Jingya Wang
机构: ShanghaiTech University; The Xi’an Jiaotong-Liverpool University; University of Surrey; Queen Mary University London
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Conventional person re-identification (ReID) research is often limited to single-modality sensor data from static cameras, which fails to address the complexities of real-world scenarios where multi-modal signals are increasingly prevalent. For instance, consider an urban ReID system integrating stationary RGB cameras, nighttime infrared sensors, and UAVs equipped with dynamic tracking capabilities. Such systems face significant challenges due to variations in camera perspectives, lighting conditions, and sensor modalities, hindering effective person ReID. To address these challenges, we introduce the MP-ReID benchmark, a novel dataset designed specifically for multi-modality and multi-platform ReID. This benchmark uniquely compiles data from 1,930 identities across diverse modalities, including RGB, infrared, and thermal imaging, captured by both UAVs and ground-based cameras in indoor and outdoor environments. Building on this benchmark, we introduce Uni-Prompt ReID, a framework with specific-designed prompts, tailored for cross-modality and cross-platform scenarios. Our method consistently outperforms state-of-the-art approaches, establishing a robust foundation for future research in complex and dynamic ReID environments. Our dataset are available at:this https URL.
zh
[CV-39] FFaceNeRF: Few-shot Face Editing in Neural Radiance Fields CVPR2025
【速读】:该论文旨在解决现有基于掩码(mask)的3D人脸编辑方法因使用预训练分割掩码而导致的用户控制受限问题。尽管基于神经辐射场(Neural Radiance Fields, NeRF)的方法能够生成高质量编辑图像,但受限于固定掩码布局,用户难以实现灵活控制,且需要大规模标注数据集以适应特定的掩码布局,这在实际应用中极具挑战性。论文提出FFaceNeRF,一种基于NeRF的人脸编辑技术,其关键在于引入了几何适配器(geometry adapter)与特征注入机制,实现了对几何属性的有效操控,并通过三平面增强中的潜在混合(latent mixing)技术,在少量样本条件下完成快速模型适配至目标掩码布局。这种方法显著提升了灵活性和用户控制能力,同时保证了生成图像的质量,为个性化医疗成像及创意性人脸编辑等领域的定制化高保真3D人脸编辑奠定了基础。
链接: https://arxiv.org/abs/2503.17095
作者: Kwan Yun,Chaelin Kim,Hangyeul Shin,Junyong Noh
机构: KAIST (KAIST), Visual Media Lab (视觉媒体实验室); Handong Global University (汉东全球大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025, 11 pages, 14 figures
点击查看摘要
Abstract:Recent 3D face editing methods using masks have produced high-quality edited images by leveraging Neural Radiance Fields (NeRF). Despite their impressive performance, existing methods often provide limited user control due to the use of pre-trained segmentation masks. To utilize masks with a desired layout, an extensive training dataset is required, which is challenging to gather. We present FFaceNeRF, a NeRF-based face editing technique that can overcome the challenge of limited user control due to the use of fixed mask layouts. Our method employs a geometry adapter with feature injection, allowing for effective manipulation of geometry attributes. Additionally, we adopt latent mixing for tri-plane augmentation, which enables training with a few samples. This facilitates rapid model adaptation to desired mask layouts, crucial for applications in fields like personalized medical imaging or creative face editing. Our comparative evaluations demonstrate that FFaceNeRF surpasses existing mask based face editing methods in terms of flexibility, control, and generated image quality, paving the way for future advancements in customized and high-fidelity 3D face editing. The code is available on the \hrefthis https URLproject-page.
zh
[CV-40] ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration CVPR2025
【速读】:该论文旨在解决分布式多视图立体视觉(Multi-View Stereo, MVS)重建中的联合参考框架注册(registration)问题,特别是针对结构从运动(Structure-from-Motion, SfM)重建的地图共享需求。当前缺乏可扩展的方法和训练数据集来实现SfM重建的注册。论文的关键解决方案包括两个方面:首先,提出了一种针对SfM重建点云的可扩展配准任务,并通过利用合成相机轨迹生成每个场景的部分重建,构建了一个新的SfM配准数据集生成管道;其次,在最先进的配准方法RoITr的基础上,设计了一个简单但有效的神经细化模块RefineRoITr,显著提升了配准性能。这些贡献共同实现了协作式SfM(Collaborative SfM, ColabSfM)。
链接: https://arxiv.org/abs/2503.17093
作者: Johan Edstedt,André Mateus,Alberto Jaenal
机构: Linköping University (林雪平大学); Ericsson Research (爱立信研究院), Sweden (瑞典)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
点击查看摘要
Abstract:Structure-from-Motion (SfM) is the task of estimating 3D structure and camera poses from images. We define Collaborative SfM (ColabSfM) as sharing distributed SfM reconstructions. Sharing maps requires estimating a joint reference frame, which is typically referred to as registration. However, there is a lack of scalable methods and training datasets for registering SfM reconstructions. In this paper, we tackle this challenge by proposing the scalable task of point cloud registration for SfM reconstructions. We find that current registration methods cannot register SfM point clouds when trained on existing datasets. To this end, we propose a SfM registration dataset generation pipeline, leveraging partial reconstructions from synthetically generated camera trajectories for each scene. Finally, we propose a simple but impactful neural refiner on top of the SotA registration method RoITr that yields significant improvements, which we call RefineRoITr. Our extensive experimental evaluation shows that our proposed pipeline and model enables ColabSfM. Code is available at this https URL
zh
[CV-41] Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection CVPR2025
【速读】:该论文旨在解决CLIP模型在大规模图像-文本预训练过程中计算开销高、数据处理和内存需求大的问题,同时避免现有掩码策略因选择性移除图像补丁而导致的关键语义信息丢失,从而影响视觉特征与文本描述对齐效果的问题。论文的关键解决方案是提出了一种名为Patch Generation-to-Selection (PGS) 的方法,通过引入渐进式掩码生成与选择过程,在保留重要语义信息的同时提升训练效率。具体而言,该方法首先从候选补丁中预选潜在掩码区域,接着利用Sobel边缘检测生成优先保留主要物体区域的边缘掩码,并最终结合相似度评分与最优传输归一化优化掩码选择,确保平衡的相似性矩阵,从而实现高效且语义完整的掩码策略。
链接: https://arxiv.org/abs/2503.17080
作者: Gensheng Pei,Tao Chen,Yujia Wang,Xinhao Cai,Xiangbo Shu,Tianfei Zhou,Yazhou Yao
机构: Nanjing University of Science and Technology (南京理工大学); Zhejiang Sci-Tech University (浙江理工大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR 2025
点击查看摘要
Abstract:The CLIP model has demonstrated significant advancements in aligning visual and language modalities through large-scale pre-training on image-text pairs, enabling strong zero-shot classification and retrieval capabilities on various domains. However, CLIP’s training remains computationally intensive, with high demands on both data processing and memory. To address these challenges, recent masking strategies have emerged, focusing on the selective removal of image patches to improve training efficiency. Although effective, these methods often compromise key semantic information, resulting in suboptimal alignment between visual features and text descriptions. In this work, we present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP’s training efficiency while preserving critical semantic content. Our method introduces a gradual masking process in which a small set of candidate patches is first pre-selected as potential mask regions. Then, we apply Sobel edge detection across the entire image to generate an edge mask that prioritizes the retention of the primary object areas. Finally, similarity scores between the candidate mask patches and their neighboring patches are computed, with optimal transport normalization refining the selection process to ensure a balanced similarity matrix. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks, achieving superior performance in robustness evaluation and language compositionality benchmarks.
zh
[CV-42] Halton Scheduler For Masked Generative Image Transformer
【速读】:该论文旨在解决 Masked Generative Image Transformers (MaskGIT) 框架中 token unmasking scheduler 的性能瓶颈问题。论文指出,现有基于 Confidence 调度器的采样策略存在不可恢复的采样误差,导致图像生成质量受限且超参数调优复杂。为了解决这一问题,论文提出了一种基于 Halton 序列的新采样策略。Halton 序列是一种准随机、低差异序列,能够以渐进方式均匀覆盖图像空间,从而有效减少采样误差,并实现更高质量的图像生成。该方法无需重新训练或添加噪声,可作为现有采样策略的简单替换。实验结果表明,Halton 调度器在 ImageNet 的类别到图像合成任务和 COCO 数据集的文本到图像生成任务中均优于 Confidence 调度器,不仅定量上降低了 Fréchet Inception Distance (FID),还生成了更加多样化和细节丰富的图像。
链接: https://arxiv.org/abs/2503.17076
作者: Victor Besnier,Mickael Chen,David Hurych,Eduardo Valle,Matthieu Cord
机构: valeo.ai (法雷奥人工智能); Sorbonne Université (索邦大学); H company (H公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Masked Generative Image Transformers (MaskGIT) have emerged as a scalable and efficient image generation framework, able to deliver high-quality visuals with low inference costs. However, MaskGIT’s token unmasking scheduler, an essential component of the framework, has not received the attention it deserves. We analyze the sampling objective in MaskGIT, based on the mutual information between tokens, and elucidate its shortcomings. We then propose a new sampling strategy based on our Halton scheduler instead of the original Confidence scheduler. More precisely, our method selects the token’s position according to a quasi-random, low-discrepancy Halton sequence. Intuitively, that method spreads the tokens spatially, progressively covering the image uniformly at each step. Our analysis shows that it allows reducing non-recoverable sampling errors, leading to simpler hyper-parameters tuning and better quality images. Our scheduler does not require retraining or noise injection and may serve as a simple drop-in replacement for the original sampling strategy. Evaluation of both class-to-image synthesis on ImageNet and text-to-image generation on the COCO dataset demonstrates that the Halton scheduler outperforms the Confidence scheduler quantitatively by reducing the FID and qualitatively by generating more diverse and more detailed images. Our code is at this https URL.
zh
[CV-43] Zero-Shot Styled Text Image Generation but Make It Autoregressive CVPR2025
【速读】:该论文旨在解决手写文本生成(Handwritten Text Generation, HTG)领域中现有方法在泛化到新风格时的局限性,以及在最大输出长度和训练效率方面的技术约束。为了解决这些问题,论文提出了一种名为Emuru的新框架,其关键在于结合了一个强大的文本图像表示模型(变分自编码器)与一个自回归Transformer。这种组合使得生成的文本图像能够基于文本内容和风格示例(如特定字体或手写风格)进行条件生成,并且通过在包含超过100,000种打字字体和书法字体的多样化合成数据集上的训练,实现了对未见风格的零样本复现能力。此外,Emuru生成的图像不含背景杂点,便于下游应用使用。
链接: https://arxiv.org/abs/2503.17074
作者: Vittorio Pippi,Fabio Quattrini,Silvia Cascianelli,Alessio Tonioni,Rita Cucchiara
机构: University of Modena and Reggio Emilia (摩德纳大学和雷焦艾米利亚大学); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR2025
点击查看摘要
Abstract:Styled Handwritten Text Generation (HTG) has recently received attention from the computer vision and document analysis communities, which have developed several solutions, either GAN- or diffusion-based, that achieved promising results. Nonetheless, these strategies fail to generalize to novel styles and have technical constraints, particularly in terms of maximum output length and training efficiency. To overcome these limitations, in this work, we propose a novel framework for text image generation, dubbed Emuru. Our approach leverages a powerful text image representation model (a variational autoencoder) combined with an autoregressive Transformer. Our approach enables the generation of styled text images conditioned on textual content and style examples, such as specific fonts or handwriting styles. We train our model solely on a diverse, synthetic dataset of English text rendered in over 100,000 typewritten and calligraphy fonts, which gives it the capability to reproduce unseen styles (both fonts and users’ handwriting) in zero-shot. To the best of our knowledge, Emuru is the first autoregressive model for HTG, and the first designed specifically for generalization to novel styles. Moreover, our model generates images without background artifacts, which are easier to use for downstream applications. Extensive evaluation on both typewritten and handwritten, any-length text image generation scenarios demonstrates the effectiveness of our approach.
zh
[CV-44] Superpowering Open-Vocabulary Object Detectors for X-ray Vision
【速读】:本文旨在解决开放词汇物体检测(Open-vocabulary Object Detection, OvOD)在X射线成像中的应用难题。传统基于RGB图像的OvOD方法无法直接应用于X射线扫描,主要因为数据稀缺性和模态差异(modality gap)。为克服这些限制,论文提出了一种无需训练的框架RAXO,其核心在于通过双源检索策略构建高质量的X射线类别描述符。RAXO从网络收集相关RGB图像,并通过新颖的X射线材质迁移机制增强这些图像,从而消除对标注数据库的需求。关键创新点在于用这些视觉描述符替代基于文本的分类,在开放词汇检测中利用跨模态特征距离实现鲁棒检测。实验表明,RAXO显著提升了OvOD性能,平均mAP提升可达17.0点。此外,作者还发布了DET-COMPASS基准数据集以推动该领域研究。
链接: https://arxiv.org/abs/2503.17071
作者: Pablo Garcia-Fernandez,Lorenzo Vaquero,Mingxuan Liu,Feng Xue,Daniel Cores,Nicu Sebe,Manuel Mucientes,Elisa Ricci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset available at: this https URL.
zh
[CV-45] PVChat: Personalized Video Chat with One-Shot Learning
【速读】:该论文旨在解决视频大型语言模型(Video Large Language Models, ViLLMs)在身份感知理解(identity-aware comprehension)方面的局限性,例如无法有效处理需要识别特定人物及其行为的问题,如“Wilson正在接受化疗”或“Tom正在与Sarah讨论”。这种局限性限制了ViLLMs在智能医疗和智能家居等场景中的应用。为了解决这一问题,论文提出了一种名为PVChat的一次性学习框架,这是首个能够通过单一视频实现主体感知问答(subject-aware question answering, QA)的个性化ViLLM。
解决方案的关键在于引入了一个增强的混合注意力头机制(Mixture-of-Heads, MoH),并通过一个合成扩充的视频-QA数据集进行优化。具体而言,研究团队开发了一种自动增强管道,用于生成保持身份特征的正样本,并从现有视频语料库中检索困难负样本,从而构建包含四种QA类型(存在性、外观、动作和位置查询)的多样化训练数据集。此外,提出了ReLU路由混合注意力机制,并设计了两项创新目标:(1) 平滑邻近性正则化(Smooth Proximity Regularization),通过指数距离缩放实现渐进学习;(2) 头激活增强(Head Activation Enhancement),以实现平衡的注意力路由。最后,采用两阶段训练策略,从图像预训练过渡到视频微调,使静态属性到动态表示的学习过程逐步展开。实验结果表明,PVChat在从单一视频中学习后,对个性化特征的理解优于现有的ViLLMs。
链接: https://arxiv.org/abs/2503.17069
作者: Yufei Shi,Weilong Yan,Gang Xu,Yumeng Li,Yuchen Li,Zhenxi Li,Fei Richard Yu,Ming Li,Si Yong Yeo
机构: Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); Nankai University (南开大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东人工智能与数字经济实验室 (深圳))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as “Wilson is receiving chemotherapy” or “Tom is discussing with Sarah”, limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage, demonstrating its superiority in personalized feature understanding after learning from a single video, compared to state-of-the-art ViLLMs.
zh
[CV-46] DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech AAAI2025
【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成共发言语手势(co-speech gestures)时因计算密集型采样步骤而导致实际应用受限的问题。论文的关键解决方案是提出了一种基于解耦半隐式扩散模型框架的DIDiffGes方法。该方法通过利用生成对抗网络(GANs)实现大步长采样,从而显著减少采样步骤至仅10步,同时保持高质量和表达性。关键创新点在于将手势数据解耦为身体与手部分布,并进一步分解为边缘分布和条件分布。其中,GAN隐式建模边缘分布,而L2重建损失显式学习条件分布,这一策略增强了训练稳定性并确保生成手势的表达能力。此外,该框架还通过局部身体表示来引导根噪声去噪,保证生成结果的稳定性和真实性。这些技术突破使得DIDiffGes相比现有方法减少了100倍的采样步骤,同时在用户研究中表现出更高的拟人度、适配性和风格正确性。
链接: https://arxiv.org/abs/2503.17059
作者: Yongkang Cheng,Shaoli Huang,Xuelin Chen,Jifeng Ning,Mingming Gong
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Accepted by AAAI 2025
点击查看摘要
Abstract:Diffusion models have demonstrated remarkable synthesis quality and diversity in generating co-speech gestures. However, the computationally intensive sampling steps associated with diffusion models hinder their practicality in real-world applications. Hence, we present DIDiffGes, for a Decoupled Semi-Implicit Diffusion model-based framework, that can synthesize high-quality, expressive gestures from speech using only a few sampling steps. Our approach leverages Generative Adversarial Networks (GANs) to enable large-step sampling for diffusion model. We decouple gesture data into body and hands distributions and further decompose them into marginal and conditional distributions. GANs model the marginal distribution implicitly, while L2 reconstruction loss learns the conditional distributions exciplictly. This strategy enhances GAN training stability and ensures expressiveness of generated full-body gestures. Our framework also learns to denoise root noise conditioned on local body representation, guaranteeing stability and realism. DIDiffGes can generate gestures from speech with just 10 sampling steps, without compromising quality and expressiveness, reducing the number of sampling steps by a factor of 100 compared to existing methods. Our user study reveals that our method outperforms state-of-the-art approaches in human likeness, appropriateness, and style correctness. Project is this https URL.
zh
[CV-47] Scoring Remember and Reference: Catching Camouflaged Objects in Videos
【速读】:该论文旨在解决视频伪装物体检测(VCOD)任务中的挑战,即分割外观与周围环境高度相似的物体。现有视觉模型在处理此类场景时常遇到因伪装物体不可区分的表观特征以及未能充分利用视频中的动态信息而导致的困难。为应对这些挑战,论文提出了一种基于人类记忆-识别机制的端到端VCOD框架,通过整合记忆参考帧来利用历史视频信息以处理伪装序列。关键解决方案在于设计了一个双用途解码器,同时生成预测掩码和分数,从而实现基于分数的选择参考帧,并引入辅助监督以增强特征提取;此外,还引入了一种新颖的参考引导多级不对称注意力机制,有效结合长期参考信息与短期运动线索进行综合特征提取。最终形成的评分、记忆与参考(SRR)框架能够高效提取信息定位目标,并利用记忆指导改进后续处理过程,在基准数据集上的性能超越现有方法10%,且参数量仅为54M,仅需单次视频遍历。
链接: https://arxiv.org/abs/2503.17050
作者: Yuang Feng,Shuyong Gao,Fuzhen Yan,Yicheng Song,Lingyi Hong,Junjie Hu,Wenqiang Zhang
机构: Fudan University (复旦大学); Keenon Robotics Co. Ltd (快诺机器人有限公司); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Video Camouflaged Object Detection (VCOD) aims to segment objects whose appearances closely resemble their surroundings, posing a challenging and emerging task. Existing vision models often struggle in such scenarios due to the indistinguishable appearance of camouflaged objects and the insufficient exploitation of dynamic information in videos. To address these challenges, we propose an end-to-end VCOD framework inspired by human memory-recognition, which leverages historical video information by integrating memory reference frames for camouflaged sequence processing. Specifically, we design a dual-purpose decoder that simultaneously generates predicted masks and scores, enabling reference frame selection based on scores while introducing auxiliary supervision to enhance feature this http URL, this study introduces a novel reference-guided multilevel asymmetric attention mechanism, effectively integrating long-term reference information with short-term motion cues for comprehensive feature extraction. By combining these modules, we develop the Scoring, Remember, and Reference (SRR) framework, which efficiently extracts information to locate targets and employs memory guidance to improve subsequent processing. With its optimized module design and effective utilization of video data, our model achieves significant performance improvements, surpassing existing approaches by 10% on benchmark datasets while requiring fewer parameters (54M) and only a single pass through the video. The code will be made publicly available.
zh
[CV-48] HAPI: A Model for Learning Robot Facial Expressions from Human Preferences
【速读】:该论文致力于解决自动机器人面部表情生成中因有限自由度和感知整合不足导致的表达缺乏细腻性和真实感的问题。传统手工方法基于固定关节配置常产生僵化且不自然的行为,而现有自动化技术虽减少了人工调参需求,但在弥合人类偏好与模型预测之间的差距方面仍显不足。为此,论文提出了一种新颖的学习排序框架,通过利用人类反馈来解决这一偏差并提升机器人面部的表情表现力。关键在于开发了基于成对比较标注的人类情感印象(HAPI)模型,这是一种以Siamese RankNet为基础的方法,用于优化表情评估,并通过贝叶斯优化及在线表情调查验证了其在35自由度人形平台上的有效性,最终实现了比基准和专家设计方法更逼真且具社会共鸣的愤怒、快乐和惊讶等情绪表达。这表明所提出的框架能够有效弥合人类偏好与模型预测之间的鸿沟,并使机器人表情生成稳健地与人类情感响应保持一致。
链接: https://arxiv.org/abs/2503.17046
作者: Dongsheng Yang,Qianying Liu,Wataru Sato,Takashi Minato,Chaoran Liu,Shin’ya Nishida
机构: Graduate School of Informatics, Kyoto University (京都大学信息学研究生院), Japan; Psychological Process Research Team, Guardian Robot Project, RIKEN (理化学研究所监护机器人项目心理过程研究中心), Japan; Field Science Education and Research Center, Kyoto University (京都大学野外科学教育研究中心), Japan; Interactive Robot Research Team, Guardian Robot Project, RIKEN (理化学研究所监护机器人项目交互机器人研究中心), Japan; NIIThis work is supported by JSP
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Automatic robotic facial expression generation is crucial for human-robot interaction, as handcrafted methods based on fixed joint configurations often yield rigid and unnatural behaviors. Although recent automated techniques reduce the need for manual tuning, they tend to fall short by not adequately bridging the gap between human preferences and model predictions-resulting in a deficiency of nuanced and realistic expressions due to limited degrees of freedom and insufficient perceptual integration. In this work, we propose a novel learning-to-rank framework that leverages human feedback to address this discrepancy and enhanced the expressiveness of robotic faces. Specifically, we conduct pairwise comparison annotations to collect human preference data and develop the Human Affective Pairwise Impressions (HAPI) model, a Siamese RankNet-based approach that refines expression evaluation. Results obtained via Bayesian Optimization and online expression survey on a 35-DOF android platform demonstrate that our approach produces significantly more realistic and socially resonant expressions of Anger, Happiness, and Surprise than those generated by baseline and expert-designed methods. This confirms that our framework effectively bridges the gap between human preferences and model predictions while robustly aligning robotic expression generation with human affective responses.
zh
[CV-49] ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail ATC WWW
【速读】:该论文试图解决的问题是如何生成包含多层次细节描述的3D室内场景中物体的文字说明。现有的方法通常仅在单一层次上描述物体,难以捕捉如纹理、材质和形状等细粒度细节。为此,论文提出了表达性3D标注(expressive 3D captioning)任务,要求对输入的3D场景中的每个物体同时生成高层次的对象描述和低层次的部分属性描述。
解决方案的关键在于提出了一种名为ExCap3D的表达性3D标注模型,该模型以3D扫描作为输入,在检测到的每个物体上生成包括部分集体描述及其条件化于部分级描述的对象级描述。为了确保生成文本描述之间的语义一致性和潜在空间中的文本相似性,设计了ExCap3D模型以进一步提升生成描述的质量。此外,通过利用视觉语言模型(VLM)进行多视图标注构建了ExCap3D数据集,该数据集包含了ScanNet++数据集中947个室内场景内34k个3D对象的190k条文本描述。实验结果表明,与最先进的方法相比,ExCap3D生成的物体级和部分级详细描述的质量显著提高,CIDEr分数分别提升了17%和124%。
链接: https://arxiv.org/abs/2503.17044
作者: Chandan Yeshwanth,David Rozenberszki,Angela Dai
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL , Video: this https URL
点击查看摘要
Abstract:Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods do this by describing objects at a single level of detail, which often does not capture fine-grained details such as varying textures, materials, and shapes of the parts of objects. We propose the task of expressive 3D captioning: given an input 3D scene, describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts. To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage semantic consistency between the generated text descriptions, as well as textual similarity in the latent space, to further increase the quality of the generated captions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. The ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. Our experiments show that the object- and part-level of detail captions generated by ExCap3D are of higher quality than those produced by state-of-the-art methods, with a Cider score improvement of 17% and 124% for object- and part-level details respectively. Our code, dataset and models will be made publicly available.
zh
[CV-50] An Attentive Representative Sample Selection Strategy Combined with Balanced Batch Training for Skin Lesion Segmentation
【速读】:该论文试图解决医学图像分割研究中训练子集有效选择的问题,特别是在少量标注(minimal supervision)场景下,随机选择训练集可能导致模型性能次优。论文的关键解决方案在于通过原型对比学习(prototypical contrastive learning)和聚类提取具有代表性和多样性的样本用于标注,并设计了一种基于聚类的图像选择流程。此外,引入了无监督平衡批次数据加载(unsupervised balanced batch dataloading)的概念,以在标注数据有限的情况下提升模型学习效果。这些方法在公开的皮肤病变数据集(ISIC 2018)上的实验验证中表现出色。
链接: https://arxiv.org/abs/2503.17034
作者: Stephen Lloyd-Brown,Susan Francis,Caroline Hoad,Penny Gowland,Karen Mullinger,Andrew French,Xin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ISBI 2025
点击查看摘要
Abstract:An often overlooked problem in medical image segmentation research is the effective selection of training subsets to annotate from a complete set of unlabelled data. Many studies select their training sets at random, which may lead to suboptimal model performance, especially in the minimal supervision setting where each training image has a profound effect on performance outcomes. This work aims to address this issue. We use prototypical contrasting learning and clustering to extract representative and diverse samples for annotation. We improve upon prior works with a bespoke cluster-based image selection process. Additionally, we introduce the concept of unsupervised balanced batch dataloading to medical image segmentation, which aims to improve model learning with minimally annotated data. We evaluated our method on a public skin lesion dataset (ISIC 2018) and compared it to another state-of-the-art data sampling method. Our method achieved superior performance in a low annotation budget scenario.
zh
[CV-51] aoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting CVPR2025
【速读】:该论文致力于解决现有3D全身影像生成方法在高保真度、实时性和细粒度控制(如面部表情和身体动作)方面的不足。具体而言,现有的基于3D Gaussian Splatting (3DGS) 的方法难以实现精细的表情与动作控制,且细节不足,无法在移动设备上实时运行。
论文的关键解决方案在于提出了一种名为TaoAvatar的方法,通过构建个性化的带衣物人体参数化模板绑定Gaussians来表示外观,并结合预训练的StyleUnet网络处理复杂的姿态相关的非刚体变形以捕捉高频细节。为解决资源密集型问题,研究者采用蒸馏技术将非刚体变形烘焙到轻量级MLP网络中,并开发混合形状以补偿细节缺失。这种方法实现了高质量渲染的同时,在多种设备上保持实时性能,例如在Apple Vision Pro等高清立体设备上维持90 FPS的帧率。
链接: https://arxiv.org/abs/2503.17032
作者: Jianchuan Chen,Jingchuan Hu,Gaige Wang,Zhonghua Jiang,Tiansong Zhou,Zhiwen Chen,Chengfei Lv
机构: Alibaba Group (阿里巴巴集团), Hangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025, project page: this https URL
点击查看摘要
Abstract:Realistic 3D full-body talking avatars hold great potential in AR, with applications ranging from e-commerce live streaming to holographic communication. Despite advances in 3D Gaussian Splatting (3DGS) for lifelike avatar creation, existing methods struggle with fine-grained control of facial expressions and body movements in full-body talking tasks. Additionally, they often lack sufficient details and cannot run in real-time on mobile devices. We present TaoAvatar, a high-fidelity, lightweight, 3DGS-based full-body talking avatar driven by various signals. Our approach starts by creating a personalized clothed human parametric template that binds Gaussians to represent appearances. We then pre-train a StyleUnet-based network to handle complex pose-dependent non-rigid deformation, which can capture high-frequency appearance details but is too resource-intensive for mobile devices. To overcome this, we “bake” the non-rigid deformations into a lightweight MLP-based network using a distillation technique and develop blend shapes to compensate for details. Extensive experiments show that TaoAvatar achieves state-of-the-art rendering quality while running in real-time across various devices, maintaining 90 FPS on high-definition stereo devices such as the Apple Vision Pro.
zh
[CV-52] AnimatePainter: A Self-Supervised Rendering Framework for Reconstructing Painting Process
【速读】:该论文旨在解决现有生成绘画过程方法局限于特定数据类型且通常依赖昂贵的人类标注数据集的问题。论文提出了一种新颖的自监督框架,用于从任何类型的图像生成绘画过程,并将其视为视频生成任务。解决方案的关键在于通过逐步从参考图像中移除笔画来逆向模拟人类创作序列,同时利用深度估计和笔画渲染构建自监督数据集,而非依赖真实的人类绘画过程数据。此外,论文将人类绘画建模为“细化”和“分层”过程,并引入深度融合层以使视频生成模型能够学习和复制人类绘画行为。
链接: https://arxiv.org/abs/2503.17029
作者: Junjie Hu,Shuyong Gao,Qianyu Guo,Yan Wang,Qishan Wang,Yuang Feng,Wenqiang Zhang
机构: Fudan University (复旦大学); Keenon Robotics Co. Ltd (快诺机器人有限公司); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Humans can intuitively decompose an image into a sequence of strokes to create a painting, yet existing methods for generating drawing processes are limited to specific data types and often rely on expensive human-annotated datasets. We propose a novel self-supervised framework for generating drawing processes from any type of image, treating the task as a video generation problem. Our approach reverses the drawing process by progressively removing strokes from a reference image, simulating a human-like creation sequence. Crucially, our method does not require costly datasets of real human drawing processes; instead, we leverage depth estimation and stroke rendering to construct a self-supervised dataset. We model human drawings as “refinement” and “layering” processes and introduce depth fusion layers to enable video generation models to learn and replicate human drawing behavior. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to generate realistic drawings without the need for real drawing process data.
zh
[CV-53] RAW-Adapter: Adapting Pre-trained Visual Model to Camera RAW Images and A Benchmark ECCV2024
【速读】:该论文旨在解决现有视觉感知方法在处理相机RAW图像时存在的问题,即直接使用RAW数据的方法通常将图像信号处理(Image Signal Processing, ISP)与后续网络模块集成,而忽视了模型层面可能存在的潜在协同效应。论文的关键在于提出了一种名为RAW-Adapter的新框架,它通过引入可学习的ISP模块作为输入级适配器来调整RAW输入,并利用模型级适配器无缝桥接ISP处理与高层下游架构。此外,RAW-Adapter作为一个通用框架适用于多种计算机视觉体系结构,并且还引入了一个包含17种基于RAW常见退化的基准测试集RAW-Bench,以系统地比较RAW-Adapter与其他最先进的ISP方法和基于RAW的高级视觉算法的性能。同时,提出了一个基于RAW的数据增强策略,进一步提升RAW-Adapter的表现并改善其域外泛化能力。大量实验验证了RAW-Adapter的有效性和高效性,在不同场景下表现出稳健的性能。
链接: https://arxiv.org/abs/2503.17027
作者: Ziteng Cui,Jianfei Yang,Tatsuya Harada
机构: RCAST, The University of Tokyo (东京大学未来技术创新研究院); MARS Lab, Nanyang Technological University (南洋理工大学MARS实验室); RIKEN AIP (理化学研究所AIP)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 17 figures, extension of ECCV 2024 work: arXiv:2408.14802
点击查看摘要
Abstract:In the computer vision community, the preference for pre-training visual models has largely shifted toward sRGB images due to their ease of acquisition and compact storage. However, camera RAW images preserve abundant physical details across diverse real-world scenarios. Despite this, most existing visual perception methods that utilize RAW data directly integrate image signal processing (ISP) stages with subsequent network modules, often overlooking potential synergies at the model level. Building on recent advances in adapter-based methodologies in both NLP and computer vision, we propose RAW-Adapter, a novel framework that incorporates learnable ISP modules as input-level adapters to adjust RAW inputs. At the same time, it employs model-level adapters to seamlessly bridge ISP processing with high-level downstream architectures. Moreover, RAW-Adapter serves as a general framework applicable to various computer vision frameworks. Furthermore, we introduce RAW-Bench, which incorporates 17 types of RAW-based common corruptions, including lightness degradations, weather effects, blurriness, camera imaging degradations, and variations in camera color response. Using this benchmark, we systematically compare the performance of RAW-Adapter with state-of-the-art (SOTA) ISP methods and other RAW-based high-level vision algorithms. Additionally, we propose a RAW-based data augmentation strategy to further enhance RAW-Adapter’s performance and improve its out-of-domain (OOD) generalization ability. Extensive experiments substantiate the effectiveness and efficiency of RAW-Adapter, highlighting its robust performance across diverse scenarios. Comments: 23 pages, 17 figures, extension of ECCV 2024 work: arXiv:2408.14802 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.17027 [cs.CV] (or arXiv:2503.17027v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.17027 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-54] A Tale of Two Classes: Adapting Supervised Contrastive Learning to Binary Imbalanced Datasets
【速读】:该论文旨在解决监督对比学习(Supervised Contrastive Learning, SupCon)在处理二分类不平衡数据集时,无法有效学习条件良好的表示空间的问题。论文通过实验表明,随着类别不平衡程度的增加,SupCon在自然图像和医学图像的七个二分类数据集上的性能显著下降。为了解决这一问题,论文引入了两个新的评估指标来揭示传统指标无法检测到的表示空间结构缺陷,并据此提出了两种针对二分类不平衡数据集的新型SupCon策略。这些策略通过改进表示空间的结构,将下游分类任务的准确性相较于标准SupCon提升了多达35%。关键在于结合新提出的评估指标,设计针对性的改进策略以优化表示空间的结构特性。
链接: https://arxiv.org/abs/2503.17024
作者: David Mildenberger,Paul Hager,Daniel Rueckert,Martin J Menten
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Imperial College London (帝国理工学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Supervised contrastive learning (SupCon) has proven to be a powerful alternative to the standard cross-entropy loss for classification of multi-class balanced datasets. However, it struggles to learn well-conditioned representations of datasets with long-tailed class distributions. This problem is potentially exacerbated for binary imbalanced distributions, which are commonly encountered during many real-world problems such as medical diagnosis. In experiments on seven binary datasets of natural and medical images, we show that the performance of SupCon decreases with increasing class imbalance. To substantiate these findings, we introduce two novel metrics that evaluate the quality of the learned representation space. By measuring the class distribution in local neighborhoods, we are able to uncover structural deficiencies of the representation space that classical metrics cannot detect. Informed by these insights, we propose two new supervised contrastive learning strategies tailored to binary imbalanced datasets that improve the structure of the representation space and increase downstream classification accuracy over standard SupCon by up to 35%. We make our code available.
zh
[CV-55] Specifying What You Know or Not for Multi-Label Class-Incremental Learning AAAI2025
【速读】:本文针对多标签渐进学习(Multi-Label Class-Incremental Learning, MLCIL)中因样本不完备标签导致的学习目标矛盾问题展开研究。现有方法主要适用于单标签分类任务,在多标签场景下表现欠佳,其核心挑战在于模型难以清晰区分已知与未知知识,这种模糊性阻碍了模型同时保留历史知识、掌握当前类别以及为未来学习做好准备的能力。为解决此问题,论文提出了一种名为HCP的新框架,其关键是通过动态特征净化、分布先验增强召回,明确区分已知类别以提升信息的精确性和保留度;同时设计前瞻性知识挖掘策略,探索未知领域以支持未来的知识学习。实验表明,所提方法在MS-COCO B0-C10设置下平均精度较先前最优方法提升了3.3%,且无需使用重放缓冲区。
链接: https://arxiv.org/abs/2503.17017
作者: Aoting Zhang,Dongbao Yang,Chang Liu,Xiaopeng Hong,Yu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025
点击查看摘要
Abstract:Existing class incremental learning is mainly designed for single-label classification task, which is ill-equipped for multi-label scenarios due to the inherent contradiction of learning objectives for samples with incomplete labels. We argue that the main challenge to overcome this contradiction in multi-label class-incremental learning (MLCIL) lies in the model’s inability to clearly distinguish between known and unknown knowledge. This ambiguity hinders the model’s ability to retain historical knowledge, master current classes, and prepare for future learning simultaneously. In this paper, we target at specifying what is known or not to accommodate Historical, Current, and Prospective knowledge for MLCIL and propose a novel framework termed as HCP. Specifically, (i) we clarify the known classes by dynamic feature purification and recall enhancement with distribution prior, enhancing the precision and retention of known information. (ii) We design prospective knowledge mining to probe the unknown, preparing the model for future learning. Extensive experiments validate that our method effectively alleviates catastrophic forgetting in MLCIL, surpassing the previous state-of-the-art by 3.3% on average accuracy for MS-COCO B0-C10 setting without replay buffers.
zh
[CV-56] Steady Progress Beats Stagnation: Mutual Aid of Foundation and Conventional Models in Mixed Domain Semi-Supervised Medical Image Segmentation CVPR2025
【速读】:该论文试图解决在半监督医学图像分割任务中,由于领域偏移导致预训练视觉基础模型(如MedSAM)在适应下游特定领域任务时容易产生过自信且部分错误预测的问题。这种错误积累阻碍了未标注数据的有效利用,并限制了进一步性能提升。为了解决这一问题,论文提出了一个名为“Synergistic training framework for Foundation and Conventional models (SynFoC)”的框架。其关键在于利用从零开始训练的传统模型修正基础模型的高置信度错误预测,同时借助基础模型在早期训练阶段生成高质量伪标签进行协同训练;此外,通过引入一致性正则化方法——共识-分歧一致性正则化,增强了两种模型的协作训练效果,促进了可靠收敛。实验结果表明,该方法在四个公开多领域数据集上表现出色,尤其在Prostate数据集上提升了10.31%的Dice分数。
链接: https://arxiv.org/abs/2503.16997
作者: Qinghe Ma,Jian Zhang,Zekun Li,Lei Qi,Qian Yu,Yinghuan Shi
机构: Nanjing University (南京大学); Southeast University (东南大学); Shandong Women’s University (山东女子大学); State Key Laboratory for Novel Software Technology and National Institute of Healthcare Data Science, Nanjing University, China (软件新技术国家重点实验室和南京大学健康医疗数据科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:Large pretrained visual foundation models exhibit impressive general capabilities. However, the extensive prior knowledge inherent in these models can sometimes be a double-edged sword when adapting them to downstream tasks in specific domains. In the context of semi-supervised medical image segmentation with domain shift, foundation models like MedSAM tend to make overconfident predictions, some of which are incorrect. The error accumulation hinders the effective utilization of unlabeled data and limits further improvements. In this paper, we introduce a Synergistic training framework for Foundation and Conventional models (SynFoC) to address the issue. We observe that a conventional model trained from scratch has the ability to correct the high-confidence mispredictions of the foundation model, while the foundation model can supervise it with high-quality pseudo-labels in the early training stages. Furthermore, to enhance the collaborative training effectiveness of both models and promote reliable convergence towards optimization, the consensus-divergence consistency regularization is proposed. We demonstrate the superiority of our method across four public multi-domain datasets. In particular, our method improves the Dice score by 10.31% on the Prostate dataset. Our code is available at this https URL .
zh
[CV-57] Enabling Versatile Controls for Video Diffusion Models
【速读】:该论文致力于解决在文本到视频生成领域中,实现对预训练视频扩散模型精细时空属性控制的精确性和灵活性不足的问题。解决方案的关键在于提出了一种名为VCtrl(也称PP-VCtrl)的新框架,它通过一个通用的条件模块,以统一的方式整合多种用户指定的控制信号(如Canny边缘、分割掩码和人体关键点),而无需修改底层生成器。此外,设计了统一的控制信号编码管道和稀疏残差连接机制,以高效引入控制表示。实验结果表明,VCtrl显著提升了可控性和生成质量。
链接: https://arxiv.org/abs/2503.16983
作者: Xu Zhang,Hao Zhou,Haoming Qin,Xiaobin Lu,Jiaxing Yan,Guanzhong Wang,Zeyu Chen,Yi Liu
机构: PaddlePaddle Team, Baidu Inc. (飞桨团队, 百度公司); Xiamen University (厦门大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Codes and Supplementary Material: this http URL
点击查看摘要
Abstract:Despite substantial progress in text-to-video generation, achieving precise and flexible control over fine-grained spatiotemporal attributes remains a significant unresolved challenge in video generation research. To address these limitations, we introduce VCtrl (also termed PP-VCtrl), a novel framework designed to enable fine-grained control over pre-trained video diffusion models in a unified manner. VCtrl integrates diverse user-specified control signals-such as Canny edges, segmentation masks, and human keypoints-into pretrained video diffusion models via a generalizable conditional module capable of uniformly encoding multiple types of auxiliary signals without modifying the underlying generator. Additionally, we design a unified control signal encoding pipeline and a sparse residual connection mechanism to efficiently incorporate control representations. Comprehensive experiments and human evaluations demonstrate that VCtrl effectively enhances controllability and generation quality. The source code and pre-trained models are publicly available and implemented using the PaddlePaddle framework at this http URL.
zh
[CV-58] Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting
【速读】:该论文旨在解决现有自由视点视频流方法中每帧重建时间过长(超过10秒)以及误差累积的问题,这些挑战限制了其更广泛的应用。论文提出了一种名为Instant Gaussian Stream (IGS)的快速且通用的流媒体框架来应对这些问题。解决方案的关键在于两个方面:首先,引入了广义的锚点驱动高斯运动网络(Anchor-driven Gaussian Motion Network),该网络通过锚点驱动所有高斯分布的运动,将多视角二维运动特征投影到三维空间,并在单次推理时间内生成目标帧中高斯分布的运动;其次,提出了关键帧引导流策略(Key-frame-guided Streaming Strategy),通过优化每个关键帧实现对复杂时序场景的精确重建,同时减轻误差累积。实验结果表明,该方法实现了平均每帧2秒以上的重建速度,同时提升了视图合成质量。
链接: https://arxiv.org/abs/2503.16979
作者: Jinbo Yan,Rui Peng,Zhiyan Wang,Luyang Tang,Jiayu Yang,Jie Liang,Jiahao Wu,Ronggang Wang
机构: Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology (广东省超高清沉浸式媒体技术重点实验室), Shenzhen Graduate School, Peking University (北京大学深圳研究生院); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Building Free-Viewpoint Videos in a streaming manner offers the advantage of rapid responsiveness compared to offline training methods, greatly enhancing user experience. However, current streaming approaches face challenges of high per-frame reconstruction time (10s+) and error accumulation, limiting their broader application. In this paper, we propose Instant Gaussian Stream (IGS), a fast and generalizable streaming framework, to address these issues. First, we introduce a generalized Anchor-driven Gaussian Motion Network, which projects multi-view 2D motion features into 3D space, using anchor points to drive the motion of all Gaussians. This generalized Network generates the motion of Gaussians for each target frame in the time required for a single inference. Second, we propose a Key-frame-guided Streaming Strategy that refines each key frame, enabling accurate reconstruction of temporally complex scenes while mitigating error accumulation. We conducted extensive in-domain and cross-domain evaluations, demonstrating that our approach can achieve streaming with a average per-frame reconstruction time of 2s+, alongside a enhancement in view synthesis quality.
zh
[CV-59] GeoT: Geometry-guided Instance-dependent Transition Matrix for Semi-supervised Tooth Point Cloud Segmentation
【速读】:该论文旨在解决牙科点云数据半监督分割中由噪声伪标签引起的挑战。现有半监督医学分割方法的主要难题在于如何处理未标注数据生成的伪标签中的噪声。论文提出的解决方案核心在于引入了几何先验知识来优化实例依赖转换矩阵(Instance-Dependent Transition Matrix, IDTM)的估计。具体而言,通过点级几何正则化(Point-Level Geometric Regularization, PLGR)增强三维空间与IDTM空间中点邻接关系的一致性,并利用牙齿类别固定的几何分布进行类级几何平滑(Class-Level Geometric Smoothing, CLGS),以应对因数万颗牙齿点带来的巨大解空间问题。实验结果表明,该方法能够充分利用未标注数据,仅使用20%的标注数据即可达到接近全监督方法的性能。
链接: https://arxiv.org/abs/2503.16976
作者: Weihao Yu,Xiaoqing Guo,Chenxin Li,Yifan Liu,Yixuan Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IPMI2025
点击查看摘要
Abstract:Achieving meticulous segmentation of tooth point clouds from intra-oral scans stands as an indispensable prerequisite for various orthodontic applications. Given the labor-intensive nature of dental annotation, a significant amount of data remains unlabeled, driving increasing interest in semi-supervised approaches. One primary challenge of existing semi-supervised medical segmentation methods lies in noisy pseudo labels generated for unlabeled data. To address this challenge, we propose GeoT, the first framework that employs instance-dependent transition matrix (IDTM) to explicitly model noise in pseudo labels for semi-supervised dental segmentation. Specifically, to handle the extensive solution space of IDTM arising from tens of thousands of dental points, we introduce tooth geometric priors through two key components: point-level geometric regularization (PLGR) to enhance consistency between point adjacency relationships in 3D and IDTM spaces, and class-level geometric smoothing (CLGS) to leverage the fixed spatial distribution of tooth categories for optimal IDTM estimation. Extensive experiments performed on the public Teeth3DS dataset and private dataset demonstrate that our method can make full utilization of unlabeled data to facilitate segmentation, achieving performance comparable to fully supervised methods with only 20% of the labeled data.
zh
[CV-60] EasyRobust: A Comprehensive and Easy-to-use Toolkit for Robust and Generalized Vision
【速读】:该论文旨在解决深度神经网络(DNNs)在计算机视觉任务中的鲁棒性不足问题,特别是在对抗攻击(adversarial attacks)和数据分布偏移(data distribution shifts)场景下模型性能下降的问题。论文的关键解决方案是开发了一个名为EasyRobust的综合且易于使用的工具包,用于训练、评估和分析具有鲁棒性的视觉模型。EasyRobust关注两种类型的鲁棒性:一是对抗鲁棒性(adversarial robustness),使模型能够防御由最坏情况扰动构造的恶意输入(即对抗样本);二是非对抗鲁棒性(non-adversarial robustness),增强模型在存在图像损坏或分布偏移的自然测试图像上的表现。通过全面的图像分类基准测试,EasyRobust能够为视觉模型提供准确的鲁棒性评估。该工具包已开源,期望帮助训练实用的鲁棒模型,并推动学术界与工业界在缩小人类视觉与机器视觉差距方面的进展。
链接: https://arxiv.org/abs/2503.16975
作者: Xiaofeng Mao,Yuefeng Chen,Rong Zhang,Hui Xue,Zhao Li,Hang Su
机构: Alibaba Group; Zhejiang University (浙江大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Deep neural networks (DNNs) has shown great promise in computer vision tasks. However, machine vision achieved by DNNs cannot be as robust as human perception. Adversarial attacks and data distribution shifts have been known as two major scenarios which degrade machine performance and obstacle the wide deployment of machines “in the wild”. In order to break these obstructions and facilitate the research of model robustness, we develop EasyRobust, a comprehensive and easy-to-use toolkit for training, evaluation and analysis of robust vision models. EasyRobust targets at two types of robustness: 1) Adversarial robustness enables the model to defense against malicious inputs crafted by worst-case perturbations, also known as adversarial examples; 2) Non-adversarial robustness enhances the model performance on natural test images with corruptions or distribution shifts. Thorough benchmarks on image classification enable EasyRobust to provide an accurate robustness evaluation on vision models. We wish our EasyRobust can help for training practically-robust models and promote academic and industrial progress in closing the gap between human and machine vision. Codes and models of EasyRobust have been open-sourced in this https URL.
zh
[CV-61] ARFlow: Human Action-Reaction Flow Matching with Physical Guidance
【速读】:该论文旨在解决人类动作-反应合成中的两个关键问题:一是扩散模型依赖复杂的噪声到反应生成器及其繁琐的条件机制;二是生成的动作频繁违反物理规律。为了解决这些问题,论文提出了Action-Reaction Flow Matching (ARFlow)框架,其关键是通过直接建立动作到反应的映射来消除复杂条件机制的需求,并引入了两种创新方法:一种是x1预测方法,直接输出人体运动而非速度场,以实现显式的约束 enforcement;另一种是无需训练、基于梯度的物理引导机制,有效防止采样过程中出现身体穿透伪影。
链接: https://arxiv.org/abs/2503.16973
作者: Wentao Jiang,Jingya Wang,Haotao Lu,Kaiyang Ji,Baoxiong Jia,Siyuan Huang,Ye Shi
机构: ShanghaiTech University (上海科技大学); Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
点击查看摘要
Abstract:Human action-reaction synthesis, a fundamental challenge in modeling causal human interactions, plays a critical role in applications ranging from virtual reality to social robotics. While diffusion-based models have demonstrated promising performance, they exhibit two key limitations for interaction synthesis: reliance on complex noise-to-reaction generators with intricate conditional mechanisms, and frequent physical violations in generated motions. To address these issues, we propose Action-Reaction Flow Matching (ARFlow), a novel framework that establishes direct action-to-reaction mappings, eliminating the need for complex conditional mechanisms. Our approach introduces two key innovations: an x1-prediction method that directly outputs human motions instead of velocity fields, enabling explicit constraint enforcement; and a training-free, gradient-based physical guidance mechanism that effectively prevents body penetration artifacts during sampling. Extensive experiments on NTU120 and Chi3D datasets demonstrate that ARFlow not only outperforms existing methods in terms of Fréchet Inception Distance and motion diversity but also significantly reduces body collisions, as measured by our new Intersection Volume and Intersection Frequency metrics.
zh
[CV-62] Distilling Monocular Foundation Model for Fine-grained Depth Completion
【速读】:该论文旨在解决深度完成(depth completion)任务中由于稀疏激光雷达输入导致的密集监督不足问题,即如何在缺乏精确深度标注的情况下有效学习详细的几何特征。论文的关键解决方案是提出了一种两阶段的知识蒸馏框架,利用强大的单目基础模型为深度完成任务提供密集监督。第一阶段通过从自然图像生成多样化训练数据,利用单目深度估计与网格重建模拟激光雷达扫描,从而无需真实深度标签即可进行知识蒸馏;第二阶段引入尺度和偏移不变损失(SSI Loss),以解决单目深度估计中的尺度模糊性问题,并在真实世界数据集上微调模型。这一框架使得深度完成模型能够充分利用单目基础模型的优势,在KITTI基准测试中取得了最先进的性能。
链接: https://arxiv.org/abs/2503.16970
作者: Yingping Liang,Yutao Hu,Wenqi Shao,Ying Fu
机构: Beijing Institute of Technology (北京理工大学); Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University (东南大学新一代人工智能技术及其交叉应用重点实验室); Shanghai Al Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Depth completion involves predicting dense depth maps from sparse LiDAR inputs. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. In this paper, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images, which distills geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Besides, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results demonstrate that models trained with our two-stage distillation framework achieve state-of-the-art performance, ranking \textbffirst place on the KITTI benchmark. Code is available at this https URL
zh
[CV-63] DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery
【速读】:该论文旨在解决无人机在野外场景中进行三维重建时面临的两个主要挑战:动态干扰物对辐射场静态场景假设的破坏以及有限视点约束对精确捕捉场景几何结构的限制。论文的关键解决方案在于引入了DroneSplat框架,其通过结合局部-全局分割启发式方法与统计学方法自适应调整遮罩阈值,从而实现对静态场景中动态干扰物的精确定位与消除。此外,还通过对三维高斯点 splatting 的改进,利用多视图立体预测和体素引导优化策略,在有限视点条件下支持高质量渲染。
链接: https://arxiv.org/abs/2503.16964
作者: Jiadong Tang,Yu Gao,Dianyi Yang,Liqi Yan,Yufeng Yue,Yi Yang
机构: Beijing Institute of Technology (北京理工大学); Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Drones have become essential tools for reconstructing wild scenes due to their outstanding maneuverability. Recent advances in radiance field methods have achieved remarkable rendering quality, providing a new avenue for 3D reconstruction from drone imagery. However, dynamic distractors in wild environments challenge the static scene assumption in radiance fields, while limited view constraints hinder the accurate capture of underlying scene geometry. To address these challenges, we introduce DroneSplat, a novel framework designed for robust 3D reconstruction from in-the-wild drone imagery. Our method adaptively adjusts masking thresholds by integrating local-global segmentation heuristics with statistical approaches, enabling precise identification and elimination of dynamic distractors in static scenes. We enhance 3D Gaussian Splatting with multi-view stereo predictions and a voxel-guided optimization strategy, supporting high-quality rendering under limited view constraints. For comprehensive evaluation, we provide a drone-captured 3D reconstruction dataset encompassing both dynamic and static scenes. Extensive experiments demonstrate that DroneSplat outperforms both 3DGS and NeRF baselines in handling in-the-wild drone imagery.
zh
[CV-64] Center-guided Classifier for Semantic Segmentation of Remote Sensing Images
【速读】:该论文旨在解决遥感图像(Remote Sensing Images, RSIs)语义分割中的挑战,特别是由于类内方差较大导致的困难,以及现有语义分割模型中常用的标准softmax分类器存在的三个主要问题:(1) 训练过程中对像素表示缺乏直接监督;(2) 在大类内方差情况下参数化softmax分类器的建模能力不足;(3) 分类决策过程不透明。为了解决这些问题,论文提出了一种名为CenterSeg的新分类器,其关键在于通过多个原型实现对像素特征的多角度建模,利用Grassmann流形提供直接监督,并引入可解释性策略。具体而言,CenterSeg通过对真实标签掩码聚合像素特征获取局部类别中心,并通过硬注意力分配与动量更新生成多个原型。此外,通过引入Grassmann流形,利用两个额外的正则化项约束像素特征与原型的联合嵌入空间。在推理阶段,CenterSeg通过将原型限制为训练集中的样本进一步增强模型的可解释性。实验结果验证了CenterSeg在性能、简洁性、轻量化、兼容性和可解释性方面的优势。
链接: https://arxiv.org/abs/2503.16963
作者: Wei Zhang,Mengting Ma,Yizhen Jiang,Rongrong Lian,Zhenkai Wu,Kangning Cui,Xiaowen Ma
机构: School of Software Technology, Zhejiang University (浙江大学软件学院); Innovation Center of Yangtze River Delta, Zhejiang University (浙江大学长江三角洲创新中心); School of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); Department of Mathematics, City University of Hong Kong (香港城市大学数学系); Noah’s Ark Lab, Huawei (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Compared with natural images, remote sensing images (RSIs) have the unique characteristic. i.e., larger intraclass variance, which makes semantic segmentation for remote sensing images more challenging. Moreover, existing semantic segmentation models for remote sensing images usually employ a vanilla softmax classifier, which has three drawbacks: (1) non-direct supervision for the pixel representations during training; (2) inadequate modeling ability of parametric softmax classifiers under large intraclass variance; and (3) opaque process of classification decision. In this paper, we propose a novel classifier (called CenterSeg) customized for RSI semantic segmentation, which solves the abovementioned problems with multiple prototypes, direct supervision under Grassmann manifold, and interpretability strategy. Specifically, for each class, our CenterSeg obtains local class centers by aggregating corresponding pixel features based on ground-truth masks, and generates multiple prototypes through hard attention assignment and momentum updating. In addition, we introduce the Grassmann manifold and constrain the joint embedding space of pixel features and prototypes based on two additional regularization terms. Especially, during the inference, CenterSeg can further provide interpretability to the model by restricting the prototype as a sample of the training set. Experimental results on three remote sensing segmentation datasets validate the effectiveness of the model. Besides the superior performance, CenterSeg has the advantages of simplicity, lightweight, compatibility, and interpretability. Code is available at this https URL.
zh
[CV-65] MagicColor: Multi-Instance Sketch Colorization
【速读】:该论文旨在解决多实例线描图着色任务中因行业标准工作流(包含角色设计、个体对象上色及细化过程)导致的手动逐个上色效率低且易出错的问题,同时克服现有生成式方法(Generative Methods)因多实例配对数据收集困难而无法有效完成此任务的挑战。为应对这些难题,论文提出了三项关键技术:首先引入自博弈训练策略(Self-Play Training Strategy)以缓解训练数据不足的问题;其次设计实例引导器(Instance Guider)为每个实例提供颜色输入;最后通过细粒度色彩匹配与边缘损失(Fine-Grained Color Matching with Edge Loss)实现精确的颜色一致性。这些模块共同确保了模型能够一次性完成多实例线描图的精准着色,并保持风格一致性和多实例控制能力。实验结果表明,该方法在色彩精度方面显著优于现有技术,且无需人工调整即可实现自动化着色,使新手用户也能生成高质量的艺术作品。
链接: https://arxiv.org/abs/2503.16948
作者: Yinhan Zhang,Yue Ma,Bingyuan Wang,Qifeng Chen,Zeyu Wang
机构: HKUST(GZ)(香港科技大学(广州)); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present \textitMagicColor, a diffusion-based framework for multi-instance sketch colorization. The production of multi-instance 2D line art colorization adheres to an industry-standard workflow, which consists of three crucial stages: the design of line art characters, the coloring of individual objects, and the refinement process. The artists are required to repeat the process of coloring each instance one by one, which is inaccurate and inefficient. Meanwhile, current generative methods fail to solve this task due to the challenge of multi-instance pair data collection. To tackle these challenges, we incorporate three technical designs to ensure precise character detail transcription and achieve multi-instance sketch colorization in a single forward. Specifically, we first propose the self-play training strategy to solve the lack of training data. Then we introduce an instance guider to feed the color of the instance. To achieve accurate color matching, we present fine-grained color matching with edge loss to enhance visual quality. Equipped with the proposed modules, MagicColor enables automatically transforming sketches into vividly-colored images with accurate consistency and multi-instance control. Experiments on our collected datasets show that our model outperforms existing methods regarding chromatic precision. Specifically, our model critically automates the colorization process with zero manual adjustments, so novice users can produce stylistically consistent artwork by providing reference instances and the original line art. Our code and additional details are available at this https URL
zh
[CV-66] PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition
【速读】:该论文旨在解决动态面部表情识别(Dynamic Facial Expression Recognition, DFER)中基于视觉-语言模型(Vision-Language Models, VLMs)如CLIP所面临的挑战,包括低效的全微调(full fine-tuning)、高复杂度以及文本与视觉表征之间对齐不佳的问题。此外,现有方法在时间建模方面也表现欠佳。为了解决这些问题,论文提出了一种参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)框架——PE-CLIP。其关键是引入了两种专门设计的适配器:Temporal Dynamic Adapter (TDA) 和 Shared Adapter (ShA),并通过Multi-modal Prompt Learning (MaPLe) 增强多模态语义对齐。其中,TDA 利用基于GRU的时间动态机制捕捉序列依赖性,并强调信息丰富的时序特征;ShA 则通过轻量级方式优化文本和视觉编码器中的表示,确保一致性与效率。这些创新共同实现了在减少可训练参数的同时保持高精度的目标,从而在资源效率和性能之间取得平衡。
链接: https://arxiv.org/abs/2503.16945
作者: Ibtissam Saadi,Abdenour Hadid,Douglas W. Cunningham,Abdelmalik Taleb-Ahmed,Yassin El Hillali
机构: Univ. BTU Cottbus-Senftenberg(Cottbus) ; Sorbonne Center for Artificial Intelligence, Sorbonne University (阿联酋阿布扎比) ; Faculty of Graphical Systems, Univ. BTU Cottbus-Senftenberg(Cottbus) ; Univ. Polytechnique Hauts-de-France(France)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) like CLIP offer promising solutions for Dynamic Facial Expression Recognition (DFER) but face challenges such as inefficient full fine-tuning, high complexity, and poor alignment between textual and visual representations. Additionally, existing methods struggle with ineffective temporal modeling. To address these issues, we propose PE-CLIP, a parameter-efficient fine-tuning (PEFT) framework that adapts CLIP for DFER while significantly reducing trainable parameters while maintaining high accuracy. PE-CLIP introduces two specialized adapters: a Temporal Dynamic Adapter (TDA) and a Shared Adapter (ShA). The TDA is a GRU-based module with dynamic scaling that captures sequential dependencies while emphasizing informative temporal features and suppressing irrelevant variations. The ShA is a lightweight adapter that refines representations within both textual and visual encoders, ensuring consistency and efficiency. Additionally, we integrate Multi-modal Prompt Learning (MaPLe), introducing learnable prompts for visual and action unit-based textual inputs, enhancing semantic alignment between modalities and enabling efficient CLIP adaptation for dynamic tasks. We evaluate PE-CLIP on two benchmark datasets, DFEW and FERV39K, achieving competitive performance compared to state-of-the-art methods while requiring fewer trainable parameters. By balancing efficiency and accuracy, PE-CLIP sets a new benchmark in resource-efficient DFER. The source code of the proposed PE-CLIP will be publicly available at this https URL .
zh
[CV-67] HyperLoRA: Parameter-Efficient Adaptive Generation for Portrait Synthesis
【速读】:该论文旨在解决个性化肖像合成领域中个性化生成与计算效率之间的权衡问题。现有方法如LoRA和DreamBooth虽能生成高度逼真的输出,但需针对个体样本进行微调,耗时且资源密集,并伴随不稳定风险;而基于Adapter的方法虽支持零样本推理,却在自然性和真实性方面表现不足。论文的关键创新在于提出了一种参数高效的自适应生成方法HyperLoRA,通过设计一个自适应插件网络来生成LoRA权重,结合了LoRA的高性能与Adapter方案的零样本能力。其核心解决方案在于精心设计的网络结构与训练策略,实现了支持单张或多张图像输入的高保真、高真实感及可编辑性的零样本个性化肖像生成。
链接: https://arxiv.org/abs/2503.16944
作者: Mengtian Li,Jinshu Chen,Wanquan Feng,Bingchuan Li,Fei Dai,Songtao Zhao,Qian He
机构: Intelligent Creation (智能创作), ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Personalized portrait synthesis, essential in domains like social entertainment, has recently made significant progress. Person-wise fine-tuning based methods, such as LoRA and DreamBooth, can produce photorealistic outputs but need training on individual samples, consuming time and resources and posing an unstable risk. Adapter based techniques such as IP-Adapter freeze the foundational model parameters and employ a plug-in architecture to enable zero-shot inference, but they often exhibit a lack of naturalness and authenticity, which are not to be overlooked in portrait synthesis tasks. In this paper, we introduce a parameter-efficient adaptive generation method, namely HyperLoRA, that uses an adaptive plug-in network to generate LoRA weights, merging the superior performance of LoRA with the zero-shot capability of adapter scheme. Through our carefully designed network structure and training strategy, we achieve zero-shot personalized portrait generation (supporting both single and multiple image inputs) with high photorealism, fidelity, and editability.
zh
[CV-68] Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model
【速读】:该论文旨在解决现有数字人研究在唇形同步与身体动作方面的局限性,以及无法有效支持与真实世界环境(如物体)交互的人体视频生成技术的问题。特别地,尽管手部合成已是一个复杂任务,但生成与手接触的物体及其交互则更具挑战性,尤其是在物体尺寸和形状存在显著变化时。为应对这些挑战,论文提出了一种新颖的视频重演框架——通过自适应布局指令扩散模型专注于人体-物体交互(Re-HOLD)。其关键创新在于分别采用专门的手部和物体布局表示,使手部建模与物体适应多样化运动序列得以有效解耦。此外,通过引入两个独立的记忆库设计了用于手部和物体的交互纹理增强模块,并提出了针对跨物体重演场景的布局调整策略以适应推理过程中因物体大小多样性导致的不合理布局。全面的定性和定量评估表明,所提出的框架显著优于现有方法。
链接: https://arxiv.org/abs/2503.16942
作者: Yingying Fan,Quanwei Yang,Kaisiyuan Wang,Hang Zhou,Yingying Li,Haocheng Feng,Yu Wu,Jingdong Wang
机构: Wuhan University (武汉大学); University of Science and Technology of China (中国科学技术大学); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Current digital human studies focusing on lip-syncing and body movement are no longer sufficient to meet the growing industrial demand, while human video generation techniques that support interacting with real-world environments (e.g., objects) have not been well investigated. Despite human hand synthesis already being an intricate problem, generating objects in contact with hands and their interactions presents an even more challenging task, especially when the objects exhibit obvious variations in size and shape. To cope with these issues, we present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive Layout-instructed Diffusion model (Re-HOLD). Our key insight is to employ specialized layout representation for hands and objects, respectively. Such representations enable effective disentanglement of hand modeling and object adaptation to diverse motion sequences. To further improve the generation quality of HOI, we have designed an interactive textural enhancement module for both hands and objects by introducing two independent memory banks. We also propose a layout-adjusting strategy for the cross-object reenactment scenario to adaptively adjust unreasonable layouts caused by diverse object sizes during inference. Comprehensive qualitative and quantitative evaluations demonstrate that our proposed framework significantly outperforms existing methods. Project page: this https URL.
zh
[CV-69] Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks CVPR2025
【速读】:该论文旨在解决动态图像退化(如噪声、模糊和光照不一致)导致的图像复原挑战,现有深度展开网络(Deep Unfolding Networks, DUNs)虽提供稳定性能,但需手动选择每种退化类型的降质矩阵,限制了其在多样化场景中的适应性。为应对这一问题,论文提出了一种视觉-语言引导的展开网络(Vision-Language-guided Unfolding Network, VLU-Net),这是一种统一的DUN框架,能够同时处理多种退化类型。其关键在于利用在退化图像-文本对上优化的视觉-语言模型(Vision-Language Model, VLM),将图像特征与退化描述对齐以选择目标退化的合适变换,并通过将基于VLM的自动梯度估计策略集成到近端梯度下降(Proximal Gradient Descent, PGD)算法中,有效解决复杂的多退化复原任务,同时保持可解释性。此外,设计了分层特征展开结构以增强框架性能,实现了高效合成不同层次的退化模式。
链接: https://arxiv.org/abs/2503.16930
作者: Haijin Zeng,Xiangming Wang,Yongyong Chen,Jingyong Su,Jie Liu
机构: Harvard University (哈佛大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); github.com/xianggkl/VLU-Net
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
点击查看摘要
Abstract:Dynamic image degradations, including noise, blur and lighting inconsistencies, pose significant challenges in image restoration, often due to sensor limitations or adverse environmental conditions. Existing Deep Unfolding Networks (DUNs) offer stable restoration performance but require manual selection of degradation matrices for each degradation type, limiting their adaptability across diverse scenarios. To address this issue, we propose the Vision-Language-guided Unfolding Network (VLU-Net), a unified DUN framework for handling multiple degradation types simultaneously. VLU-Net leverages a Vision-Language Model (VLM) refined on degraded image-text pairs to align image features with degradation descriptions, selecting the appropriate transform for target degradation. By integrating an automatic VLM-based gradient estimation strategy into the Proximal Gradient Descent (PGD) algorithm, VLU-Net effectively tackles complex multi-degradation restoration tasks while maintaining interpretability. Furthermore, we design a hierarchical feature unfolding structure to enhance VLU-Net framework, efficiently synthesizing degradation patterns across various levels. VLU-Net is the first all-in-one DUN framework and outperforms current leading one-by-one and all-in-one end-to-end methods by 3.74 dB on the SOTS dehazing dataset and 1.70 dB on the Rain100L deraining dataset.
zh
[CV-70] EMPO: Temporal Preference Optimization of Video LLM s via Difficulty Scheduling and Pre-SFT Alignment
【速读】:该论文旨在解决现有Video LLMs在时间推理(temporal reasoning)方面的局限性,主要由于数据中时间对应关系较弱以及训练过程中对下一-token预测范式的依赖。为了解决这些问题,论文提出了一种名为TEMPO(TEMPoral Preference Optimization)的系统框架,其关键在于通过直接偏好优化(Direct Preference Optimization, DPO)来增强Video LLMs的时间推理能力。具体而言,TEMPO引入了一个自动化的偏好数据生成管道,通过选择富含时间信息的视频、设计特定于视频的扰动策略,并评估模型在干净与扰动输入下的响应,构建偏好对。此外,TEMPO包含两个创新特性:课程学习逐步增加扰动难度以提升模型鲁棒性和适应性;“Pre-SFT对齐”在指令微调前应用偏好优化,优先强化细粒度的时间理解能力。实验表明,这种方法能够在多个基准测试中有效提升Video LLM性能,同时分析了DPO数据在不同架构中的迁移性和难度调度的作用。
链接: https://arxiv.org/abs/2503.16929
作者: Shicheng Li,Lei Li,Kun Ouyang,Shuhuai Ren,Yuanxin Liu,Yuanxing Zhang,Fuzheng Zhang,Lingpeng Kong,Qi Liu,Xu Sun
机构: National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University (北京大学); The University of Hong Kong (香港大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Video Large Language Models (Video LLMs) have achieved significant success by leveraging a two-stage paradigm: pretraining on large-scale video-text data for vision-language alignment, followed by supervised fine-tuning (SFT) for task-specific capabilities. However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and reliance on the next-token prediction paradigm during training. To address these limitations, we propose TEMPO (TEMporal Preference Optimization), a systematic framework that enhances Video LLMs’ temporal reasoning capabilities through Direct Preference Optimization (DPO). To facilitate this, we introduce an automated preference data generation pipeline that systematically constructs preference pairs by selecting videos that are rich in temporal information, designing video-specific perturbation strategies, and finally evaluating model responses on clean and perturbed video inputs. Our temporal alignment features two key innovations: curriculum learning which that progressively increases perturbation difficulty to improve model robustness and adaptability; and ``Pre-SFT Alignment’', applying preference optimization before instruction tuning to prioritize fine-grained temporal comprehension. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. We further analyze the transferability of DPO data across architectures and the role of difficulty scheduling in optimization. Our findings highlight our TEMPO as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.
zh
[CV-71] Optimized Minimal 3D Gaussian Splatting
【速读】:该论文旨在解决基于3D高斯点 splatting (3D Gaussian Splatting, 3DGS) 表示3D场景时存储开销大且计算成本高的问题。现有方法虽通过属性压缩减少高斯函数的数量,但难以在降低存储的同时保持高质量渲染,因为较小的高斯集合对属性的有损压缩非常敏感,容易导致严重质量下降。论文的关键在于提出了一种优化的最小高斯表示法 (Optimized Minimal Gaussians, OMG),其核心解决方案包括:(1) 通过区分近似高斯分布来最小化冗余,同时保证渲染质量;(2) 设计一种紧凑且精确的属性表示方法,有效捕捉高斯函数之间的连续性和不规则性;(3) 引入子向量量化技术以改进不规则性的表达,同时保持快速训练且代码本大小可忽略不计。实验结果表明,OMG 在存储需求上比现有最先进的方法减少了近50%,同时实现了超过600 FPS的渲染速度,且保持了高质量渲染。
链接: https://arxiv.org/abs/2503.16924
作者: Joo Chan Lee,Jong Hwan Ko,Eunbyung Park
机构: Sungkyunkwan University (成均馆大学); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful representation for real-time, high-performance rendering, enabling a wide range of applications. However, representing 3D scenes with numerous explicit Gaussian primitives imposes significant storage and memory overhead. Recent studies have shown that high-quality rendering can be achieved with a substantially reduced number of Gaussians when represented with high-precision attributes. Nevertheless, existing 3DGS compression methods still rely on a relatively large number of Gaussians, focusing primarily on attribute compression. This is because a smaller set of Gaussians becomes increasingly sensitive to lossy attribute compression, leading to severe quality degradation. Since the number of Gaussians is directly tied to computational costs, it is essential to reduce the number of Gaussians effectively rather than only optimizing storage. In this paper, we propose Optimized Minimal Gaussians representation (OMG), which significantly reduces storage while using a minimal number of primitives. First, we determine the distinct Gaussian from the near ones, minimizing redundancy without sacrificing quality. Second, we propose a compact and precise attribute representation that efficiently captures both continuity and irregularity among primitives. Additionally, we propose a sub-vector quantization technique for improved irregularity representation, maintaining fast training with a negligible codebook size. Extensive experiments demonstrate that OMG reduces storage requirements by nearly 50% compared to the previous state-of-the-art and enables 600+ FPS rendering while maintaining high rendering quality. Our source code is available at this https URL.
zh
[CV-72] When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO
【速读】:该论文致力于解决在基于扩散模型(Diffusion Models)的图像生成任务中,由于偏好数据(Preference Data)中的少数群体样本(Minority Samples)导致的模型性能下降问题。论文指出,人类普遍偏好在图像生成中具有主观性,并且少数群体样本可能对模型训练产生不利影响。为应对这一挑战,论文提出了一种新颖的方法——自适应DPO(Adaptive-DPO),其关键是引入了一个考虑少数群体实例的度量方法,该度量方法结合注释者内部置信度(intra-annotator confidence)和注释者间稳定性(inter-annotator stability),以区分多数样本与少数样本。通过改进DPO目标函数,Adaptive-DPO不仅提升了模型对多数标签的学习效果,还减轻了少数样本带来的负面影响。实验结果表明,该方法能够有效处理合成的少数数据以及真实世界中的偏好数据,从而为图像生成任务提供了更有效的训练策略。
链接: https://arxiv.org/abs/2503.16921
作者: Lingfan Zhang,Chen Liu,Chengming Xu,Kai Hu,Donghao Luo,Chengjie Wang,Yanwei Fu,Yuan Yao
机构: Fudan University (复旦大学); The Hong Kong University of Science and Technology (香港科技大学); Tencent (腾讯); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In recent years, the field of image generation has witnessed significant advancements, particularly in fine-tuning methods that align models with universal human preferences. This paper explores the critical role of preference data in the training process of diffusion models, particularly in the context of Diffusion-DPO and its subsequent adaptations. We investigate the complexities surrounding universal human preferences in image generation, highlighting the subjective nature of these preferences and the challenges posed by minority samples in preference datasets. Through pilot experiments, we demonstrate the existence of minority samples and their detrimental effects on model performance. We propose Adaptive-DPO – a novel approach that incorporates a minority-instance-aware metric into the DPO objective. This metric, which includes intra-annotator confidence and inter-annotator stability, distinguishes between majority and minority samples. We introduce an Adaptive-DPO loss function which improves the DPO loss in two ways: enhancing the model’s learning of majority labels while mitigating the negative impact of minority samples. Our experiments demonstrate that this method effectively handles both synthetic minority data and real-world preference data, paving the way for more effective training methodologies in image generation tasks.
zh
[CV-73] mporal Action Detection Model Compression by Progressive Block Drop CVPR2025
【速读】:本文旨在解决基于更大特征提取器和数据集的模型性能提升所带来的计算需求增加的问题,这对依赖有限计算资源的应用(如自动驾驶和机器人)构成了挑战。为应对这一挑战,现有通道剪枝方法虽可压缩模型,但会降低GPU的并行化效率。为此,本文提出了一种渐进块丢弃(Progressive Block Drop)的方法,通过减少模型深度而非层宽度来实现高效压缩。该方法的关键在于:首先迭代移除对模型性能影响最小的冗余模块;其次采用参数高效的跨深度对齐技术微调剪枝后的模型以恢复精度。实验结果显示,此方法在两个时间动作检测基准(THUMOS14 和 ActivityNet-1.3)上实现了 25% 的计算开销降低,并且证明了其与通道剪枝方法的正交性,可进一步提高效率。
链接: https://arxiv.org/abs/2503.16916
作者: Xiaoyong Chen,Yong Guo,Jiaming Liang,Sitong Zhuang,Runhao Zeng,Xiping Hu
机构: Guangdong-Hong Kong-Macao Joint Laboratory for Emotional Intelligence and Pervasive Computing (广东-香港-澳门联合实验室情感智能与普适计算), Shenzhen MSU-BIT University (深圳北理莫斯科大学); Shenzhen University (深圳大学); South China University of Technology (华南理工大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
点击查看摘要
Abstract:Temporal action detection (TAD) aims to identify and localize action instances in untrimmed videos, which is essential for various video understanding tasks. However, recent improvements in model performance, driven by larger feature extractors and datasets, have led to increased computational demands. This presents a challenge for applications like autonomous driving and robotics, which rely on limited computational resources. While existing channel pruning methods can compress these models, reducing the number of channels often hinders the parallelization efficiency of GPU, due to the inefficient multiplication between small matrices. Instead of pruning channels, we propose a Progressive Block Drop method that reduces model depth while retaining layer width. In this way, we still use large matrices for computation but reduce the number of multiplications. Our approach iteratively removes redundant blocks in two steps: first, we drop blocks with minimal impact on model performance; and second, we employ a parameter-efficient cross-depth alignment technique, fine-tuning the pruned model to restore model accuracy. Our method achieves a 25% reduction in computational overhead on two TAD benchmarks (THUMOS14 and ActivityNet-1.3) to achieve lossless compression. More critically, we empirically show that our method is orthogonal to channel pruning methods and can be combined with it to yield further efficiency gains.
zh
[CV-74] Salient Object Detection in Traffic Scene through the TSOD10K Dataset
【速读】:该论文旨在解决交通场景显著目标检测(TSOD)任务中缺乏专用数据集和方法的问题。传统自然场景图像显著性目标检测(NSI-SOD)侧重于视觉上突出的区域,而TSOD则强调因语义影响需要驾驶员立即关注的目标,即使这些目标在视觉对比度上较低。论文的关键在于提出了一个基于Mamba的TSOD模型Tramba,并构建了一个包含像素级显著性标注的大规模数据集TSOD10K。Tramba通过引入双频视觉状态空间模块和螺旋式二维选择性扫描机制,解决了复杂交通背景下难以区分不明显视觉信息的挑战,同时有效捕捉全局多方向的空间依赖性,从而实现了对交通场景中关键区域的精确感知。
链接: https://arxiv.org/abs/2503.16910
作者: Yu Qiu,Yuhang Sun,Jie Mei,Lin Xiao,Jing Xu
机构: College of Information Science and Engineering, Hunan Normal University (湖南师范大学信息科学与工程学院); National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, College of Artificial Intelligence, Nankai University (南开大学人工智能学院传染病智能追踪与预测国家实验室); National Engineering Research Center of Robot Visual Perception and Control Technology, School of Robotics, Hunan University (湖南大学机器人学院机器人视觉感知与控制技术国家工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 12 figures
点击查看摘要
Abstract:Traffic Salient Object Detection (TSOD) aims to segment the objects critical to driving safety by combining semantic (e.g., collision risks) and visual saliency. Unlike SOD in natural scene images (NSI-SOD), which prioritizes visually distinctive regions, TSOD emphasizes the objects that demand immediate driver attention due to their semantic impact, even with low visual contrast. This dual criterion, i.e., bridging perception and contextual risk, re-defines saliency for autonomous and assisted driving systems. To address the lack of task-specific benchmarks, we collect the first large-scale TSOD dataset with pixel-wise saliency annotations, named TSOD10K. TSOD10K covers the diverse object categories in various real-world traffic scenes under various challenging weather/illumination variations (e.g., fog, snowstorms, low-contrast, and low-light). Methodologically, we propose a Mamba-based TSOD model, termed Tramba. Considering the challenge of distinguishing inconspicuous visual information from complex traffic backgrounds, Tramba introduces a novel Dual-Frequency Visual State Space module equipped with shifted window partitioning and dilated scanning to enhance the perception of fine details and global structure by hierarchically decomposing high/low-frequency components. To emphasize critical regions in traffic scenes, we propose a traffic-oriented Helix 2D-Selective-Scan (Helix-SS2D) mechanism that injects driving attention priors while effectively capturing global multi-direction spatial dependencies. We establish a comprehensive benchmark by evaluating Tramba and 22 existing NSI-SOD models on TSOD10K, demonstrating Tramba’s superiority. Our research establishes the first foundation for safety-aware saliency analysis in intelligent transportation systems.
zh
[CV-75] Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification CVPR2025
【速读】:该论文旨在解决多标签图像分类中准确标注数据获取困难且成本高昂的问题。为应对这一挑战,研究提出了一种利用无监督多标签分类的方法,并基于CLIP模型进行改进。然而,CLIP模型存在视点依赖性预测和固有偏见等问题,限制了其性能。论文的关键在于提出了一种名为Classifier-guided CLIP Distillation (CCD) 的新方法,通过利用由分类器Class Activation Mapping (CAM) 指导的目标对象附近多个视角的信息,以及对从CLIP预测结果得到的伪标签进行去偏处理,来提升分类效果。这种方法能够在无需额外标注的情况下选择局部视图并修正预测偏差,从而显著提高多标签分类任务的表现。实验结果表明,所提方法在多种数据集上优于现有技术。代码已公开发布。
链接: https://arxiv.org/abs/2503.16873
作者: Dongseob Kim,Hyunjung Shim
机构: Samsung Electronics (三星电子, Republic of Korea); KAIST (韩国科学技术院, Republic of Korea)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025 Accepted
点击查看摘要
Abstract:Multi-label classification is crucial for comprehensive image understanding, yet acquiring accurate annotations is challenging and costly. To address this, a recent study suggests exploiting unsupervised multi-label classification leveraging CLIP, a powerful vision-language model. Despite CLIP’s proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects, guided by Class Activation Mapping (CAM) of the classifier, and debiasing pseudo-labels derived from CLIP predictions. Our Classifier-guided CLIP Distillation (CCD) enables selecting multiple local views without extra labels and debiasing predictions to enhance classification performance. Experimental results validate our method’s superiority over existing techniques across diverse datasets. The code is available at this https URL.
zh
[CV-76] Lie Detector: Unified Backdoor Detection via Cross-Examination Framework
【速读】:本文旨在解决在半诚实(semi-honest)设置下,机构因数据和计算资源限制而将模型训练外包给第三方服务提供商时存在的严重安全风险问题。具体而言,当采用预定义学习范式(如监督学习或半监督学习)时,恶意对手可能通过污染训练数据在模型中嵌入后门触发器。现有检测方法主要依赖统计分析,但其检测精度在不同学习范式之间难以保持一致性和准确性。为此,论文提出了一种统一的后门检测框架,通过利用两家独立服务提供商之间的模型不一致性交叉验证来解决上述挑战。方案的关键在于引入中心核对齐(Central Kernel Alignment)技术,以实现跨不同模型架构和学习范式的鲁棒特征相似性度量,从而精确恢复和识别后门触发器;同时结合后门微调敏感性分析,有效区分后门触发器与对抗扰动,大幅降低误报率。实验结果表明,该方法在监督学习、半监督学习及自回归学习任务中的检测性能显著优于现有最先进的基线方法,分别提升了5.4%、1.6%和11.9%的准确率,并首次实现了对多模态大型语言模型中后门的有效检测。
链接: https://arxiv.org/abs/2503.16872
作者: Xuan Wang,Siyuan Liang,Dongping Liao,Han Fang,Aishan Liu,Xiaochun Cao,Yu-liang Lu,Ee-Chien Chang,Xitong Gao
机构: Nudt(国防科技大学); NUS(新加坡国立大学); umac(澳门大学); bua(北京航空航天大学); sysu(中山大学); siat(中国科学院深圳先进技术研究院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Institutions with limited data and computing resources often outsource model training to third-party providers in a semi-honest setting, assuming adherence to prescribed training protocols with pre-defined learning paradigm (e.g., supervised or semi-supervised learning). However, this practice can introduce severe security risks, as adversaries may poison the training data to embed backdoors into the resulting model. Existing detection approaches predominantly rely on statistical analyses, which often fail to maintain universally accurate detection accuracy across different learning paradigms. To address this challenge, we propose a unified backdoor detection framework in the semi-honest setting that exploits cross-examination of model inconsistencies between two independent service providers. Specifically, we integrate central kernel alignment to enable robust feature similarity measurements across different model architectures and learning paradigms, thereby facilitating precise recovery and identification of backdoor triggers. We further introduce backdoor fine-tuning sensitivity analysis to distinguish backdoor triggers from adversarial perturbations, substantially reducing false positives. Extensive experiments demonstrate that our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines across supervised, semi-supervised, and autoregressive learning tasks, respectively. Notably, it is the first to effectively detect backdoors in multimodal large language models, further highlighting its broad applicability and advancing secure deep learning.
zh
[CV-77] ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)生成中精确评估文本提示与生成视频之间语义对齐程度的挑战。现有指标如CLIPScore仅提供粗粒度评分且缺乏细粒度对齐细节,无法有效反映人类偏好。为克服这一局限,论文提出ETVA(Evaluation method of Text-to-Video Alignment),通过细粒度的问题生成与回答实现文本到视频的对齐评估。其关键在于设计了一个包含多智能体语义场景图解析以生成原子问题的系统,并结合知识增强的多阶段推理框架:首先辅助大型语言模型(LLM)检索常识知识(如物理定律),随后视频专用LLM通过多阶段推理机制回答生成的问题。实验表明,ETVA在Spearman相关系数上达到58.47,显著优于现有方法的31.0,同时构建了一个包含2k多样提示和12k原子问题的综合基准,揭示了15种现有T2V模型的关键能力与局限性。
链接: https://arxiv.org/abs/2503.16867
作者: Kaisi Guan,Zhengfeng Lai,Yuchong Sun,Peng Zhang,Wei Liu,Kieran Liu,Meng Cao,Ruihua Song
机构: Renmin University of China (中国人民大学); Apple (苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answers the generated questions through a multi-stage reasoning mechanism. Extensive experiments demonstrate that ETVA achieves a Spearman’s correlation coefficient of 58.47, showing a much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation.
zh
[CV-78] City2Scene: Improving Acoustic Scene Classification with City Features
【速读】:该论文试图解决传统声景分类(Acoustic Scene Classification, ASC)方法在利用跨城市通用模式时忽视城市特定特征的问题。作者假设声学特征中的城市特异性环境和文化差异对ASC任务具有重要价值。解决方案的关键在于提出City2Scene框架,通过知识蒸馏(Knowledge Distillation)将城市分类模型中学到的城市特定知识迁移到声景分类模型中,从而提升ASC性能。实验结果表明,这种基于城市特征的知识迁移显著提升了多种主流ASC骨干模型(包括卷积神经网络CNNs和Transformer)的分类准确性。
链接: https://arxiv.org/abs/2503.16862
作者: Yiqiang Cai,Yizhou Tan,Peihong Zhang,Yuxuan Liu,Shengchen Li,Xi Shao,Mark D. Plumbley
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Nanjing University of Posts and Telecommunications (南京邮电大学); University of Surrey (萨里大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Acoustic scene recordings are often collected from a diverse range of cities. Most existing acoustic scene classification (ASC) approaches focus on identifying common acoustic scene patterns across cities to enhance generalization. In contrast, we hypothesize that city-specific environmental and cultural differences in acoustic features are beneficial for the ASC task. In this paper, we introduce City2Scene, a novel framework that leverages city features to improve ASC. City2Scene transfers the city-specific knowledge from city classification models to a scene classification model using knowledge distillation. We evaluated City2Scene on the DCASE Challenge Task 1 datasets, where each audio clip is annotated with both scene and city labels. Experimental results demonstrate that city features provide valuable information for classifying scenes. By distilling the city-specific knowledge, City2Scene effectively improves accuracy for various state-of-the-art ASC backbone models, including both CNNs and Transformers.
zh
[CV-79] Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Multi-Culture Sign Language Recognition
【速读】:该论文旨在解决多文化手语识别(Multi-cultural Sign Language, McSL)中的挑战,现有手语识别系统在特定文化下的手语识别表现良好,但在跨文化交流场景中对多文化手语的识别能力有限。为应对这一问题,论文提出了一种堆叠时空变换网络(Stack Spatial-Temporal Transformer Network),其关键在于利用多头注意力机制(multi-head attention mechanism),通过堆叠转移(Stack Transfer)概念捕获手语序列的空间和时间依赖性,并实现层次化特征提取。具体而言,首先将原始数据嵌入到一个具有高表达能力的向量中,然后输入到新提出的堆叠变换器中以获取包含短距离和长距离依赖关系的分层特征。该网络架构包括多个阶段,依次处理空间和时间关系,确保有效特征提取。此外,通过引入空间多头注意力变换器(Spatial Multi-Head Attention Transformer)和时间多头注意力变换器(Temporal Multi-Head Attention Transformer),分别捕捉关节之间的空间依赖性和长程时间依赖性,并结合跳跃连接(skip connection)进一步优化特征表示。最终,经过多次迭代(共10个变换器块),通过前馈网络(Feed-Forward Network, FFN)和归一化层输出最终特征张量。实验表明,该方法在日语手语(JSL)、韩语手语(KSL)和美国手语(ASL)数据集上取得了良好的性能,尤其在多文化手语识别任务中表现出显著改进,是一项创新性工作。
链接: https://arxiv.org/abs/2503.16855
作者: Koki Hirooka,Abu Saleh Musa Miah,Tatsuya Murakami,Yuto Akiba,Yong Seok Hwang,Jungpil Shin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Hand gesture-based Sign Language Recognition (SLR) serves as a crucial communication bridge between deaf and non-deaf individuals. Existing SLR systems perform well for their cultural SL but may struggle with multi-cultural sign languages (McSL). To address these challenges, this paper proposes a Stack Spatial-Temporal Transformer Network that leverages multi-head attention mechanisms to capture both spatial and temporal dependencies with hierarchical features using the Stack Transfer concept. In the proceed, firstly, we applied a fully connected layer to make a embedding vector which has high expressive power from the original dataset, then fed them a stack newly proposed transformer to achieve hierarchical features with short-range and long-range dependency. The network architecture is composed of several stages that process spatial and temporal relationships sequentially, ensuring effective feature extraction. After making the fully connected layer, the embedding vector is processed by the Spatial Multi-Head Attention Transformer, which captures spatial dependencies between joints. In the next stage, the Temporal Multi-Head Attention Transformer captures long-range temporal dependencies, and again, the features are concatenated with the output using another skip connection. The processed features are then passed to the Feed-Forward Network (FFN), which refines the feature representations further. After the FFN, additional skip connections are applied to combine the output with earlier layers, followed by a final normalization layer to produce the final output feature tensor. This process is repeated for 10 transformer blocks. The extensive experiment shows that the JSL, KSL and ASL datasets achieved good performance accuracy. Our approach demonstrates improved performance in McSL, and it will be consider as a novel work in this domain.
zh
[CV-80] Generative Compositor for Few-Shot Visual Information Extraction
【速读】:该论文旨在解决视觉信息抽取(Visual Information Extraction, VIE)任务中因训练数据匮乏导致的挑战,特别是少样本场景下的性能瓶颈。VIE 面临着多种文档布局、语义范围及多语言支持的复杂性,许多类型的数据缺乏足够的标注样本。为应对这一挑战,论文提出了一种名为生成式排版器(Generative Compositor)的新颖生成模型。该模型是一种混合指针-生成网络,通过从源文本中检索词汇并依据提示信息组装内容,模拟传统排版操作。关键解决方案包括引入基于提示的信息检索机制以及三种预训练策略以增强模型的空间上下文感知能力,并设计了一种提示感知重采样器以高效匹配实体语义信息。这些创新使模型能够在有限的训练样本下获取更有效的空间和语义线索,从而在全样本和少样本(如 1-shot、5-shot 和 10-shot)设置下均表现出色。
链接: https://arxiv.org/abs/2503.16854
作者: Zhibo Yang,Wei Hua,Sibo Song,Cong Yao,Yingying Zhu,Wenqing Cheng,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学), Wuhan, Hubei, China; Alibaba Group (阿里巴巴集团), Hangzhou, Zhejiang, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Visual Information Extraction (VIE), aiming at extracting structured information from visually rich document images, plays a pivotal role in document processing. Considering various layouts, semantic scopes, and languages, VIE encompasses an extensive range of types, potentially numbering in the thousands. However, many of these types suffer from a lack of training data, which poses significant challenges. In this paper, we propose a novel generative model, named Generative Compositor, to address the challenge of few-shot VIE. The Generative Compositor is a hybrid pointer-generator network that emulates the operations of a compositor by retrieving words from the source text and assembling them based on the provided prompts. Furthermore, three pre-training strategies are employed to enhance the model’s perception of spatial context information. Besides, a prompt-aware resampler is specially designed to enable efficient matching by leveraging the entity-semantic prior contained in prompts. The introduction of the prompt-based retrieval mechanism and the pre-training strategies enable the model to acquire more effective spatial and semantic clues with limited training samples. Experiments demonstrate that the proposed method achieves highly competitive results in the full-sample training, while notably outperforms the baseline in the 1-shot, 5-shot, and 10-shot settings.
zh
[CV-81] Casual Inference via Style Bias Deconfounding for Domain Generalization
【速读】:该论文旨在解决深度神经网络(DNNs)在处理分布外数据时可靠性不足的问题,特别是在领域泛化任务中,现有方法通常忽视训练集中样式频率的影响,导致模型捕获虚假视觉相关性而非因果表示,从而削弱推理可靠性。论文提出了一种名为样式去混淆因果学习(Style Deconfounding Causal Learning, SDCL)的新框架,其关键是构建面向领域泛化问题的结构因果模型(SCM),并应用后门调整策略来考虑样式影响。此外,设计了一个样式引导专家模块(SGEM)以自适应聚类训练过程中的样式分布,并通过后门因果学习模块(BDCL)在特征提取过程中进行因果干预,确保公平整合全局混淆样式到样本预测中,有效减少样式偏差。SDCL框架高度通用,可与最先进的数据增强技术无缝集成,在多种自然图像和医学图像识别任务上的广泛实验验证了其有效性。
链接: https://arxiv.org/abs/2503.16852
作者: Jiaxi Li,Di Lin,Hao Chen,Hongying Liu,Liang Wan,Wei Feng
机构: Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China (医学工程与转化医学研究院, 天津大学, 中国); College of Intelligence and Computing, Tianjin University, Tianjin 300072, China (智能与计算学院, 天津大学, 中国); Department of Computer Science and Engineering and the Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong, China (计算机科学与工程系以及化学与生物工程系, 香港科技大学, 中国香港)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: under review
点击查看摘要
Abstract:Deep neural networks (DNNs) often struggle with out-of-distribution data, limiting their reliability in diverse realworld applications. To address this issue, domain generalization methods have been developed to learn domain-invariant features from single or multiple training domains, enabling generalization to unseen testing domains. However, existing approaches usually overlook the impact of style frequency within the training set. This oversight predisposes models to capture spurious visual correlations caused by style confounding factors, rather than learning truly causal representations, thereby undermining inference reliability. In this work, we introduce Style Deconfounding Causal Learning (SDCL), a novel causal inference-based framework designed to explicitly address style as a confounding factor. Our approaches begins with constructing a structural causal model (SCM) tailored to the domain generalization problem and applies a backdoor adjustment strategy to account for style influence. Building on this foundation, we design a style-guided expert module (SGEM) to adaptively clusters style distributions during training, capturing the global confounding style. Additionally, a back-door causal learning module (BDCL) performs causal interventions during feature extraction, ensuring fair integration of global confounding styles into sample predictions, effectively reducing style bias. The SDCL framework is highly versatile and can be seamlessly integrated with state-of-the-art data augmentation techniques. Extensive experiments across diverse natural and medical image recognition tasks validate its efficacy, demonstrating superior performance in both multi-domain and the more challenging single-domain generalization scenarios.
zh
[CV-82] HSM: Hierarchical Scene Motifs for Multi-Scale Indoor Scene Generation
【速读】:该论文试图解决在室内三维场景布局生成中,密集物体排列场景合成困难的问题。现有方法主要关注大型家具而忽视小型物体,导致生成的场景缺乏真实感且过于空旷;而专注于放置小型物体的方法通常未能遵循指定的布局要求,使得物体分布随机且不符合文本描述。论文的关键解决方案是提出HSM(Hierarchical Scene Model),这是一种针对跨空间尺度密集物体排列的分层框架。HSM利用室内场景固有的层次结构特性,即不同规模物体在支撑面上的分布规律(从地板上的大型家具到桌子和架子上的小型物品),通过捕捉跨尺度的空间模式,以统一的方式生成复杂且逼真的室内场景。实验表明,HSM在多种房间类型和空间配置下生成的场景比现有方法更符合用户输入且更加真实。
链接: https://arxiv.org/abs/2503.16848
作者: Hou In Derek Pun,Hou In Ivan Tam,Austin T. Wang,Xiaoliang Huo,Angel X. Chang,Manolis Savva
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 7 figures
点击查看摘要
Abstract:Despite advances in indoor 3D scene layout generation, synthesizing scenes with dense object arrangements remains challenging. Existing methods primarily focus on large furniture while neglecting smaller objects, resulting in unrealistically empty scenes. Those that place small objects typically do not honor arrangement specifications, resulting in largely random placement not following the text description. We present HSM, a hierarchical framework for indoor scene generation with dense object arrangements across spatial scales. Indoor scenes are inherently hierarchical, with surfaces supporting objects at different scales, from large furniture on floors to smaller objects on tables and shelves. HSM embraces this hierarchy and exploits recurring cross-scale spatial patterns to generate complex and realistic indoor scenes in a unified manner. Our experiments show that HSM outperforms existing methods by generating scenes that are more realistic and better conform to user input across room types and spatial configurations.
zh
[CV-83] LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models CVPR2025
【速读】:该论文旨在解决在多模态大语言模型(Multimodal Large Language Models, MLLMs)适应特定下游任务时,如何同时保留通用知识和专门知识的问题。尽管低秩适应(Low-Rank Adaptation, LoRA)被广泛用于高效获取MLLMs中的专门知识,但在视觉指令微调过程中引入了大量的有害冗余参数,这加剧了通用知识的遗忘并降低了下游任务性能。为了解决这一问题,论文提出了LoRASculpt方法,通过消除有害的冗余参数来实现通用知识与专门知识的平衡。
解决方案的关键在于:首先,在理论保证下将稀疏更新引入LoRA,以有效丢弃冗余参数;其次,提出冲突缓解正则化器(Conflict Mitigation Regularizer),优化LoRA的更新轨迹,缓解与预训练权重之间的知识冲突。实验结果表明,即使在极高的稀疏度(≤5%)下,该方法也能同时提升泛化能力和下游任务性能,证明了其能够有效缓解灾难性遗忘问题,并进一步促进MLLMs中的知识和谐化。
链接: https://arxiv.org/abs/2503.16843
作者: Jian Liang,Wenke Huang,Guancheng Wan,Qu Yang,Mang Ye
机构: National Engineering Research Center for Multimedia Software (多媒体软件国家工程研究中心), School of Computer Science (计算机科学学院), Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:While Multimodal Large Language Models (MLLMs) excel at generalizing across modalities and tasks, effectively adapting them to specific downstream tasks while simultaneously retaining both general and specialized knowledge remains challenging. Although Low-Rank Adaptation (LoRA) is widely used to efficiently acquire specialized knowledge in MLLMs, it introduces substantial harmful redundancy during visual instruction tuning, which exacerbates the forgetting of general knowledge and degrades downstream task performance. To address this issue, we propose LoRASculpt to eliminate harmful redundant parameters, thereby harmonizing general and specialized knowledge. Specifically, under theoretical guarantees, we introduce sparse updates into LoRA to discard redundant parameters effectively. Furthermore, we propose a Conflict Mitigation Regularizer to refine the update trajectory of LoRA, mitigating knowledge conflicts with the pretrained weights. Extensive experimental results demonstrate that even at very high degree of sparsity ( \le 5%), our method simultaneously enhances generalization and downstream task performance. This confirms that our approach effectively mitigates the catastrophic forgetting issue and further promotes knowledge harmonization in MLLMs.
zh
[CV-84] Safe and Reliable Diffusion Models via Subspace Projection
【速读】:该论文旨在解决大型文本到图像(Text-to-Image, T2I)扩散模型在生成过程中可能无意中包含不适当内容的问题,例如版权作品或冒犯性图像。尽管现有方法尝试消除特定的不想要的概念,但往往无法完全去除这些概念,导致它们以微妙的形式重新出现。为了解决这一挑战,论文提出了一种名为SAFER的新颖且高效的方法,用于彻底移除扩散模型中的目标概念。SAFER的关键在于利用文本嵌入空间的低维结构特性,首先识别与目标概念相关的特定子空间 ( S_c ),然后将提示嵌入投影到 ( S_c ) 的互补子空间上,从而有效地从生成的图像中擦除该概念。为了更精确地估计子空间并提高移除效果,该方法采用了文本反转(Textual Inversion)技术,通过参考图像学习目标概念的优化嵌入。此外,还引入了一种子空间扩展策略以确保全面且稳健的概念擦除。大量实验表明,SAFER能够始终如一且有效地从扩散模型中移除不需要的概念,同时保持生成质量。
链接: https://arxiv.org/abs/2503.16835
作者: Huiqiang Chen,Tianqing Zhu,Linlin Wang,Xin Yu,Longxiang Gao,Wanlei Zhou
机构: Faculty of Data Science, City University of Macau (澳门城市大学数据科学学院); School of Computer Science, University of Queensland (昆士兰大学计算机科学学院); Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences) (教育部计算力网络与信息安全重点实验室,山东计算机科学中心,齐鲁工业大学(山东省科学院)); Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science (山东省级计算力互联网与服务计算重点实验室,山东计算机科学基础研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large-scale text-to-image (T2I) diffusion models have revolutionized image generation, enabling the synthesis of highly detailed visuals from textual descriptions. However, these models may inadvertently generate inappropriate content, such as copyrighted works or offensive images. While existing methods attempt to eliminate specific unwanted concepts, they often fail to ensure complete removal, allowing the concept to reappear in subtle forms. For instance, a model may successfully avoid generating images in Van Gogh’s style when explicitly prompted with ‘Van Gogh’, yet still reproduce his signature artwork when given the prompt ‘Starry Night’. In this paper, we propose SAFER, a novel and efficient approach for thoroughly removing target concepts from diffusion models. At a high level, SAFER is inspired by the observed low-dimensional structure of the text embedding space. The method first identifies a concept-specific subspace S_c associated with the target concept c. It then projects the prompt embeddings onto the complementary subspace of S_c , effectively erasing the concept from the generated images. Since concepts can be abstract and difficult to fully capture using natural language alone, we employ textual inversion to learn an optimized embedding of the target concept from a reference image. This enables more precise subspace estimation and enhances removal performance. Furthermore, we introduce a subspace expansion strategy to ensure comprehensive and robust concept erasure. Extensive experiments demonstrate that SAFER consistently and effectively erases unwanted concepts from diffusion models while preserving generation quality.
zh
[CV-85] Joint Self-Supervised Video Alignment and Action Segmentation
【速读】:本文旨在解决同时实现自监督视频对齐与动作分割的问题,并提出了一种基于统一最优传输框架的创新方法。关键在于引入了一种融合结构先验的Gromov-Wasserstein最优传输公式,用于高效地在GPU上训练,并通过较少的迭代次数解决最优传输问题。与依赖传统Kantorovich最优传输公式的方法相比,单任务方法在多个视频对齐基准上达到了最先进的性能。进一步地,作者提出了一个联合自监督视频对齐和动作分割的统一最优传输框架,仅需训练和存储单一模型,相较于两个独立的单任务模型显著节省时间和内存消耗。实验结果表明,多任务方法在视频对齐任务上表现相当,而在动作分割任务上优于现有方法。据我们所知,这是首次将视频对齐和动作分割统一到单一模型中的工作。
链接: https://arxiv.org/abs/2503.16832
作者: Ali Shah Ali,Syed Ahmed Mahmood,Mubin Saeed,Andrey Konin,M. Zeeshan Zia,Quoc-Huy Tran
机构: Retrocausal, Inc. (逆向因果公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We introduce a novel approach for simultaneous self-supervised video alignment and action segmentation based on a unified optimal transport framework. In particular, we first tackle self-supervised video alignment by developing a fused Gromov-Wasserstein optimal transport formulation with a structural prior, which trains efficiently on GPUs and needs only a few iterations for solving the optimal transport problem. Our single-task method achieves the state-of-the-art performance on multiple video alignment benchmarks and outperforms VAVA, which relies on a traditional Kantorovich optimal transport formulation with an optimality prior. Furthermore, we extend our approach by proposing a unified optimal transport framework for joint self-supervised video alignment and action segmentation, which requires training and storing a single model and saves both time and memory consumption as compared to two different single-task models. Extensive evaluations on several video alignment and action segmentation datasets demonstrate that our multi-task method achieves comparable video alignment yet superior action segmentation results over previous methods in video alignment and action segmentation respectively. Finally, to the best of our knowledge, this is the first work to unify video alignment and action segmentation into a single model.
zh
[CV-86] SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion
【速读】:该论文旨在解决场景语义完成(Scene Semantic Completion, SSC)任务中因视觉遮挡导致的场景语义信息不完整的问题。现有基于相机的方法在可见区域表现良好,但在存在频繁视觉遮挡的情况下难以捕捉完整的场景语义。为了解决这一局限性,论文提出了首个卫星-地面协同的SSC框架——SGFormer,探索卫星-地面图像对在SSC任务中的潜力。解决方案的关键在于设计了一个双分支架构,能够并行编码正交的卫星视图和地面视图,并将其统一到一个共同的表征域中;同时提出了一种地面视图引导策略,在特征编码过程中校正卫星图像的偏差,以解决卫星与地面视图之间的不对齐问题;此外,还开发了一种自适应加权策略,平衡卫星和地面视图的贡献。实验结果表明,SGFormer在SemanticKITTI和SSCBench-KITTI-360数据集上超越了现有最先进的方法。
链接: https://arxiv.org/abs/2503.16825
作者: Xiyue Guo,Jiarui Hu,Junjie Hu,Hujun Bao,Guofeng Zhang
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学); Chinese University of Hong Kong, Shenzhen (香港中文大学,深圳)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Recently, camera-based solutions have been extensively explored for scene semantic completion (SSC). Despite their success in visible areas, existing methods struggle to capture complete scene semantics due to frequent visual occlusions. To address this limitation, this paper presents the first satellite-ground cooperative SSC framework, i.e., SGFormer, exploring the potential of satellite-ground image pairs in the SSC task. Specifically, we propose a dual-branch architecture that encodes orthogonal satellite and ground views in parallel, unifying them into a common domain. Additionally, we design a ground-view guidance strategy that corrects satellite image biases during feature encoding, addressing misalignment between satellite and ground views. Moreover, we develop an adaptive weighting strategy that balances contributions from satellite and ground views. Experiments demonstrate that SGFormer outperforms the state of the art on SemanticKITTI and SSCBench-KITTI-360 datasets. Our code is available on this https URL.
zh
[CV-87] RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos
【速读】:该论文旨在解决利用二维视频建模复杂关节物体的问题,以实现新颖视角下的合成(novel view synthesis),同时确保模型易于编辑、驱动和重新定位。为应对这一挑战,论文提出了一种名为RigGS的新范式,其关键是结合了3D高斯表示(3D Gaussian Representation)与基于骨架的动作表示(skeleton-based motion representation),无需依赖额外的模板先验信息(template priors)。具体而言,首先提出了骨架感知的节点控制变形(skeleton-aware node-controlled deformation),通过将规范化的3D高斯表示随时间变形来初始化建模过程,并生成候选骨架节点,再依据运动和语义信息简化为稀疏的3D骨架结构。在此基础上,设计了可学习的皮肤变形和姿态相关的细节变形,从而能够灵活调整3D高斯表示以生成新的动作,并从新颖视角渲染高质量图像。实验表明,该方法能够轻松生成逼真的新动作并实现高质量渲染。
链接: https://arxiv.org/abs/2503.16822
作者: Yuxin Yao,Zhi Deng,Junhui Hou
机构: City University of Hong Kong (香港城市大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper considers the problem of modeling articulated objects captured in 2D videos to enable novel view synthesis, while also being easily editable, drivable, and re-posable. To tackle this challenging problem, we propose RigGS, a new paradigm that leverages 3D Gaussian representation and skeleton-based motion representation to model dynamic objects without utilizing additional template priors. Specifically, we first propose skeleton-aware node-controlled deformation, which deforms a canonical 3D Gaussian representation over time to initialize the modeling process, producing candidate skeleton nodes that are further simplified into a sparse 3D skeleton according to their motion and semantic information. Subsequently, based on the resulting skeleton, we design learnable skin deformations and pose-dependent detailed deformations, thereby easily deforming the 3D Gaussian representation to generate new actions and render further high-quality images from novel views. Extensive experiments demonstrate that our method can generate realistic new actions easily for objects and achieve high-quality rendering.
zh
[CV-88] ST-Prompt Guided Histological Hypergraph Learning for Spatial Gene Expression Prediction
【速读】:该论文旨在解决利用HE染色组织切片预测全片组织切面上的Spatial Transcriptomics (ST) 空间转录组图谱的问题。由于组织形态与基因表达之间的异质性关系(由不同患者和组织切片间的显著变异性引起),直接从HE图像预测ST具有挑战性。论文提出的关键解决方案是PHG2ST框架,这是一种基于ST提示引导的组织超图学习方法。通过将稀疏的ST信号作为提示,指导组织超图的学习过程,并结合多尺度组织超图表示的掩码ST提示编码机制,提升模型的鲁棒性和泛化能力。实验表明,该方法在两个公开的ST数据集上的表现优于现有最先进的方法,并接近真实值,证明了利用局部稀疏ST数据进行实际生物医学应用中可扩展且经济高效的空间基因表达映射的潜力。
链接: https://arxiv.org/abs/2503.16816
作者: Yi Niu,Jiashuai Liu,Yingkang Zhan,Jiangbo Shi,Di Zhang,Ines Machado,Mireia Crispin-Ortuzar,Chen Li,Zeyu Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Spatial Transcriptomics (ST) reveals the spatial distribution of gene expression in tissues, offering critical insights into biological processes and disease mechanisms. However, predicting ST from H\E-stained histology images is challenging due to the heterogeneous relationship between histomorphology and gene expression, which arises from substantial variability across different patients and tissue sections. A more practical and valuable approach is to utilize ST data from a few local regions to predict the spatial transcriptomic landscape across the remaining regions in HE slides. In response, we propose PHG2ST, an ST-prompt guided histological hypergraph learning framework, which leverages sparse ST signals as prompts to guide histological hypergraph learning for global spatial gene expression prediction. Our framework fuses histological hypergraph representations at multiple scales through a masked ST-prompt encoding mechanism, improving robustness and generalizability. Benchmark evaluations on two public ST datasets demonstrate that PHG2ST outperforms the existing state-of-the-art methods and closely aligns with the ground truth. These results underscore the potential of leveraging sparse local ST data for scalable and cost-effective spatial gene expression mapping in real-world biomedical applications.
zh
[CV-89] Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision
【速读】:该论文旨在解决通过语义标签(semantic labels)监督基于 LiDAR 的 3D 对象检测任务中冗余标签导致的问题。传统方法依赖于边界框标签(bounding box labels)和语义掩码标签(semantic mask labels),但这些独立标签存在显著冗余。为消除这种冗余,论文提出仅使用语义标签来监督 3D 对象检测。然而,由于点云实例几何结构的不完整性及边界模糊性,生成精确伪标签(pseudo labels)面临挑战,从而可能导致检测性能下降。
为应对上述挑战,论文提出了名为 Seg2Box 的新方法。其关键是引入了多帧多尺度聚类模块(Multi-Frame Multi-Scale Clustering, MFMS-C),利用点云的空间-时间一致性生成精确的边界框级伪标签。此外,还设计了语义引导迭代挖掘自训练模块(Semantic Guiding Iterative-Mining Self-Training, SGIM-ST),通过逐步优化伪标签以及挖掘未标注实例来进一步提升性能。实验结果表明,该方法在 Waymo Open 数据集和 nuScenes 数据集上的平均精度(mAP)分别提升了 23.7% 和 10.3%,验证了其高效的标签利用率和优越性。
链接: https://arxiv.org/abs/2503.16811
作者: Maoji Zheng,Ziyu Xu,Qiming Xia,Hai Wu,Chenglu Wen,Cheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
点击查看摘要
Abstract:LiDAR-based 3D object detection and semantic segmentation are critical tasks in 3D scene understanding. Traditional detection and segmentation methods supervise their models through bounding box labels and semantic mask labels. However, these two independent labels inherently contain significant redundancy. This paper aims to eliminate the redundancy by supervising 3D object detection using only semantic labels. However, the challenge arises due to the incomplete geometry structure and boundary ambiguity of point-cloud instances, leading to inaccurate pseudo labels and poor detection results. To address these challenges, we propose a novel method, named Seg2Box. We first introduce a Multi-Frame Multi-Scale Clustering (MFMS-C) module, which leverages the spatio-temporal consistency of point clouds to generate accurate box-level pseudo-labels. Additionally, the Semantic?Guiding Iterative-Mining Self-Training (SGIM-ST) module is proposed to enhance the performance by progressively refining the pseudo-labels and mining the instances without generating pseudo-labels. Experiments on the Waymo Open Dataset and nuScenes Dataset show that our method significantly outperforms other competitive methods by 23.7% and 10.3% in mAP, respectively. The results demonstrate the great label-efficient potential and advancement of our method.
zh
[CV-90] Auto-Regressive Diffusion for Generating 3D Human-Object Interactions
【速读】:该论文旨在解决文本驱动的人机交互(Text-to-HOI)生成中长序列交互一致性的问题。现有基于文本到运动(Text-to-Motion)的方法,如离散运动标记化技术,在此领域因数据稀缺和模态复杂性限制而难以直接应用。为解决这一问题,论文提出了一种自回归扩散模型(ARDHOI),其关键在于通过预测下一个连续标记来确保长序列的一致性。具体而言,引入对比变分自编码器(Contrastive Variational Autoencoder, cVAE)以学习物理上合理的连续人机交互标记空间,从而保证生成的动作自然且真实;同时,开发基于Mamba的上下文编码器以捕捉并保持动作序列的一致性,并采用基于MLP的去噪器在编码上下文条件下生成后续标记。这些方法使得ARDHOI在OMOMO和BEHAVE数据集上的表现优于现有最先进的方法,不仅性能更优,且推理速度更快。
链接: https://arxiv.org/abs/2503.16801
作者: Zichen Geng,Zeeshan Hayder,Wei Liu,Ajmal Saeed Mian
机构: Unknown
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Text-driven Human-Object Interaction (Text-to-HOI) generation is an emerging field with applications in animation, video games, virtual reality, and robotics. A key challenge in HOI generation is maintaining interaction consistency in long sequences. Existing Text-to-Motion-based approaches, such as discrete motion tokenization, cannot be directly applied to HOI generation due to limited data in this domain and the complexity of the modality. To address the problem of interaction consistency in long sequences, we propose an autoregressive diffusion model (ARDHOI) that predicts the next continuous token. Specifically, we introduce a Contrastive Variational Autoencoder (cVAE) to learn a physically plausible space of continuous HOI tokens, thereby ensuring that generated human-object motions are realistic and natural. For generating sequences autoregressively, we develop a Mamba-based context encoder to capture and maintain consistent sequential actions. Additionally, we implement an MLP-based denoiser to generate the subsequent token conditioned on the encoded context. Our model has been evaluated on the OMOMO and BEHAVE datasets, where it outperforms existing state-of-the-art methods in terms of both performance and inference speed. This makes ARDHOI a robust and efficient solution for text-driven HOI tasks
zh
[CV-91] DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics
【速读】:该论文致力于解决文本引导图像编辑任务中精确定位和编辑目标语义的挑战,此前方法在此方面表现不足。论文的关键解决方案包括引入一种利用视觉和文本自注意力以增强交叉注意力图的精确语义定位策略(Precise Semantic Localization),该策略可作为区域提示以提升编辑性能;此外,提出了一种双层控制机制(Dual-Level Control),在特征层和潜在层同时整合区域提示,实现更精细的编辑控制。为全面评估所提方法,构建了包含高分辨率图像、长描述性文本、真实世界图像及新文本编辑任务的RW-800基准数据集。实验结果表明,该方法在保持背景完整性和提供精准编辑方面表现出色。
链接: https://arxiv.org/abs/2503.16795
作者: Yihan Hu,Jianing Peng,Yiheng Lin,Ting Liu,Xiaochao Qu,Luoqi Liu,Yao Zhao,Yunchao Wei
机构: Institute of Information Science, Beijing Jiaotong University (北京交通大学信息科学研究所); MT Lab, Meitu Inc (美图公司MT Lab); Pengcheng Laboratory, Shenzhen, China (鹏城实验室,中国深圳); Visual Intelligence + X International Joint Laboratory of the Ministry of Education (教育部视觉智能+X国际联合实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper presents a novel approach to improving text-guided image editing using diffusion-based models. Text-guided image editing task poses key challenge of precisly locate and edit the target semantic, and previous methods fall shorts in this aspect. Our method introduces a Precise Semantic Localization strategy that leverages visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance. Then we propose a Dual-Level Control mechanism for incorporating regional cues at both feature and latent levels, offering fine-grained control for more precise edits. To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task. Experimental results on the popular PIE-Bench and RW-800 benchmarks demonstrate the superior performance of our approach in preserving background and providing accurate edits.
zh
[CV-92] Restoring Forgotten Knowledge in Non-Exemplar Class Incremental Learning through Test-Time Semantic Evolution
【速读】:该论文旨在解决非范例类增量学习(Non-exemplar Class Incremental Learning, NECIL)中的灾难性遗忘问题,特别是在增量优化过程中因旧类别不可访问而导致的对先前知识保留的阻碍。传统方法在训练阶段难以实现稳定性与可塑性的平衡,而本文指出测试阶段尚未被充分考虑,但具有潜在的解决方案价值。为此,论文提出了一种名为RoSE的简单而有效的方法,通过测试时语义演化(test-time Semantic Evolution)恢复被遗忘的知识。RoSE的关键在于其作为测试时语义漂移补偿框架的设计,能够以自监督的方式更准确地估计漂移,并且针对在线测试中的不完全优化问题,提出了一个解析解作为梯度下降的替代方案。实验结果表明,RoSE在CIFAR-100、TinyImageNet和ImageNet100数据集上优于大多数最先进的方法,验证了测试时演化在NECIL中的潜力与可行性。
链接: https://arxiv.org/abs/2503.16793
作者: Haori Lu,Xusheng Cao,Linlan Huang,Enguang Wang,Fei Yang,Xialei Liu
机构: VCIP, CS, Nankai University (南开大学); NKIARI, Shenzhen Futian (深圳福田)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Continual learning aims to accumulate knowledge over a data stream while mitigating catastrophic forgetting. In Non-exemplar Class Incremental Learning (NECIL), forgetting arises during incremental optimization because old classes are inaccessible, hindering the retention of prior knowledge. To solve this, previous methods struggle in achieving the stability-plasticity balance in the training stages. However, we note that the testing stage is rarely considered among them, but is promising to be a solution to forgetting. Therefore, we propose RoSE, which is a simple yet effective method that \textbfRest\textbfores forgotten knowledge through test-time \textbfSemantic \textbfEvolution. Specifically designed for minimizing forgetting, RoSE is a test-time semantic drift compensation framework that enables more accurate drift estimation in a self-supervised manner. Moreover, to avoid incomplete optimization during online testing, we derive an analytical solution as an alternative to gradient descent. We evaluate RoSE on CIFAR-100, TinyImageNet, and ImageNet100 datasets, under both cold-start and warm-start settings. Our method consistently outperforms most state-of-the-art (SOTA) methods across various scenarios, validating the potential and feasibility of test-time evolution in NECIL.
zh
[CV-93] Learning Part Knowledge to Facilitate Category Understanding for Fine-Grained Generalized Category Discovery
【速读】:该论文旨在解决广义类别发现(Generalized Category Discovery, GCD)在细粒度场景下的挑战,现有方法依赖全局图像特征的对比学习来自动捕捉区分性线索,但在细粒度任务中难以捕获区分类别所需的微妙局部差异。论文的关键解决方案是引入部件知识,通过自适应部件分解(Adaptive Part Decomposition)利用高斯混合模型自动提取类别特定语义部件,并通过部件差异正则化(Part Discrepancy Regularization)明确分离部件特征以放大细粒度的局部部件差异。实验验证了该方法在多个细粒度基准数据集上的领先性能,同时在通用数据集上保持竞争力。
链接: https://arxiv.org/abs/2503.16782
作者: Enguang Wang,Zhimao Peng,Zhengyuan Xie,Haori Lu,Fei Yang,Xialei Liu
机构: VCIP, CS, Nankai University (南开大学); NKIARI, Shenzhen Futian (深圳福田)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Generalized Category Discovery (GCD) aims to classify unlabeled data containing both seen and novel categories. Although existing methods perform well on generic datasets, they struggle in fine-grained scenarios. We attribute this difficulty to their reliance on contrastive learning over global image features to automatically capture discriminative cues, which fails to capture the subtle local differences essential for distinguishing fine-grained categories. Therefore, in this paper, we propose incorporating part knowledge to address fine-grained GCD, which introduces two key challenges: the absence of annotations for novel classes complicates the extraction of the part features, and global contrastive learning prioritizes holistic feature invariance, inadvertently suppressing discriminative local part patterns. To address these challenges, we propose PartGCD, including 1) Adaptive Part Decomposition, which automatically extracts class-specific semantic parts via Gaussian Mixture Models, and 2) Part Discrepancy Regularization, enforcing explicit separation between part features to amplify fine-grained local part distinctions. Experiments demonstrate state-of-the-art performance across multiple fine-grained benchmarks while maintaining competitiveness on generic datasets, validating the effectiveness and robustness of our approach. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2503.16782 [cs.CV] (or arXiv:2503.16782v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.16782 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-94] A-IDE : Agent -Integrated Denoising Experts
【速读】:该论文试图解决低剂量 CT (Low-Dose CT, LDCT) 图像去噪中单个模型难以在多种解剖结构间通用的问题。解决方案的关键在于提出了一种名为“Agent-Integrated Denoising Experts (A-IDE)”的框架,该框架通过一个决策大语言模型 (LLM) 驱动的智能体管理三个专注于不同解剖区域的 RED-CNN 模型。智能体利用 BiomedCLIP 提取语义线索,动态分配 LDCT 扫描至最合适的专家模型,从而实现性能提升,同时具备在异构且数据稀缺环境中的鲁棒性,并自动防止过拟合,无需人工干预。
链接: https://arxiv.org/abs/2503.16780
作者: Uihyun Cho,Namhun Kim
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 11 figures
点击查看摘要
Abstract:Recent advances in deep-learning based denoising methods have improved Low-Dose CT image quality. However, due to distinct HU distributions and diverse anatomical characteristics, a single model often struggles to generalize across multiple anatomies. To address this limitation, we introduce \textbfAgent-Integrated Denoising Experts (A-IDE) framework, which integrates three anatomical region-specialized RED-CNN models under the management of decision-making LLM agent. The agent analyzes semantic cues from BiomedCLIP to dynamically route incoming LDCT scans to the most appropriate expert model. We highlight three major advantages of our approach. A-IDE excels in heterogeneous, data-scarce environments. The framework automatically prevents overfitting by distributing tasks among multiple experts. Finally, our LLM-driven agentic pipeline eliminates the need for manual interventions. Experimental evaluations on the Mayo-2016 dataset confirm that A-IDE achieves superior performance in RMSE, PSNR, and SSIM compared to a single unified denoiser.
zh
[CV-95] OpenCity3D: What do Vision-Language Models know about Urban Environments? WACV2025
【速读】:该论文试图解决的问题是如何将视觉-语言模型(Vision-Language Models, VLMs)从室内场景或自动驾驶等低级任务扩展到城市规模环境中的高级任务。论文的关键解决方案是提出OpenCity3D方法,通过利用多视角航拍图像的三维重建技术,实现包括人口密度估计、建筑年龄分类、房产价格预测、犯罪率评估以及噪声污染评价在内的多种高级任务。研究展示了OpenCity3D在零样本(zero-shot)和少样本(few-shot)场景下的出色能力,体现了其适应新上下文的强大灵活性,从而开创了基于语言驱动的城市分析新范式,为城市规划、政策制定及环境监测提供了应用可能性。
链接: https://arxiv.org/abs/2503.16776
作者: Valentin Bieri,Marco Zamboni,Nicolas S. Blumer,Qingxuan Chen,Francis Engelmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at WACV 2025
点击查看摘要
Abstract:Vision-language models (VLMs) show great promise for 3D scene understanding but are mainly applied to indoor spaces or autonomous driving, focusing on low-level tasks like segmentation. This work expands their use to urban-scale environments by leveraging 3D reconstructions from multi-view aerial imagery. We propose OpenCity3D, an approach that addresses high-level tasks, such as population density estimation, building age classification, property price prediction, crime rate assessment, and noise pollution evaluation. Our findings highlight OpenCity3D’s impressive zero-shot and few-shot capabilities, showcasing adaptability to new contexts. This research establishes a new paradigm for language-driven urban analytics, enabling applications in planning, policy, and environmental monitoring. See our project page: this http URL
zh
[CV-96] Region Masking to Accelerate Video Processing on Neuromorphic Hardware
【速读】:该论文旨在解决在资源受限设备上运行深度学习模型时的能耗和延迟问题,特别是在基于脉冲神经网络(Spiking Neural Networks, SNNs)的事件驱动处理中仍存在的大量冗余计算问题。论文的关键解决方案是提出了一种区域屏蔽策略(region masking strategy),通过在SNN输入端识别感兴趣区域(regions of interest),屏蔽不重要的区域,从而消除来自这些区域的事件所引发的计算和数据传输。这种方法不仅显著减少了网络的整体脉冲活动,还提升了吞吐量并降低了延迟,在Loihi 2平台上进行视频目标检测实验时,屏蔽约60%的输入区域可使能量-延迟积降低1.65倍,同时mAP@0.5仅下降1.09%。
链接: https://arxiv.org/abs/2503.16775
作者: Sreetama Sarkar,Sumit Bam Shrestha,Yue Che,Leobardo Campos-Macias,Gourav Datta,Peter A. Beerel
机构: Universiy of Southern California(南加州大学), Los Angeles, USA; Intel Labs(英特尔实验室), Santa Clara, USA; Case Western Reserve University(凯斯西储大学), USA
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:The rapidly growing demand for on-chip edge intelligence on resource-constrained devices has motivated approaches to reduce energy and latency of deep learning models. Spiking neural networks (SNNs) have gained particular interest due to their promise to reduce energy consumption using event-based processing. We assert that while sigma-delta encoding in SNNs can take advantage of the temporal redundancy across video frames, they still involve a significant amount of redundant computations due to processing insignificant events. In this paper, we propose a region masking strategy that identifies regions of interest at the input of the SNN, thereby eliminating computation and data movement for events arising from unimportant regions. Our approach demonstrates that masking regions at the input not only significantly reduces the overall spiking activity of the network, but also provides significant improvement in throughput and latency. We apply region masking during video object detection on Loihi 2, demonstrating that masking approximately 60% of input regions can reduce energy-delay product by 1.65x over a baseline sigma-delta network, with a degradation in mAP@0.5 by 1.09%.
zh
[CV-97] Dynamic Attention Mechanism in Spatiotemporal Memory Networks for Object Tracking
【速读】:该论文旨在解决主流视觉目标跟踪框架在复杂场景(如目标形变、遮挡和背景杂乱)下因模板特征质量下降而导致性能下降的问题。同时,现有基于时空记忆的跟踪器虽强调扩展记忆容量,但缺乏有效的动态特征选择与自适应融合机制。为填补这一空白,论文提出了Dynamic Attention Mechanism in Spatiotemporal Memory Network (DASTM),其关键创新包括:1)一种可微分的动态注意力机制,通过分析模板与记忆特征间的时空相关性自适应调整通道-空间注意力权重;2)一个轻量级门控网络,根据目标运动状态自主分配计算资源,在挑战场景中优先利用高辨别能力的特征。这些创新显著提升了跟踪的精度、鲁棒性和实时效率。
链接: https://arxiv.org/abs/2503.16768
作者: Meng Zhou,Jiadong Xie,Mingsheng Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Mainstream visual object tracking frameworks predominantly rely on template matching paradigms. Their performance heavily depends on the quality of template features, which becomes increasingly challenging to maintain in complex scenarios involving target deformation, occlusion, and background clutter. While existing spatiotemporal memory-based trackers emphasize memory capacity expansion, they lack effective mechanisms for dynamic feature selection and adaptive fusion. To address this gap, we propose a Dynamic Attention Mechanism in Spatiotemporal Memory Network (DASTM) with two key innovations: 1) A differentiable dynamic attention mechanism that adaptively adjusts channel-spatial attention weights by analyzing spatiotemporal correlations between the templates and memory features; 2) A lightweight gating network that autonomously allocates computational resources based on target motion states, prioritizing high-discriminability features in challenging scenarios. Extensive evaluations on OTB-2015, VOT 2018, LaSOT, and GOT-10K benchmarks demonstrate our DASTM’s superiority, achieving state-of-the-art performance in success rate, robustness, and real-time efficiency, thereby offering a novel solution for real-time tracking in complex environments.
zh
[CV-98] Rethinking the Role of Spatial Mixing
【速读】:该论文试图探索空间混合(spatial mixing)操作在现代计算机视觉模型中的作用,并验证是否可以在不训练空间混合模块的情况下,仍保持模型的分类性能。此外,研究还关注随机固定的空间混合对模型鲁棒性的潜在提升。论文的关键在于通过实验发现,无论是经典模型(如ResNet)还是前沿模型(如ConvMixer),其分类性能在将空间混合器保持在随机初始状态时几乎不受影响,并进一步证明此类模型对对抗扰动具有天然的鲁棒性,同时这种特性还可推广到像素洗牌图像的解码任务中。
链接: https://arxiv.org/abs/2503.16760
作者: George Cazenavette,Joel Julin,Simon Lucey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Until quite recently, the backbone of nearly every state-of-the-art computer vision model has been the 2D convolution. At its core, a 2D convolution simultaneously mixes information across both the spatial and channel dimensions of a representation. Many recent computer vision architectures consist of sequences of isotropic blocks that disentangle the spatial and channel-mixing components. This separation of the operations allows us to more closely juxtapose the effects of spatial and channel mixing in deep learning. In this paper, we take an initial step towards garnering a deeper understanding of the roles of these mixing operations. Through our experiments and analysis, we discover that on both classical (ResNet) and cutting-edge (ConvMixer) models, we can reach nearly the same level of classification performance by and leaving the spatial mixers at their random initializations. Furthermore, we show that models with random, fixed spatial mixing are naturally more robust to adversarial perturbations. Lastly, we show that this phenomenon extends past the classification regime, as such models can also decode pixel-shuffled images.
zh
[CV-99] aTCSF: A Temporal Contrast Sensitivity Function for Flicker Detection and Modeling Variable Refresh Rate Flicker SIGGRAPH
【速读】:该论文旨在解决传统闪烁感知评估方法(如Critical Flicker Frequency, CFF)在处理低对比度闪烁时的局限性,以及现有国际标准(IDMS中的Temporal Contrast Sensitivity Function, TCSF_IDMS)未能充分考虑亮度(luminance)、离心率(eccentricity)和面积(area)等关键参数的问题,特别是在低空间频率下的闪烁检测不足。为了解决这些问题,论文的关键创新在于扩展并结合TCSF_IDMS与一个新的空间概率叠加模型,提出了一种新的闪烁敏感函数模型(enhanced luminance, eccentricity, and area Temporal Contrast Sensitivity Function, elaTCSF),以综合考虑亮度、离心率和面积的影响。通过在多个闪烁检测数据集上的训练及验证,并构建首个可变刷新率闪烁检测数据集,该模型不仅能够预测虚拟现实头显中因低持续性导致的闪烁,还能识别无闪烁的可变刷新率操作范围,并用于照明设计中的闪烁敏感性分析。
链接: https://arxiv.org/abs/2503.16759
作者: Yancheng Cai,Ali Bozorgian,Maliha Ashraf,Robert Wanat,Rafał K. Mantiuk
机构: University of Cambridge (剑桥大学); Norwegian University of Science and Technology (挪威科技大学); LG Electronics North America (LG电子北美); University of Cambridge (剑桥大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at SIGGRAPH Asia 2024
点击查看摘要
Abstract:The perception of flicker has been a prominent concern in illumination and electronic display fields for over a century. Traditional approaches often rely on Critical Flicker Frequency (CFF), primarily suited for high-contrast (full-on, full-off) flicker. To tackle varying contrast flicker, the International Committee for Display Metrology (ICDM) introduced a Temporal Contrast Sensitivity Function TCSF _IDMS within the Information Display Measurements Standard (IDMS). Nevertheless, this standard overlooks crucial parameters: luminance, eccentricity, and area. Existing models incorporating these parameters are inadequate for flicker detection, especially at low spatial frequencies. To address these limitations, we extend the TCSF _IDMS and combine it with a new spatial probability summation model to incorporate the effects of luminance, eccentricity, and area (elaTCSF). We train the elaTCSF on various flicker detection datasets and establish the first variable refresh rate flicker detection dataset for further verification. Additionally, we contribute to resolving a longstanding debate on whether the flicker is more visible in peripheral vision. We demonstrate how elaTCSF can be used to predict flicker due to low-persistence in VR headsets, identify flicker-free VRR operational ranges, and determine flicker sensitivity in lighting design.
zh
[CV-100] SAGE: Semantic-Driven Adaptive Gaussian Splatting in Extended Reality
【速读】:该论文旨在解决在扩展现实(XR)应用中,三维高斯点 splatting (3D Gaussian Splatting, 3DGS) 技术因固定细节层次(Level of Detail, LOD)而导致的内存占用大和计算开销高的问题。论文提出的解决方案是 SAGE (Semantic-Driven Adaptive Gaussian Splatting in Extended Reality),其关键在于通过语义分割动态调整不同3DGS对象的LOD,从而在保持目标视觉质量的同时优化内存使用和计算效率。
链接: https://arxiv.org/abs/2503.16747
作者: Chiara Schiavo,Elena Camuffo,Leonardo Badia,Simone Milani
机构: Dept. of Information Engineering, University of Padova (帕多瓦大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has significantly improved the efficiency and realism of three-dimensional scene visualization in several applications, ranging from robotics to eXtended Reality (XR). This work presents SAGE (Semantic-Driven Adaptive Gaussian Splatting in Extended Reality), a novel framework designed to enhance the user experience by dynamically adapting the Level of Detail (LOD) of different 3DGS objects identified via a semantic segmentation. Experimental results demonstrate how SAGE effectively reduces memory and computational overhead while keeping a desired target visual quality, thus providing a powerful optimization for interactive XR applications.
zh
[CV-101] Digitally Prototype Your Eye Tracker: Simulating Hardware Performance using 3D Synthetic Data
【速读】:该论文试图解决在评估眼动追踪(Eye Tracking, ET)硬件设计对基于机器学习的ET性能影响时,因真实硬件变体数据获取成本高昂且难以大规模收集而导致的挑战。论文的关键解决方案在于提出了一种利用合成数据进行端到端评估的方法。通过使用从光穹数据重建的真实3D眼睛数据集,并结合神经辐射场(Neural Radiance Fields, NeRF)技术生成不同视角和相机参数下的虚拟眼睛图像,研究者能够模拟多种硬件配置下的ET性能,包括传感器噪声、光照亮度及光学模糊等变化的影响。此外,该方法还验证了与公开的眼动追踪数据集的强相关性,并首次分析了眼动追踪相机位置变化对ET性能的影响,从而显著加速了ET硬件的原型设计过程。
链接: https://arxiv.org/abs/2503.16742
作者: Esther Y. H. Lin,Yimin Ding,Jogendra Kundu,Yatong An,Mohamed T. El-Haddad,Alexander Fix
机构: Meta (Meta); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures
点击查看摘要
Abstract:Eye tracking (ET) is a key enabler for Augmented and Virtual Reality (AR/VR). Prototyping new ET hardware requires assessing the impact of hardware choices on eye tracking performance. This task is compounded by the high cost of obtaining data from sufficiently many variations of real hardware, especially for machine learning, which requires large training datasets. We propose a method for end-to-end evaluation of how hardware changes impact machine learning-based ET performance using only synthetic data. We utilize a dataset of real 3D eyes, reconstructed from light dome data using neural radiance fields (NeRF), to synthesize captured eyes from novel viewpoints and camera parameters. Using this framework, we demonstrate that we can predict the relative performance across various hardware configurations, accounting for variations in sensor noise, illumination brightness, and optical blur. We also compare our simulator with the publicly available eye tracking dataset from the Project Aria glasses, demonstrating a strong correlation with real-world performance. Finally, we present a first-of-its-kind analysis in which we vary ET camera positions, evaluating ET performance ranging from on-axis direct views of the eye to peripheral views on the frame. Such an analysis would have previously required manufacturing physical devices to capture evaluation data. In short, our method enables faster prototyping of ET hardware.
zh
[CV-102] EDiT: Efficient Diffusion Transformers with Linear Compressed Attention
【速读】:该论文试图解决传统Diffusion Transformers (DiTs) 和 Multimodal DiTs (MM-DiTs) 在高分辨率图像生成或资源受限设备上的效率瓶颈问题。解决方案的关键在于提出了一种高效的扩散Transformer (EDiT) 架构,其核心创新包括:1) 提出一种新颖的线性压缩注意力方法,通过多层卷积网络利用局部信息调制查询(queries),同时对键(keys)和值(values)进行空间聚合;2) 设计了一种针对多模态输入的混合注意力机制,结合线性注意力用于图像间交互,以及标准缩放点积注意力用于涉及提示(prompts)的交互。将这两种方法融合后,形成了表达能力强且具有线性时间复杂度的多模态高效扩散Transformer (MM-EDiT),实现了高达2.2倍的速度提升,同时保持了相似的图像质量。
链接: https://arxiv.org/abs/2503.16726
作者: Philipp Becker,Abhinav Mehrotra,Ruchika Chavhan,Malcolm Chadwick,Luca Morreale,Mehdi Noroozi,Alberto Gil Ramos,Sourav Bhattacharya
机构: Samsung AI Center, Cambridge (三星人工智能中心, 剑桥); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Diffusion Transformers (DiTs) have emerged as a leading architecture for text-to-image synthesis, producing high-quality and photorealistic images. However, the quadratic scaling properties of the attention in DiTs hinder image generation with higher resolution or on devices with limited resources. This work introduces an efficient diffusion transformer (EDiT) to alleviate these efficiency bottlenecks in conventional DiTs and Multimodal DiTs (MM-DiTs). First, we present a novel linear compressed attention method that uses a multi-layer convolutional network to modulate queries with local information while keys and values are spatially aggregated. Second, we formulate a hybrid attention scheme for multi-modal inputs that combines linear attention for image-to-image interactions and standard scaled dot-product attention for interactions involving prompts. Merging these two approaches leads to an expressive, linear-time Multimodal Efficient Diffusion Transformer (MM-EDiT). We demonstrate the effectiveness of the EDiT and MM-EDiT architectures by integrating them into PixArt-Sigma(conventional DiT) and Stable Diffusion 3.5-Medium (MM-DiT), achieving up to 2.2x speedup with comparable image quality after distillation.
zh
[CV-103] Depth Matters: Multimodal RGB-D Perception for Robust Autonomous Agents IROS2025
【速读】:该论文试图解决自动驾驶代理在实时控制决策中对感知信息利用效率和鲁棒性不足的问题。解决方案的关键在于通过将深度信息(Depth)与RGB图像融合,显著提升模型预测转向命令的能力,相比于仅使用RGB图像。具体而言,论文设计了一种轻量级循环控制器,利用融合后的RGB-D特征进行序列决策,并通过收集高质量数据训练模型,在多种配置下实现硬件部署。研究发现,早期融合深度数据能够生成高度鲁棒的控制器,即使在帧丢失和噪声增加的情况下仍能保持任务专注性和有效性。
链接: https://arxiv.org/abs/2503.16711
作者: Mihaela-Larisa Clement,Mónika Farsang,Felix Resch,Radu Grosu
机构: CPS, Technische Universität Wien (TU Wien)(维也纳技术大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to IROS 2025
点击查看摘要
Abstract:Autonomous agents that rely purely on perception to make real-time control decisions require efficient and robust architectures. In this work, we demonstrate that augmenting RGB input with depth information significantly enhances our agents’ ability to predict steering commands compared to using RGB alone. We benchmark lightweight recurrent controllers that leverage the fused RGB-D features for sequential decision-making. To train our models, we collect high-quality data using a small-scale autonomous car controlled by an expert driver via a physical steering wheel, capturing varying levels of steering difficulty. Our models, trained under diverse configurations, were successfully deployed on real hardware. Specifically, our findings reveal that the early fusion of depth data results in a highly robust controller, which remains effective even with frame drops and increased noise levels, without compromising the network’s focus on the task.
zh
[CV-104] 4D Gaussian Splatting SLAM
【速读】:本文旨在解决在动态场景中同时定位相机姿态并构建高斯辐射场的问题,这一任务是连接2D图像与4D真实世界的关键桥梁。传统方法通常会移除动态物体以避免干扰,并仅重建静态环境,而本文提出了一种高效的架构,通过使用RGB-D图像序列增量跟踪相机姿态并建立4D高斯辐射场。方案的关键在于首先利用运动掩码获取每个像素的静态和动态先验信息;然后将高斯基元分类为静态和动态高斯集,结合稀疏控制点与MLP(多层感知机)来建模动态高斯的变换场;此外,设计了一种新颖的2D光流图重构算法,用于渲染相邻图像之间动态物体的光流,这些光流进一步与传统的光度和几何约束一起监督4D高斯辐射场的学习过程。实验结果表明,所提方法在真实环境中实现了稳健的跟踪和高质量的视图合成性能。
链接: https://arxiv.org/abs/2503.16710
作者: Yanyan Li,Youxu Fang,Zunjie Zhu,Kunyi Li,Yong Ding,Federico Tombari
机构: Technical University of Munich (慕尼黑工业大学); Hangzhou Dianzi University (杭州电子科技大学); Zhejiang University (浙江大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Simultaneously localizing camera poses and constructing Gaussian radiance fields in dynamic scenes establish a crucial bridge between 2D images and the 4D real world. Instead of removing dynamic objects as distractors and reconstructing only static environments, this paper proposes an efficient architecture that incrementally tracks camera poses and establishes the 4D Gaussian radiance fields in unknown scenarios by using a sequence of RGB-D images. First, by generating motion masks, we obtain static and dynamic priors for each pixel. To eliminate the influence of static scenes and improve the efficiency on learning the motion of dynamic objects, we classify the Gaussian primitives into static and dynamic Gaussian sets, while the sparse control points along with an MLP is utilized to model the transformation fields of the dynamic Gaussians. To more accurately learn the motion of dynamic Gaussians, a novel 2D optical flow map reconstruction algorithm is designed to render optical flows of dynamic objects between neighbor images, which are further used to supervise the 4D Gaussian radiance fields along with traditional photometric and geometric constraints. In experiments, qualitative and quantitative evaluation results show that the proposed method achieves robust tracking and high-quality view synthesis performance in real-world environments.
zh
[CV-105] QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge CVPR2025
【速读】:该论文旨在解决在资源受限的边缘设备(尤其是专用集成电路 ASIC)上部署高精度深度估计模型的挑战,主要由于这些模型对计算和内存的需求较高。尽管近期基础深度估计技术取得了显著进展,但进一步提升了在 ASIC 上部署的难度。为了解决这一问题,论文提出了一种名为 QuartDepth 的方法,其关键是通过后训练量化将单目深度估计算法 (MDE) 模型的权重和激活值量化到 4 位精度,以减少模型大小和计算成本。此外,为了缓解性能下降,引入了激活优化与补偿算法,并设计了权重量化误差最小化的权重重构方法。同时,通过支持算子融合和自定义指令可编程性,开发了一种灵活且可编程的硬件加速器,提高了吞吐量和效率。实验结果表明,QuartDepth 在保持竞争力的准确性的同时,实现了快速推理和更高的能量效率,从而弥合了高性能深度估计与实际边缘设备应用之间的差距。
链接: https://arxiv.org/abs/2503.16709
作者: Xuan Shen,Weize Ma,Jing Liu,Changdi Yang,Rui Ding,Quanyi Wang,Henghui Ding,Wei Niu,Yanzhi Wang,Pu Zhao,Jun Lin,Jiuxiang Gu
机构: Northeastern University (东北大学); Nanjing University (南京大学); Monash University (蒙纳士大学); Nanjing University of Information Science and Technology (南京信息工程大学); Fudan University (复旦大学); University of Georgia (乔治亚大学); Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:Monocular Depth Estimation (MDE) has emerged as a pivotal task in computer vision, supporting numerous real-world applications. However, deploying accurate depth estimation models on resource-limited edge devices, especially Application-Specific Integrated Circuits (ASICs), is challenging due to the high computational and memory demands. Recent advancements in foundational depth estimation deliver impressive results but further amplify the difficulty of deployment on ASICs. To address this, we propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs. Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost. To mitigate the performance degradation, we introduce activation polishing and compensation algorithm applied before and after activation quantization, as well as a weight reconstruction method for minimizing errors in weight quantization. Furthermore, we design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability, enhancing throughput and efficiency. Experimental results demonstrate that our framework achieves competitive accuracy while enabling fast inference and higher energy efficiency on ASICs, bridging the gap between high-performance depth estimation and practical edge-device applicability. Code: this https URL
zh
[CV-106] Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding CVPR2025
【速读】:该论文旨在解决三维场景理解中开放词汇知识利用的问题,特别是由于缺乏大规模三维文本语料库,现有方法通常依赖单一视觉语言模型(Vision-Language Models, VLMs)来对齐三维模型与通用语言空间中的特征空间,这限制了三维模型充分利用多种基础模型所包含的多样化空间和语义能力。为了解决这一局限性,论文提出了名为CUA-O3D(Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding)的方法,这是首个将包括CLIP、DINOv2和Stable Diffusion在内的多个基础模型集成到三维场景理解中的模型。该方法的关键在于引入了一种确定性的不确定性估计,以自适应地蒸馏和协调来自这些模型的异构二维特征嵌入,并通过结合视觉基础模型的空间感知几何知识与VLMs的语义先验,以及捕捉模型特定的不确定性,帮助在训练过程中协调异构表示,从而有效应对上述挑战。
链接: https://arxiv.org/abs/2503.16707
作者: Jinlong Li,Cristiano Saltori,Fabio Poiesi,Nicu Sebe
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺凯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). owever, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models-such as CLIP, DINOv2, and Stable Diffusion-into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities. The code will be available at \hrefthis https URLCUA_O3D.
zh
[CV-107] GAIR: Improving Multimodal Geo-Foundation Model with Geo-Aligned Implicit Representations
【速读】:该论文旨在解决现有地理基础模型(GeoFMs)主要依赖于航拍遥感数据而忽视其他模态数据(如街景图像)的问题,并提出一种能够显式建模多模态数据之间地理空间关系的方法以提升模型在任务、空间尺度和时间上下文中的泛化能力。论文的关键在于提出了GAIR这一新型多模态GeoFM架构,通过整合航拍遥感数据、街景图像及其地理位置元数据,并利用三个分解神经编码器将街景图像、其地理位置以及遥感图像投影到嵌入空间中。为了实现街景图像与遥感图像的空间对齐,引入了一种新颖的隐式神经表示(INR)模块来学习连续的遥感图像表示并在街景图像的地理位置处查询对应的嵌入。最终,这些经过地理对齐后的嵌入通过对比学习目标从无标注数据中进行训练。
链接: https://arxiv.org/abs/2503.16683
作者: Zeping Liu,Fan Zhang,Junfeng Jiao,Ni Lao,Gengchen Mai
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures
点击查看摘要
Abstract:Advancements in vision and language foundation models have inspired the development of geo-foundation models (GeoFMs), enhancing performance across diverse geospatial tasks. However, many existing GeoFMs primarily focus on overhead remote sensing (RS) data while neglecting other data modalities such as ground-level imagery. A key challenge in multimodal GeoFM development is to explicitly model geospatial relationships across modalities, which enables generalizability across tasks, spatial scales, and temporal contexts. To address these limitations, we propose GAIR, a novel multimodal GeoFM architecture integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. We utilize three factorized neural encoders to project an SV image, its geolocation, and an RS image into the embedding space. The SV image needs to be located within the RS image’s spatial footprint but does not need to be at its geographic center. In order to geographically align the SV image and RS image, we propose a novel implicit neural representations (INR) module that learns a continuous RS image representation and looks up the RS embedding at the SV image’s geolocation. Next, these geographically aligned SV embedding, RS embedding, and location embedding are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 10 geospatial tasks spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art GeoFMs and other strong baselines, highlighting its effectiveness in learning generalizable and transferable geospatial representations.
zh
[CV-108] xtBite: A Historical Czech Document Dataset for Logical Page Segmentation
【速读】:该论文试图解决文档分析中的逻辑页面分割问题,旨在通过仅利用图像域内的信息避免光学字符识别(OCR)的需求,从而实现更高效的文本语义表示、信息检索和理解。解决方案的关键在于定义任务为纯图像域分割,并提出了一种评价指标,仅关注前景文本像素而忽略背景像素,以确保评估不受几何变化的影响。此外,论文引入了一个名为TextBite的数据集及一套结合文本区域检测与关系预测的基线方法,为历史捷克文档的逻辑分段研究提供支持。
链接: https://arxiv.org/abs/2503.16664
作者: Martin Kostelník,Karel Beneš,Michal Hradiš
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Logical page segmentation is an important step in document analysis, enabling better semantic representations, information retrieval, and text understanding. Previous approaches define logical segmentation either through text or geometric objects, relying on OCR or precise geometry. To avoid the need for OCR, we define the task purely as segmentation in the image domain. Furthermore, to ensure the evaluation remains unaffected by geometrical variations that do not impact text segmentation, we propose to use only foreground text pixels in the evaluation metric and disregard all background pixels. To support research in logical document segmentation, we introduce TextBite, a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. The dataset comprises 8,449 page images with 78,863 annotated segments of logically and thematically coherent text. We propose a set of baseline methods combining text region detection and relation prediction. The dataset, baselines and evaluation framework can be accessed at this https URL.
zh
[CV-109] When Less is Enough: Adaptive Token Reduction for Efficient Image Representation
【速读】:该论文旨在解决视觉编码器生成大量视觉tokens导致计算成本显著增加的问题,同时探讨这些tokens是否都具有同等价值,能否通过筛选保留关键特征以减少计算开销而不影响性能。论文的关键在于提出了一种基于更宝贵特征可以重构较不宝贵特征的理念的新方法,并通过将自动编码器与Gumbel-Softmax选择机制相结合,实现仅保留最具有信息量的视觉tokens。实验验证表明,在OCR任务中可移除超过50%的视觉上下文且性能损失微乎其微,而在通用领域任务中,随机保留仅30%的tokens即可达到使用完整tokens集的相似性能。这一成果为适应性和高效多模态剪枝提供了有前景的方向,支持可扩展且低开销的推理过程,同时保持性能不变。
链接: https://arxiv.org/abs/2503.16660
作者: Eduard Allakhverdov,Elizaveta Goncharova,Andrey Kuznetsov
机构: AIRI; MIPT
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures
点击查看摘要
Abstract:Vision encoders typically generate a large number of visual tokens, providing information-rich representations but significantly increasing computational demands. This raises the question of whether all generated tokens are equally valuable or if some of them can be discarded to reduce computational costs without compromising quality. In this paper, we introduce a new method for determining feature utility based on the idea that less valuable features can be reconstructed from more valuable ones. We implement this concept by integrating an autoencoder with a Gumbel-Softmax selection mechanism, that allows identifying and retaining only the most informative visual tokens. To validate our approach, we compared the performance of the LLaVA-NeXT model, using features selected by our method with randomly selected features. We found that on OCR-based tasks, more than 50% of the visual context can be removed with minimal performance loss, whereas randomly discarding the same proportion of features significantly affects the model capabilities. Furthermore, in general-domain tasks, even randomly retaining only 30% of tokens achieves performance comparable to using the full set of visual tokens. Our results highlight a promising direction towards adaptive and efficient multimodal pruning that facilitates scalable and low-overhead inference without compromising performance.
zh
[CV-110] Flame: Interleaving Full and Linear Attention for Efficient Mesh Generation
【速读】:本文旨在解决基于注意力机制的网格生成模型在高分辨率3D数据上的可扩展性问题。传统基于注意力的模型虽然性能卓越,但其二次计算复杂度限制了其在大规模数据上的应用;而线性注意力机制虽降低了计算成本,却难以捕捉长距离依赖关系,导致生成效果不佳。为解决这一权衡,论文提出了一种交错自回归网格生成框架,将线性注意力的高效性和全注意力机制的表现力相结合。此外,通过将此框架嵌入到沙漏架构中,并结合缓存算法以进一步提升推理效率,该方法在保持与纯注意力模型相当的生成性能的同时,显著减少了训练时间和KV缓存大小。关键在于交错自回归框架的设计及其与高效架构的整合,实现了计算效率与生成性能的有效平衡。
链接: https://arxiv.org/abs/2503.16653
作者: Hanxiao Wang,Biao Zhang,Weize Quan,Dong-Ming Yan,Peter Wonka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
点击查看摘要
Abstract:This paper propose iFlame, a novel transformer-based network architecture for mesh generation. While attention-based models have demonstrated remarkable performance in mesh generation, their quadratic computational complexity limits scalability, particularly for high-resolution 3D data. Conversely, linear attention mechanisms offer lower computational costs but often struggle to capture long-range dependencies, resulting in suboptimal outcomes. To address this trade-off, we propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms. To further enhance efficiency and leverage the inherent structure of mesh representations, we integrate this interleaving approach into an hourglass architecture, which significantly boosts efficiency. Our approach reduces training time while achieving performance comparable to pure attention-based models. To improve inference efficiency, we implemented a caching algorithm that almost doubles the speed and reduces the KV cache size by seven-eighths compared to the original Transformer. We evaluate our framework on ShapeNet and Objaverse, demonstrating its ability to generate high-quality 3D meshes efficiently. Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance, making it a practical solution for mesh generation. The training takes only 2 days with 4 GPUs on 39k data with a maximum of 4k faces on Objaverse.
zh
[CV-111] riTex: Learning Texture from a Single Mesh via Triplane Semantic Features
【速读】:该论文旨在解决在计算机图形学中从一个三维网格到另一个三维网格进行语义纹理迁移时,现有方法难以在保持源纹理外观的同时实现高质量迁移的问题。论文的关键在于提出了一种名为\ourmethod的新方法,它通过将语义特征映射到表面颜色来学习体积纹理场。此方法利用高效的三平面架构,实现了对目标网格的语义感知纹理迁移,即使仅使用单一示例进行训练,也能有效推广至同一类别内的多样化形状。实验结果表明,与现有方法相比,\ourmethod在新创建的数据集上的纹理迁移质量和推理速度均表现优异,为游戏开发和模拟等应用中跨相关3D模型保持视觉一致性提供了实用解决方案。
链接: https://arxiv.org/abs/2503.16630
作者: Dana Cohen-Bar,Daniel Cohen-Or,Gal Chechik,Yoni Kasten
机构: Tel Aviv University (特拉维夫大学); NVIDIA (英伟达)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:As 3D content creation continues to grow, transferring semantic textures between 3D meshes remains a significant challenge in computer graphics. While recent methods leverage text-to-image diffusion models for texturing, they often struggle to preserve the appearance of the source texture during texture transfer. We present \ourmethod, a novel approach that learns a volumetric texture field from a single textured mesh by mapping semantic features to surface colors. Using an efficient triplane-based architecture, our method enables semantic-aware texture transfer to a novel target mesh. Despite training on just one example, it generalizes effectively to diverse shapes within the same category. Extensive evaluation on our newly created benchmark dataset shows that \ourmethod achieves superior texture transfer quality and fast inference times compared to existing methods. Our approach advances single-example texture transfer, providing a practical solution for maintaining visual coherence across related 3D models in applications like game development and simulation.
zh
[CV-112] Utilizing Reinforcement Learning for Bottom-Up part-wise Reconstruction of 2D Wire-Frame Projections
【速读】:本文关注于从任意三维线框模型投影到图像平面后的所有边的重建任务。为解决此问题,论文提出了一种由强化学习(Reinforcement Learning, RL)代理执行的自底向上分段和重建二维多部分对象的方法。环境状态表示为一个四色图像,不同颜色对应背景、目标边、重建线以及两者的重叠区域。在每一步中,代理可以在四维动作空间内变换重建线或通过特定终止动作结束序列。为了研究奖励函数形式的影响,测试了基于片段的奖励和增量奖励以及结合两者的方法,实验结果表明后者的训练表现最为有效。进一步提升效率和稳定性,引入了课程学习策略:首先实施基于动作的课程,代理最初仅限于减少的动作空间,只能执行五个可能动作中的三个,然后逐步过渡到完整动作空间;其次测试了基于任务的课程,代理先解决简化版本的问题,再面对完整的复杂任务。第二种方法产生了有希望的结果,代理不仅成功从学习简化任务过渡到掌握完整任务,还显著提升了性能。本研究表明迭代RL二维线框重建的潜力,通过优化奖励函数形式与课程学习策略的结合,显著提高了训练成功率。所提出的框架为解决类似任务提供了有效方法,并代表了该领域未来研究的有前景方向。
链接: https://arxiv.org/abs/2503.16629
作者: Julian Ziegler,Patrick Frenzel,Mirco Fuchs
机构: Leipzig University of Applied Sciences (莱比锡应用科技大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to RLDM 2025
点击查看摘要
Abstract:This work concerns itself with the task of reconstructing all edges of an arbitrary 3D wire-frame model projected to an image plane. We explore a bottom-up part-wise procedure undertaken by an RL agent to segment and reconstruct these 2D multipart objects. The environment’s state is represented as a four-colour image, where different colours correspond to background, a target edge, a reconstruction line, and the overlap of both. At each step, the agent can transform the reconstruction line within a four-dimensional action space or terminate the episode using a specific termination action. To investigate the impact of reward function formulations, we tested episodic and incremental rewards, as well as combined approaches. Empirical results demonstrated that the latter yielded the most effective training performance. To further enhance efficiency and stability, we introduce curriculum learning strategies. First, an action-based curriculum was implemented, where the agent was initially restricted to a reduced action space, being able to only perform three of the five possible actions, before progressing to the full action space. Second, we test a task-based curriculum, where the agent first solves a simplified version of the problem before being presented with the full, more complex task. This second approach produced promising results, as the agent not only successfully transitioned from learning the simplified task to mastering the full task, but in doing so gained significant performance. This study demonstrates the potential of an iterative RL wire-frame reconstruction in two dimensions. By combining optimized reward function formulations with curriculum learning strategies, we achieved significant improvements in training success. The proposed methodology provides an effective framework for solving similar tasks and represents a promising direction for future research in the field.
zh
[CV-113] MobilePlantViT: A Mobile-friendly Hybrid ViT for Generalized Plant Disease Image Classification
【速读】:该论文旨在解决在移动和边缘设备上部署深度学习模型进行植物疾病分类时面临的高计算需求和资源限制问题,以推动更广泛可及的智能农业系统的发展。论文的关键解决方案是提出了一种名为MobilePlantViT的新颖混合视觉Transformer (Vision Transformer, ViT) 架构,该架构通过优化资源效率同时保持高性能,实现了轻量级且准确的植物疾病分类能力。实验结果表明,该模型在不同规模的数据集上表现出色,测试准确率范围从80%到超过99%,并且以仅0.69百万参数的规模超越了具有更高参数量的MobileViTv1和MobileViTv2的小型版本。这一成果凸显了该方法在实际应用中的潜力,可用于支持可持续发展的资源高效智能农业系统。
链接: https://arxiv.org/abs/2503.16628
作者: Moshiur Rahman Tonmoy,Md. Mithun Hossain,Nilanjan Dey,M. F. Mridha
机构: Department of Computer Science and Engineering, University of Asia Pacific, Dhaka 1205, Bangladesh (达卡亚洲太平洋大学计算机科学与工程系); Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka 1216, Bangladesh (达卡商业与技术大学计算机科学与工程系); Department of Computer Science and Engineering, Techno International New Town, Kolkata 700156, India (加尔各答Techno国际新城计算机科学与工程系); Department of Computer Science, American International University-Bangladesh, Dhaka 1229, Bangladesh (达卡美国国际大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to a journal for peer-review under IEEE Transactions series
点击查看摘要
Abstract:Plant diseases significantly threaten global food security by reducing crop yields and undermining agricultural sustainability. AI-driven automated classification has emerged as a promising solution, with deep learning models demonstrating impressive performance in plant disease identification. However, deploying these models on mobile and edge devices remains challenging due to high computational demands and resource constraints, highlighting the need for lightweight, accurate solutions for accessible smart agriculture systems. To address this, we propose MobilePlantViT, a novel hybrid Vision Transformer (ViT) architecture designed for generalized plant disease classification, which optimizes resource efficiency while maintaining high performance. Extensive experiments across diverse plant disease datasets of varying scales show our model’s effectiveness and strong generalizability, achieving test accuracies ranging from 80% to over 99%. Notably, with only 0.69 million parameters, our architecture outperforms the smallest versions of MobileViTv1 and MobileViTv2, despite their higher parameter counts. These results underscore the potential of our approach for real-world, AI-powered automated plant disease classification in sustainable and resource-efficient smart agriculture systems. All codes will be available in the GitHub repository: this https URL
zh
[CV-114] Progressive Test Time Energy Adaptation for Medical Image Segmentation
【速读】:该论文旨在解决在医疗图像分割任务中,由于成像协议不一致和患者差异导致的分布偏移问题,从而保持模型在多样化医学数据集上的性能。论文的关键在于提出了一种适用于任意模型的渐进式测试时间能量适应方法,与需要多次通过目标数据的传统领域自适应方法不同,该方法允许预训练模型在处理测试数据时逐步适应。其核心解决方案是利用在源数据上训练的形状能量模型,在patch级别为分割图分配能量得分,通过最小化测试时间的能量得分来调整分割模型以匹配目标分布。这一机制有效提升了模型在跨数据集任务中的表现。
链接: https://arxiv.org/abs/2503.16616
作者: Xiaoran Zhang,Byung-Woo Hong,Hyoungseob Park,Daniel H. Pak,Anne-Marie Rickmann,Lawrence H. Staib,James S. Duncan,Alex Wong
机构: Yale University (耶鲁大学); Chung-Ang University (中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We propose a model-agnostic, progressive test-time energy adaptation approach for medical image segmentation. Maintaining model performance across diverse medical datasets is challenging, as distribution shifts arise from inconsistent imaging protocols and patient variations. Unlike domain adaptation methods that require multiple passes through target data - impractical in clinical settings - our approach adapts pretrained models progressively as they process test data. Our method leverages a shape energy model trained on source data, which assigns an energy score at the patch level to segmentation maps: low energy represents in-distribution (accurate) shapes, while high energy signals out-of-distribution (erroneous) predictions. By minimizing this energy score at test time, we refine the segmentation model to align with the target distribution. To validate the effectiveness and adaptability, we evaluated our framework on eight public MRI (bSSFP, T1- and T2-weighted) and X-ray datasets spanning cardiac, spinal cord, and lung segmentation. We consistently outperform baselines both quantitatively and qualitatively.
zh
[CV-115] A Recipe for Generating 3D Worlds From a Single Image
【速读】:该论文旨在解决从单张图像生成沉浸式三维(3D)世界的任务,其核心挑战是如何在无需大量训练的情况下,利用现有的二维(2D)生成模型实现高质量的3D环境构建。解决方案的关键在于将此任务视为上下文学习(in-context learning)问题,并通过两步法实现:首先使用预训练的扩散模型生成一致性的全景图;其次借助度量深度估计器将全景图提升至3D空间。此外,通过在渲染点云条件下微调修复模型(inpainting model),以填充未观测区域,进一步优化生成效果。实验结果表明,该方法在合成与真实图像上均能生成适合虚拟现实(VR)显示的高质量3D环境,并在多个定量图像质量指标上显著优于基于视频合成的现有技术。
链接: https://arxiv.org/abs/2503.16611
作者: Katja Schwarz,Denys Rozumnyi,Samuel Rota Bulò,Lorenzo Porzi,Peter Kontschieder
机构: Meta Reality Labs (Meta 实景实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:We introduce a recipe for generating immersive 3D worlds from a single image by framing the task as an in-context learning problem for 2D inpainting models. This approach requires minimal training and uses existing generative models. Our process involves two steps: generating coherent panoramas using a pre-trained diffusion model and lifting these into 3D with a metric depth estimator. We then fill unobserved regions by conditioning the inpainting model on rendered point clouds, requiring minimal fine-tuning. Tested on both synthetic and real images, our method produces high-quality 3D environments suitable for VR display. By explicitly modeling the 3D structure of the generated environment from the start, our approach consistently outperforms state-of-the-art, video synthesis-based methods along multiple quantitative image quality metrics. Project Page: this https URL
zh
[CV-116] UniK3D: Universal Camera Monocular 3D Estimation
【速读】:该论文旨在解决单目 3D 估计在非理想成像条件下的局限性问题,现有方法依赖于简化的假设(如针孔相机模型或校正图像),导致在实际场景中(特别是使用鱼眼或全景相机时)性能不佳且上下文信息损失严重。为了解决这一问题,论文提出了 UniK3D,这是一种可泛化的方法,能够处理任意相机模型的单目 3D 估计任务。其关键创新在于引入球面 3D 表征(spherical 3D representation),实现了相机几何与场景几何的有效解耦,并通过学习球谐函数的叠加实现了一种独立于具体相机模型的射线束表示(model-independent representation of the pencil of rays)。此外,论文设计了一种角度损失(angular loss)以避免宽视场相机的 3D 输出收缩问题。这些方案共同确保了 UniK3D 在多种复杂场景中的卓越表现,包括大视场和全景设置,同时保持传统针孔小视场域的高精度。
链接: https://arxiv.org/abs/2503.16591
作者: Luigi Piccinelli,Christos Sakaridis,Mattia Segu,Yung-Hsu Yang,Siyuan Li,Wim Abbeloos,Luc Van Gool
机构: ETH Zürich (苏黎世联邦理工学院); Toyota Motor Europe (丰田汽车欧洲); INSAIT, Sofia University St. Kliment Ohridski (索非亚大学INSAIT)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Monocular 3D estimation is crucial for visual perception. However, current methods fall short by relying on oversimplified assumptions, such as pinhole camera models or rectified images. These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss. To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera. Our method introduces a spherical 3D representation which allows for better disentanglement of camera and scene geometry and enables accurate metric 3D reconstruction for unconstrained camera models. Our camera component features a novel, model-independent representation of the pencil of rays, achieved through a learned superposition of spherical harmonics. We also introduce an angular loss, which, together with the camera module design, prevents the contraction of the 3D outputs for wide-view cameras. A comprehensive zero-shot evaluation on 13 diverse datasets demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and camera metrics, with substantial gains in challenging large-field-of-view and panoramic settings, while maintaining top accuracy in conventional pinhole small-field-of-view domains. Code and models are available at this http URL .
zh
[CV-117] World Knowledge from AI Image Generation for Robot Control
【速读】:该论文试图解决机器人在面对未明确指定任务时如何做出合理决策的问题,即在缺乏清晰对错答案的情境下,机器人需选择合适的行动方案。论文的关键解决方案在于利用现代生成式 AI (Generative AI) 系统通过生成逼真的世界图像所隐含的关于现实世界的知识,将真实场景中的物体配置信息作为指导,帮助机器人理解哪些布局是有意义的或符合人类习惯,从而有效应对任务不明确的挑战。
链接: https://arxiv.org/abs/2503.16579
作者: Jonas Krumme,Christoph Zetzsche
机构: University of Bremen (不来梅大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 10 figures
点击查看摘要
Abstract:When interacting with the world robots face a number of difficult questions, having to make decisions when given under-specified tasks where they need to make choices, often without clearly defined right and wrong answers. Humans, on the other hand, can often rely on their knowledge and experience to fill in the gaps. For example, the simple task of organizing newly bought produce into the fridge involves deciding where to put each thing individually, how to arrange them together meaningfully, e.g. putting related things together, all while there is no clear right and wrong way to accomplish this task. We could encode all this information on how to do such things explicitly into the robots’ knowledge base, but this can quickly become overwhelming, considering the number of potential tasks and circumstances the robot could encounter. However, images of the real world often implicitly encode answers to such questions and can show which configurations of objects are meaningful or are usually used by humans. An image of a full fridge can give a lot of information about how things are usually arranged in relation to each other and the full fridge at large. Modern generative systems are capable of generating plausible images of the real world and can be conditioned on the environment in which the robot operates. Here we investigate the idea of using the implicit knowledge about the world of modern generative AI systems given by their ability to generate convincing images of the real world to solve under-specified tasks.
zh
[CV-118] REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models
【速读】:该论文旨在解决现有基准测试在评估大型视觉语言模型(Large Vision-Language Models, LVLMs)时缺乏全面性和深度的问题。现有的评估基准通常专注于特定方面,如感知能力、认知能力和对抗攻击的安全性,但无法提供对LVLMs整体优势与局限性的充分理解。为填补这一空白,论文提出了REVAL,这是一个综合性的基准框架,用于评估LVLMs的可靠性(Reliability)和价值(Value)。REVAL包含超过144K个图像-文本视觉问答(Visual Question Answering, VQA)样本,并分为两个主要部分:可靠性部分关注模型的准确性、真实性(如感知精度和幻觉倾向)以及鲁棒性(如对抗攻击、排版攻击和图像损坏的抵抗力),而价值部分则评估伦理问题(如偏见和道德理解)、安全问题(如毒性及越狱漏洞)以及隐私问题(如隐私意识和隐私泄露)。关键在于REVAL通过系统化的方式全面覆盖了这些维度,从而为研究者提供了评估和比较不同LVLMs的坚实框架,推动了领域内的进步。
链接: https://arxiv.org/abs/2503.16566
作者: Jie Zhang,Zheng Yuan,Zhongqi Wang,Bei Yan,Sibo Wang,Xiangkui Cao,Zonghui Guo,Shiguang Shan,Xilin Chen
机构: ICT (中国科学院计算技术研究所); VIPL (视觉信息处理与学习实验室, 中国科学院计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 45 pages, 5 figures, 18 tables
点击查看摘要
Abstract:The rapid evolution of Large Vision-Language Models (LVLMs) has highlighted the necessity for comprehensive evaluation frameworks that assess these models across diverse dimensions. While existing benchmarks focus on specific aspects such as perceptual abilities, cognitive capabilities, and safety against adversarial attacks, they often lack the breadth and depth required to provide a holistic understanding of LVLMs’ strengths and limitations. To address this gap, we introduce REVAL, a comprehensive benchmark designed to evaluate the \textbfREliability and \textbfVALue of LVLMs. REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability, which assesses truthfulness (\eg, perceptual accuracy and hallucination tendencies) and robustness (\eg, resilience to adversarial attacks, typographic attacks, and image corruption), and Values, which evaluates ethical concerns (\eg, bias and moral understanding), safety issues (\eg, toxicity and jailbreak vulnerabilities), and privacy problems (\eg, privacy awareness and privacy leakage). We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro. Our findings reveal that while current LVLMs excel in perceptual tasks and toxicity avoidance, they exhibit significant vulnerabilities in adversarial scenarios, privacy preservation, and ethical reasoning. These insights underscore critical areas for future improvements, guiding the development of more secure, reliable, and ethically aligned LVLMs. REVAL provides a robust framework for researchers to systematically assess and compare LVLMs, fostering advancements in the field.
zh
[CV-119] MathFlow: Enhancing the Perceptual Flow of MLLM s for Visual Mathematical Problems
【速读】:该论文试图解决多模态大型语言模型(Multimodal Large Language Models, MLLMs)在视觉数学问题求解中的不足,特别是其在从图表中准确提取和解析信息方面的能力限制。论文假设从图表中提取有意义信息的感知能力至关重要,因为它直接影响后续推理过程。为验证此假设,作者构建了一个名为FlowVerse的综合基准,将问题解决过程中使用的全部信息分为四个组件,并组合成六个问题版本进行评估。初步结果显示现有MLLMs在从图表中提取关键信息、推断属性以及基于这些视觉输入进行复杂推理时存在显著局限性。
为应对这一挑战,论文提出了MathFlow,这是一种模块化的问题求解流水线,将感知与推理分离为独立阶段以分别优化。鉴于当前MLLMs在感知方面的局限性,研究者训练了MathFlow-P-7B作为专用的感知模型。实验结果表明,当MathFlow-P-7B与多种闭源及开源推理模型结合使用时,可带来显著性能提升。这证明了MathFlow流水线的有效性及其对多样化推理框架的兼容性。
解决方案的关键在于通过设计MathFlow流水线实现感知与推理的解耦,并专门开发了MathFlow-P-7B作为感知模型来弥补现有MLLMs在此领域的不足。
链接: https://arxiv.org/abs/2503.16549
作者: Felix Chen,Hangjie Yuan,Yunqiu Xu,Tao Feng,Jun Cen,Pengwei Liu,Zeying Huang,Yi Yang
机构: Zhejiang University (浙江大学); Tsinghua University (清华大学); Intelligent Learning (智能学习); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
点击查看摘要
Abstract:Despite impressive performance across diverse tasks, Multimodal Large Language Models (MLLMs) have yet to fully demonstrate their potential in visual mathematical problem-solving, particularly in accurately perceiving and interpreting diagrams. Inspired by typical processes of humans, we hypothesize that the perception capabilities to extract meaningful information from diagrams is crucial, as it directly impacts subsequent inference processes. To validate this hypothesis, we developed FlowVerse, a comprehensive benchmark that categorizes all information used during problem-solving into four components, which are then combined into six problem versions for evaluation. Our preliminary results on FlowVerse reveal that existing MLLMs exhibit substantial limitations when extracting essential information and reasoned property from diagrams and performing complex reasoning based on these visual inputs. In response, we introduce MathFlow, a modular problem-solving pipeline that decouples perception and inference into distinct stages, thereby optimizing each independently. Given the perceptual limitations observed in current MLLMs, we trained MathFlow-P-7B as a dedicated perception model. Experimental results indicate that MathFlow-P-7B yields substantial performance gains when integrated with various closed-source and open-source inference models. This demonstrates the effectiveness of the MathFlow pipeline and its compatibility to diverse inference frameworks. The FlowVerse benchmark and code are available at this https URL.
zh
[CV-120] A Comprehensive Survey on Architectural Advances in Deep CNNs: Challenges Applications and Emerging Research Directions
【速读】:该论文旨在系统性地梳理深度卷积神经网络(Deep Convolutional Neural Networks, CNNs)自2015年至2025年的演进历程,总结其架构创新、应用拓展及未来发展方向。论文试图解决的问题是如何通过体系化的分类方法和全面的回顾,整合CNN在多个领域(如计算机视觉、自然语言处理、医疗诊断等)的应用成果,并揭示其面临的挑战与机遇。论文的关键解决方案在于提出一个统一的分类 taxonomy,基于空间特征提取、多路径结构、深度、宽度、维度扩展、通道增强以及注意力机制等维度对CNN架构进行归类;同时,强调了高效预处理策略(如傅里叶变换、低精度计算和权重压缩)以优化推理速度,并探讨了新兴学习范式(如少量样本学习、零样本学习、弱监督学习和联邦学习框架)。此外,论文还展望了混合CNN-Transformer模型、跨模态融合(如视觉-语言集成)、生成式学习等未来研究方向。
链接: https://arxiv.org/abs/2503.16546
作者: Saddam Hussain Khan,Rashid Iqbal(Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat, Pakistan)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 100 Pages, 44 Figures
点击查看摘要
Abstract:Deep Convolutional Neural Networks (CNNs) have significantly advanced deep learning, driving breakthroughs in computer vision, natural language processing, medical diagnosis, object detection, and speech recognition. Architectural innovations including 1D, 2D, and 3D convolutional models, dilated and grouped convolutions, depthwise separable convolutions, and attention mechanisms address domain-specific challenges and enhance feature representation and computational efficiency. Structural refinements such as spatial-channel exploitation, multi-path design, and feature-map enhancement contribute to robust hierarchical feature extraction and improved generalization, particularly through transfer learning. Efficient preprocessing strategies, including Fourier transforms, structured transforms, low-precision computation, and weight compression, optimize inference speed and facilitate deployment in resource-constrained environments. This survey presents a unified taxonomy that classifies CNN architectures based on spatial exploitation, multi-path structures, depth, width, dimensionality expansion, channel boosting, and attention mechanisms. It systematically reviews CNN applications in face recognition, pose estimation, action recognition, text classification, statistical language modeling, disease diagnosis, radiological analysis, cryptocurrency sentiment prediction, 1D data processing, video analysis, and speech recognition. In addition to consolidating architectural advancements, the review highlights emerging learning paradigms such as few-shot, zero-shot, weakly supervised, federated learning frameworks and future research directions include hybrid CNN-transformer models, vision-language integration, generative learning, etc. This review provides a comprehensive perspective on CNN’s evolution from 2015 to 2025, outlining key innovations, challenges, and opportunities.
zh
[CV-121] Defending Against Gradient Inversion Attacks for Biomedical Images via Learnable Data Perturbation
【速读】:该论文旨在解决在联邦学习(Federated Learning, FL)中梯度反转攻击(Gradient Inversion Attacks)导致的隐私泄露问题,尤其是在医疗数据共享场景下,现有防御方法缺乏广义适用性且未充分验证其在真实医疗系统中的有效性。论文的关键在于提出了一种基于潜在数据扰动(Latent Data Perturbation)和极小-极大优化(Minimax Optimization)的新型防御机制,通过通用数据集与医学图像数据集验证其效果。实验结果显示,该方法相较于两种基准方法,在保持约90%客户端分类准确性的同时,将攻击者对重建图像分类的准确率降低了12.5%,并将原始图像与重建图像之间的均方误差(Mean Squared Error, MSE)提升了超过12.4%,表明该方案具有较强的普适性和实用性。
链接: https://arxiv.org/abs/2503.16542
作者: Shiyi Jiang,Farshad Firouzi,Krishnendu Chakrabarty
机构: Department of Electrical and Computer Engineering, Duke University (杜克大学), Durham, NC 27708 USA; School of Electrical, Computer and Energy Engineering, Arizona State University (亚利桑那州立大学), Tempe, AZ 85281 USA
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The increasing need for sharing healthcare data and collaborating on clinical research has raised privacy concerns. Health information leakage due to malicious attacks can lead to serious problems such as misdiagnoses and patient identification issues. Privacy-preserving machine learning (PPML) and privacy-enhancing technologies, particularly federated learning (FL), have emerged in recent years as innovative solutions to balance privacy protection with data utility; however, they also suffer from inherent privacy vulnerabilities. Gradient inversion attacks constitute major threats to data sharing in federated learning. Researchers have proposed many defenses against gradient inversion attacks. However, current defense methods for healthcare data lack generalizability, i.e., existing solutions may not be applicable to data from a broader range of populations. In addition, most existing defense methods are tested using non-healthcare data, which raises concerns about their applicability to real-world healthcare systems. In this study, we present a defense against gradient inversion attacks in federated learning. We achieve this using latent data perturbation and minimax optimization, utilizing both general and medical image datasets. Our method is compared to two baselines, and the results show that our approach can outperform the baselines with a reduction of 12.5% in the attacker’s accuracy in classifying reconstructed images. The proposed method also yields an increase of over 12.4% in Mean Squared Error (MSE) between the original and reconstructed images at the same level of model utility of around 90% client classification accuracy. The results suggest the potential of a generalizable defense for healthcare data.
zh
[CV-122] Leverag ing Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking
【速读】:该论文旨在解决开放词汇检测(Open-Vocabulary Detection, OVD)、实例分割(Instance Segmentation)和跟踪(Tracking)任务中,如何有效利用视觉-语言模型(Vision-Language Models, VLMs)的能力以提升模型性能的问题。解决方案的关键在于将VLM生成的结构化描述与现有的OVD、实例分割及跟踪方法相结合。通过利用VLM生成的描述来识别可见的目标实例、收集应用场景相关的属性,并指导开放词汇检测器提取对应的边界框,这些边界框随后被传递至视频分割模型以提供精确的分割掩膜和跟踪能力。此外,该方法能够在初始化后直接提取分割掩膜,在实时处理图像流的同时保持较低的计算开销,并支持在线更新跟踪结果。这种结合不仅发挥了VLM的描述能力,还融合了OVD的语义接地能力和视频分割的像素级理解与速度优势。
链接: https://arxiv.org/abs/2503.16538
作者: Bastian Pätzold,Jan Nogga,Sven Behnke
机构: Autonomous Intelligent Systems group of University of Bonn (波恩大学自主智能系统小组), Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted to IEEE Robotics and Automation Letters (RA-L)
点击查看摘要
Abstract:This paper introduces a novel approach that leverages the capabilities of vision-language models (VLMs) by integrating them with established approaches for open-vocabulary detection (OVD), instance segmentation, and tracking. We utilize VLM-generated structured descriptions to identify visible object instances, collect application-relevant attributes, and inform an open-vocabulary detector to extract corresponding bounding boxes that are passed to a video segmentation model providing precise segmentation masks and tracking capabilities. Once initialized, this model can then directly extract segmentation masks, allowing processing of image streams in real time with minimal computational overhead. Tracks can be updated online as needed by generating new structured descriptions and corresponding open-vocabulary detections. This combines the descriptive power of VLMs with the grounding capability of OVD and the pixel-level understanding and speed of video segmentation. Our evaluation across datasets and robotics platforms demonstrates the broad applicability of this approach, showcasing its ability to extract task-specific attributes from non-standard objects in dynamic environments.
zh
[CV-123] Vision-Language Embodiment for Monocular Depth Estimation
【速读】:该论文旨在解决单目图像三维重建中固有的不确定性问题,特别是深度估计这一机器人感知与视觉任务中的核心难题。传统方法主要依赖于图像间关系进行有监督训练,而忽视了相机本身的内在特性。论文的关键创新在于将相机模型及其物理特性融入深度学习框架,通过实时交互的方式计算具身场景深度(Embodied Scene Depth)。该方案仅利用相机的内参属性即可实现无需额外设备的实时深度计算,并结合RGB图像特征,从几何与视觉细节两个维度获取全面的场景理解。此外,论文引入包含环境描述与深度信息的文本先验,作为场景理解的补充,进一步增强模型对物体的感知能力。这种图像与语言模态的结合充分利用了两者互补的优势,显著提升了单目深度估计的效果。实验结果表明,所提出的具身深度估计方法在不同场景下均有效提升了模型性能。
链接: https://arxiv.org/abs/2503.16535
作者: Jinchang Zhang,Guoyu Lu
机构: Intelligent Vision and Sensing Lab (智能视觉与传感实验室); University of Georgia (乔治亚大学); Binghamton University (宾汉姆顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Depth estimation is a core problem in robotic perception and vision tasks, but 3D reconstruction from a single image presents inherent uncertainties. Current depth estimation models primarily rely on inter-image relationships for supervised training, often overlooking the intrinsic information provided by the camera itself. We propose a method that embodies the camera model and its physical characteristics into a deep learning model, computing embodied scene depth through real-time interactions with road environments. The model can calculate embodied scene depth in real-time based on immediate environmental changes using only the intrinsic properties of the camera, without any additional equipment. By combining embodied scene depth with RGB image features, the model gains a comprehensive perspective on both geometric and visual details. Additionally, we incorporate text descriptions containing environmental content and depth information as priors for scene understanding, enriching the model’s perception of objects. This integration of image and language - two inherently ambiguous modalities - leverages their complementary strengths for monocular depth estimation. The real-time nature of the embodied language and depth prior model ensures that the model can continuously adjust its perception and behavior in dynamic environments. Experimental results show that the embodied depth estimation method enhances model performance across different scenes.
zh
[CV-124] Adams Bashforth Moulton Solver for Inversion and Editing in Rectified Flow
【速读】:该论文旨在解决现有数值求解器在反向流模型中的采样速度与高精度解之间的权衡问题,这一局限性阻碍了其在图像重建和编辑等下游应用中的有效性。为了解决这一挑战,论文的关键创新在于提出了一种基于Adams-Bashforth-Moulton (ABM) 预测校正方法的ABM-Solver。该方法通过引入多步预测校正机制减少局部截断误差,并结合自适应步长调整以提升采样效率。此外,为了在实现语义修改的同时有效保留未编辑区域,论文还设计了一个Mask Guided Feature Injection模块,利用自相似性生成空间掩码以区分需保护区域与可编辑区域。实验结果表明,ABM-Solver显著提升了反演精度和编辑质量,优于现有求解器且无需额外训练或优化。
链接: https://arxiv.org/abs/2503.16522
作者: Yongjia Ma,Donglin Di,Xuan Liu,Xiaokai Chen,Lei Fan,Wei Chen,Tonghua Su
机构: Li Auto (理想汽车); Harbin Institute of Technology (哈尔滨工业大学); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Rectified flow models have achieved remarkable performance in image and video generation tasks. However, existing numerical solvers face a trade-off between fast sampling and high-accuracy solutions, limiting their effectiveness in downstream applications such as reconstruction and editing. To address this challenge, we propose leveraging the Adams-Bashforth-Moulton (ABM) predictor-corrector method to enhance the accuracy of ODE solving in rectified flow models. Specifically, we introduce ABM-Solver, which integrates a multi step predictor corrector approach to reduce local truncation errors and employs Adaptive Step Size Adjustment to improve sampling speed. Furthermore, to effectively preserve non edited regions while facilitating semantic modifications, we introduce a Mask Guided Feature Injection module. We estimate self-similarity to generate a spatial mask that differentiates preserved regions from those available for editing. Extensive experiments on multiple high-resolution image datasets validate that ABM-Solver significantly improves inversion precision and editing quality, outperforming existing solvers without requiring additional training or optimization.
zh
[CV-125] VocalEyes: Enhancing Environmental Perception for the Visually Impaired through Vision-Language Models and Distance-Aware Object Detection
【速读】:本文旨在解决视障人士在独立移动和环境感知方面的需求,提出了一种创新的实时系统,通过提供周围环境的音频描述来增强其情境意识。解决方案的关键在于采用了一种经过量化(Quantization)和微调(Fine-tuning)的Florence-2大模型,该模型被调整至4位精度,以实现在低功耗边缘设备(如NVIDIA Jetson Orin Nano)上的高效运行。系统通过将视频信号转换为帧的形式,在5帧延迟内快速生成对象、行人及障碍物及其估计距离的上下文相关描述,并结合Parler TTS Mini轻量级文本转语音(Text-to-Speech, TTS)模块实现高效的音频反馈,支持34种不同音色定制以及语速、风格调整,从而显著提升用户体验。关键在于结合紧凑模型架构与灵活TTS组件,优化了系统的实时性能与实用性。
链接: https://arxiv.org/abs/2503.16488
作者: Kunal Chavan,Keertan Balaji,Spoorti Barigidad,Samba Raju Chiluveru
机构: School of Electronics Engineering (电子工程学院), Vellore Institute of Technology (韦洛尔理工学院); Electrical Electronics and Communication Engineering (电气电子与通信工程系), IIT Dharwad (印度理工学院达瓦德); Department of Computer Science and Engineering (计算机科学与工程系), S.R.M Institute of Technology (SRM理工学院)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:With an increasing demand for assistive technologies that promote the independence and mobility of visually impaired people, this study suggests an innovative real-time system that gives audio descriptions of a user’s surroundings to improve situational awareness. The system acquires live video input and processes it with a quantized and fine-tuned Florence-2 big model, adjusted to 4-bit accuracy for efficient operation on low-power edge devices such as the NVIDIA Jetson Orin Nano. By transforming the video signal into frames with a 5-frame latency, the model provides rapid and contextually pertinent descriptions of objects, pedestrians, and barriers, together with their estimated distances. The system employs Parler TTS Mini, a lightweight and adaptable Text-to-Speech (TTS) solution, for efficient audio feedback. It accommodates 34 distinct speaker types and enables customization of speech tone, pace, and style to suit user requirements. This study examines the quantization and fine-tuning techniques utilized to modify the Florence-2 model for this application, illustrating how the integration of a compact model architecture with a versatile TTS component improves real-time performance and user experience. The proposed system is assessed based on its accuracy, efficiency, and usefulness, providing a viable option to aid vision-impaired users in navigating their surroundings securely and successfully.
zh
[CV-126] Inclusive STEAM Education: A Framework for Teaching Cod-2 ing and Robotics to Students with Visually Impairment Using 3 Advanced Computer Vision
【速读】:该论文旨在解决视障学生在编程和机器人学习中因视觉障碍导致的跟踪机器人运动和建立空间意识的重大挑战。论文提出的关键解决方案是构建一个融合多种技术的框架:利用预构造的机器人与算法(如迷宫求解技术),并通过可访问的学习环境支持视障学生的编程技能发展与复杂问题解决能力。该框架的核心在于结合Contrastive Language-Image Pre-training (CLIP) 和 Audio Virtual Reality (AVR) 系统,将摄像头捕捉的迷宫布局视觉数据转换为文本描述,并生成空间音频提示;同时,通过机器人搭载的立体相机提供实时Simultaneous Localization and Mapping (SLAM) 数据,实现连续反馈。这种多模态技术的整合是解决方案的关键,它不仅解决了视障学生在迷宫求解任务中的具体困难,还展示了计算机视觉在特殊教育领域的广泛应用潜力,从而提升STEAM学科的可访问性和学习体验。
链接: https://arxiv.org/abs/2503.16482
作者: Mahmoud Hamash,Md Raqib Khan,Peter Tiernan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 14 pages, 2 figures
点击查看摘要
Abstract:STEAM education integrates Science, Technology, Engineering, Arts, and Mathematics to foster creativity and problem-solving. However, students with visual impairments (VI) encounter significant challenges in programming and robotics, particularly in tracking robot movements and developing spatial awareness. This paper presents a framework that leverages pre-constructed robots and algorithms, such as maze-solving techniques, within an accessible learning environment. The proposed system employs Contrastive Language-Image Pre-training (CLIP) to process global camera-captured maze layouts, converting visual data into textual descriptions that generate spatial audio prompts in an Audio Virtual Reality (AVR) system. Students issue verbal commands, which are refined through CLIP, while robot-mounted stereo cameras provide real-time data processed via Simultaneous Localization and Mapping (SLAM) for continuous feedback. By integrating these technologies, the framework empowers VI students to develop coding skills and engage in complex problem-solving tasks. Beyond maze-solving applications, this approach demonstrates the broader potential of computer vision in special education, contributing to improved accessibility and learning experiences in STEAM disciplines.
zh
[CV-127] CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance
【速读】:该论文旨在解决轻量级视觉-语言模型在资源受限场景下的性能瓶颈问题,特别是在仅依赖单一图像-文本对比学习目标时表现不佳的现象。为应对这一挑战,论文提出了一种名为CLIP-PING(Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance)的新颖训练范式。CLIP-PING的关键在于通过引导最近邻(Nearest-Neighbor, NN)和交叉最近邻(Cross Nearest-Neighbor, XNN)样本的内在特征,为模型提供额外的对比监督信号,从而显著提升跨模态特征对齐能力。这种方法在保证较低计算开销和数据需求的同时,使轻量级模型能够学习到更通用且语义丰富的特征,在零样本泛化和跨模态检索任务中展现出优越性能。
链接: https://arxiv.org/abs/2412.03871
作者: Chu Myaet Thwal,Ye Lin Tun,Minh N. H. Nguyen,Eui-Nam Huh,Choong Seon Hong
机构: Department of Computer Science and Engineering, Kyung Hee University (启明大学计算机科学与工程系); Digital Science and Technology Institute, The University of Danang—Vietnam-Korea University of Information and Communication Technology (越南-韩国信息通信技术大学数字科学与技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 14 pages, 5 figures, 24 tables
点击查看摘要
Abstract:Beyond the success of Contrastive Language-Image Pre-training (CLIP), recent trends mark a shift toward exploring the applicability of lightweight vision-language models for resource-constrained scenarios. These models often deliver suboptimal performance when relying solely on a single image-text contrastive learning objective, spotlighting the need for more effective training mechanisms that guarantee robust cross-modal feature alignment. In this work, we propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance, a novel yet simple and efficient training paradigm designed to boost the performance of lightweight vision-language models with minimal computational overhead and lower data demands. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples, i.e., nearest-neighbor (NN) and cross nearest-neighbor (XNN). We find that extra contrastive supervision from these neighbors substantially boosts cross-modal alignment, enabling lightweight models to learn more generic features with rich semantic diversity. Extensive experiments reveal that CLIP-PING notably surpasses its peers in zero-shot generalization and cross-modal retrieval tasks. Specifically, a 5.5% gain on zero-shot ImageNet1K classification with 10.7% (I2T) and 5.7% (T2I) on Flickr30K retrieval, compared to the original CLIP when using ViT-XS image encoder trained on 3 million (image, text) pairs. Moreover, CLIP-PING showcases a strong transferability under the linear evaluation protocol across several downstream tasks.
zh
[CV-128] OnDev-LCT: On-Device Lightweight Convolutional Transformers towards federated learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中视觉任务模型在资源受限边缘设备上的部署挑战。具体而言,传统基于卷积神经网络(CNNs)或视觉Transformer(Vision Transformer, ViT)的方法因计算需求高或参数量大,难以适应联邦学习中数据分布异构性和通信带宽限制的问题。论文提出的解决方案关键在于设计了一种轻量级卷积Transformer模型——OnDev-LCT(Lightweight Convolutional Transformers for On-Device 视觉任务)。该模型通过LCT分词器引入图像特定归纳偏置,利用深度可分离卷积在残差线性瓶颈块中高效提取局部特征,并结合多头自注意力机制(Multi-Head Self-Attention, MHSA)隐式捕捉全局表征,从而在有限的训练数据和资源约束下实现高性能视觉任务处理,同时保持较低的参数量和计算需求。
链接: https://arxiv.org/abs/2401.11652
作者: Chu Myaet Thwal,Minh N.H. Nguyen,Ye Lin Tun,Seong Tae Kim,My T. Thai,Choong Seon Hong
机构: Kangwon National University (江原国立大学); Vietnam National University, Danang (越南国家大学,岘港); University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Published in Neural Networks
点击查看摘要
Abstract:Federated learning (FL) has emerged as a promising approach to collaboratively train machine learning models across multiple edge devices while preserving privacy. The success of FL hinges on the efficiency of participating models and their ability to handle the unique challenges of distributed learning. While several variants of Vision Transformer (ViT) have shown great potential as alternatives to modern convolutional neural networks (CNNs) for centralized training, the unprecedented size and higher computational demands hinder their deployment on resource-constrained edge devices, challenging their widespread application in FL. Since client devices in FL typically have limited computing resources and communication bandwidth, models intended for such devices must strike a balance between model size, computational efficiency, and the ability to adapt to the diverse and non-IID data distributions encountered in FL. To address these challenges, we propose OnDev-LCT: Lightweight Convolutional Transformers for On-Device vision tasks with limited training data and resources. Our models incorporate image-specific inductive biases through the LCT tokenizer by leveraging efficient depthwise separable convolutions in residual linear bottleneck blocks to extract local features, while the multi-head self-attention (MHSA) mechanism in the LCT encoder implicitly facilitates capturing global representations of images. Extensive experiments on benchmark image datasets indicate that our models outperform existing lightweight vision models while having fewer parameters and lower computational demands, making them suitable for FL scenarios with data heterogeneity and communication bottlenecks.
zh
[CV-129] A Topological Data Analysis Framework for Quantifying Necrosis in Glioblastomas
【速读】:该论文旨在解决如何通过量化肿瘤坏死区域的几何特性来表征和区分不同类型的胶质母细胞瘤(Glioblastoma, GB)的问题。解决方案的关键在于引入了一个基于拓扑数据分析(Topological Data Analysis, TDA)的形状描述符——“内部函数”(Interior Function),并通过这一概念定义了一个新的指标——亚复形空隙度(subcomplex lacunarity),用于量化肿瘤坏死区域如凝聚性的几何特征。在此基础上,构建了一组指数来分析坏死形态,并通过这些指标绘制图表以捕捉肿瘤坏死区域的独特结构与几何属性。最终,该框架应用于胶质母细胞瘤MRI的研究中,结合聚类分析识别出四种反映坏死区域几何特性的不同亚型。
链接: https://arxiv.org/abs/2503.17331
作者: Francisco Tellez,Enrique Torres-Giese
机构: Universidad de Los Andes (Universidad de Los Andes), Bogotá, Colombia; Trinity Western University (Trinity Western University), Langley, Canada
类目: Algebraic Topology (math.AT); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this paper, we introduce a shape descriptor that we call “interior function”. This is a Topological Data Analysis (TDA) based descriptor that refines previous descriptors for image analysis. Using this concept, we define subcomplex lacunarity, a new index that quantifies geometric characteristics of necrosis in tumors such as conglomeration. Building on this framework, we propose a set of indices to analyze necrotic morphology and construct a diagram that captures the distinct structural and geometric properties of necrotic regions in tumors. We present an application of this framework in the study of MRIs of Glioblastomas (GB). Using cluster analysis, we identify four distinct subtypes of Glioblastomas that reflect geometric properties of necrotic regions.
zh
[CV-130] Vision Transformer Based Semantic Communications for Next Generation Wireless Networks
【速读】:本文旨在解决6G网络中语义通信的问题,通过优先传输语义信息而非原始数据准确性来革新数据传输方式。论文提出了一种基于Vision Transformer(ViT)的语义通信框架,其关键在于利用ViT作为编解码器架构,在发射端高效编码图像以富含高语义内容,并在接收端精确重建图像,同时考虑实际环境中的衰落与噪声影响。该模型基于ViT的注意力机制,相较于专门用于生成此类图像的卷积神经网络(CNNs)和生成对抗网络(GANs),表现出更优性能。所提出的ViT网络架构实现了38 dB的峰值信噪比(PSNR),在保持不同通信环境下的语义相似性方面优于其他深度学习(DL)方法,从而确立了本研究在语义通信领域的重要突破。
链接: https://arxiv.org/abs/2503.17275
作者: Muhammad Ahmed Mohsin,Muhammad Jazib,Zeeshan Alam,Muhmmad Farhan Khan,Muhammad Saad,Muhammad Ali Jamshed
机构: Stanford University (斯坦福大学); Pakistan Engineering Council, Pakistan Institute of Engineering & Applied Sciences (PIEAS), Pakistan; University of New Brunswick (UNB) (新不伦瑞克大学), Canada; University College Cork (UCC) (科克大学学院), Ireland; National University of Sciences and Technology (NUST), Pakistan; University of Glasgow (格拉斯哥大学), UK
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted @ ICC 2025
点击查看摘要
Abstract:In the evolving landscape of 6G networks, semantic communications are poised to revolutionize data transmission by prioritizing the transmission of semantic meaning over raw data accuracy. This paper presents a Vision Transformer (ViT)-based semantic communication framework that has been deliberately designed to achieve high semantic similarity during image transmission while simultaneously minimizing the demand for bandwidth. By equipping ViT as the encoder-decoder framework, the proposed architecture can proficiently encode images into a high semantic content at the transmitter and precisely reconstruct the images, considering real-world fading and noise consideration at the receiver. Building on the attention mechanisms inherent to ViTs, our model outperforms Convolution Neural Network (CNNs) and Generative Adversarial Networks (GANs) tailored for generating such images. The architecture based on the proposed ViT network achieves the Peak Signal-to-noise Ratio (PSNR) of 38 dB, which is higher than other Deep Learning (DL) approaches in maintaining semantic similarity across different communication environments. These findings establish our ViT-based approach as a significant breakthrough in semantic communications.
zh
[CV-131] Cross-Modal Interactive Perception Network with Mamba for Lung Tumor Segmentation in PET-CT Images CVPR2025
【速读】:该论文旨在解决PET-CT成像在肺肿瘤分割中存在的图像质量差、运动伪影及复杂肿瘤形态等问题,并提出了一种基于深度学习的方法以克服现有小规模私有数据集限制导致的性能瓶颈。为实现这一目标,论文引入了一个包含21,930对PET-CT图像的大规模肺肿瘤分割数据集PCLT20K,并开发了一种名为跨模态交互感知网络(Cross-modal Interactive Perception Network with Mamba, CIPA)的新模型用于PET-CT图像中的肺肿瘤分割任务。CIPA的关键创新在于设计了一个通道式校正模块(Channel-wise Rectification Module, CRM),通过多模态特征间的状态空间块来学习相关表示并减少模态特定噪声;同时,还提出了动态跨模态交互模块(Dynamic Cross-modality Interaction Module, DCIM),用于有效整合位置与上下文信息,利用PET图像获取区域位置信息并作为桥梁帮助建模CT图像局部特征之间的关系。实验结果表明,相比现有的先进分割方法,CIPA在全面基准测试中表现优异。我们希望本研究能够为医学图像分割提供更多探索方向。数据集及相关代码可从提供的链接获取。
链接: https://arxiv.org/abs/2503.17261
作者: Jie Mei,Chenyu Lin,Yu Qiu,Yaonan Wang,Hui Zhang,Ziyang Wang,Dong Dai
机构: Hunan University (湖南大学); Nankai University (南开大学); Hunan Normal University (湖南师范大学); Tianjin Medical University Cancer Institute and Hospital (天津医科大学肿瘤医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
点击查看摘要
Abstract:Lung cancer is a leading cause of cancer-related deaths globally. PET-CT is crucial for imaging lung tumors, providing essential metabolic and anatomical information, while it faces challenges such as poor image quality, motion artifacts, and complex tumor morphology. Deep learning-based models are expected to address these problems, however, existing small-scale and private datasets limit significant performance improvements for these methods. Hence, we introduce a large-scale PET-CT lung tumor segmentation dataset, termed PCLT20K, which comprises 21,930 pairs of PET-CT images from 605 patients. Furthermore, we propose a cross-modal interactive perception network with Mamba (CIPA) for lung tumor segmentation in PET-CT images. Specifically, we design a channel-wise rectification module (CRM) that implements a channel state space block across multi-modal features to learn correlated representations and helps filter out modality-specific noise. A dynamic cross-modality interaction module (DCIM) is designed to effectively integrate position and context information, which employs PET images to learn regional position information and serves as a bridge to assist in modeling the relationships between local features of CT images. Extensive experiments on a comprehensive benchmark demonstrate the effectiveness of our CIPA compared to the current state-of-the-art segmentation methods. We hope our research can provide more exploration opportunities for medical image segmentation. The dataset and code are available at this https URL.
zh
[CV-132] Deep End-to-End Posterior ENergy (DEEPEN) for image recovery
【速读】:该论文旨在解决现有端到端(End-to-End, E2E)和即插即用(Plug-and-Play, PnP)图像重建算法仅能近似最大后验估计(Maximum A Posteriori, MAP),而无法像扩散模型那样从后验分布采样的问题。同时,论文指出扩散模型难以以端到端的方式进行训练。为此,论文提出了一种名为深度端到端后验能量网络(Deep End-to-End Posterior ENergy, DEEPEN)的框架,能够实现MAP估计以及后验分布采样。其关键在于通过最大似然优化以端到端方式学习后验分布参数,该后验分布由数据一致性误差与负对数先验分布之和构成。DEEPEN 不需要展开算法(algorithm unrolling),因此计算量和内存占用小于当前的端到端方法,且无需当前PnP方法所需的收缩约束。实验结果表明,DEEPEN 在MAP设置下性能优于现有端到端和即插即用模型,并且在采样速度上优于扩散模型,同时所学的能量模型对图像采集设置的变化更具鲁棒性。
链接: https://arxiv.org/abs/2503.17244
作者: Jyothi Rikhab Chand,Mathews Jacob
机构: Department of Electrical and Computer Engineering, University of Virginia (弗吉尼亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Current end-to-end (E2E) and plug-and-play (PnP) image reconstruction algorithms approximate the maximum a posteriori (MAP) estimate but cannot offer sampling from the posterior distribution, like diffusion models. By contrast, it is challenging for diffusion models to be trained in an E2E fashion. This paper introduces a Deep End-to-End Posterior ENergy (DEEPEN) framework, which enables MAP estimation as well as sampling. We learn the parameters of the posterior, which is the sum of the data consistency error and the negative log-prior distribution, using maximum likelihood optimization in an E2E fashion. The proposed approach does not require algorithm unrolling, and hence has a smaller computational and memory footprint than current E2E methods, while it does not require contraction constraints typically needed by current PnP methods. Our results demonstrate that DEEPEN offers improved performance than current E2E and PnP models in the MAP setting, while it also offers faster sampling compared to diffusion models. In addition, the learned energy-based model is observed to be more robust to changes in image acquisition settings.
zh
[CV-133] A New Statistical Model of Star Speckles for Learning to Detect and Characterize Exoplanets in Direct Imaging Observations CVPR2025
【速读】:该论文旨在解决直接成像法探测系外行星时面临的挑战,即从被宿主恒星强烈残余光污染的信号中分离出行星星信号。解决方案的关键在于提出了一种新的统计模型,该模型采用多尺度方法捕捉干扰波动,利用问题的对称性和基于物理原理的联合光谱通道表示。此模型整合到一个可解释且端到端可学习的框架中,实现了系外行星的同时检测与流量估计。关键创新点在于通过物理驱动的多尺度分析有效分离信号与噪声,显著提升了在困难数据集上的精确度与召回率表现。
链接: https://arxiv.org/abs/2503.17117
作者: Théo Bodrito,Olivier Flasseur,Julien Mairal,Jean Ponce,Maud Langlois,Anne-Marie Lagrange
机构: Département d’Informatique de l’École normale supérieure (ENS-PSL, CNRS, Inria); Universite Claude Bernard Lyon 1, Centre de Recherche Astrophysique de Lyon UMR 5574, ENS de Lyon, CNRS, Villeurbanne, F-69622, France; Université Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK; Courant Institute and Center for Data Science, New York University; Laboratoire d’Études Spatiales et d’Instrumentation en Astrophysique, Observatoire de Paris, Université PSL, Sorbonne Université, Université Paris Diderot; Université Grenoble Alpes, Institut de Planétologie et d’Astrophysique de Grenoble
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
备注: Accepted to CVPR 2025
点击查看摘要
Abstract:The search for exoplanets is an active field in astronomy, with direct imaging as one of the most challenging methods due to faint exoplanet signals buried within stronger residual starlight. Successful detection requires advanced image processing to separate the exoplanet signal from this nuisance component. This paper presents a novel statistical model that captures nuisance fluctuations using a multi-scale approach, leveraging problem symmetries and a joint spectral channel representation grounded in physical principles. Our model integrates into an interpretable, end-to-end learnable framework for simultaneous exoplanet detection and flux estimation. The proposed algorithm is evaluated against the state of the art using datasets from the SPHERE instrument operating at the Very Large Telescope (VLT). It significantly improves the precision-recall trade-off, notably on challenging datasets that are otherwise unusable by astronomers. The proposed approach is computationally efficient, robust to varying data quality, and well suited for large-scale observational surveys.
zh
[CV-134] Exploring Few-Shot Object Detection on Blood Smear Images: A Case Study of Leukocytes and Schistocytes
【速读】:本文旨在解决血液疾病检测中血细胞类型精确计数的问题,这一过程对于病理条件的诊断至关重要。论文提出了一种名为DE-ViT的新方法,并在Few-Shot学习范式下进行训练,即利用有限数量的图像完成模型训练。该方案的关键在于开发一种能够在少量样本条件下仍能保持高性能的自动计数系统,同时通过与Faster R-CNN 50和Faster R-CNN X 101两种基线模型对比,验证其在Raabin-WBC(白细胞检测)和本地Schistocyte识别数据集上的有效性。尽管DE-ViT在COCO和LVIS数据集上表现出最先进的性能,但其在特定任务中的表现受到域偏移现象的影响。
链接: https://arxiv.org/abs/2503.17107
作者: Davide Antonio Mura,Michela Pinna,Lorenzo Putzu,Andrea Loddo,Alessandra Perniciano,Olga Mulas,Cecilia Di Ruberto
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The detection of blood disorders often hinges upon the quantification of specific blood cell types. Variations in cell counts may indicate the presence of pathological conditions. Thus, the significance of developing precise automatic systems for blood cell enumeration is underscored. The investigation focuses on a novel approach termed DE-ViT. This methodology is employed in a Few-Shot paradigm, wherein training relies on a limited number of images. Two distinct datasets are utilised for experimental purposes: the Raabin-WBC dataset for Leukocyte detection and a local dataset for Schistocyte identification. In addition to the DE-ViT model, two baseline models, Faster R-CNN 50 and Faster R-CNN X 101, are employed, with their outcomes being compared against those of the proposed model. While DE-ViT has demonstrated state-of-the-art performance on the COCO and LVIS datasets, both baseline models surpassed its performance on the Raabin-WBC dataset. Moreover, only Faster R-CNN X 101 yielded satisfactory results on the SC-IDB. The observed disparities in performance may possibly be attributed to domain shift phenomena.
zh
[CV-135] A Comparative Analysis of Image Descriptors for Histopathological Classification of Gastric Cancer
【速读】:该论文旨在解决胃癌病理诊断中因病理科医生工作负荷高及潜在诊断错误导致的预后预测不足的问题。论文的关键在于利用机器学习(Machine Learning)和深度学习(Deep Learning)技术开发自动化、准确的组织病理学图像分类工具,通过结合手工特征与深度特征以及浅层学习分类器(如随机森林RF分类器),在无需微调策略的情况下实现健康与癌症组织病理学图像的有效区分。实验结果显示,采用RF分类器的方法可达到F1值为93.4%,验证了该方法的有效性。
链接: https://arxiv.org/abs/2503.17105
作者: Marco Usai,Andrea Loddo,Alessandra Perniciano,Maurizio Atzori,Cecilia Di Ruberto
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Gastric cancer ranks as the fifth most common and fourth most lethal cancer globally, with a dismal 5-year survival rate of approximately 20%. Despite extensive research on its pathobiology, the prognostic predictability remains inadequate, compounded by pathologists’ high workload and potential diagnostic errors. Thus, automated, accurate histopathological diagnosis tools are crucial. This study employs Machine Learning and Deep Learning techniques to classify histopathological images into healthy and cancerous categories. Using handcrafted and deep features with shallow learning classifiers on the GasHisSDB dataset, we offer a comparative analysis and insights into the most robust and high-performing combinations of features and classifiers for distinguishing between normal and abnormal histopathological images without fine-tuning strategies. With the RF classifier, our approach can reach F1 of 93.4%, demonstrating its validity.
zh
[CV-136] Does a Rising Tide Lift All Boats? Bias Mitigation for AI-based CMR Segmentation
【速读】:该论文旨在解决基于人工智能(AI)的心脏磁共振(CMR)图像分割模型中存在的种族偏见问题,特别是在训练数据集不平衡的情况下。论文的关键在于评估几种常见的偏见缓解方法(oversampling、importance reweighing 和 Group DRO,以及它们的组合)在减少黑人与白人受试者之间偏见方面的有效性,并探索这些方法在处理裁剪后的CMR图像时的表现。研究发现,过采样(oversampling)能够显著提高代表性不足的黑人群体的性能,同时对多数群体白人的表现影响不大;Group DRO虽能提升黑人群体的表现但不显著,而重加权(reweighing)则降低了黑人群体的表现;结合过采样和Group DRO同样仅对黑人群体有轻微改善。此外,使用裁剪后的图像可以提高两组的性能并减少偏见,若再加入过采样作为偏见缓解技术,则可进一步减少偏见。
链接: https://arxiv.org/abs/2503.17089
作者: Tiarna Lee,Esther Puyol-Antón,Bram Ruijsink,Miaojing Shi,Andrew P. King
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Artificial intelligence (AI) is increasingly being used for medical imaging tasks. However, there can be biases in the resulting models, particularly when they were trained using imbalanced training datasets. One such example has been the strong race bias effect in cardiac magnetic resonance (CMR) image segmentation models. Although this phenomenon has been reported in a number of publications, little is known about the effectiveness of bias mitigation algorithms in this domain. We aim to investigate the impact of common bias mitigation methods to address bias between Black and White subjects in AI-based CMR segmentation models. Specifically, we use oversampling, importance reweighing and Group DRO as well as combinations of these techniques to mitigate the race bias. Furthermore, motivated by recent findings on the root causes of AI-based CMR segmentation bias, we evaluate the same methods using models trained and evaluated on cropped CMR images. We find that bias can be mitigated using oversampling, significantly improving performance for the underrepresented Black subjects whilst not significantly reducing the majority White subjects’ performance. Group DRO also improves performance for Black subjects but not significantly, while reweighing decreases performance for Black subjects. Using a combination of oversampling and Group DRO also improves performance for Black subjects but not significantly. Using cropped images increases performance for both races and reduces the bias, whilst adding oversampling as a bias mitigation technique with cropped images reduces the bias further.
zh
[CV-137] Semi-supervised Cervical Segmentation on Ultrasound by A Dual Framework for Neural Networks
【速读】:该论文旨在解决超声(US)图像中颈椎肌肉精确分割的问题,特别是在标签数据稀缺的情况下,如何有效发展自动化的计算机辅助方法。论文的关键解决方案在于提出了一种新颖的半监督学习(Semi-Supervised Learning, SSL)框架,该框架结合了双神经网络,利用它们相互生成伪标签并在像素级别进行交叉监督。此外,引入了一种自监督对比学习策略,通过一对深度表示来增强特征学习能力,尤其是在未标记数据上的表现。这一框架在颈椎分割任务中展示了竞争力。
链接: https://arxiv.org/abs/2503.17057
作者: Fangyijie Wang,Kathleen M. Curran,Guénolé Silvestre
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for an oral presentation at ISBI 2025 Fetal Ultrasound Grand Challenge: Semi-Supervised Cervical Segmentation
点击查看摘要
Abstract:Accurate segmentation of ultrasound (US) images of the cervical muscles is crucial for precision healthcare. The demand for automatic computer-assisted methods is high. However, the scarcity of labeled data hinders the development of these methods. Advanced semi-supervised learning approaches have displayed promise in overcoming this challenge by utilizing labeled and unlabeled data. This study introduces a novel semi-supervised learning (SSL) framework that integrates dual neural networks. This SSL framework utilizes both networks to generate pseudo-labels and cross-supervise each other at the pixel level. Additionally, a self-supervised contrastive learning strategy is introduced, which employs a pair of deep representations to enhance feature learning capabilities, particularly on unlabeled data. Our framework demonstrates competitive performance in cervical segmentation tasks. Our codes are publicly available on this https URL_Cervical_Segmentation.
zh
[CV-138] Exploring the Efficacy of Partial Denoising Using Bit Plane Slicing for Enhanced Fracture Identification: A Comparative Study of Deep Learning-Based Approaches and Handcrafted Feature Extraction Techniques
【速读】:该论文旨在解决骨折分类中因复杂图案和图像噪声导致的准确检测挑战。解决方案的关键在于引入部分去噪技术,并通过多种图像表示形式(包括原始图像、不同位平面组合、完全去噪图像等)来分析信噪比(SNR)及分类准确性,以识别包含最多信息特征的位平面。研究特别强调了部分去噪技术在保留关键特征方面的重要性,从而提升分类结果。实验结果显示,采用Random Forest分类器的部分去噪图像表示在测试集上的准确率达到95.61%,优于其他图像表示方法。这表明部分去噪与合适的分类模型结合是提升骨折检测性能的核心方法。
链接: https://arxiv.org/abs/2503.17030
作者: Snigdha Paul,Sambit Mallick,Anindya Sen
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Computer vision has transformed medical diagnosis, treatment, and research through advanced image processing and machine learning techniques. Fracture classification, a critical area in healthcare, has greatly benefited from these advancements, yet accurate detection is challenged by complex patterns and image noise. Bit plane slicing enhances medical images by reducing noise interference and extracting informative features. This research explores partial denoising techniques to provide practical solutions for improved fracture analysis, ultimately enhancing patient care. The study explores deep learning model DenseNet and handcrafted feature extraction. Decision Tree and Random Forest, were employed to train and evaluate distinct image representations. These include the original image, the concatenation of the four bit planes from the LSB as well as MSB, the fully denoised image, and an image consisting of 6 bit planes from MSB and 2 denoised bit planes from LSB. The purpose of forming these diverse image representations is to analyze SNR as well as classification accuracy and identify the bit planes that contain the most informative features. Moreover, the study delves into the significance of partial denoising techniques in preserving crucial features, leading to improvements in classification results. Notably, this study shows that employing the Random Forest classifier, the partially denoised image representation exhibited a testing accuracy of 95.61% surpassing the performance of other image representations. The outcomes of this research provide valuable insights into the development of efficient preprocessing, feature extraction and classification approaches for fracture identification. By enhancing diagnostic accuracy, these advancements hold the potential to positively impact patient care and overall medical outcomes.
zh
[CV-139] High Accuracy Pulmonary Vessel Segmentation for Contrast and Non-contrast CT Images and Its Clinical Evaluation
【速读】:该论文致力于解决肺部血管在CT血管成像(CTPA)和非对比增强CT(NCCT)图像中精确分割的问题,这是诊断和评估多种肺部疾病的关键步骤。目前,针对CTPA的高精度肺血管分割算法匮乏,而NCCT中的分割更是面临更大挑战。论文提出了一种用于自动化分割两种CT图像中肺血管的3D图像分割算法。其解决方案的关键在于设计了一个血管腔结构优化模块(Vessel Lumen Structure Optimization Module, VLSOM),该模块通过提取血管中心线、基于位置信息调整权重,并引入Cl-Dice-Loss来监督血管结构的稳定性。此外,还开发了一种从CTPA生成NCCT血管Ground Truth的方法,以支持同时处理CTPA和NCCT的模型训练。实验结果表明,该方法在CTPA和NCCT上的分割性能均表现出色,验证了其在临床应用中的巨大潜力。
链接: https://arxiv.org/abs/2503.16988
作者: Ying Ming(1),Shaoze Luo(2),Longfei Zhao(2),Qiqi Xu(2),Wei Song(1) ((1) Department of Radiology Peking Union Medical College Hospital Chinese Academy of Medical Sciences and Peking Union Medical College, (2) Research and Development Center Canon Medical Systems (China))
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate segmentation of pulmonary vessels plays a very critical role in diagnosing and assessing various lung diseases. In clinical practice, diagnosis is typically carried out using CTPA images. However, there is a lack of high-precision pulmonary vessel segmentation algorithms for CTPA, and pulmonary vessel segmentation for NCCT poses an even greater challenge. In this study, we propose a 3D image segmentation algorithm for automated pulmonary vessel segmentation from both contrast and non-contrast CT images. In the network, we designed a Vessel Lumen Structure Optimization Module (VLSOM), which extracts the centerline of vessels and adjusts the weights based on the positional information and adds a Cl-Dice-Loss to supervise the stability of the vessels structure. In addition, we designed a method for generating vessel GT from CTPA to NCCT for training models that support both CTPA and NCCT. In this work, we used 427 sets of high-precision annotated CT data from multiple vendors and countries. Finally, our experimental model achieved Cl-Recall, Cl-DICE and Recall values of 0.879, 0.909, 0.934 (CTPA) and 0.928, 0.936, 0.955 (NCCT) respectively. This shows that our model has achieved good performance in both accuracy and completeness of pulmonary vessel segmentation. In clinical visual evaluation, our model also had good segmentation performance on various disease types and can assist doctors in medical diagnosis, verifying the great potential of this method in clinical application.
zh
[CV-140] From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech CVPR2025 KR
【速读】:该论文旨在解决从无声说话人脸视频生成高质量语音的问题,即视频到语音合成(video-to-speech synthesis)任务。这一任务面临的主要挑战是无声视频与多维度语音之间显著的模态差距(modality gap)。为有效弥合这一差距并显著提升合成语音的质量,论文提出了一种新颖的视频到语音系统。其关键在于通过学习从视频到语音的分层表征(hierarchical representations),逐步将无声视频转换为声学特征空间,具体分为内容、音色和韵律建模三个连续阶段,并在每个阶段对齐视觉因素(如唇部运动、面部身份和表情)与对应的声学特征。此外,为了生成逼真且连贯的语音,论文采用了流匹配模型(flow matching model),通过估计从简单先验分布到目标语音分布的直接轨迹实现视觉表征到语音的转化。实验结果表明,该方法生成的语音质量优异,可媲美真实语音,显著优于现有方法。
链接: https://arxiv.org/abs/2503.16956
作者: Ji-Hoon Kim,Jeongsoo Choi,Jaehun Kim,Chaeyoung Jung,Joon Son Chung
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: CVPR 2025, demo page: this https URL
点击查看摘要
Abstract:The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages – content, timbre, and prosody modeling. In each stage, we align visual factors – lip movements, face identity, and facial expressions – with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.
zh
[CV-141] Downstream Analysis of Foundational Medical Vision Models for Disease Progression
【速读】:该论文旨在评估医学视觉基础模型在预测疾病进展方面的潜力,并探讨图像分割模型与配准模型中间层特征的差异。论文的关键假设是:分割模型的中间层特征捕获结构信息,而配准模型的特征编码时间变化的知识。为验证这一假设,作者使用简单的线性探针进行实验,并发现这些特征对于疾病进展预测具有价值。此外,研究还表明,配准模型的特征无需空间对齐的输入图像即可有效工作,而分割模型则需要空间对齐以实现最佳性能。因此,该研究的关键解决方案在于强调空间对齐的重要性以及基础模型特征在图像配准中的实用性。
链接: https://arxiv.org/abs/2503.16842
作者: Basar Demir,Soumitri Chattopadhyay,Thomas Hastings Greer,Boqi Chen,Marc Niethammer
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Medical vision foundational models are used for a wide variety of tasks, including medical image segmentation and registration. This work evaluates the ability of these models to predict disease progression using a simple linear probe. We hypothesize that intermediate layer features of segmentation models capture structural information, while those of registration models encode knowledge of change over time. Beyond demonstrating that these features are useful for disease progression prediction, we also show that registration model features do not require spatially aligned input images. However, for segmentation models, spatial alignment is essential for optimal performance. Our findings highlight the importance of spatial alignment and the utility of foundation model features for image registration.
zh
[CV-142] Depth-Aided Color Image Inpainting in Quaternion Domain
【速读】:本文旨在解决传统四元数域彩色图像修复方法中未能充分利用深度信息的问题。常规基于四元数的修复技术仅利用三个虚部表示彩色通道,而实部被设为零且未携带有效信息。本文提出的深度辅助低秩四元数矩阵完成(Depth-aided Low-Rank Quaternion Matrix Completion, D-LRQMC)方法的关键在于将深度信息作为四元数表示的实部,利用颜色与深度之间的相关性提升修复效果。具体而言,该方法首先通过传统的LRQMC恢复观测图像并估计其深度,然后将估计的深度融入观测图像的实部,再次执行LRQMC。仿真结果表明,与传统LRQMC相比,D-LRQMC在多种图像的修复精度和视觉质量上均有显著改善,证明了深度信息在四元数域彩色图像处理中的有效性。
链接: https://arxiv.org/abs/2503.16818
作者: Shunki Tatsumi,Ryo Hayakawa,Youji Iiguni
机构: Graduate School of Engineering Science, Osaka University (大阪大学工程科学研究生院); Institute of Engineering, Tokyo University of Agriculture and Technology (东京农业技术大学工程学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to IEEE Signal Processing Letters
点击查看摘要
Abstract:In this paper, we propose a depth-aided color image inpainting method in the quaternion domain, called depth-aided low-rank quaternion matrix completion (D-LRQMC). In conventional quaternion-based inpainting techniques, the color image is expressed as a quaternion matrix by using the three imaginary parts as the color channels, whereas the real part is set to zero and has no information. Our approach incorporates depth information as the real part of the quaternion representations, leveraging the correlation between color and depth to improve the result of inpainting. In the proposed method, we first restore the observed image with the conventional LRQMC and estimate the depth of the restored result. We then incorporate the estimated depth into the real part of the observed image and perform LRQMC again. Simulation results demonstrate that the proposed D-LRQMC can improve restoration accuracy and visual quality for various images compared to the conventional LRQMC. These results suggest the effectiveness of the depth information for color image processing in quaternion domain.
zh
[CV-143] Fed-NDIF: A Noise-Embedded Federated Diffusion Model For Low-Count Whole-Body PET Denoising
【速读】:本文旨在解决低计数正电子发射断层成像(LCPET)图像因辐射剂量降低而导致的图像噪声增加和病灶可检测性下降的问题,并提出有效的去噪方法。由于训练扩散模型需要大规模且多样化的数据集,在医学领域获取这些数据集具有挑战性,同时隐私保护也是一个重要考量。因此,论文结合了联邦学习与扩散模型,采用去中心化的训练方式来缓解数据稀缺和隐私问题。然而,不同机构间扫描设备类型及图像噪声水平的变化进一步增加了联邦学习在LCPET去噪任务中的难度。
为应对上述挑战,本研究提出了一个名为“噪声嵌入联邦学习扩散模型”(Fed-NDIF)的新方法。该方法的核心在于利用多中心数据集以及不同的计数水平,并通过将肝脏归一化标准差(NSTD)噪声嵌入到2.5维扩散模型中,同时使用Federated Averaging (FedAvg) 算法聚合本地训练模型以形成全局模型。此外,还针对每个局部数据集对全局模型进行微调以实现个性化优化。实验结果表明,Fed-NDIF模型在提升整体3D体积图像质量方面显著优于本地扩散模型和基于联邦UNet的模型,特别是在峰值信噪比(PSNR)、结构相似性指数(SSIM)以及归一化均方误差(NMSE)等指标上表现优异,并且对于病灶检测和量化也有明显改善。
链接: https://arxiv.org/abs/2503.16635
作者: Yinchi Zhou,Huidong Xie,Menghua Xia,Qiong Liu,Bo Zhou,Tianqi Chen,Jun Hou,Liang Guo,Xinyuan Zheng,Hanzhong Wang,Biao Li,Axel Rominger,Kuangyu Shi,Nicha C. Dvorneka,Chi Liu
机构: Department of Biomedical Engineering, Yale University, New Haven, CT, USA (生物医学工程系,耶鲁大学); Department of Radiology, Northwestern University, Chicago, IL, USA (放射科,西北大学); Department of Nuclear Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China (核医学科,瑞金医院,上海交通大学医学院); Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, Bern, Switzerland (核医学科,因塞尔医院,伯尔尼大学医院,伯尔尼大学); Computer Aided Medical Procedures and Augmented Reality, Institute of Informatics I16, Technical University of Munich, Munich, Germany (计算机辅助医疗程序与增强现实,信息学研究所I16,慕尼黑工业大学); Department of Radiology and Biomedical Imaging, Yale School of Medicine, New Haven, CT, USA (放射学与生物医学影像系,耶鲁大学医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Low-count positron emission tomography (LCPET) imaging can reduce patients’ exposure to radiation but often suffers from increased image noise and reduced lesion detectability, necessitating effective denoising techniques. Diffusion models have shown promise in LCPET denoising for recovering degraded image quality. However, training such models requires large and diverse datasets, which are challenging to obtain in the medical domain. To address data scarcity and privacy concerns, we combine diffusion models with federated learning – a decentralized training approach where models are trained individually at different sites, and their parameters are aggregated on a central server over multiple iterations. The variation in scanner types and image noise levels within and across institutions poses additional challenges for federated learning in LCPET denoising. In this study, we propose a novel noise-embedded federated learning diffusion model (Fed-NDIF) to address these challenges, leveraging a multicenter dataset and varying count levels. Our approach incorporates liver normalized standard deviation (NSTD) noise embedding into a 2.5D diffusion model and utilizes the Federated Averaging (FedAvg) algorithm to aggregate locally trained models into a global model, which is subsequently fine-tuned on local datasets to optimize performance and obtain personalized models. Extensive validation on datasets from the University of Bern, Ruijin Hospital in Shanghai, and Yale-New Haven Hospital demonstrates the superior performance of our method in enhancing image quality and improving lesion quantification. The Fed-NDIF model shows significant improvements in PSNR, SSIM, and NMSE of the entire 3D volume, as well as enhanced lesion detectability and quantification, compared to local diffusion models and federated UNet-based models.
zh
[CV-144] Reliable Radiologic Skeletal Muscle Area Assessment – A Biomarker for Cancer Cachexia Diagnosis
【速读】:该论文旨在解决癌症恶病质(Cancer Cachexia)早期识别与监测中的关键挑战,包括现有工具自动化程度不足及准确性不一致的问题,这些限制了其在临床工作流中的广泛应用。论文提出的关键解决方案是开发了一种基于人工智能的端到端自动化管道——SMAART-AI(Skeletal Muscle Assessment-Automated and Reliable Tool-based on AI)。这一方案利用深度学习模型(nnU-Net 2D)对腰椎中部层面的CT图像进行训练,并采用5折交叉验证确保模型的泛化能力和鲁棒性。此外,SMAART-AI引入基于不确定性的机制来标记高误差预测结果以供专家审查,从而提升整体可靠性。通过结合SMA、骨骼肌指数、BMI及临床数据,该方法还构建了一个多层感知机(MLP)模型用于在癌症诊断时预测恶病质的发生,实现了79%的精确度。因此,SMAART-AI的核心在于通过高度自动化、高精度以及对不确定性意识的整合,弥合了研究与临床应用之间的差距,为管理癌症恶病质提供了创新路径。
链接: https://arxiv.org/abs/2503.16556
作者: Sabeen Ahmed,Nathan Parker,Margaret Park,Daniel Jeong,Lauren Peres,Evan W. Davis,Jennifer B. Permuth,Erin Siegel,Matthew B. Schabath,Yasin Yilmaz,Ghulam Rasool
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注: 47 pages, 19 figures, 9 Tables
点击查看摘要
Abstract:Cancer cachexia is a common metabolic disorder characterized by severe muscle atrophy which is associated with poor prognosis and quality of life. Monitoring skeletal muscle area (SMA) longitudinally through computed tomography (CT) scans, an imaging modality routinely acquired in cancer care, is an effective way to identify and track this condition. However, existing tools often lack full automation and exhibit inconsistent accuracy, limiting their potential for integration into clinical workflows. To address these challenges, we developed SMAART-AI (Skeletal Muscle Assessment-Automated and Reliable Tool-based on AI), an end-to-end automated pipeline powered by deep learning models (nnU-Net 2D) trained on mid-third lumbar level CT images with 5-fold cross-validation, ensuring generalizability and robustness. SMAART-AI incorporates an uncertainty-based mechanism to flag high-error SMA predictions for expert review, enhancing reliability. We combined the SMA, skeletal muscle index, BMI, and clinical data to train a multi-layer perceptron (MLP) model designed to predict cachexia at the time of cancer diagnosis. Tested on the gastroesophageal cancer dataset, SMAART-AI achieved a Dice score of 97.80% +/- 0.93%, with SMA estimated across all four datasets in this study at a median absolute error of 2.48% compared to manual annotations with SliceOmatic. Uncertainty metrics-variance, entropy, and coefficient of variation-strongly correlated with SMA prediction errors (0.83, 0.76, and 0.73 respectively). The MLP model predicts cachexia with 79% precision, providing clinicians with a reliable tool for early diagnosis and intervention. By combining automation, accuracy, and uncertainty awareness, SMAART-AI bridges the gap between research and clinical application, offering a transformative approach to managing cancer cachexia.
zh
[CV-145] Comprehensive Review of Reinforcement Learning for Medical Ultrasound Imaging
【速读】:该论文试图解决医学超声(Medical Ultrasound, US)成像领域中因操作者依赖性、解释变异性以及分辨率限制等问题导致的应用局限性,尤其是在专业人员短缺的情况下。此外,现有研究主要集中在部分自主化的解决方案,而未能深入探讨超声成像各阶段与强化学习(Reinforcement Learning, RL)最新进展之间的交叉点。为填补这一空白,论文提出了一种综合分类法,将超声成像流程的各个阶段与RL开发管道相结合。这种分类不仅展示了超声领域内RL的最新进展,还识别了实现完全自主超声系统的未解决问题。论文的关键在于通过结合超声成像过程与RL技术,推动构建高效且智能化的自主超声解决方案,并揭示进一步发展的潜力与挑战。
链接: https://arxiv.org/abs/2503.16543
作者: Hanae Elmekki,Saidul Islam,Ahmed Alagha,Hani Sami,Amanda Spilkin,Ehsan Zakeri,Antonela Mariel Zanuttini,Jamal Bentahar,Lyes Kadem,Wen-Fang Xie,Philippe Pibarot,Rabeb Mizouni,Hadi Otrok,Shakti Singh,Azzam Mourad
机构: Concordia University (康考迪亚大学); Université Laval (拉瓦尔大学); Khalifa University (哈利法大学); Lebanese American University (黎巴嫩美国大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 89 pages, 23 figures
点击查看摘要
Abstract:Medical Ultrasound (US) imaging has seen increasing demands over the past years, becoming one of the most preferred imaging modalities in clinical practice due to its affordability, portability, and real-time capabilities. However, it faces several challenges that limit its applicability, such as operator dependency, variability in interpretation, and limited resolution, which are amplified by the low availability of trained experts. This calls for the need of autonomous systems that are capable of reducing the dependency on humans for increased efficiency and throughput. Reinforcement Learning (RL) comes as a rapidly advancing field under Artificial Intelligence (AI) that allows the development of autonomous and intelligent agents that are capable of executing complex tasks through rewarded interactions with their environments. Existing surveys on advancements in the US scanning domain predominantly focus on partially autonomous solutions leveraging AI for scanning guidance, organ identification, plane recognition, and diagnosis. However, none of these surveys explore the intersection between the stages of the US process and the recent advancements in RL solutions. To bridge this gap, this review proposes a comprehensive taxonomy that integrates the stages of the US process with the RL development pipeline. This taxonomy not only highlights recent RL advancements in the US domain but also identifies unresolved challenges crucial for achieving fully autonomous US systems. This work aims to offer a thorough review of current research efforts, highlighting the potential of RL in building autonomous US solutions while identifying limitations and opportunities for further advancements in this field.
zh
人工智能
[AI-0] HCAST: Human-Calibrated Autonomy Software Tasks
链接: https://arxiv.org/abs/2503.17354
作者: David Rein,Joel Becker,Amy Deng,Seraphina Nix,Chris Canal,Daniel O’Connel,Pip Arnott,Ryan Bloom,Thomas Broadley,Katharyn Garcia,Brian Goodrich,Max Hasin,Sami Jawhar,Megan Kinniment,Thomas Kwa,Aron Lajko,Nate Rush,Lucas Jun Koba Sato,Sydney Von Arx,Ben West,Lawrence Chan,Elizabeth Barnes
类目: Artificial Intelligence (cs.AI)
*备注: 32 pages, 10 figures, 5 tables
点击查看摘要
Abstract:To understand and predict the societal impacts of highly autonomous AI systems, we need benchmarks with grounding, i.e., metrics that directly connect AI performance to real-world effects we care about. We present HCAST (Human-Calibrated Autonomy Software Tasks), a benchmark of 189 machine learning engineering, cybersecurity, software engineering, and general reasoning tasks. We collect 563 human baselines (totaling over 1500 hours) from people skilled in these domains, working under identical conditions as AI agents, which lets us estimate that HCAST tasks take humans between one minute and 8+ hours. Measuring the time tasks take for humans provides an intuitive metric for evaluating AI capabilities, helping answer the question “can an agent be trusted to complete a task that would take a human X hours?” We evaluate the success rates of AI agents built on frontier foundation models, and we find that current agents succeed 70-80% of the time on tasks that take humans less than one hour, and less than 20% of the time on tasks that take humans more than 4 hours.
[AI-1] NdLinear Is All You Need for Representation Learning
链接: https://arxiv.org/abs/2503.17353
作者: Alex Reneau,Jerry Yao-Chieh Hu,Zhongfang Zhuang,Ting-Chun Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code is available at this https URL
点击查看摘要
Abstract:Many high-impact machine learning tasks involve multi-dimensional data (e.g., images, volumetric medical scans, multivariate time-series). Yet, most neural architectures flatten inputs, discarding critical cross-dimension information. We introduce NdLinear, a novel linear transformation that preserves these structures without extra overhead. By operating separately along each dimension, NdLinear captures dependencies that standard fully connected layers overlook. Extensive experiments across convolutional, recurrent, and transformer-based networks show significant improvements in representational power and parameter efficiency. Crucially, NdLinear serves as a foundational building block for large-scale foundation models by operating on any unimodal or multimodal data in its native form. This removes the need for flattening or modality-specific preprocessing. Ndlinear rethinks core architectural priorities beyond attention, enabling more expressive, context-aware models at scale. We propose NdLinear as a drop-in replacement for standard linear layers – marking an important step toward next-generation neural architectures.
[AI-2] Can AI expose tax loopholes? Towards a new generation of legal policy assistants
链接: https://arxiv.org/abs/2503.17339
作者: Peter Fratrič,Nils Holzenberger,David Restrepo Amariles
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 13 pages, 6 figures
点击查看摘要
Abstract:The legislative process is the backbone of a state built on solid institutions. Yet, due to the complexity of laws – particularly tax law – policies may lead to inequality and social tensions. In this study, we introduce a novel prototype system designed to address the issues of tax loopholes and tax avoidance. Our hybrid solution integrates a natural language interface with a domain-specific language tailored for planning. We demonstrate on a case study how tax loopholes and avoidance schemes can be exposed. We conclude that our prototype can help enhance social welfare by systematically identifying and addressing tax gaps stemming from loopholes.
[AI-3] Capturing Individual Human Preferences with Reward Features
链接: https://arxiv.org/abs/2503.17338
作者: André Barreto,Vincent Dumoulin,Yiran Mao,Nicolas Perez-Nieves,Bobak Shahriari,Yann Dauphin,Doina Precup,Hugo Larochelle
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Reinforcement learning from human feedback usually models preferences using a reward model that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We propose a method to specialise a reward model to a person or group of people. Our approach builds on the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models comparing the proposed architecture with a non-adaptive reward model and also adaptive counterparts, including models that do in-context personalisation. Depending on how much disagreement there is in the training data, our model either significantly outperforms the baselines or matches their performance with a simpler architecture and more stable training.
[AI-4] CVE-Bench: A Benchmark for AI Agents Ability to Exploit Real-World Web Application Vulnerabilities
链接: https://arxiv.org/abs/2503.17332
作者: Yuxuan Zhu,Antony Kellermann,Dylan Bowman,Philip Li,Akul Gupta,Adarsh Danda,Richard Fang,Conner Jensen,Eric Ihli,Jason Benn,Jet Geronimo,Avi Dhir,Sudhit Rao,Kaicheng Yu,Twm Stone,Daniel Kang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 15 pages, 4 figures, 5 tables
点击查看摘要
Abstract:Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% of vulnerabilities.
[AI-5] LLM MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language
链接: https://arxiv.org/abs/2503.17309
作者: Kun Chu,Xufeng Zhao,Cornelius Weber,Stefan Wermter
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Code and video are available at this https URL
点击查看摘要
Abstract:Bimanual robotic manipulation provides significant versatility, but also presents an inherent challenge due to the complexity involved in the spatial and temporal coordination between two hands. Existing works predominantly focus on attaining human-level manipulation skills for robotic hands, yet little attention has been paid to task planning on long-horizon timescales. With their outstanding in-context learning and zero-shot generation abilities, Large Language Models (LLMs) have been applied and grounded in diverse robotic embodiments to facilitate task planning. However, LLMs still suffer from errors in long-horizon reasoning and from hallucinations in complex robotic tasks, lacking a guarantee of logical correctness when generating the plan. Previous works, such as LLM+P, extended LLMs with symbolic planners. However, none have been successfully applied to bimanual robots. New challenges inevitably arise in bimanual manipulation, necessitating not only effective task decomposition but also efficient task allocation. To address these challenges, this paper introduces LLM+MAP, a bimanual planning framework that integrates LLM reasoning and multi-agent planning, automating effective and efficient bimanual task planning. We conduct simulated experiments on various long-horizon manipulation tasks of differing complexity. Our method is built using GPT-4o as the backend, and we compare its performance against plans generated directly by LLMs, including GPT-4o, V3 and also recent strong reasoning models o1 and R1. By analyzing metrics such as planning time, success rate, group debits, and planning-step reduction rate, we demonstrate the superior performance of LLM+MAP, while also providing insights into robotic reasoning. Code is available at this https URL.
[AI-6] Preference-Guided Diffusion for Multi-Objective Offline Optimization
链接: https://arxiv.org/abs/2503.17299
作者: Yashas Annadani,Syrine Belakaria,Stefano Ermon,Stefan Bauer,Barbara E Engelhardt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Offline multi-objective optimization aims to identify Pareto-optimal solutions given a dataset of designs and their objective values. In this work, we propose a preference-guided diffusion model that generates Pareto-optimal designs by leveraging a classifier-based guidance mechanism. Our guidance classifier is a preference model trained to predict the probability that one design dominates another, directing the diffusion model toward optimal regions of the design space. Crucially, this preference model generalizes beyond the training distribution, enabling the discovery of Pareto-optimal solutions outside the observed dataset. We introduce a novel diversity-aware preference guidance, augmenting Pareto dominance preference with diversity criteria. This ensures that generated solutions are optimal and well-distributed across the objective space, a capability absent in prior generative methods for offline multi-objective optimization. We evaluate our approach on various continuous offline multi-objective optimization tasks and find that it consistently outperforms other inverse/generative approaches while remaining competitive with forward/surrogate-based optimization methods. Our results highlight the effectiveness of classifier-guided diffusion models in generating diverse and high-quality solutions that approximate the Pareto front well.
[AI-7] Breaking the Symmetries of Indistinguishable Objects
链接: https://arxiv.org/abs/2503.17251
作者: Ozgur Akgun,Mun See Chang,Ian P. Gent,Christopher Jefferson
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Indistinguishable objects often occur when modelling problems in constraint programming, as well as in other related paradigms. They occur when objects can be viewed as being drawn from a set of unlabelled objects, and the only operation allowed on them is equality testing. For example, the golfers in the social golfer problem are indistinguishable. If we do label the golfers, then any relabelling of the golfers in one solution gives another valid solution. Therefore, we can regard the symmetric group of size n as acting on a set of n indistinguishable objects. In this paper, we show how we can break the symmetries resulting from indistinguishable objects. We show how symmetries on indistinguishable objects can be defined properly in complex types, for example in a matrix indexed by indistinguishable objects. We then show how the resulting symmetries can be broken correctly. In Essence, a high-level modelling language, indistinguishable objects are encapsulated in “unnamed types”. We provide an implementation of complete symmetry breaking for unnamed types in Essence.
[AI-8] reeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning
链接: https://arxiv.org/abs/2503.17195
作者: Sheng Wang,Pengan Chen,Jingqi Zhou,Qintong Li,Jingwei Dong,Jiahui Gao,Boyang Xue,Jiyue Jiang,Lingpeng Kong,Chuan Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Model customization requires high-quality and diverse datasets, but acquiring such data remains challenging and costly. Although large language models (LLMs) can synthesize training data, current approaches are constrained by limited seed data, model bias and insufficient control over the generation process, resulting in limited diversity and biased distribution with the increase of data scales. To tackle this challenge, we present TreeSynth, a tree-guided subspace-based data synthesis framework that recursively partitions the entire data space into hierar-chical subspaces, enabling comprehensive and diverse scaling of data synthesis. Briefly, given a task-specific description, we construct a data space partitioning tree by iteratively executing criteria determination and subspace coverage steps. This hierarchically divides the whole space (i.e., root node) into mutually exclusive and complementary atomic subspaces (i.e., leaf nodes). By collecting synthesized data according to the attributes of each leaf node, we obtain a diverse dataset that fully covers the data space. Empirically, our extensive experiments demonstrate that TreeSynth surpasses both human-designed datasets and the state-of-the-art data synthesis baselines, achieving maximum improvements of 45.2% in data diversity and 17.6% in downstream task performance across various models and tasks. Hopefully, TreeSynth provides a scalable solution to synthesize diverse and comprehensive datasets from scratch without human intervention.
[AI-9] LLM s Love Python: A Study of LLM s Bias for Programming Languages and Libraries
链接: https://arxiv.org/abs/2503.17181
作者: Lukas Twist,Jie M. Zhang,Mark Harman,Don Syme,Joost Noppen,Detlef Nauck
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 12 pages, 1 figure
点击查看摘要
Abstract:Programming language and library choices are crucial to software reliability and security. Poor or inconsistent choices can lead to increased technical debt, security vulnerabilities, and even catastrophic failures in safety-critical systems. As Large Language Models (LLMs) play an increasing role in code generation, it is essential to understand how they make these decisions. However, little is known about their preferences when selecting programming languages and libraries for different coding tasks. To fill this gap, this study provides the first in-depth investigation into LLM preferences for programming languages and libraries used when generating code. We assess the preferences of eight diverse LLMs by prompting them to complete various coding tasks, including widely-studied benchmarks and the more practical task of generating the initial structural code for new projects (a crucial step that often determines a project’s language or library choices). Our findings reveal that LLMs heavily favour Python when solving language-agnostic problems, using it in 90%-97% of cases for benchmark tasks. Even when generating initial project code where Python is not a suitable language, it remains the most-used language in 58% of instances. Moreover, LLMs contradict their own language recommendations in 83% of project initialisation tasks, raising concerns about their reliability in guiding language selection. Similar biases toward well-established libraries further create serious discoverability challenges for newer open-source projects. These results highlight the need to improve LLMs’ adaptability to diverse programming contexts and to develop mechanisms for mitigating programming language and library bias. Comments: 12 pages, 1 figure Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.17181 [cs.SE] (or arXiv:2503.17181v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2503.17181 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-10] DiTEC-WDN: A Large-Scale Dataset of Water Distribution Network Scenarios under Diverse Hydraulic Conditions
链接: https://arxiv.org/abs/2503.17167
作者: Huy Truong,Andrés Tello,Alexander Lazovik,Victoria Degeler
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to Nature Scientific Data. Huy Truong and Andrés Tello contributed equally to this work. For the dataset, see this https URL
点击查看摘要
Abstract:Privacy restrictions hinder the sharing of real-world Water Distribution Network (WDN) models, limiting the application of emerging data-driven machine learning, which typically requires extensive observations. To address this challenge, we propose the dataset DiTEC-WDN that comprises 36,000 unique scenarios simulated over either short-term (24 hours) or long-term (1 year) periods. We constructed this dataset using an automated pipeline that optimizes crucial parameters (e.g., pressure, flow rate, and demand patterns), facilitates large-scale simulations, and records discrete, synthetic but hydraulically realistic states under standard conditions via rule validation and post-hoc analysis. With a total of 228 million generated graph-based states, DiTEC-WDN can support a variety of machine-learning tasks, including graph-level, node-level, and link-level regression, as well as time-series forecasting. This contribution, released under a public license, encourages open scientific research in the critical water sector, eliminates the risk of exposing sensitive data, and fulfills the need for a large-scale water distribution network benchmark for study comparisons and scenario analysis.
[AI-11] Leverag ing Language Models for Out-of-Distribution Recovery in Reinforcement Learning
链接: https://arxiv.org/abs/2503.17125
作者: Chan Kim,Seung-Woo Seo,Seong-Woo Kim
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 14 pages, 17 figures
点击查看摘要
Abstract:Deep Reinforcement Learning (DRL) has demonstrated strong performance in robotic control but remains susceptible to out-of-distribution (OOD) states, often resulting in unreliable actions and task failure. While previous methods have focused on minimizing or preventing OOD occurrences, they largely neglect recovery once an agent encounters such states. Although the latest research has attempted to address this by guiding agents back to in-distribution states, their reliance on uncertainty estimation hinders scalability in complex environments. To overcome this limitation, we introduce Language Models for Out-of-Distribution Recovery (LaMOuR), which enables recovery learning without relying on uncertainty estimation. LaMOuR generates dense reward codes that guide the agent back to a state where it can successfully perform its original task, leveraging the capabilities of LVLMs in image description, logical reasoning, and code generation. Experimental results show that LaMOuR substantially enhances recovery efficiency across diverse locomotion tasks and even generalizes effectively to complex environments, including humanoid locomotion and mobile manipulation, where existing methods struggle. The code and supplementary materials are available at \hrefthis https URLthis https URL.
[AI-12] Deterministic AI Agent Personality Expression through Standard Psychological Diagnostics
链接: https://arxiv.org/abs/2503.17085
作者: J. M. Diederik Kruijssen,Nicholas Emmons(Allora Foundation)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 25 pages, 8 figures, 4 tables; appeared in ADI (March 2025)
点击查看摘要
Abstract:Artificial intelligence (AI) systems powered by large language models have become increasingly prevalent in modern society, enabling a wide range of applications through natural language interaction. As AI agents proliferate in our daily lives, their generic and uniform expressiveness presents a significant limitation to their appeal and adoption. Personality expression represents a key prerequisite for creating more human-like and distinctive AI systems. We show that AI models can express deterministic and consistent personalities when instructed using established psychological frameworks, with varying degrees of accuracy depending on model capabilities. We find that more advanced models like GPT-4o and o1 demonstrate the highest accuracy in expressing specified personalities across both Big Five and Myers-Briggs assessments, and further analysis suggests that personality expression emerges from a combination of intelligence and reasoning capabilities. Our results reveal that personality expression operates through holistic reasoning rather than question-by-question optimization, with response-scale metrics showing higher variance than test-scale metrics. Furthermore, we find that model fine-tuning affects communication style independently of personality expression accuracy. These findings establish a foundation for creating AI agents with diverse and consistent personalities, which could significantly enhance human-AI interaction across applications from education to healthcare, while additionally enabling a broader range of more unique AI agents. The ability to quantitatively assess and implement personality expression in AI systems opens new avenues for research into more relatable, trustworthy, and ethically designed AI.
[AI-13] A Thorough Assessment of the Non-IID Data Impact in Federated Learning
链接: https://arxiv.org/abs/2503.17070
作者: Daniel M. Jimenez-Gutierrez,Mehrdad Hassanzadeh,Aris Anagnostopoulos,Ioannis Chatzigiannakis,Andrea Vitaletti
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Federated learning (FL) allows collaborative machine learning (ML) model training among decentralized clients’ information, ensuring data privacy. The decentralized nature of FL deals with non-independent and identically distributed (non-IID) data. This open problem has notable consequences, such as decreased model performance and more significant convergence times. Despite its importance, experimental studies systematically addressing all types of data heterogeneity (a.k.a. non-IIDness) remain scarce. We aim to fill this gap by assessing and quantifying the non-IID effect through a thorough empirical analysis. We use the Hellinger Distance (HD) to measure differences in distribution among clients. Our study benchmarks four state-of-the-art strategies for handling non-IID data, including label, feature, quantity, and spatiotemporal skewness, under realistic and controlled conditions. This is the first comprehensive analysis of the spatiotemporal skew effect in FL. Our findings highlight the significant impact of label and spatiotemporal skew non-IID types on FL model performance, with notable performance drops occurring at specific HD thresholds. Additionally, the FL performance is heavily affected mainly when the non-IIDness is extreme. Thus, we provide recommendations for FL research to tackle data heterogeneity effectively. Our work represents the most extensive examination of non-IIDness in FL, offering a robust foundation for future research.
[AI-14] Replay4NCL: An Efficient Memory Replay-based Methodology for Neuromorphic Continual Learning in Embedded AI Systems
链接: https://arxiv.org/abs/2503.17061
作者: Mishal Fatima Minhas,Rachmad Vidya Wicaksana Putra,Falah Awwad,Osman Hasan,Muhammad Shafique
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at the 62th Design Automation Conference (DAC) 2025, June 2025, San Francisco, CA, USA
点击查看摘要
Abstract:Neuromorphic Continual Learning (NCL) paradigm leverages Spiking Neural Networks (SNNs) to enable continual learning (CL) capabilities for AI systems to adapt to dynamically changing environments. Currently, the state-of-the-art employ a memory replay-based method to maintain the old knowledge. However, this technique relies on long timesteps and compression-decompression steps, thereby incurring significant latency and energy overheads, which are not suitable for tightly-constrained embedded AI systems (e.g., mobile agents/robotics). To address this, we propose Replay4NCL, a novel efficient memory replay-based methodology for enabling NCL in embedded AI systems. Specifically, Replay4NCL compresses the latent data (old knowledge), then replays them during the NCL training phase with small timesteps, to minimize the processing latency and energy consumption. To compensate the information loss from reduced spikes, we adjust the neuron threshold potential and learning rate settings. Experimental results on the class-incremental scenario with the Spiking Heidelberg Digits (SHD) dataset show that Replay4NCL can preserve old knowledge with Top-1 accuracy of 90.43% compared to 86.22% from the state-of-the-art, while effectively learning new tasks, achieving 4.88x latency speed-up, 20% latent memory saving, and 36.43% energy saving. These results highlight the potential of our Replay4NCL methodology to further advances NCL capabilities for embedded AI systems.
[AI-15] Data-Driven Optimization of EV Charging Station Placement Using Causal Discovery
链接: https://arxiv.org/abs/2503.17055
作者: Julius Stephan Junker,Rong Hu,Ziyue Li,Wolfgang Ketter
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review of IEEE CASE 2025; This is also the master thesis project from Julius supervised by Dr. Ziyue Li
点击查看摘要
Abstract:This paper addresses the critical challenge of optimizing electric vehicle charging station placement through a novel data-driven methodology employing causal discovery techniques. While traditional approaches prioritize economic factors or power grid constraints, they often neglect empirical charging patterns that ultimately determine station utilization. We analyze extensive charging data from Palo Alto and Boulder (337,344 events across 100 stations) to uncover latent relationships between station characteristics and utilization. Applying structural learning algorithms (NOTEARS and DAGMA) to this data reveals that charging demand is primarily determined by three factors: proximity to amenities, EV registration density, and adjacency to high-traffic routes. These findings, consistent across multiple algorithms and urban contexts, challenge conventional infrastructure distribution strategies. We develop an optimization framework that translates these insights into actionable placement recommendations, identifying locations likely to experience high utilization based on the discovered dependency structures. The resulting site selection model prioritizes strategic clustering in high-amenity areas with substantial EV populations rather than uniform spatial distribution. Our approach contributes a framework that integrates empirical charging behavior into infrastructure planning, potentially enhancing both station utilization and user convenience. By focusing on data-driven insights instead of theoretical distribution models, we provide a more effective strategy for expanding charging networks that can adjust to various stages of EV market development.
[AI-16] A Guide to Bayesian Networks Software Packages for Structure and Parameter Learning – 2025 Edition
链接: https://arxiv.org/abs/2503.17025
作者: Joverlyn Gaudillo,Nicole Astrologo,Fabio Stella,Enzo Acerbi,Francesco Canonaco
类目: Artificial Intelligence (cs.AI)
*备注: 11 pages, 1 figure
点击查看摘要
Abstract:A representation of the cause-effect mechanism is needed to enable artificial intelligence to represent how the world works. Bayesian Networks (BNs) have proven to be an effective and versatile tool for this task. BNs require constructing a structure of dependencies among variables and learning the parameters that govern these relationships. These tasks, referred to as structural learning and parameter learning, are actively investigated by the research community, with several algorithms proposed and no single method having established itself as standard. A wide range of software, tools, and packages have been developed for BNs analysis and made available to academic researchers and industry practitioners. As a consequence of having no one-size-fits-all solution, moving the first practical steps and getting oriented into this field is proving to be challenging to outsiders and beginners. In this paper, we review the most relevant tools and software for BNs structural and parameter learning to date, providing our subjective recommendations directed to an audience of beginners. In addition, we provide an extensive easy-to-consult overview table summarizing all software packages and their main features. By improving the reader understanding of which available software might best suit their needs, we improve accessibility to the field and make it easier for beginners to take their first step into it.
[AI-17] Symbolic Audio Classification via Modal Decision Tree Learning
链接: https://arxiv.org/abs/2503.17018
作者: Enrico Marzano,Giovanni Pagliarini,Riccardo Pasini,Guido Sciavicco,Ionel Eduard Stan
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The range of potential applications of acoustic analysis is wide. Classification of sounds, in particular, is a typical machine learning task that received a lot of attention in recent years. The most common approaches to sound classification are sub-symbolic, typically based on neural networks, and result in black-box models with high performances but very low transparency. In this work, we consider several audio tasks, namely, age and gender recognition, emotion classification, and respiratory disease diagnosis, and we approach them with a symbolic technique, that is, (modal) decision tree learning. We prove that such tasks can be solved using the same symbolic pipeline, that allows to extract simple rules with very high accuracy and low complexity. In principle, all such tasks could be associated to an autonomous conversation system, which could be useful in different contexts, such as an automatic reservation agent for an hospital or a clinic.
[AI-18] Developing Critical Thinking in Second Language Learners: Exploring Generative AI like ChatGPT as a Tool for Argumentative Essay Writing ICIP
链接: https://arxiv.org/abs/2503.17013
作者: Simon Suh,Jihyuk Bang,Ji Woo Han
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 12 pages, 3 figures. Uses Paul-Elder Critical Thinking Model and Tan’s argumentative writing framework. Includes an experimental study with 10 participants
点击查看摘要
Abstract:This study employs the Paul-Elder Critical Thinking Model and Tan’s argumentative writing framework to create a structured methodology. This methodology, ChatGPT Guideline for Critical Argumentative Writing (CGCAW) framework, integrates the models with ChatGPT’s capabilities to guide L2 learners in utilizing ChatGPT to enhance their critical thinking skills. A quantitative experiment was conducted with 10 participants from a state university, divided into experimental and control groups. The experimental group utilized the CGCAW framework, while the control group used ChatGPT without specific guidelines. Participants wrote an argumentative essay within a 40-minute timeframe, and essays were evaluated by three assessors: ChatGPT, Grammarly, and a course instructor. Results indicated that the experimental group showed improvements in clarity, logical coherence, and use of evidence, demonstrating ChatGPT’s potential to enhance specific aspects of argumentative writing. However, the control group performed better in overall language mechanics and articulation of main arguments, indicating areas where the CGCAW framework could be further refined. This study highlights the need for further research to optimize the use of AI tools like ChatGPT in L2 learning environments to enhance critical thinking and writing skills.
[AI-19] argetless 6DoF Calibration of LiDAR and 2D Scanning Radar Based on Cylindrical Occupancy
链接: https://arxiv.org/abs/2503.17002
作者: Weimin Wang,Yu Du,Ting Yang,Yu Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Owing to the capability for reliable and all-weather long-range sensing, the fusion of LiDAR and Radar has been widely applied to autonomous vehicles for robust perception. In practical operation, well manually calibrated extrinsic parameters, which are crucial for the fusion of multi-modal sensors, may drift due to the vibration. To address this issue, we present a novel targetless calibration approach, termed LiRaCo, for the extrinsic 6DoF calibration of LiDAR and Radar sensors. Although both types of sensors can obtain geometric information, bridging the geometric correspondences between multi-modal data without any clues of explicit artificial markers is nontrivial, mainly due to the low vertical resolution of scanning Radar. To achieve the targetless calibration, LiRaCo leverages a spatial occupancy consistency between LiDAR point clouds and Radar scans in a common cylindrical representation, considering the increasing data sparsity with distance for both sensors. Specifically, LiRaCo expands the valid Radar scanned pixels into 3D occupancy grids to constrain LiDAR point clouds based on spatial consistency. Consequently, a cost function involving extrinsic calibration parameters is formulated based on the spatial overlap of 3D grids and LiDAR points. Extrinsic parameters are finally estimated by optimizing the cost function. Comprehensive quantitative and qualitative experiments on two real outdoor datasets with different LiDAR sensors demonstrate the feasibility and accuracy of the proposed method. The source code will be publicly available.
[AI-20] Real-Time Diffusion Policies for Games: Enhancing Consistency Policies with Q-Ensembles
链接: https://arxiv.org/abs/2503.16978
作者: Ruoqi Zhang,Ziwei Luo,Jens Sjölund,Per Mattsson,Linus Gisslén,Alessandro Sestini
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Diffusion models have shown impressive performance in capturing complex and multi-modal action distributions for game agents, but their slow inference speed prevents practical deployment in real-time game environments. While consistency models offer a promising approach for one-step generation, they often suffer from training instability and performance degradation when applied to policy learning. In this paper, we present CPQE (Consistency Policy with Q-Ensembles), which combines consistency models with Q-ensembles to address these this http URL leverages uncertainty estimation through Q-ensembles to provide more reliable value function approximations, resulting in better training stability and improved performance compared to classic double Q-network methods. Our extensive experiments across multiple game scenarios demonstrate that CPQE achieves inference speeds of up to 60 Hz – a significant improvement over state-of-the-art diffusion policies that operate at only 20 Hz – while maintaining comparable performance to multi-step diffusion approaches. CPQE consistently outperforms state-of-the-art consistency model approaches, showing both higher rewards and enhanced training stability throughout the learning process. These results indicate that CPQE offers a practical solution for deploying diffusion-based policies in games and other real-time applications where both multi-modal behavior modeling and rapid inference are critical requirements.
[AI-21] Neural-Guided Equation Discovery
链接: https://arxiv.org/abs/2503.16953
作者: Jannis Brugger,Mattia Cerrato,David Richter,Cedric Derstroff,Daniel Maninger,Mira Mezini,Stefan Kramer
类目: Artificial Intelligence (cs.AI)
*备注: 32 pages + 4 pages appendix, 9 figures, book chapter
点击查看摘要
Abstract:Deep learning approaches are becoming increasingly attractive for equation discovery. We show the advantages and disadvantages of using neural-guided equation discovery by giving an overview of recent papers and the results of experiments using our modular equation discovery system MGMT ( \textbfM ulti-Task \textbfG rammar-Guided \textbfM onte-Carlo \textbfT ree Search for Equation Discovery). The system uses neural-guided Monte-Carlo Tree Search (MCTS) and supports both supervised and reinforcement learning, with a search space defined by a context-free grammar. We summarize seven desirable properties of equation discovery systems, emphasizing the importance of embedding tabular data sets for such learning approaches. Using the modular structure of MGMT, we compare seven architectures (among them, RNNs, CNNs, and Transformers) for embedding tabular datasets on the auxiliary task of contrastive learning for tabular data sets on an equation discovery task. For almost all combinations of modules, supervised learning outperforms reinforcement learning. Moreover, our experiments indicate an advantage of using grammar rules as action space instead of tokens. Two adaptations of MCTS – risk-seeking MCTS and AmEx-MCTS – can improve equation discovery with that kind of search.
[AI-22] On-Sensor Convolutional Neural Networks with Early-Exits
链接: https://arxiv.org/abs/2503.16939
作者: Hazem Hesham Yousef Shalby,Arianna De Vecchi,Alice Scandelli,Pietro Bartoli,Diana Trojaniello,Manuel Roveri,Federica Villa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Presented at IEEE SSCI
点击查看摘要
Abstract:Tiny Machine Learning (TinyML) is a novel research field aiming at integrating Machine Learning (ML) within embedded devices with limited memory, computation, and energy. Recently, a new branch of TinyML has emerged, focusing on integrating ML directly into the sensors to further reduce the power consumption of embedded devices. Interestingly, despite their state-of-the-art performance in many tasks, none of the current solutions in the literature aims to optimize the implementation of Convolutional Neural Networks (CNNs) operating directly into sensors. In this paper, we introduce for the first time in the literature the optimized design and implementation of Depth-First CNNs operating on the Intelligent Sensor Processing Unit (ISPU) within an Inertial Measurement Unit (IMU) by STMicroelectronics. Our approach partitions the CNN between the ISPU and the microcontroller (MCU) and employs an Early-Exit mechanism to stop the computations on the IMU when enough confidence about the results is achieved, hence significantly reducing power consumption. When using a NUCLEO-F411RE board, this solution achieved an average current consumption of 4.8 mA, marking an 11% reduction compared to the regular inference pipeline on the MCU, while having equal accuracy.
[AI-23] Interpretable Machine Learning for Oral Lesion Diagnosis through Prototypical Instances Identification
链接: https://arxiv.org/abs/2503.16938
作者: Alessio Cascione,Mattia Setzu,Federico A. Galatolo,Mario G.C.A. Cimino,Riccardo Guidotti
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Decision-making processes in healthcare can be highly complex and challenging. Machine Learning tools offer significant potential to assist in these processes. However, many current methodologies rely on complex models that are not easily interpretable by experts. This underscores the need to develop interpretable models that can provide meaningful support in clinical decision-making. When approaching such tasks, humans typically compare the situation at hand to a few key examples and representative cases imprinted in their memory. Using an approach which selects such exemplary cases and grounds its predictions on them could contribute to obtaining high-performing interpretable solutions to such problems. To this end, we evaluate PivotTree, an interpretable prototype selection model, on an oral lesion detection problem, specifically trying to detect the presence of neoplastic, aphthous and traumatic ulcerated lesions from oral cavity images. We demonstrate the efficacy of using such method in terms of performance and offer a qualitative and quantitative comparison between exemplary cases and ground-truth prototypes selected by experts.
[AI-24] Rude Humans and Vengeful Robots: Examining Human Perceptions of Robot Retaliatory Intentions in Professional Settings
链接: https://arxiv.org/abs/2503.16932
作者: Kate Letheren,Nicole Robinson
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: This is the author version of the manuscript submitted to ACM Transactions on Human-Robot Interaction. The final version, if accepted, will be published by ACM and available via the ACM Digital Library. 12 pages, 1 figure, 2 tables
点击查看摘要
Abstract:Humans and robots are increasingly working in personal and professional settings. In workplace settings, humans and robots may work together as colleagues, potentially leading to social expectations, or violation thereof. Extant research has primarily sought to understand social interactions and expectations in personal rather than professional settings, and none of these studies have examined negative outcomes arising from violations of social expectations. This paper reports the results of a 2x3 online experiment that used a unique first-person perspective video to immerse participants in a collaborative workplace setting. The results are nuanced and reveal that while robots are expected to act in accordance with social expectations despite human behavior, there are benefits for robots perceived as being the bigger person in the face of human rudeness. Theoretical and practical implications are provided which discuss the import of these findings for the design of social robots.
[AI-25] RustEvo2: An Evolving Benchmark for API Evolution in LLM -based Rust Code Generation
链接: https://arxiv.org/abs/2503.16922
作者: Linxi Liang,Jing Gong,Mingwei Liu,Chong Wang,Guangsheng Ou,Yanlin Wang,Xin Peng,Zibin Zheng
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have become pivotal tools for automating code generation in software development. However, these models face significant challenges in producing version-aware code for rapidly evolving languages like Rust, where frequent Application Programming Interfaces (API) changes across versions lead to compatibility issues and correctness errors. Existing benchmarks lack systematic evaluation of how models navigate API transitions, relying on labor-intensive manual curation and offering limited version-specific insights. To address this gap, we present RustEvo, a novel framework for constructing dynamic benchmarks that evaluate the ability of LLMs to adapt to evolving Rust APIs. RustEvo automates dataset creation by synthesizing 588 API changes (380 from Rust standard libraries, 208 from 15 third-party crates) into programming tasks mirroring real-world challenges. These tasks cover four API evolution categories: Stabilizations, Signature Changes, Behavioral Changes, and Deprecations, reflecting their actual distribution in the Rust ecosystem. Experiments on state-of-the-art (SOTA) LLMs reveal significant performance variations: models achieve a 65.8% average success rate on stabilized APIs but only 38.0% on behavioral changes, highlighting difficulties in detecting semantic shifts without signature alterations. Knowledge cutoff dates strongly influence performance, with models scoring 56.1% on before-cutoff APIs versus 32.5% on after-cutoff tasks. Retrieval-Augmented Generation (RAG) mitigates this gap, improving success rates by 13.5% on average for APIs released after model training. Our findings underscore the necessity of our evolution-aware benchmarks to advance the adaptability of LLMs in fast-paced software ecosystems. The framework and the benchmarks are publicly released at this https URL. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.16922 [cs.SE] (or arXiv:2503.16922v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2503.16922 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-26] A New Segment Routing method with Swap Node Selection Strategy Based on Deep Reinforcement Learning for Software Defined Network
链接: https://arxiv.org/abs/2503.16914
作者: Miao Ye,Jihao Zheng,Qiuxiang Jiang,Yuan Huang,Ziheng Wang,Yong Wang
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The existing segment routing (SR) methods need to determine the routing first and then use path segmentation approaches to select swap nodes to form a segment routing path (SRP). They require re-segmentation of the path when the routing changes. Furthermore, they do not consider the flow table issuance time, which cannot maximize the speed of issuance flow table. To address these issues, this paper establishes an optimization model that can simultaneously form routing strategies and path segmentation strategies for selecting the appropriate swap nodes to reduce flow table issuance time. It also designs an intelligent segment routing algorithm based on deep reinforcement learning (DRL-SR) to solve the proposed model. First, a traffic matrix is designed as the state space for the deep reinforcement learning agent; this matrix includes multiple QoS performance indicators, flow table issuance time overhead and SR label stack depth. Second, the action selection strategy and corresponding reward function are designed, where the agent selects the next node considering the routing; in addition, the action selection strategy whether the newly added node is selected as the swap node and the corresponding reward function are designed considering the time cost factor for the controller to issue the flow table to the swap node. Finally, a series of experiments and their results show that, compared with the existing methods, the designed segmented route optimization model and the intelligent solution algorithm (DRL-SR) can reduce the time overhead required to complete the segmented route establishment task while optimizing performance metrics such as throughput, delays and packet losses.
[AI-27] MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving
链接: https://arxiv.org/abs/2503.16905
作者: Jian Zhang,Zhiyuan Wang,Zhangqi Wang,Xinyu Zhang,Fangzhi Xu,Qika Lin,Rui Mao,Erik Cambria,Jun Liu
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Multimodal scientific problems (MSPs) involve complex issues that require the integration of multiple modalities, such as text and diagrams, presenting a significant challenge in artificial intelligence. While progress has been made in addressing traditional scientific problems, MSPs still face two primary issues: the challenge of multi-modal comprehensive reasoning in scientific problem-solving and the lack of reflective and rethinking capabilities. To address these issues, we introduce a Multi-Agent framework based on the Big Seven Personality and Socratic guidance (MAPS). This framework employs seven distinct agents that leverage feedback mechanisms and the Socratic method to guide the resolution of MSPs. To tackle the first issue, we propose a progressive four-agent solving strategy, where each agent focuses on a specific stage of the problem-solving process. For the second issue, we introduce a Critic agent, inspired by Socratic questioning, which prompts critical thinking and stimulates autonomous learning. We conduct extensive experiments on the EMMA, Olympiad, and MathVista datasets, achieving promising results that outperform the current SOTA model by 15.84% across all tasks. Meanwhile, the additional analytical experiments also verify the model’s progress as well as generalization ability.
[AI-28] Deep Learning for Human Locomotion Analysis in Lower-Limb Exoskeletons: A Comparative Study
链接: https://arxiv.org/abs/2503.16904
作者: Omar Coser,Christian Tamantini,Matteo Tortora,Leonardo Furia,Rosa Sicilia,Loredana Zollo,Paolo Soda
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 26 pages, 6 figures
点击查看摘要
Abstract:Wearable robotics for lower-limb assistance have become a pivotal area of research, aiming to enhance mobility for individuals with physical impairments or augment the performance of able-bodied users. Accurate and adaptive control systems are essential to ensure seamless interaction between the wearer and the robotic device, particularly when navigating diverse and dynamic terrains. Despite the recent advances in neural networks for time series analysis, no attempts have been directed towards the classification of ground conditions, categorized into five classes and subsequently determining the ramp’s slope and stair’s height. In this respect, this paper presents an experimental comparison between eight deep neural network backbones to predict high-level locomotion parameters across diverse terrains. All the models are trained on the publicly available CAMARGO 2021 dataset. IMU-only data equally or outperformed IMU+EMG inputs, promoting a cost-effective and efficient design. Indeeds, using three IMU sensors, the LSTM achieved high terrain classification accuracy (0.94 ± 0.04) and precise ramp slope (1.95 ± 0.58°) and the CNN-LSTM a stair height (15.65 ± 7.40 mm) estimations. As a further contribution, SHAP analysis justified sensor reduction without performance loss, ensuring a lightweight setup. The system operates with ~2 ms inference time, supporting real-time applications. The code is code available at this https URL. Comments: 26 pages, 6 figures Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) ACMclasses: F.2.2, I.2.7 Cite as: arXiv:2503.16904 [cs.RO] (or arXiv:2503.16904v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2503.16904 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-29] MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
链接: https://arxiv.org/abs/2503.16874
作者: Jian Zhang,Zhangqi Wang,Haiping Zhu,Jun Liu,Qika Lin,Erik Cambria
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The basic question-answering format of large language models involves inputting a prompt and receiving a response, and the quality of the prompt directly impacts the effectiveness of the response. Automated Prompt Optimization (APO) aims to break free from the cognitive biases of manually designed prompts and explores a broader design space for prompts. However, existing APO methods suffer from limited flexibility of fixed templates and inefficient search in prompt spaces as key issues. To this end, we propose a Multi-Agent framework Incorporating Socratic guidance (MARS), which utilizes multi-agent fusion technology for automatic planning, with gradual continuous optimization and evaluation. Specifically, MARS comprises seven agents, each with distinct functionalities, which autonomously use the Planner to devise an optimization path that ensures flexibility. Additionally, it employs a Teacher-Critic-Student Socratic dialogue pattern to iteratively optimize the prompts while conducting effective search. We conduct extensive experiments on various datasets to validate the effectiveness of our method, and perform additional analytical experiments to assess the model’s advancement as well as the interpretability.
[AI-30] In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI
链接: https://arxiv.org/abs/2503.16861
作者: Shayne Longpre,Kevin Klyman,Ruth E. Appel,Sayash Kapoor,Rishi Bommasani,Michelle Sahar,Sean McGregor,Avijit Ghosh,Borhane Blili-Hamelin,Nathan Butters,Alondra Nelson,Amit Elazari,Andrew Sellars,Casey John Ellis,Dane Sherrets,Dawn Song,Harley Geiger,Ilona Cohen,Lauren McIlvenny,Madhulika Srikumar,Mark M. Jaycox,Markus Anderljung,Nadine Farid Johnson,Nicholas Carlini,Nicolas Miailhe,Nik Marda,Peter Henderson,Rebecca S. Portnoff,Rebecca Weiss,Victoria Westerhoff,Yacine Jernite,Rumman Chowdhury,Percy Liang,Arvind Narayanan
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we identify key gaps in the evaluation and reporting of flaws in GPAI systems. We call for three interventions to advance system safety. First, we propose using standardized AI flaw reports and rules of engagement for researchers in order to ease the process of submitting, reproducing, and triaging flaws in GPAI systems. Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs, borrowing from bug bounties, with legal safe harbors to protect researchers. Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports across the many stakeholders who may be impacted. These interventions are increasingly urgent, as evidenced by the prevalence of jailbreaks and other flaws that can transfer across different providers’ GPAI systems. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.
[AI-31] Physics-Informed Neural Network Surrogate Models for River Stage Prediction
链接: https://arxiv.org/abs/2503.16850
作者: Maximilian Zoch,Edward Holmberg,Pujan Pokhrel,Ken Pathak,Steven Sloan,Kendall Niles,Jay Ratcliff,Maik Flanagin,Elias Ioup,Christian Guetl,Mahdi Abdelguerfi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures
点击查看摘要
Abstract:This work investigates the feasibility of using Physics-Informed Neural Networks (PINNs) as surrogate models for river stage prediction, aiming to reduce computational cost while maintaining predictive accuracy. Our primary contribution demonstrates that PINNs can successfully approximate HEC-RAS numerical solutions when trained on a single river, achieving strong predictive accuracy with generally low relative errors, though some river segments exhibit higher deviations. By integrating the governing Saint-Venant equations into the learning process, the proposed PINN-based surrogate model enforces physical consistency and significantly improves computational efficiency compared to HEC-RAS. We evaluate the model’s performance in terms of accuracy and computational speed, demonstrating that it closely approximates HEC-RAS predictions while enabling real-time inference. These results highlight the potential of PINNs as effective surrogate models for single-river hydrodynamics, offering a promising alternative for computationally efficient river stage forecasting. Future work will explore techniques to enhance PINN training stability and robustness across a more generalized multi-river model. Comments: 10 pages, 5 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.16850 [cs.LG] (or arXiv:2503.16850v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.16850 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-32] DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation
链接: https://arxiv.org/abs/2503.16806
作者: Jiangran Lyu,Ziming Li,Xuesong Shi,Chaoyi Xu,Yizhou Wang,He Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL
点击查看摘要
Abstract:Nonprehensile manipulation is crucial for handling objects that are too thin, large, or otherwise ungraspable in unstructured environments. While conventional planning-based approaches struggle with complex contact modeling, learning-based methods have recently emerged as a promising alternative. However, existing learning-based approaches face two major limitations: they heavily rely on multi-view cameras and precise pose tracking, and they fail to generalize across varying physical conditions, such as changes in object mass and table friction. To address these challenges, we propose the Dynamics-Adaptive World Action Model (DyWA), a novel framework that enhances action learning by jointly predicting future states while adapting to dynamics variations based on historical trajectories. By unifying the modeling of geometry, state, physics, and robot actions, DyWA enables more robust policy learning under partial observability. Compared to baselines, our method improves the success rate by 31.5% using only single-view point cloud observations in the simulation. Furthermore, DyWA achieves an average success rate of 68% in real-world experiments, demonstrating its ability to generalize across diverse object geometries, adapt to varying table friction, and robustness in challenging scenarios such as half-filled water bottles and slippery surfaces.
[AI-33] Causally Aligned Curriculum Learning ICLR2024
链接: https://arxiv.org/abs/2503.16799
作者: Mingxuan Li,Junzhe Zhang,Elias Bareinboim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted as Posters in ICLR 2024
点击查看摘要
Abstract:A pervasive challenge in Reinforcement Learning (RL) is the “curse of dimensionality” which is the exponential growth in the state-action space when optimizing a high-dimensional target task. The framework of curriculum learning trains the agent in a curriculum composed of a sequence of related and more manageable source tasks. The expectation is that when some optimal decision rules are shared across source tasks and the target task, the agent could more quickly pick up the necessary skills to behave optimally in the environment, thus accelerating the learning process. However, this critical assumption of invariant optimal decision rules does not necessarily hold in many practical applications, specifically when the underlying environment contains unobserved confounders. This paper studies the problem of curriculum RL through causal lenses. We derive a sufficient graphical condition characterizing causally aligned source tasks, i.e., the invariance of optimal decision rules holds. We further develop an efficient algorithm to generate a causally aligned curriculum, provided with qualitative causal knowledge of the target task. Finally, we validate our proposed methodology through experiments in discrete and continuous confounded tasks with pixel observations.
[AI-34] A Learnability Analysis on Neuro-Symbolic Learning
链接: https://arxiv.org/abs/2503.16797
作者: Hao-Yuan He,Ming Li
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper analyzes the learnability of neuro-symbolic (NeSy) tasks within hybrid systems. We show that the learnability of NeSy tasks can be characterized by their derived constraint satisfaction problems (DCSPs). Specifically, a task is learnable if the corresponding DCSP has a unique solution; otherwise, it is unlearnable. For learnable tasks, we establish error bounds by exploiting the clustering property of the hypothesis space. Additionally, we analyze the asymptotic error for general NeSy tasks, showing that the expected error scales with the disagreement among solutions. Our results offer a principled approach to determining learnability and provide insights into the design of new algorithms.
[AI-35] “The Diagram is like Guardrails”: Structuring GenAI-assisted Hypotheses Exploration with an Interactive Shared Representation
链接: https://arxiv.org/abs/2503.16791
作者: Zijian Ding,Michelle Brachman,Joel Chan,Werner Geyer
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Data analysis encompasses a spectrum of tasks, from high-level conceptual reasoning to lower-level execution. While AI-powered tools increasingly support execution tasks, there remains a need for intelligent assistance in conceptual tasks. This paper investigates the design of an ordered node-link tree interface augmented with AI-generated information hints and visualizations, as a potential shared representation for hypothesis exploration. Through a design probe (n=22), participants generated diagrams averaging 21.82 hypotheses. Our findings showed that the node-link diagram acts as “guardrails” for hypothesis exploration, facilitating structured workflows, providing comprehensive overviews, and enabling efficient backtracking. The AI-generated information hints, particularly visualizations, aided users in transforming abstract ideas into data-backed concepts while reducing cognitive load. We further discuss how node-link diagrams can support both parallel exploration and iterative refinement in hypothesis formulation, potentially enhancing the breadth and depth of human-AI collaborative data analysis.
[AI-36] Does Chain-of-Thought Reasoning Help Mobile GUI Agent ? An Empirical Study
链接: https://arxiv.org/abs/2503.16788
作者: Li Zhang,Longxi Gao,Mengwei Xu
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Reasoning capabilities have significantly improved the performance of vision-language models (VLMs) in domains such as mathematical problem-solving, coding, and visual question-answering. However, their impact on real-world applications remains unclear. This paper presents the first empirical study on the effectiveness of reasoning-enabled VLMs in mobile GUI agents, a domain that requires interpreting complex screen layouts, understanding user instructions, and executing multi-turn interactions. We evaluate two pairs of commercial models–Gemini 2.0 Flash and Claude 3.7 Sonnet–comparing their base and reasoning-enhanced versions across two static benchmarks (ScreenSpot and AndroidControl) and one interactive environment (AndroidWorld). We surprisingly find the Claude 3.7 Sonnet reasoning model achieves state-of-the-art performance on AndroidWorld. However, reasoning VLMs generally offer marginal improvements over non-reasoning models on static benchmarks and even degrade performance in some agent setups. Notably, reasoning and non-reasoning VLMs fail on different sets of tasks, suggesting that reasoning does have an impact, but its benefits and drawbacks counterbalance each other. We attribute these inconsistencies to the limitations of benchmarks and VLMs. Based on the findings, we provide insights for further enhancing mobile GUI agents in terms of benchmarks, VLMs, and their adaptability in dynamically invoking reasoning VLMs. The experimental data are publicly available at this https URL.
[AI-37] SuperARC: A Test for General and Super Intelligence Based on First Principles of Recursion Theory and Algorithmic Probability
链接: https://arxiv.org/abs/2503.16743
作者: Alberto Hernández-Espinosa,Luan Ozelim,Felipe S. Abrahão,Hector Zenil
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: 45 pages + Technical Supplementary Information, 71 pages total
点击查看摘要
Abstract:We introduce an open-ended test grounded in algorithmic probability that can avoid benchmark contamination in the quantitative evaluation of frontier models in the context of their Artificial General Intelligence (AGI) and Superintelligence (ASI) claims. Unlike other tests, this test does not rely on statistical compression methods (such as GZIP or LZW), which are more closely related to Shannon entropy than to Kolmogorov complexity. The test challenges aspects related to features of intelligence of fundamental nature such as synthesis and model creation in the context of inverse problems (generating new knowledge from observation). We argue that metrics based on model abstraction and optimal Bayesian inference for planning can provide a robust framework for testing intelligence, including natural intelligence (human and animal), narrow AI, AGI, and ASI. Our results show no clear evidence of LLM convergence towards a defined level of intelligence, particularly AGI or ASI. We found that LLM model versions tend to be fragile and incremental, as new versions may perform worse than older ones, with progress largely driven by the size of training data. The results were compared with a hybrid neurosymbolic approach that theoretically guarantees model convergence from optimal inference based on the principles of algorithmic probability and Kolmogorov complexity. The method outperforms LLMs in a proof-of-concept on short binary sequences. Our findings confirm suspicions regarding the fundamental limitations of LLMs, exposing them as systems optimised for the perception of mastery over human language. Progress among different LLM versions from the same developers was found to be inconsistent and limited, particularly in the absence of a solid symbolic counterpart.
[AI-38] owards Agent ic Recommender Systems in the Era of Multimodal Large Language Models
链接: https://arxiv.org/abs/2503.16734
作者: Chengkai Huang,Junda Wu,Yu Xia,Zixu Yu,Ruhan Wang,Tong Yu,Ruiyi Zhang,Ryan A. Rossi,Branislav Kveton,Dongruo Zhou,Julian McAuley,Lina Yao
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Recent breakthroughs in Large Language Models (LLMs) have led to the emergence of agentic AI systems that extend beyond the capabilities of standalone models. By empowering LLMs to perceive external environments, integrate multimodal information, and interact with various tools, these agentic systems exhibit greater autonomy and adaptability across complex tasks. This evolution brings new opportunities to recommender systems (RS): LLM-based Agentic RS (LLM-ARS) can offer more interactive, context-aware, and proactive recommendations, potentially reshaping the user experience and broadening the application scope of RS. Despite promising early results, fundamental challenges remain, including how to effectively incorporate external knowledge, balance autonomy with controllability, and evaluate performance in dynamic, multimodal settings. In this perspective paper, we first present a systematic analysis of LLM-ARS: (1) clarifying core concepts and architectures; (2) highlighting how agentic capabilities – such as planning, memory, and multimodal reasoning – can enhance recommendation quality; and (3) outlining key research questions in areas such as safety, efficiency, and lifelong personalization. We also discuss open problems and future directions, arguing that LLM-ARS will drive the next wave of RS innovation. Ultimately, we foresee a paradigm shift toward intelligent, autonomous, and collaborative recommendation experiences that more closely align with users’ evolving needs and complex decision-making processes.
[AI-39] owards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
链接: https://arxiv.org/abs/2503.16724
作者: Zhaoxin Li,Zhang Xi-Jia,Batuhan Altundas,Letian Chen,Rohan Paleja,Matthew Gombolay
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Semantic Interpretability in Reinforcement Learning (RL) enables transparency, accountability, and safer deployment by making the agent’s decisions understandable and verifiable. Achieving this, however, requires a feature space composed of human-understandable concepts, which traditionally rely on human specification and fail to generalize to unseen environments. In this work, we introduce Semantically Interpretable Reinforcement Learning with Vision-Language Models Empowered Automation (SILVA), an automated framework that leverages pre-trained vision-language models (VLM) for semantic feature extraction and interpretable tree-based models for policy optimization. SILVA first queries a VLM to identify relevant semantic features for an unseen environment, then extracts these features from the environment. Finally, it trains an Interpretable Control Tree via RL, mapping the extracted features to actions in a transparent and interpretable manner. To address the computational inefficiency of extracting features directly with VLMs, we develop a feature extraction pipeline that generates a dataset for training a lightweight convolutional network, which is subsequently used during RL. By leveraging VLMs to automate tree-based RL, SILVA removes the reliance on human annotation previously required by interpretable models while also overcoming the inability of VLMs alone to generate valid robot policies, enabling semantically interpretable reinforcement learning without human-in-the-loop.
[AI-40] Limits of trust in medical AI
链接: https://arxiv.org/abs/2503.16692
作者: Joshua Hatherley
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 12 pages
点击查看摘要
Abstract:Artificial intelligence (AI) is expected to revolutionize the practice of medicine. Recent advancements in the field of deep learning have demonstrated success in a variety of clinical tasks: detecting diabetic retinopathy from images, predicting hospital readmissions, aiding in the discovery of new drugs, etc. AI’s progress in medicine, however, has led to concerns regarding the potential effects of this technology upon relationships of trust in clinical practice. In this paper, I will argue that there is merit to these concerns, since AI systems can be relied upon, and are capable of reliability, but cannot be trusted, and are not capable of trustworthiness. Insofar as patients are required to rely upon AI systems for their medical decision-making, there is potential for this to produce a deficit of trust in relationships in clinical practice.
[AI-41] GauRast: Enhancing GPU Triangle Rasterizers to Accelerate 3D Gaussian Splatting
链接: https://arxiv.org/abs/2503.16681
作者: Sixu Li,Ben Keller,Yingyan Celine Lin,Brucek Khailany
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: DAC 2025
点击查看摘要
Abstract:3D intelligence leverages rich 3D features and stands as a promising frontier in AI, with 3D rendering fundamental to many downstream applications. 3D Gaussian Splatting (3DGS), an emerging high-quality 3D rendering method, requires significant computation, making real-time execution on existing GPU-equipped edge devices infeasible. Previous efforts to accelerate 3DGS rely on dedicated accelerators that require substantial integration overhead and hardware costs. This work proposes an acceleration strategy that leverages the similarities between the 3DGS pipeline and the highly optimized conventional graphics pipeline in modern GPUs. Instead of developing a dedicated accelerator, we enhance existing GPU rasterizer hardware to efficiently support 3DGS operations. Our results demonstrate a 23 \times increase in processing speed and a 24 \times reduction in energy consumption, with improvements yielding 6 \times faster end-to-end runtime for the original 3DGS algorithm and 4 \times for the latest efficiency-improved pipeline, achieving 24 FPS and 46 FPS respectively. These enhancements incur only a minimal area overhead of 0.2% relative to the entire SoC chip area, underscoring the practicality and efficiency of our approach for enabling 3DGS rendering on resource-constrained platforms.
[AI-42] Echoes of Power: Investigating Geopolitical Bias in US and China Large Language Models
链接: https://arxiv.org/abs/2503.16679
作者: Andre G. C. Pacheco,Athus Cavalini,Giovanni Comarela
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have emerged as powerful tools for generating human-like text, transforming human-machine interactions. However, their widespread adoption has raised concerns about their potential to influence public opinion and shape political narratives. In this work, we investigate the geopolitical biases in US and Chinese LLMs, focusing on how these models respond to questions related to geopolitics and international relations. We collected responses from ChatGPT and DeepSeek to a set of geopolitical questions and evaluated their outputs through both qualitative and quantitative analyses. Our findings show notable biases in both models, reflecting distinct ideological perspectives and cultural influences. However, despite these biases, for a set of questions, the models’ responses are more aligned than expected, indicating that they can address sensitive topics without necessarily presenting directly opposing viewpoints. This study highlights the potential of LLMs to shape public discourse and underscores the importance of critically assessing AI-generated content, particularly in politically sensitive contexts.
[AI-43] Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
链接: https://arxiv.org/abs/2503.16672
作者: Daniel Haziza,Timothy Chou,Dhruv Choudhary,Luca Wehrstedt,Francisco Massa,Jiecao Yu,Geonhwa Jeong,Supriya Rao,Patrick Labatut,Jesse Cai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In this paper, we demonstrate how to leverage 2:4 sparsity, a popular hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuracy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.
[AI-44] Aligning Text-to-Music Evaluation with Human Preferences
链接: https://arxiv.org/abs/2503.16669
作者: Yichen Huang,Zachary Novack,Koichi Saito,Jiatong Shi,Shinji Watanabe,Yuki Mitsufuji,John Thickstun,Chris Donahue
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to particular musical desiderata, and (2) collecting and evaluating on MusicPrefs, the first open-source dataset of human preferences for TTM systems. We find that not only is the standard FAD setup inconsistent on both synthetic and human preference data, but that nearly all existing metrics fail to effectively capture desiderata, and are only weakly correlated with human perception. We propose a new metric, the MAUVE Audio Divergence (MAD), computed on representations from a self-supervised audio embedding model. We find that this metric effectively captures diverse musical desiderata (average rank correlation 0.84 for MAD vs. 0.49 for FAD and also correlates more strongly with MusicPrefs (0.62 vs. 0.14).
[AI-45] Code Evolution Graphs: Understanding Large Language Model Driven Design of Algorithms GECCO2025
链接: https://arxiv.org/abs/2503.16668
作者: Niki van Stein,Anna V. Kononova,Lars Kotthoff,Thomas Bäck
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: Accepted at GECCO 2025
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated great promise in generating code, especially when used inside an evolutionary computation framework to iteratively optimize the generated algorithms. However, in some cases they fail to generate competitive algorithms or the code optimization stalls, and we are left with no recourse because of a lack of understanding of the generation process and generated codes. We present a novel approach to mitigate this problem by enabling users to analyze the generated codes inside the evolutionary process and how they evolve over repeated prompting of the LLM. We show results for three benchmark problem classes and demonstrate novel insights. In particular, LLMs tend to generate more complex code with repeated prompting, but additional complexity can hurt algorithmic performance in some cases. Different LLMs have different coding ``styles’’ and generated code tends to be dissimilar to other LLMs. These two findings suggest that using different LLMs inside the code evolution frameworks might produce higher performing code than using only one LLM.
[AI-46] Explainable AI-Guided Efficient Approximate DNN Generation for Multi-Pod Systolic Arrays
链接: https://arxiv.org/abs/2503.16583
作者: Ayesha Siddique,Khurram Khalil,Khaza Anuarul Hoque
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: This paper has been accepted in the ISQED 2025 conference
点击查看摘要
Abstract:Approximate deep neural networks (AxDNNs) are promising for enhancing energy efficiency in real-world devices. One of the key contributors behind this enhanced energy efficiency in AxDNNs is the use of approximate multipliers. Unfortunately, the simulation of approximate multipliers does not usually scale well on CPUs and GPUs. As a consequence, this slows down the overall simulation of AxDNNs aimed at identifying the appropriate approximate multipliers to achieve high energy efficiency with a minimum accuracy loss. To address this problem, we present a novel XAI-Gen methodology, which leverages the analytical model of the emerging hardware accelerator (e.g., Google TPU v4) and explainable artificial intelligence (XAI) to precisely identify the non-critical layers for approximation and quickly discover the appropriate approximate multipliers for AxDNN layers. Our results show that XAI-Gen achieves up to 7x lower energy consumption with only 1-2% accuracy loss. We also showcase the effectiveness of the XAI-Gen approach through a neural architecture search (XAI-NAS) case study. Interestingly, XAI-NAS achieves 40% higher energy efficiency with up to 5x less execution time when compared to the state-of-the-art NAS methods for generating AxDNNs.
[AI-47] Machine Learning-Based Genomic Linguistic Analysis (Gene Sequence Feature Learning): A Case Study on Predicting Heavy Metal Response Genes in Rice
链接: https://arxiv.org/abs/2503.16582
作者: Ruiqi Yang,Jianxu Wang,Wei Yuan,Xun Wang,Mei Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注:
点击查看摘要
Abstract:This study explores the application of machine learning-based genetic linguistics for identifying heavy metal response genes in rice (Oryza sativa). By integrating convolutional neural networks and random forest algorithms, we developed a hybrid model capable of extracting and learning meaningful features from gene sequences, such as k-mer frequencies and physicochemical properties. The model was trained and tested on datasets of genes, achieving high predictive performance (precision: 0.89, F1-score: 0.82). RNA-seq and qRT-PCR experiments conducted on rice leaves which exposed to Hg0, revealed differential expression of genes associated with heavy metal responses, which validated the model’s predictions. Co-expression network analysis identified 103 related genes, and a literature review indicated that these genes are highly likely to be involved in heavy metal-related biological processes. By integrating and comparing the analysis results with those of differentially expressed genes (DEGs), the validity of the new machine learning method was further demonstrated. This study highlights the efficacy of combining machine learning with genetic linguistics for large-scale gene prediction. It demonstrates a cost-effective and efficient approach for uncovering molecular mechanisms underlying heavy metal responses, with potential applications in developing stress-tolerant crop varieties.
[AI-48] Feature selection strategies for optimized heart disease diagnosis using ML and DL models
链接: https://arxiv.org/abs/2503.16577
作者: Bilal Ahmad,Jinfu Chen,Haibao Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Heart disease remains one of the leading causes of morbidity and mortality worldwide, necessitating the development of effective diagnostic tools to enable early diagnosis and clinical decision-making. This study evaluates the impact of feature selection techniques Mutual Information (MI), Analysis of Variance (ANOVA), and Chi-Square on the predictive performance of various machine learning (ML) and deep learning (DL) models using a dataset of clinical indicators for heart disease. Eleven ML/DL models were assessed using metrics such as precision, recall, AUC score, F1-score, and accuracy. Results indicate that MI outperformed other methods, particularly for advanced models like neural networks, achieving the highest accuracy of 82.3% and recall score of 0.94. Logistic regression (accuracy 82.1%) and random forest (accuracy 80.99%) also demonstrated improved performance with MI. Simpler models such as Naive Bayes and decision trees achieved comparable results with ANOVA and Chi-Square, yielding accuracies of 76.45% and 75.99%, respectively, making them computationally efficient alternatives. Conversely, k Nearest Neighbors (KNN) and Support Vector Machines (SVM) exhibited lower performance, with accuracies ranging between 51.52% and 54.43%, regardless of the feature selection method. This study provides a comprehensive comparison of feature selection methods for heart disease prediction, demonstrating the critical role of feature selection in optimizing model performance. The results offer practical guidance for selecting appropriate feature selection techniques based on the chosen classification algorithm, contributing to the development of more accurate and efficient diagnostic tools for enhanced clinical decision-making in cardiology.
[AI-49] AUV Acceleration Prediction Using DVL and Deep Learning
链接: https://arxiv.org/abs/2503.16573
作者: Yair Stolero,Itzik Klein
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Autonomous underwater vehicles (AUVs) are essential for various applications, including oceanographic surveys, underwater mapping, and infrastructure inspections. Accurate and robust navigation are critical to completing these tasks. To this end, a Doppler velocity log (DVL) and inertial sensors are fused together. Recently, a model-based approach demonstrated the ability to extract the vehicle acceleration vector from DVL velocity measurements. Motivated by this advancement, in this paper we present an end-to-end deep learning approach to estimate the AUV acceleration vector based on past DVL velocity measurements. Based on recorded data from sea experiments, we demonstrate that the proposed method improves acceleration vector estimation by more than 65% compared to the model-based approach by using data-driven techniques. As a result of our data-driven approach, we can enhance navigation accuracy and reliability in AUV applications, contributing to more efficient and effective underwater missions through improved accuracy and reliability.
[AI-50] Efficient ANN-Guided Distillation: Aligning Rate-based Features of Spiking Neural Networks through Hybrid Block-wise Replacement
链接: https://arxiv.org/abs/2503.16572
作者: Shu Yang,Chengting Yu,Lei Liu,Hanzhi Ma,Aili Wang,Erping Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Spiking Neural Networks (SNNs) have garnered considerable attention as a potential alternative to Artificial Neural Networks (ANNs). Recent studies have highlighted SNNs’ potential on large-scale datasets. For SNN training, two main approaches exist: direct training and ANN-to-SNN (ANN2SNN) conversion. To fully leverage existing ANN models in guiding SNN learning, either direct ANN-to-SNN conversion or ANN-SNN distillation training can be employed. In this paper, we propose an ANN-SNN distillation framework from the ANN-to-SNN perspective, designed with a block-wise replacement strategy for ANN-guided learning. By generating intermediate hybrid models that progressively align SNN feature spaces to those of ANN through rate-based features, our framework naturally incorporates rate-based backpropagation as a training method. Our approach achieves results comparable to or better than state-of-the-art SNN distillation methods, showing both training and learning efficiency.
[AI-51] Advancing Problem-Based Learning in Biomedical Engineering in the Era of Generative AI
链接: https://arxiv.org/abs/2503.16558
作者: Micky C. Nnamdi,J. Ben Tamo,Wenqi Shi,May D. Wang
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Problem-Based Learning (PBL) has significantly impacted biomedical engineering (BME) education since its introduction in the early 2000s, effectively enhancing critical thinking and real-world knowledge application among students. With biomedical engineering rapidly converging with artificial intelligence (AI), integrating effective AI education into established curricula has become challenging yet increasingly necessary. Recent advancements, including AI’s recognition by the 2024 Nobel Prize, have highlighted the importance of training students comprehensively in biomedical AI. However, effective biomedical AI education faces substantial obstacles, such as diverse student backgrounds, limited personalized mentoring, constrained computational resources, and difficulties in safely scaling hands-on practical experiments due to privacy and ethical concerns associated with biomedical data. To overcome these issues, we conducted a three-year (2021-2023) case study implementing an advanced PBL framework tailored specifically for biomedical AI education, involving 92 undergraduate and 156 graduate students from the joint Biomedical Engineering program of Georgia Institute of Technology and Emory University. Our approach emphasizes collaborative, interdisciplinary problem-solving through authentic biomedical AI challenges. The implementation led to measurable improvements in learning outcomes, evidenced by high research productivity (16 student-authored publications), consistently positive peer evaluations, and successful development of innovative computational methods addressing real biomedical challenges. Additionally, we examined the role of generative AI both as a teaching subject and an educational support tool within the PBL framework. Our study presents a practical and scalable roadmap for biomedical engineering departments aiming to integrate robust AI education into their curricula.
[AI-52] Empowering Medical Multi-Agents with Clinical Consultation Flow for Dynamic Diagnosis
链接: https://arxiv.org/abs/2503.16547
作者: Sihan Wang,Suiyang Jiang,Yibo Gao,Boming Wang,Shangqi Gao,Xiahai Zhuang
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:Traditional AI-based healthcare systems often rely on single-modal data, limiting diagnostic accuracy due to incomplete information. However, recent advancements in foundation models show promising potential for enhancing diagnosis combining multi-modal information. While these models excel in static tasks, they struggle with dynamic diagnosis, failing to manage multi-turn interactions and often making premature diagnostic decisions due to insufficient persistence in information this http URL address this, we propose a multi-agent framework inspired by consultation flow and reinforcement learning (RL) to simulate the entire consultation process, integrating multiple clinical information for effective diagnosis. Our approach incorporates a hierarchical action set, structured from clinic consultation flow and medical textbook, to effectively guide the decision-making process. This strategy improves agent interactions, enabling them to adapt and optimize actions based on the dynamic state. We evaluated our framework on a public dynamic diagnosis benchmark. The proposed framework evidentially improves the baseline methods and achieves state-of-the-art performance compared to existing foundation model-based methods.
[AI-53] Modelling Emotions in Face-to-Face Setting: The Interplay of Eye-Tracking Personality and Temporal Dynamics
链接: https://arxiv.org/abs/2503.16532
作者: Meisam Jamshidi Seikavandi,Jostein Fimland,Maria Barrett,Paolo Burelli
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Accurate emotion recognition is pivotal for nuanced and engaging human-computer interactions, yet remains difficult to achieve, especially in dynamic, conversation-like settings. In this study, we showcase how integrating eye-tracking data, temporal dynamics, and personality traits can substantially enhance the detection of both perceived and felt emotions. Seventy-three participants viewed short, speech-containing videos from the CREMA-D dataset, while being recorded for eye-tracking signals (pupil size, fixation patterns), Big Five personality assessments, and self-reported emotional states. Our neural network models combined these diverse inputs including stimulus emotion labels for contextual cues and yielded marked performance gains compared to the state-of-the-art. Specifically, perceived valence predictions reached a macro F1-score of 0.76, and models incorporating personality traits and stimulus information demonstrated significant improvements in felt emotion accuracy. These results highlight the benefit of unifying physiological, individual and contextual factors to address the subjectivity and complexity of emotional expression. Beyond validating the role of user-specific data in capturing subtle internal states, our findings inform the design of future affective computing and human-agent systems, paving the way for more adaptive and cross-individual emotional intelligence in real-world interactions.
[AI-54] Advancing Human-Machine Teaming: Concepts Challenges and Applications
链接: https://arxiv.org/abs/2503.16518
作者: Dian Chen,Han Jun Yoon,Zelin Wan,Nithin Alluru,Sang Won Lee,Richard He,Terrence J. Moore,Frederica F. Nelson,Sunghyun Yoon,Hyuk Lim,Dan Dongseong Kim,Jin-Hee Cho
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Human-Machine Teaming (HMT) is revolutionizing collaboration across domains such as defense, healthcare, and autonomous systems by integrating AI-driven decision-making, trust calibration, and adaptive teaming. This survey presents a comprehensive taxonomy of HMT, analyzing theoretical models, including reinforcement learning, instance-based learning, and interdependence theory, alongside interdisciplinary methodologies. Unlike prior reviews, we examine team cognition, ethical AI, multi-modal interactions, and real-world evaluation frameworks. Key challenges include explainability, role allocation, and scalable benchmarking. We propose future research in cross-domain adaptation, trust-aware AI, and standardized testbeds. By bridging computational and social sciences, this work lays a foundation for resilient, ethical, and scalable HMT systems.
[AI-55] From G-Factor to A-Factor: Establishing a Psychometric Framework for AI Literacy
链接: https://arxiv.org/abs/2503.16517
作者: Ning Li,Wenming Deng,Jiatan Chen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注:
点击查看摘要
Abstract:This research addresses the growing need to measure and understand AI literacy in the context of generative AI technologies. Through three sequential studies involving a total of 517 participants, we establish AI literacy as a coherent, measurable construct with significant implications for education, workforce development, and social equity. Study 1 (N=85) revealed a dominant latent factor - termed the “A-factor” - that accounts for 44.16% of variance across diverse AI interaction tasks. Study 2 (N=286) refined the measurement tool by examining four key dimensions of AI literacy: communication effectiveness, creative idea generation, content evaluation, and step-by-step collaboration, resulting in an 18-item assessment battery. Study 3 (N=146) validated this instrument in a controlled laboratory setting, demonstrating its predictive validity for real-world task performance. Results indicate that AI literacy significantly predicts performance on complex, language-based creative tasks but shows domain specificity in its predictive power. Additionally, regression analyses identified several significant predictors of AI literacy, including cognitive abilities (IQ), educational background, prior AI experience, and training history. The multidimensional nature of AI literacy and its distinct factor structure provide evidence that effective human-AI collaboration requires a combination of general and specialized abilities. These findings contribute to theoretical frameworks of human-AI collaboration while offering practical guidance for developing targeted educational interventions to promote equitable access to the benefits of generative AI technologies.
[AI-56] VeriMind: Agent ic LLM for Automated Verilog Generation with a Novel Evaluation Metric
链接: https://arxiv.org/abs/2503.16514
作者: Bardia Nadimi,Ghali Omar Boutaib,Hao Zheng
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:
点击查看摘要
Abstract:Designing Verilog modules requires meticulous attention to correctness, efficiency, and adherence to design specifications. However, manually writing Verilog code remains a complex and time-consuming task that demands both expert knowledge and iterative refinement. Leveraging recent advancements in large language models (LLMs) and their structured text generation capabilities, we propose VeriMind, an agentic LLM framework for Verilog code generation that significantly automates and optimizes the synthesis process. Unlike traditional LLM-based code generators, VeriMind employs a structured reasoning approach: given a user-provided prompt describing design requirements, the system first formulates a detailed train of thought before the final Verilog code is generated. This multi-step methodology enhances interpretability, accuracy, and adaptability in hardware design. In addition, we introduce a novel evaluation metric-pass@ARC-which combines the conventional pass@k measure with Average Refinement Cycles (ARC) to capture both success rate and the efficiency of iterative refinement. Experimental results on diverse hardware design tasks demonstrated that our approach achieved up to 8.3% improvement on pass@k metric and 8.1% on pass@ARC metric. These findings underscore the transformative potential of agentic LLMs in automated hardware design, RTL development, and digital system synthesis.
[AI-57] Conversational AI as a Coding Assistant: Understanding Programmers Interactions with and Expectations from Large Language Models for Coding
链接: https://arxiv.org/abs/2503.16508
作者: Mehmet Akhoroz,Caglar Yildirim
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 20 pages
点击查看摘要
Abstract:Conversational AI interfaces powered by large language models (LLMs) are increasingly used as coding assistants. However, questions remain about how programmers interact with LLM-based conversational agents, the challenges they encounter, and the factors influencing adoption. This study investigates programmers’ usage patterns, perceptions, and interaction strategies when engaging with LLM-driven coding assistants. Through a survey, participants reported both the benefits, such as efficiency and clarity of explanations, and the limitations, including inaccuracies, lack of contextual awareness, and concerns about over-reliance. Notably, some programmers actively avoid LLMs due to a preference for independent learning, distrust in AI-generated code, and ethical considerations. Based on our findings, we propose design guidelines for improving conversational coding assistants, emphasizing context retention, transparency, multimodal support, and adaptability to user preferences. These insights contribute to the broader understanding of how LLM-based conversational agents can be effectively integrated into software development workflows while addressing adoption barriers and enhancing usability.
[AI-58] Fewer Than 1% of Explainable AI Papers Validate Explainability with Humans
链接: https://arxiv.org/abs/2503.16507
作者: Ashley Suh,Isabelle Hurley,Nora Smith,Ho Chit Siu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA '25)
点击查看摘要
Abstract:This late-breaking work presents a large-scale analysis of explainable AI (XAI) literature to evaluate claims of human explainability. We collaborated with a professional librarian to identify 18,254 papers containing keywords related to explainability and interpretability. Of these, we find that only 253 papers included terms suggesting human involvement in evaluating an XAI technique, and just 128 of those conducted some form of a human study. In other words, fewer than 1% of XAI papers (0.7%) provide empirical evidence of human explainability when compared to the broader body of XAI literature. Our findings underscore a critical gap between claims of human explainability and evidence-based validation, raising concerns about the rigor of XAI research. We call for increased emphasis on human evaluations in XAI studies and provide our literature search methodology to enable both reproducibility and further investigation into this widespread issue.
[AI-59] Stakeholder Perspectives on Whether and How Social Robots Can Support Mediation and Advocacy for Higher Education Students with Disabilities
链接: https://arxiv.org/abs/2503.16499
作者: Alva Markelius,Julie Bailey,Jenny L. Gibson,Hatice Gunes
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Robotics (cs.RO)
*备注: This is a pre-print
点击查看摘要
Abstract:This paper presents an iterative, participatory, empirical study that examines the potential of using artificial intelligence, such as social robots and large language models, to support mediation and advocacy for students with disabilities in higher education. Drawing on qualitative data from interviews and focus groups conducted with various stakeholders, including disabled students, disabled student representatives, and disability practitioners at the University of Cambridge, this study reports findings relating to understanding the problem space, ideating robotic support and participatory co-design of advocacy support robots. The findings highlight the potential of these technologies in providing signposting and acting as a sounding board or study companion, while also addressing limitations in empathic understanding, trust, equity, and accessibility. We discuss ethical considerations, including intersectional biases, the double empathy problem, and the implications of deploying social robots in contexts shaped by structural inequalities. Finally, we offer a set of recommendations and suggestions for future research, rethinking the notion of corrective technological interventions to tools that empower and amplify self-advocacy.
[AI-60] Effective Yet Ephemeral Propaganda Defense: There Needs to Be More than One-Shot Inoculation to Enhance Critical Thinking
链接: https://arxiv.org/abs/2503.16497
作者: Nicolas Hoferer,Kilian Sprenkamp,Dorian Christoph Quelle,Daniel Gordon Jones,Zoya Katashinskaya,Alexandre Bovet,Liudmila Zavolokina
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:
点击查看摘要
Abstract:In today’s media landscape, propaganda distribution has a significant impact on society. It sows confusion, undermines democratic processes, and leads to increasingly difficult decision-making for news readers. We investigate the lasting effect on critical thinking and propaganda awareness on them when using a propaganda detection and contextualization tool. Building on inoculation theory, which suggests that preemptively exposing individuals to weakened forms of propaganda can improve their resilience against it, we integrate Kahneman’s dual-system theory to measure the tools’ impact on critical thinking. Through a two-phase online experiment, we measure the effect of several inoculation doses. Our findings show that while the tool increases critical thinking during its use, this increase vanishes without access to the tool. This indicates a single use of the tool does not create a lasting impact. We discuss the implications and propose possible approaches to improve the resilience against propaganda in the long-term.
[AI-61] he Impact of Generative AI Coding Assistants on Developers Who Are Visually Impaired
链接: https://arxiv.org/abs/2503.16491
作者: Claudia Flores-Saviaga,Benjamin V. Hanrahan,Kashif Imteyaz,Steven Clarke,Saiph Savage
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 21 pages, 3 figures, published in the ACM Conference on Human Factors in Computing Systems 2025 (CHI’25)
点击查看摘要
Abstract:The rapid adoption of generative AI in software development has impacted the industry, yet its effects on developers with visual impairments remain largely unexplored. To address this gap, we used an Activity Theory framework to examine how developers with visual impairments interact with AI coding assistants. For this purpose, we conducted a study where developers who are visually impaired completed a series of programming tasks using a generative AI coding assistant. We uncovered that, while participants found the AI assistant beneficial and reported significant advantages, they also highlighted accessibility challenges. Specifically, the AI coding assistant often exacerbated existing accessibility barriers and introduced new challenges. For example, it overwhelmed users with an excessive number of suggestions, leading developers who are visually impaired to express a desire for ``AI timeouts.‘’ Additionally, the generative AI coding assistant made it more difficult for developers to switch contexts between the AI-generated content and their own code. Despite these challenges, participants were optimistic about the potential of AI coding assistants to transform the coding experience for developers with visual impairments. Our findings emphasize the need to apply activity-centered design principles to generative AI assistants, ensuring they better align with user behaviors and address specific accessibility needs. This approach can enable the assistants to provide more intuitive, inclusive, and effective experiences, while also contributing to the broader goal of enhancing accessibility in software development.
[AI-62] PythonPal: Enhancing Online Programming Education through Chatbot-Driven Personalized Feedback
链接: https://arxiv.org/abs/2503.16487
作者: Sirinda Palahan
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The rise of online programming education has necessitated more effective, personalized interactions, a gap that PythonPal aims to fill through its innovative learning system integrated with a chatbot. This research delves into PythonPal’s potential to enhance the online learning experience, especially in contexts with high student-to-teacher ratios where there is a need for personalized feedback. PythonPal’s design, featuring modules for conversation, tutorials, and exercises, was evaluated through student interactions and feedback. Key findings reveal PythonPal’s proficiency in syntax error recognition and user query comprehension, with its intent classification model showing high accuracy. The system’s performance in error feedback, though varied, demonstrates both strengths and areas for enhancement. Student feedback indicated satisfactory query understanding and feedback accuracy but also pointed out the need for faster responses and improved interaction quality. PythonPal’s deployment promises to significantly enhance online programming education by providing immediate, personalized feedback and interactive learning experiences, fostering a deeper understanding of programming concepts among students. These benefits mark a step forward in addressing the challenges of distance learning, making programming education more accessible and effective.
[AI-63] Accodemy: AI Powered Code Learning Platform to Assist Novice Programmers in Overcoming the Fear of Coding
链接: https://arxiv.org/abs/2503.16486
作者: M.A.F. Aamina,V. Kavishcan,W.M.P.B.B. Jayaratne,K.K.D.S.N. Kannangara,A.A. Aamil,Achini Adikari
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Computer programming represents a rapidly evolving and sought-after career path in the 21st century. Nevertheless, novice learners may find the process intimidating for several reasons, such as limited and highly competitive career opportunities, peer and parental pressure for academic success, and course difficulties. These factors frequently contribute to anxiety and eventual dropout as a result of fear. Furthermore, research has demonstrated that beginners are significantly deterred by the fear of failure, which results in programming anxiety and and a sense of being overwhelmed by intricate topics, ultimately leading to dropping out. This project undertakes an exploration beyond the scope of conventional code learning platforms by identifying and utilising effective and personalised strategies of learning. The proposed solution incorporates features such as AI-generated challenging questions, mindfulness quotes, and tips to motivate users, along with an AI chatbot that functions as a motivational aid. In addition, the suggested solution integrates personalized roadmaps and gamification elements to maintain user involvement. The project aims to systematically monitor the progress of novice programmers and enhance their knowledge of coding with a personalised, revised curriculum to help mitigate the fear of coding and boost confidence.
[AI-64] Optimizing Generative AIs Accuracy and Transparency in Inductive Thematic Analysis: A Human-AI Comparison
链接: https://arxiv.org/abs/2503.16485
作者: Matthew Nyaaba,Min SungEun,Mary Abiswin Apam,Kwame Owoahene Acheampong,Emmanuel Dwamena
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This study explores the use of OpenAI’s API for inductive thematic analysis, employing a stepwise strategy to enhance transparency and traceability in GenAI-generated coding. A five-phase analysis and evaluation process were followed. Using the stepwise prompt, GenAI effectively generated codes with supporting statements and references, categorized themes, and developed broader interpretations by linking them to real-world contexts. While GenAI performed at a comparable level to human coders in coding and theming, it exhibited a more generalized and conceptual approach to interpretation, whereas human coders provided more specific, theme-based interpretations. Mapping these processes onto Naeem et al.'s (2023) six-step thematic analysis framework, GenAI covered four out of the six steps, while human coders followed three steps. Although GenAI’s coding, theming, and interpretation align with keywording, coding, theming, and interpretation in Naeem et al.‘s framework, human coders’ interpretations were more closely tied to themes rather than broader conceptualization. This study positions GenAI as a viable tool for conducting inductive thematic analysis with minimal human intervention, offering an efficient and structured approach to qualitative data analysis. Future research should explore the development of specialized prompts that align GenAI’s inductive thematic analysis with established qualitative research frameworks.
[AI-65] AI-Powered Episodic Future Thinking
链接: https://arxiv.org/abs/2503.16484
作者: Sareh Ahmadi,Michelle Rockwell,Megan Stuart,Allison Tegge,Xuan Wang,Jeffrey Stein,Edward A. Fox
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Episodic Future Thinking (EFT) is an intervention that involves vividly imagining personal future events and experiences in detail. It has shown promise as an intervention to reduce delay discounting - the tendency to devalue delayed rewards in favor of immediate gratification - and to promote behavior change in a range of maladaptive health behaviors. We present EFTeacher, an AI chatbot powered by the GPT-4-Turbo large language model, designed to generate EFT cues for users with lifestyle-related conditions. To evaluate the chatbot, we conducted a user study that included usability assessments and user evaluations based on content characteristics questionnaires, followed by semi-structured interviews. The study provides qualitative insights into participants’ experiences and interactions with the chatbot and its usability. Our findings highlight the potential application of AI chatbots based on Large Language Models (LLMs) in EFT interventions, and offer design guidelines for future behavior-oriented applications.
[AI-66] LeRAAT: LLM -Enabled Real-Time Aviation Advisory Tool ALT
链接: https://arxiv.org/abs/2503.16477
作者: Marc R. Schlichting,Vale Rasmussen,Heba Alazzeh,Houjun Liu,Kiana Jafari,Amelia F. Hardy,Dylan M. Asmar,Mykel J. Kochenderfer
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
*备注: 4 pages, 3 figures, code: this https URL , demo video: this https URL
点击查看摘要
Abstract:In aviation emergencies, high-stakes decisions must be made in an instant. Pilots rely on quick access to precise, context-specific information – an area where emerging tools like large language models (LLMs) show promise in providing critical support. This paper introduces LeRAAT, a framework that integrates LLMs with the X-Plane flight simulator to deliver real-time, context-aware pilot assistance. The system uses live flight data, weather conditions, and aircraft documentation to generate recommendations aligned with aviation best practices and tailored to the particular situation. It employs a Retrieval-Augmented Generation (RAG) pipeline that extracts and synthesizes information from aircraft type-specific manuals, including performance specifications and emergency procedures, as well as aviation regulatory materials, such as FAA directives and standard operating procedures. We showcase the framework in both a virtual reality and traditional on-screen simulation, supporting a wide range of research applications such as pilot training, human factors research, and operational decision support.
[AI-67] From Voices to Worlds: Developing an AI-Powered Framework for 3D Object Generation in Augmented Reality
链接: https://arxiv.org/abs/2503.16474
作者: Majid Behravan,Denis Gracanin
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2502.15869
点击查看摘要
Abstract:This paper presents Matrix, an advanced AI-powered framework designed for real-time 3D object generation in Augmented Reality (AR) environments. By integrating a cutting-edge text-to-3D generative AI model, multilingual speech-to-text translation, and large language models (LLMs), the system enables seamless user interactions through spoken commands. The framework processes speech inputs, generates 3D objects, and provides object recommendations based on contextual understanding, enhancing AR experiences. A key feature of this framework is its ability to optimize 3D models by reducing mesh complexity, resulting in significantly smaller file sizes and faster processing on resource-constrained AR devices. Our approach addresses the challenges of high GPU usage, large model output sizes, and real-time system responsiveness, ensuring a smoother user experience. Moreover, the system is equipped with a pre-generated object repository, further reducing GPU load and improving efficiency. We demonstrate the practical applications of this framework in various fields such as education, design, and accessibility, and discuss future enhancements including image-to-3D conversion, environmental object detection, and multimodal support. The open-source nature of the framework promotes ongoing innovation and its utility across diverse industries.
[AI-68] Human-AI Interaction Design Standards
链接: https://arxiv.org/abs/2503.16472
作者: Chaoyi Zhao,Wei Xu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The rapid development of artificial intelligence (AI) has significantly transformed human-computer interactions, making it essential to establish robust design standards to ensure effective, ethical, and human-centered AI (HCAI) solutions. Standards serve as the foundation for the adoption of new technologies, and human-AI interaction (HAII) standards are critical to supporting the industrialization of AI technology by following an HCAI approach. These design standards aim to provide clear principles, requirements, and guidelines for designing, developing, deploying, and using AI systems, enhancing the user experience and performance of AI systems. Despite their importance, the creation and adoption of HCAI-based interaction design standards face challenges, including the absence of universal frameworks, the inherent complexity of HAII, and the ethical dilemmas that arise in such systems. This chapter provides a comparative analysis of HAII versus traditional human-computer interaction (HCI) and outlines guiding principles for HCAI-based design. It explores international, regional, national, and industry standards related to HAII design from an HCAI perspective and reviews design guidelines released by leading companies such as Microsoft, Google, and Apple. Additionally, the chapter highlights tools available for implementing HAII standards and presents case studies of human-centered interaction design for AI systems in diverse fields, including healthcare, autonomous vehicles, and customer service. It further examines key challenges in developing HAII standards and suggests future directions for the field. Emphasizing the importance of ongoing collaboration between AI designers, developers, and experts in human factors and HCI, this chapter stresses the need to advance HCAI-based interaction design standards to ensure human-centered AI solutions across various domains.
[AI-69] A Review of Brain-Computer Interface Technologies: Signal Acquisition Methods and Interaction Paradigms
链接: https://arxiv.org/abs/2503.16471
作者: Yifan Wang,Cheng Jiang,Chenzhong Li
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 12 figures,20 pages
点击查看摘要
Abstract:Brain-Computer Interface (BCI) technology facilitates direct communication between the human brain and external devices, representing a substantial advancement in human-machine interaction. This review provides an in-depth analysis of various BCI paradigms, including classic paradigms, current classifications, and hybrid paradigms, each with distinct characteristics and applications. Additionally, we explore a range of signal acquisition methods, classified into non-implantation, intervention, and implantation techniques, elaborating on their principles and recent advancements. By examining the interdependence between paradigms and signal acquisition technologies, this review offers a comprehensive perspective on how innovations in one domain propel progress in the other. The goal is to present insights into the future development of more efficient, user-friendly, and versatile BCI systems, emphasizing the synergy between paradigm design and signal acquisition techniques and their potential to transform the field.
[AI-70] owards properly implementing Theory of Mind in AI systems: An account of four misconceptions
链接: https://arxiv.org/abs/2503.16468
作者: Ramira van der Meulen,Rineke Verbrugge,Max van Duijn
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 19 pages, draft version
点击查看摘要
Abstract:The search for effective collaboration between humans and computer systems is one of the biggest challenges in Artificial Intelligence. One of the more effective mechanisms that humans use to coordinate with one another is theory of mind (ToM). ToM can be described as the ability to `take someone else’s perspective and make estimations of their beliefs, desires and intentions, in order to make sense of their behaviour and attitudes towards the world’. If leveraged properly, this skill can be very useful in Human-AI collaboration. This introduces the question how we implement ToM when building an AI system. Humans and AI Systems work quite differently, and ToM is a multifaceted concept, each facet rooted in different research traditions across the cognitive and developmental sciences. We observe that researchers from artificial intelligence and the computing sciences, ourselves included, often have difficulties finding their way in the ToM literature. In this paper, we identify four common misconceptions around ToM that we believe should be taken into account when developing an AI system. We have hyperbolised these misconceptions for the sake of the argument, but add nuance in their discussion. The misconceptions we discuss are: (1) “Humans Use a ToM Module, So AI Systems Should As Well”. (2) “Every Social Interaction Requires (Advanced) ToM”. (3) “All ToM is the Same”. (4) “Current Systems Already Have ToM”. After discussing the misconception, we end each section by providing tentative guidelines on how the misconception can be overcome. Comments: 19 pages, draft version Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.16468 [cs.HC] (or arXiv:2503.16468v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2503.16468 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ramira Van Der Meulen [view email] [v1] Fri, 28 Feb 2025 19:12:35 UTC (71 KB) Full-text links: Access Paper: View a PDF of the paper titled Towards properly implementing Theory of Mind in AI systems: An account of four misconceptions, by Ramira van der Meulen and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.HC prev | next new | recent | 2025-03 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[AI-71] Enhancing Explainability with Multimodal Context Representations for Smarter Robots
链接: https://arxiv.org/abs/2503.16467
作者: Anargh Viswanath,Lokesh Veeramacheneni,Hendrik Buschmeier
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Presented at 3rd Workshop on Explainability in Human-Robot Collaboration at HRI 2025
点击查看摘要
Abstract:Artificial Intelligence (AI) has significantly advanced in recent years, driving innovation across various fields, especially in robotics. Even though robots can perform complex tasks with increasing autonomy, challenges remain in ensuring explainability and user-centered design for effective interaction. A key issue in Human-Robot Interaction (HRI) is enabling robots to effectively perceive and reason over multimodal inputs, such as audio and vision, to foster trust and seamless collaboration. In this paper, we propose a generalized and explainable multimodal framework for context representation, designed to improve the fusion of speech and vision modalities. We introduce a use case on assessing ‘Relevance’ between verbal utterances from the user and visual scene perception of the robot. We present our methodology with a Multimodal Joint Representation module and a Temporal Alignment module, which can allow robots to evaluate relevance by temporally aligning multimodal inputs. Finally, we discuss how the proposed framework for context representation can help with various aspects of explainability in HRI.
[AI-72] ACE Action and Control via Explanations: A Proposal for LLM s to Provide Human-Centered Explainability for Multimodal AI Assistants
链接: https://arxiv.org/abs/2503.16466
作者: Elizabeth Anne Watkins,Emanuel Moss,Ramesh Manuvinakurike,Meng Shi,Richard Beckwith,Giuseppe Raffa
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted at Human-Centered Explainable AI workshop at CHI 2024
点击查看摘要
Abstract:In this short paper we address issues related to building multimodal AI systems for human performance support in manufacturing domains. We make two contributions: we first identify challenges of participatory design and training of such systems, and secondly, to address such challenges, we propose the ACE paradigm: “Action and Control via Explanations”. Specifically, we suggest that LLMs can be used to produce explanations in the form of human interpretable “semantic frames”, which in turn enable end users to provide data the AI system needs to align its multimodal models and representations, including computer vision, automatic speech recognition, and document inputs. ACE, by using LLMs to “explain” using semantic frames, will help the human and the AI system to collaborate, together building a more accurate model of humans activities and behaviors, and ultimately more accurate predictive outputs for better task support, and better outcomes for human users performing manual tasks.
[AI-73] OS-Kairos: Adaptive Interaction for MLLM -Powered GUI Agents
链接: https://arxiv.org/abs/2503.16465
作者: Pengzhou Cheng,Zheng Wu,Zongru Wu,Aston Zhang,Zhuosheng Zhang,Gongshen Liu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 25 pages, 24 figures, 11 tables
点击查看摘要
Abstract:Autonomous graphical user interface (GUI) agents powered by multimodal large language models have shown great promise. However, a critical yet underexplored issue persists: over-execution, where the agent executes tasks in a fully autonomous way, without adequate assessment of its action confidence to compromise an adaptive human-agent collaboration. This poses substantial risks in complex scenarios, such as those involving ambiguous user instructions, unexpected interruptions, and environmental hijacks. To address the issue, we introduce OS-Kairos, an adaptive GUI agent capable of predicting confidence levels at each interaction step and efficiently deciding whether to act autonomously or seek human intervention. OS-Kairos is developed through two key mechanisms: (i) collaborative probing that annotates confidence scores at each interaction step; (ii) confidence-driven interaction that leverages these confidence scores to elicit the ability of adaptive interaction. Experimental results show that OS-Kairos substantially outperforms existing models on our curated dataset featuring complex scenarios, as well as on established benchmarks such as AITZ and Meta-GUI, with 24.59% \sim 87.29% improvements in task success rate. OS-Kairos facilitates an adaptive human-agent collaboration, prioritizing effectiveness, generality, scalability, and efficiency for real-world GUI interaction. The dataset and codes are available at this https URL.
[AI-74] Rank-O-ToM: Unlocking Emotional Nuance Ranking to Enhance Affective Theory-of-Mind AAAI2025
链接: https://arxiv.org/abs/2503.16461
作者: JiHyun Kim,JuneHyoung Kwon,MiHyeon Kim,Eunju Lee,YoungBin Kim
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted to AAAI 2025 Theory of Mind for AI (ToM4AI) Workshop (Spotlight) JiHyun Kim, JuneHyoung Kwon, MiHyeon Kim, and Eunju Lee contributed equally as co-first authors. YoungBin Kim is the corresponding author
点击查看摘要
Abstract:Facial Expression Recognition (FER) plays a foundational role in enabling AI systems to interpret emotional nuances, a critical aspect of affective Theory of Mind (ToM). However, existing models often struggle with poor calibration and a limited capacity to capture emotional intensity and complexity. To address this, we propose Ranking the Emotional Nuance for Theory of Mind (Rank-O-ToM), a framework that leverages ordinal ranking to align confidence levels with the emotional spectrum. By incorporating synthetic samples reflecting diverse affective complexities, Rank-O-ToM enhances the nuanced understanding of emotions, advancing AI’s ability to reason about affective states.
[AI-75] Beyond Final Answers: Evaluating Large Language Models for Math Tutoring
链接: https://arxiv.org/abs/2503.16460
作者: Adit Gupta,Jennifer Reddig,Tommaso Calo,Daniel Weitekamp,Christopher J. MacLellan
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Researchers have made notable progress in applying Large Language Models (LLMs) to solve math problems, as demonstrated through efforts like GSM8k, ProofNet, AlphaGeometry, and MathOdyssey. This progress has sparked interest in their potential use for tutoring students in mathematics. However, the reliability of LLMs in tutoring contexts – where correctness and instructional quality are crucial – remains underexplored. Moreover, LLM problem-solving capabilities may not necessarily translate into effective tutoring support for students. In this work, we present two novel approaches to evaluate the correctness and quality of LLMs in math tutoring contexts. The first approach uses an intelligent tutoring system for college algebra as a testbed to assess LLM problem-solving capabilities. We generate benchmark problems using the tutor, prompt a diverse set of LLMs to solve them, and compare the solutions to those generated by the tutor. The second approach evaluates LLM as tutors rather than problem solvers. We employ human evaluators, who act as students seeking tutoring support from each LLM. We then assess the quality and correctness of the support provided by the LLMs via a qualitative coding process. We applied these methods to evaluate several ChatGPT models, including 3.5 Turbo, 4, 4o, o1-mini, and o1-preview. Our findings show that when used as problem solvers, LLMs generate correct final answers for 85.5% of the college algebra problems tested. When employed interactively as tutors, 90% of LLM dialogues show high-quality instructional support; however, many contain errors – only 56.6% are entirely correct. We conclude that, despite their potential, LLMs are not yet suitable as intelligent tutors for math without human oversight or additional mechanisms to ensure correctness and quality.
[AI-76] Position: Beyond Assistance – Reimagining LLM s as Ethical and Adaptive Co-Creators in Mental Health Care
链接: https://arxiv.org/abs/2503.16456
作者: Abeer Badawi,Md Tahmid Rahman Laskar,Jimmy Xiangji Huang,Shaina Raza,Elham Dolatabadi
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This position paper argues for a fundamental shift in how Large Language Models (LLMs) are integrated into the mental health care domain. We advocate for their role as co-creators rather than mere assistive tools. While LLMs have the potential to enhance accessibility, personalization, and crisis intervention, their adoption remains limited due to concerns about bias, evaluation, over-reliance, dehumanization, and regulatory uncertainties. To address these challenges, we propose two structured pathways: SAFE-i (Supportive, Adaptive, Fair, and Ethical Implementation) Guidelines for ethical and responsible deployment, and HAAS-e (Human-AI Alignment and Safety Evaluation) Framework for multidimensional, human-centered assessment. SAFE-i provides a blueprint for data governance, adaptive model engineering, and real-world integration, ensuring LLMs align with clinical and ethical standards. HAAS-e introduces evaluation metrics that go beyond technical accuracy to measure trustworthiness, empathy, cultural sensitivity, and actionability. We call for the adoption of these structured approaches to establish a responsible and scalable model for LLM-driven mental health support, ensuring that AI complements-rather than replaces-human expertise.
[AI-77] Bridging Structural Dynamics and Biomechanics: Human Motion Estimation through Footstep-Induced Floor Vibrations
链接: https://arxiv.org/abs/2503.16455
作者: Yiwen Dong,Jessica Rose,Hae Young Noh
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:
点击查看摘要
Abstract:Quantitative estimation of human joint motion in daily living spaces is essential for early detection and rehabilitation tracking of neuromusculoskeletal disorders (e.g., Parkinson’s) and mitigating trip and fall risks for older adults. Existing approaches involve monitoring devices such as cameras, wearables, and pressure mats, but have operational constraints such as direct line-of-sight, carrying devices, and dense deployment. To overcome these limitations, we leverage gait-induced floor vibration to estimate lower-limb joint motion (e.g., ankle, knee, and hip flexion angles), allowing non-intrusive and contactless gait health monitoring in people’s living spaces. To overcome the high uncertainty in lower-limb movement given the limited information provided by the gait-induced floor vibrations, we formulate a physics-informed graph to integrate domain knowledge of gait biomechanics and structural dynamics into the model. Specifically, different types of nodes represent heterogeneous information from joint motions and floor vibrations; Their connecting edges represent the physiological relationships between joints and forces governed by gait biomechanics, as well as the relationships between forces and floor responses governed by the structural dynamics. As a result, our model poses physical constraints to reduce uncertainty while allowing information sharing between the body and the floor to make more accurate predictions. We evaluate our approach with 20 participants through a real-world walking experiment. We achieved an average of 3.7 degrees of mean absolute error in estimating 12 joint flexion angles (38% error reduction from baseline), which is comparable to the performance of cameras and wearables in current medical practices.
[AI-78] An Audio-Visual Fusion Emotion Generation Model Based on Neuroanatomical Alignment
链接: https://arxiv.org/abs/2503.16454
作者: Haidong Wang,Qia Shan,JianHua Zhang,PengFei Xiao,Ao Liu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:In the field of affective computing, traditional methods for generating emotions predominantly rely on deep learning techniques and large-scale emotion datasets. However, deep learning techniques are often complex and difficult to interpret, and standardizing large-scale emotional datasets are difficult and costly to establish. To tackle these challenges, we introduce a novel framework named Audio-Visual Fusion for Brain-like Emotion Learning(AVF-BEL). In contrast to conventional brain-inspired emotion learning methods, this approach improves the audio-visual emotion fusion and generation model through the integration of modular components, thereby enabling more lightweight and interpretable emotion learning and generation processes. The framework simulates the integration of the visual, auditory, and emotional pathways of the brain, optimizes the fusion of emotional features across visual and auditory modalities, and improves upon the traditional Brain Emotional Learning (BEL) model. The experimental results indicate a significant improvement in the similarity of the audio-visual fusion emotion learning generation model compared to single-modality visual and auditory emotion learning and generation model. Ultimately, this aligns with the fundamental phenomenon of heightened emotion generation facilitated by the integrated impact of visual and auditory stimuli. This contribution not only enhances the interpretability and efficiency of affective intelligence but also provides new insights and pathways for advancing affective computing technology. Our source code can be accessed here: this https URLthis https URL.
[AI-79] owards Biomarker Discovery for Early Cerebral Palsy Detection: Evaluating Explanations Through Kinematic Perturbations
链接: https://arxiv.org/abs/2503.16452
作者: Kimji N. Pellano,Inga Strümke,Daniel Groos,Lars Adde,Pål Haugen,Espen Alexander F. Ihlen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 19 pages, 14 figures
点击查看摘要
Abstract:Cerebral Palsy (CP) is a prevalent motor disability in children, for which early detection can significantly improve treatment outcomes. While skeleton-based Graph Convolutional Network (GCN) models have shown promise in automatically predicting CP risk from infant videos, their “black-box” nature raises concerns about clinical explainability. To address this, we introduce a perturbation framework tailored for infant movement features and use it to compare two explainable AI (XAI) methods: Class Activation Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM). First, we identify significant and non-significant body keypoints in very low- and very high-risk infant video snippets based on the XAI attribution scores. We then conduct targeted velocity and angular perturbations, both individually and in combination, on these keypoints to assess how the GCN model’s risk predictions change. Our results indicate that velocity-driven features of the arms, hips, and legs have a dominant influence on CP risk predictions, while angular perturbations have a more modest impact. Furthermore, CAM and Grad-CAM show partial convergence in their explanations for both low- and high-risk CP groups. Our findings demonstrate the use of XAI-driven movement analysis for early CP prediction and offer insights into potential movement-based biomarker discovery that warrant further clinical validation.
[AI-80] hink-Then-React: Towards Unconstrained Human Action-to-Reaction Generation ICLR2025
链接: https://arxiv.org/abs/2503.16451
作者: Wenhui Tan,Boyuan Li,Chuhao Jin,Wenbing Huang,Xiting Wang,Ruihua Song
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted by ICLR 2025
点击查看摘要
Abstract:Modeling human-like action-to-reaction generation has significant real-world applications, like human-robot interaction and games. Despite recent advancements in single-person motion generation, it is still challenging to well handle action-to-reaction generation, due to the difficulty of directly predicting reaction from action sequence without prompts, and the absence of a unified representation that effectively encodes multi-person motion. To address these challenges, we introduce Think-Then-React (TTR), a large language-model-based framework designed to generate human-like reactions. First, with our fine-grained multimodal training strategy, TTR is capable to unify two processes during inference: a thinking process that explicitly infers action intentions and reasons corresponding reaction description, which serve as semantic prompts, and a reacting process that predicts reactions based on input action and the inferred semantic prompts. Second, to effectively represent multi-person motion in language models, we propose a unified motion tokenizer by decoupling egocentric pose and absolute space features, which effectively represents action and reaction motion with same encoding. Extensive experiments demonstrate that TTR outperforms existing baselines, achieving significant improvements in evaluation metrics, such as reducing FID from 3.988 to 1.942.
[AI-81] Mitigating the Uncanny Valley Effect in Hyper-Realistic Robots: A Student-Centered Study on LLM -Driven Conversations
链接: https://arxiv.org/abs/2503.16449
作者: Hangyeol Kang,Thiago Freitas dos Santos,Maher Ben Moussa,Nadia Magnenat-Thalmann
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:The uncanny valley effect poses a significant challenge in the development and acceptance of hyper-realistic social robots. This study investigates whether advanced conversational capabilities powered by large language models (LLMs) can mitigate this effect in highly anthropomorphic robots. We conducted a user study with 80 participants interacting with Nadine, a hyper-realistic humanoid robot equipped with LLM-driven communication skills. Through pre- and post-interaction surveys, we assessed changes in perceptions of uncanniness, conversational quality, and overall user experience. Our findings reveal that LLM-enhanced interactions significantly reduce feelings of eeriness while fostering more natural and engaging conversations. Additionally, we identify key factors influencing user acceptance, including conversational naturalness, human-likeness, and interestingness. Based on these insights, we propose design recommendations to enhance the appeal and acceptability of hyper-realistic robots in social contexts. This research contributes to the growing field of human-robot interaction by offering empirical evidence on the potential of LLMs to bridge the uncanny valley, with implications for the future development of social robots.
[AI-82] FINCH: Locally Visualizing Higher-Order Feature Interactions in Black Box Models
链接: https://arxiv.org/abs/2503.16445
作者: Anna Kleinau,Bernhard Preim,Monique Meuschke
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 18 figures
点击查看摘要
Abstract:In an era where black-box AI models are integral to decision-making across industries, robust methods for explaining these models are more critical than ever. While these models leverage complex feature interplay for accurate predictions, most explanation methods only assign relevance to individual features. There is a research gap in methods that effectively illustrate interactions between features, especially in visualizing higher-order interactions involving multiple features, which challenge conventional representation methods. To address this challenge in local explanations focused on individual instances, we employ a visual, subset-based approach to reveal relevant feature interactions. Our visual analytics tool FINCH uses coloring and highlighting techniques to create intuitive, human-centered visualizations, and provides additional views that enable users to calibrate their trust in the model and explanations. We demonstrate FINCH in multiple case studies, demonstrating its generalizability, and conducted an extensive human study with machine learning experts to highlight its helpfulness and usability. With this approach, FINCH allows users to visualize feature interactions involving any number of features locally.
[AI-83] Conversational Explanations: Discussing Explainable AI with Non-AI Experts
链接: https://arxiv.org/abs/2503.16444
作者: Tong Zhang,Mengao Zhang,Wei Yan Low,X. Jessie Yang,Boyang Li
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted to IUI 2025
点击查看摘要
Abstract:Explainable AI (XAI) aims to provide insights into the decisions made by AI models. To date, most XAI approaches provide only one-time, static explanations, which cannot cater to users’ diverse knowledge levels and information needs. Conversational explanations have been proposed as an effective method to customize XAI explanations. However, building conversational explanation systems is hindered by the scarcity of training data. Training with synthetic data faces two main challenges: lack of data diversity and hallucination in the generated data. To alleviate these issues, we introduce a repetition penalty to promote data diversity and exploit a hallucination detector to filter out untruthful synthetic conversation turns. We conducted both automatic and human evaluations on the proposed system, fEw-shot Multi-round ConvErsational Explanation (EMCEE). For automatic evaluation, EMCEE achieves relative improvements of 81.6% in BLEU and 80.5% in ROUGE compared to the baselines. EMCEE also mitigates the degeneration of data quality caused by training on synthetic data. In human evaluations (N=60), EMCEE outperforms baseline models and the control group in improving users’ comprehension, acceptance, trust, and collaboration with static explanations by large margins. Through a fine-grained analysis of model responses, we further demonstrate that training on self-generated synthetic data improves the model’s ability to generate more truthful and understandable answers, leading to better user interactions. To the best of our knowledge, this is the first conversational explanation method that can answer free-form user questions following static explanations.
[AI-84] Situational Agency: The Framework for Designing Behavior in Agent -based art
链接: https://arxiv.org/abs/2503.16442
作者: Ary-Yue Huang,Varvara Guljajeva
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 8 pages,5 figures, accetped by 30th International Symposium on Electronic Art (ISEA)
点击查看摘要
Abstract:In the context of artificial life art and agent-based art, this paper draws on Simon Penny’s \itshape Aesthetic of Behavior theory and Sofian Audry’s discussions on behavior computation to examine how artists design agent behaviors and the ensuing aesthetic experiences. We advocate for integrating the environment in which agents operate as the context for behavioral design, positing that the environment emerges through continuous interactions among agents, audiences, and other entities, forming an evolving network of meanings generated by these interactions. Artists create contexts by deploying and guiding these computational systems, audience participation, and agent behaviors through artist strategies. This framework is developed by analysing two categories of agent-based artworks, exploring the intersection of computational systems, audience participation, and artistic strategies in creating aesthetic experiences. This paper seeks to provide a contextual foundation and framework for designing agents’ behaviors by conducting a comparative study focused on behavioural design strategies by the artists.
[AI-85] Safe and Efficient Social Navigation through Explainable Safety Regions Based on Topological Features
链接: https://arxiv.org/abs/2503.16441
作者: Victor Toscano-Duran,Sara Narteni,Alberto Carlevaro,Rocio Gonzalez-Diaz,Maurizio Mongelli,Jerome Guzzi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); General Topology (math.GN)
*备注:
点击查看摘要
Abstract:The recent adoption of artificial intelligence (AI) in robotics has driven the development of algorithms that enable autonomous systems to adapt to complex social environments. In particular, safe and efficient social navigation is a key challenge, requiring AI not only to avoid collisions and deadlocks but also to interact intuitively and predictably with its surroundings. To date, methods based on probabilistic models and the generation of conformal safety regions have shown promising results in defining safety regions with a controlled margin of error, primarily relying on classification approaches and explicit rules to describe collision-free navigation conditions. This work explores how topological features contribute to explainable safety regions in social navigation. Instead of using behavioral parameters, we leverage topological data analysis to classify and characterize different simulation behaviors. First, we apply global rule-based classification to distinguish between safe (collision-free) and unsafe scenarios based on topological properties. Then, we define safety regions, S_\varepsilon , in the topological feature space, ensuring a maximum classification error of \varepsilon . These regions are built with adjustable SVM classifiers and order statistics, providing robust decision boundaries. Local rules extracted from these regions enhance interpretability, keeping the decision-making process transparent. Our approach initially separates simulations with and without collisions, outperforming methods that not incorporate topological features. It offers a deeper understanding of robot interactions within a navigable space. We further refine safety regions to ensure deadlock-free simulations and integrate both aspects to define a compliant simulation space that guarantees safe and efficient navigation. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); General Topology (math.GN) Cite as: arXiv:2503.16441 [cs.RO] (or arXiv:2503.16441v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2503.16441 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-86] Cause-effect perception in an object place task
链接: https://arxiv.org/abs/2503.16440
作者: Nikolai Bahr,Christoph Zetzsche,Jaime Maldonado,Kerstin Schill
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 11 pages, 9 figures, submitted to: Frontiers in Cognition
点击查看摘要
Abstract:Algorithmic causal discovery is based on formal reasoning and provably converges toward the optimal solution. However, since some of the underlying assumptions are often not met in practice no applications for autonomous everyday life competence are yet available. Humans on the other hand possess full everyday competence and develop cognitive models in a data efficient manner with the ability to transfer knowledge between and to new situations. Here we investigate the causal discovery capabilities of humans in an object place task in virtual reality (VR) with haptic feedback and compare the results to the state of the art causal discovery algorithms FGES, PC and FCI. In addition we use the algorithms to analyze causal relations between sensory information and the kinematic parameters of human behavior. Our findings show that the majority of participants were able to determine which variables are causally related. This is in line with causal discovery algorithms like PC, which recover causal dependencies in the first step. However, unlike such algorithms which can identify causes and effects in our test configuration, humans are unsure in determining a causal direction. Regarding the relation between the sensory information provided to the participants and their placing actions (i.e. their kinematic parameters) the data yields a surprising dissociation of the subjects knowledge and the sensorimotor level. Knowledge of the cause-effect pairs, though undirected, should suffice to improve subject’s movements. Yet a detailed causal analysis provides little evidence for any such influence. This, together with the reports of the participants, implies that instead of exploiting their consciously perceived information they leave it to the sensorimotor level to control the movement. Comments: 11 pages, 9 figures, submitted to: Frontiers in Cognition Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.16440 [cs.HC] (or arXiv:2503.16440v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2503.16440 Focus to learn more arXiv-issued DOI via DataCite
[AI-87] DreamLLM -3D: Affective Dream Reliving using Large Language Model and 3D Generative AI NEURIPS
链接: https://arxiv.org/abs/2503.16439
作者: Pinyao Liu,Keon Ju Lee,Alexander Steinmaurer,Claudia Picard-Deland,Michelle Carr,Alexandra Kitson
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: 8 pages, 3 figures, Accepted by NeurIPS creative AI track 2024
点击查看摘要
Abstract:We present DreamLLM-3D, a composite multimodal AI system behind an immersive art installation for dream re-experiencing. It enables automated dream content analysis for immersive dream-reliving, by integrating a Large Language Model (LLM) with text-to-3D Generative AI. The LLM processes voiced dream reports to identify key dream entities (characters and objects), social interaction, and dream sentiment. The extracted entities are visualized as dynamic 3D point clouds, with emotional data influencing the color and soundscapes of the virtual dream environment. Additionally, we propose an experiential AI-Dreamworker Hybrid paradigm. Our system and paradigm could potentially facilitate a more emotionally engaging dream-reliving experience, enhancing personal insights and creativity.
[AI-88] Haunted House: A text-based game for comparing the flexibility of mental models in humans and LLM s
链接: https://arxiv.org/abs/2503.16437
作者: Brett Puppart,Paul-Henry Paltmann,Jaan Aru
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:
点击查看摘要
Abstract:This study introduces “Haunted House” a novel text-based game designed to compare the performance of humans and large language models (LLMs) in model-based reasoning. Players must escape from a house containing nine rooms in a 3x3 grid layout while avoiding the ghost. They are guided by verbal clues that they get each time they move. In Study 1, the results from 98 human participants revealed a success rate of 31.6%, significantly outperforming seven state-of-the-art LLMs tested. Out of 140 attempts across seven LLMs, only one attempt resulted in a pass by Claude 3 Opus. Preliminary results suggested that GPT o3-mini-high performance might be higher, but not at the human level. Further analysis of 29 human participants’ moves in Study 2 indicated that LLMs frequently struggled with random and illogical moves, while humans exhibited such errors less frequently. Our findings suggest that current LLMs encounter difficulties in tasks that demand active model-based reasoning, offering inspiration for future benchmarks.
[AI-89] Enhancing Human-Robot Collaboration through Existing Guidelines: A Case Study Approach
链接: https://arxiv.org/abs/2503.16436
作者: Yutaka Matsubara,Akihisa Morikawa,Daichi Mizuguchi,Kiyoshi Fujiwara
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:
点击查看摘要
Abstract:As AI systems become more prevalent, concerns about their development, operation, and societal impact intensify. Establishing ethical, social, and safety standards amidst evolving AI capabilities poses significant challenges. Global initiatives are underway to establish guidelines for AI system development and operation. With the increasing use of collaborative human-AI task execution, it’s vital to continuously adapt AI systems to meet user and environmental needs. Failure to synchronize AI evolution with changes in users and the environment could result in ethical and safety issues. This paper evaluates the applicability of existing guidelines in human-robot collaborative systems, assesses their effectiveness, and discusses limitations. Through a case study, we examine whether our target system meets requirements outlined in existing guidelines and propose improvements to enhance human-robot interactions. Our contributions provide insights into interpreting and applying guidelines, offer concrete examples of system enhancement, and highlight their applicability and limitations. We believe these contributions will stimulate discussions and influence system assurance and certification in future AI-infused critical systems.
[AI-90] Interactive Sketchpad: An Interactive Multimodal System for Collaborative Visual Problem-Solving
链接: https://arxiv.org/abs/2503.16434
作者: Steven-Shine Chen,Jimin Lee,Paul Pu Liang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Humans have long relied on visual aids like sketches and diagrams to support reasoning and problem-solving. Visual tools, like auxiliary lines in geometry or graphs in calculus, are essential for understanding complex ideas. However, many tutoring systems remain text-based, providing feedback only through natural language. Leveraging recent advances in Large Multimodal Models (LMMs), this paper introduces Interactive Sketchpad, a tutoring system that combines language-based explanations with interactive visualizations to enhance learning. Built on a pre-trained LMM, Interactive Sketchpad is fine-tuned to provide step-by-step guidance in both text and visuals, enabling natural multimodal interaction with the student. Accurate and robust diagrams are generated by incorporating code execution into the reasoning process. User studies conducted on math problems such as geometry, calculus, and trigonometry demonstrate that Interactive Sketchpad leads to improved task comprehension, problem-solving accuracy, and engagement levels, highlighting its potential for transforming educational technologies.
[AI-91] OpenAI s Approach to External Red Teaming for AI Models and Systems
链接: https://arxiv.org/abs/2503.16431
作者: Lama Ahmad,Sandhini Agarwal,Michael Lampe,Pamela Mishkin
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:Red teaming has emerged as a critical practice in assessing the possible risks of AI models and systems. It aids in the discovery of novel risks, stress testing possible gaps in existing mitigations, enriching existing quantitative safety metrics, facilitating the creation of new safety measurements, and enhancing public trust and the legitimacy of AI risk assessments. This white paper describes OpenAI’s work to date in external red teaming and draws some more general conclusions from this work. We describe the design considerations underpinning external red teaming, which include: selecting composition of red team, deciding on access levels, and providing guidance required to conduct red teaming. Additionally, we show outcomes red teaming can enable such as input into risk assessment and automated evaluations. We also describe the limitations of external red teaming, and how it can fit into a broader range of AI model and system evaluations. Through these contributions, we hope that AI developers and deployers, evaluation creators, and policymakers will be able to better design red teaming campaigns and get a deeper look into how external red teaming can fit into model deployment and evaluation processes. These methods are evolving and the value of different methods continues to shift as the ecosystem around red teaming matures and models themselves improve as tools for red teaming.
[AI-92] Attention on Personalized Clinical Decision Support System: Federated Learning Approach
链接: https://arxiv.org/abs/2401.11736
作者: Chu Myaet Thwal,Kyi Thar,Ye Lin Tun,Choong Seon Hong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
*备注: Published in IEEE BigComp 2021
点击查看摘要
Abstract:Health management has become a primary problem as new kinds of diseases and complex symptoms are introduced to a rapidly growing modern society. Building a better and smarter healthcare infrastructure is one of the ultimate goals of a smart city. To the best of our knowledge, neural network models are already employed to assist healthcare professionals in achieving this goal. Typically, training a neural network requires a rich amount of data but heterogeneous and vulnerable properties of clinical data introduce a challenge for the traditional centralized network. Moreover, adding new inputs to a medical database requires re-training an existing model from scratch. To tackle these challenges, we proposed a deep learning-based clinical decision support system trained and managed under a federated learning paradigm. We focused on a novel strategy to guarantee the safety of patient privacy and overcome the risk of cyberattacks while enabling large-scale clinical data mining. As a result, we can leverage rich clinical data for training each local neural network without the need for exchanging the confidential data of patients. Moreover, we implemented the proposed scheme as a sequence-to-sequence model architecture integrating the attention mechanism. Thus, our objective is to provide a personalized clinical decision support system with evolvable characteristics that can deliver accurate solutions and assist healthcare professionals in medical diagnosing.
[AI-93] ransformers with Attentive Federated Aggregation for Time Series Stock Forecasting
链接: https://arxiv.org/abs/2402.06638
作者: Chu Myaet Thwal,Ye Lin Tun,Kitae Kim,Seong-Bae Park,Choong Seon Hong
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Published in IEEE ICOIN 2023
点击查看摘要
Abstract:Recent innovations in transformers have shown their superior performance in natural language processing (NLP) and computer vision (CV). The ability to capture long-range dependencies and interactions in sequential data has also triggered a great interest in time series modeling, leading to the widespread use of transformers in many time series applications. However, being the most common and crucial application, the adaptation of transformers to time series forecasting has remained limited, with both promising and inconsistent results. In contrast to the challenges in NLP and CV, time series problems not only add the complexity of order or temporal dependence among input sequences but also consider trend, level, and seasonality information that much of this data is valuable for decision making. The conventional training scheme has shown deficiencies regarding model overfitting, data scarcity, and privacy issues when working with transformers for a forecasting task. In this work, we propose attentive federated transformers for time series stock forecasting with better performance while preserving the privacy of participating enterprises. Empirical results on various stock data from the Yahoo! Finance website indicate the superiority of our proposed scheme in dealing with the above challenges and data heterogeneity in federated learning.
机器学习
[LG-0] Gumbel-Softmax Flow Matching with Straight-Through Guidance for Controllable Biological Sequence Generation
链接: https://arxiv.org/abs/2503.17361
作者: Sophia Tang,Yinuo Zhang,Alexander Tong,Pranam Chatterjee
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
点击查看摘要
Abstract:Flow matching in the continuous simplex has emerged as a promising strategy for DNA sequence design, but struggles to scale to higher simplex dimensions required for peptide and protein generation. We introduce Gumbel-Softmax Flow and Score Matching, a generative framework on the simplex based on a novel Gumbel-Softmax interpolant with a time-dependent temperature. Using this interpolant, we introduce Gumbel-Softmax Flow Matching by deriving a parameterized velocity field that transports from smooth categorical distributions to distributions concentrated at a single vertex of the simplex. We alternatively present Gumbel-Softmax Score Matching which learns to regress the gradient of the probability density. Our framework enables high-quality, diverse generation and scales efficiently to higher-dimensional simplices. To enable training-free guidance, we propose Straight-Through Guided Flows (STGFlow), a classifier-based guidance method that leverages straight-through estimators to steer the unconditional velocity field toward optimal vertices of the simplex. STGFlow enables efficient inference-time guidance using classifiers pre-trained on clean sequences, and can be used with any discrete flow method. Together, these components form a robust framework for controllable de novo sequence generation. We demonstrate state-of-the-art performance in conditional DNA promoter design, sequence-only protein generation, and target-binding peptide design for rare disease treatment.
[LG-1] Predicting Potential Customer Support Needs and Optimizing Search Ranking in a Two-Sided Marketplace KDD2024
链接: https://arxiv.org/abs/2503.17329
作者: Do-kyum Kim,Han Zhao,Huiji Gao,Liwei He,Malay Haldar,Sanjeev Katariya
类目: Machine Learning (cs.LG)
*备注: TSMO Workshop in conjunction with KDD 2024
点击查看摘要
Abstract:Airbnb is an online marketplace that connects hosts and guests to unique stays and experiences. When guests stay at homes booked on Airbnb, there are a small fraction of stays that lead to support needed from Airbnb’s Customer Support (CS), which may cause inconvenience to guests and hosts and require Airbnb resources to resolve. In this work, we show that instances where CS support is needed may be predicted based on hosts and guests behavior. We build a model to predict the likelihood of CS support needs for each match of guest and host. The model score is incorporated into Airbnb’s search ranking algorithm as one of the many factors. The change promotes more reliable matches in search results and significantly reduces bookings that require CS support.
[LG-2] 3D Neural Operator-Based Flow Surrogates around 3D geometries: Signed Distance Functions and Derivative Constraints
链接: https://arxiv.org/abs/2503.17289
作者: Ali Rabeh,Adarsh Krishnamurthy,Baskar Ganapathysubramanian
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Accurate modeling of fluid dynamics around complex geometries is critical for applications such as aerodynamic optimization and biomedical device design. While advancements in numerical methods and high-performance computing have improved simulation capabilities, the computational cost of high-fidelity 3D flow simulations remains a significant challenge. Scientific machine learning (SciML) offers an efficient alternative, enabling rapid and reliable flow predictions. In this study, we evaluate Deep Operator Networks (DeepONet) and Geometric-DeepONet, a variant that incorporates geometry information via signed distance functions (SDFs), on steady-state 3D flow over complex objects. Our dataset consists of 1,000 high-fidelity simulations spanning Reynolds numbers from 10 to 1,000, enabling comprehensive training and evaluation across a range of flow regimes. To assess model generalization, we test our models on a random and extrapolatory train-test splitting. Additionally, we explore a derivative-informed training strategy that augments standard loss functions with velocity gradient penalties and incompressibility constraints, improving physics consistency in 3D flow prediction. Our results show that Geometric-DeepONet improves boundary-layer accuracy by up to 32% compared to standard DeepONet. Moreover, incorporating derivative constraints enhances gradient accuracy by 25% in interpolation tasks and up to 45% in extrapolatory test scenarios, suggesting significant improvement in generalization capabilities to unseen 3D Reynolds numbers.
[LG-3] Offline Model-Based Optimization: Comprehensive Review
链接: https://arxiv.org/abs/2503.17286
作者: Minsu Kim,Jiayao Gu,Ye Yuan,Taeyoung Yun,Zixuan Liu,Yoshua Bengio,Can Chen
类目: Machine Learning (cs.LG)
*备注: 29 pages
点击查看摘要
Abstract:Offline optimization is a fundamental challenge in science and engineering, where the goal is to optimize black-box functions using only offline datasets. This setting is particularly relevant when querying the objective function is prohibitively expensive or infeasible, with applications spanning protein engineering, material discovery, neural architecture search, and beyond. The main difficulty lies in accurately estimating the objective landscape beyond the available data, where extrapolations are fraught with significant epistemic uncertainty. This uncertainty can lead to objective hacking(reward hacking), exploiting model inaccuracies in unseen regions, or other spurious optimizations that yield misleadingly high performance estimates outside the training distribution. Recent advances in model-based optimization(MBO) have harnessed the generalization capabilities of deep neural networks to develop offline-specific surrogate and generative models. Trained with carefully designed strategies, these models are more robust against out-of-distribution issues, facilitating the discovery of improved designs. Despite its growing impact in accelerating scientific discovery, the field lacks a comprehensive review. To bridge this gap, we present the first thorough review of offline MBO. We begin by formalizing the problem for both single-objective and multi-objective settings and by reviewing recent benchmarks and evaluation metrics. We then categorize existing approaches into two key areas: surrogate modeling, which emphasizes accurate function approximation in out-of-distribution regions, and generative modeling, which explores high-dimensional design spaces to identify high-performing designs. Finally, we examine the key challenges and propose promising directions for advancement in this rapidly evolving field including safe control of superintelligent systems.
[LG-4] Revisiting End To End Sparse Autoencoder Training – A Short Finetune is All You Need
链接: https://arxiv.org/abs/2503.17272
作者: Adam Karvonen
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Sparse autoencoders (SAEs) are widely used for interpreting language model activations. A key evaluation metric is the increase in cross-entropy loss when replacing model activations with SAE reconstructions. Typically, SAEs are trained solely on mean squared error (MSE) using precomputed, shuffled activations. Recent work introduced training SAEs directly with a combination of KL divergence and MSE (“end-to-end” SAEs), significantly improving reconstruction accuracy at the cost of substantially increased computation, which has limited their widespread adoption. We propose a brief KL+MSE fine-tuning step applied only to the final 25M training tokens (just a few percent of typical training budgets) that achieves comparable improvements, reducing the cross-entropy loss gap by 20-50%, while incurring minimal additional computational cost. We further find that multiple fine-tuning methods (KL fine-tuning, LoRA adapters, linear adapters) yield similar, non-additive cross-entropy improvements, suggesting a common, easily correctable error source in MSE-trained SAEs. We demonstrate a straightforward method for effectively transferring hyperparameters and sparsity penalties despite scale differences between KL and MSE losses. While both ReLU and TopK SAEs see significant cross-entropy loss improvements, evaluations on supervised SAEBench metrics yield mixed results, suggesting practical benefits depend on both SAE architecture and the specific downstream task. Nonetheless, our method offers meaningful improvements in interpretability applications such as circuit analysis with minor additional cost.
[LG-5] On Privately Estimating a Single Parameter
链接: https://arxiv.org/abs/2503.17252
作者: Hilal Asi,John C. Duchi,Kunal Talwar
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Statistics Theory (math.ST)
*备注: 53 pages, 7 figures
点击查看摘要
Abstract:We investigate differentially private estimators for individual parameters within larger parametric models. While generic private estimators exist, the estimators we provide repose on new local notions of estimand stability, and these notions allow procedures that provide private certificates of their own stability. By leveraging these private certificates, we provide computationally and statistical efficient mechanisms that release private statistics that are, at least asymptotically in the sample size, essentially unimprovable: they achieve instance optimal bounds. Additionally, we investigate the practicality of the algorithms both in simulated data and in real-world data from the American Community Survey and US Census, highlighting scenarios in which the new procedures are successful and identifying areas for future work.
[LG-6] LoGoFair: Post-Processing for Local and Global Fairness in Federated Learning AAAI2025
链接: https://arxiv.org/abs/2503.17231
作者: Li Zhang,Chaochao Chen,Zhongxuan Han,Qiyong Zhong,Xiaolin Zheng
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by AAAI2025
点击查看摘要
Abstract:Federated learning (FL) has garnered considerable interest for its capability to learn from decentralized data sources. Given the increasing application of FL in decision-making scenarios, addressing fairness issues across different sensitive groups (e.g., female, male) in FL is crucial. Current research often focuses on facilitating fairness at each client’s data (local fairness) or within the entire dataset across all clients (global fairness). However, existing approaches that focus exclusively on either local or global fairness fail to address two key challenges: (\textbfCH1) Under statistical heterogeneity, global fairness does not imply local fairness, and vice versa. (\textbfCH2) Achieving fairness under model-agnostic setting. To tackle the aforementioned challenges, this paper proposes a novel post-processing framework for achieving both Local and Global Fairness in the FL context, namely LoGoFair. To address CH1, LoGoFair endeavors to seek the Bayes optimal classifier under local and global fairness constraints, which strikes the optimal accuracy-fairness balance in the probabilistic sense. To address CH2, LoGoFair employs a model-agnostic federated post-processing procedure that enables clients to collaboratively optimize global fairness while ensuring local fairness, thereby achieving the optimal fair classifier within FL. Experimental results on three real-world datasets further illustrate the effectiveness of the proposed LoGoFair framework.
[LG-7] ML-Based Bidding Price Prediction for Pay-As-Bid Ancillary Services Markets: A Use Case in the German Control Reserve Market
链接: https://arxiv.org/abs/2503.17214
作者: Vincent Bezold,Lukas Baur,Alexander Sauer
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:The increasing integration of renewable energy sources has led to greater volatility and unpredictability in electricity generation, posing challenges to grid stability. Ancillary service markets, such as the German control reserve market, allow industrial consumers and producers to offer flexibility in their power consumption or generation, contributing to grid stability while earning additional income. However, many participants use simple bidding strategies that may not maximize their revenues. This paper presents a methodology for forecasting bidding prices in pay-as-bid ancillary service markets, focusing on the German control reserve market. We evaluate various machine learning models, including Support Vector Regression, Decision Trees, and k-Nearest Neighbors, and compare their performance against benchmark models. To address the asymmetry in the revenue function of pay-as-bid markets, we introduce an offset adjustment technique that enhances the practical applicability of the forecasting models. Our analysis demonstrates that the proposed approach improves potential revenues by 27.43 % to 37.31 % compared to baseline models. When analyzing the relationship between the model forecasting errors and the revenue, a negative correlation is measured for three markets; according to the results, a reduction of 1 EUR/MW model price forecasting error (MAE) statistically leads to a yearly revenue increase between 483 EUR/MW and 3,631 EUR/MW. The proposed methodology enables industrial participants to optimize their bidding strategies, leading to increased earnings and contributing to the efficiency and stability of the electrical grid.
[LG-8] Curriculum RL meets Monte Carlo Planning : Optimization of a Real World Container Management Problem
链接: https://arxiv.org/abs/2503.17194
作者: Abhijeet Pendyala,Tobias Glasmachers
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this work, we augment reinforcement learning with an inference-time collision model to ensure safe and efficient container management in a waste-sorting facility with limited processing capacity. Each container has two optimal emptying volumes that trade off higher throughput against overflow risk. Conventional reinforcement learning (RL) approaches struggle under delayed rewards, sparse critical events, and high-dimensional uncertainty – failing to consistently balance higher-volume empties with the risk of safety-limit violations. To address these challenges, we propose a hybrid method comprising: (1) a curriculum-learning pipeline that incrementally trains a PPO agent to handle delayed rewards and class imbalance, and (2) an offline pairwise collision model used at inference time to proactively avert collisions with minimal online cost. Experimental results show that our targeted inference-time collision checks significantly improve collision avoidance, reduce safety-limit violations, maintain high throughput, and scale effectively across varying container-to-PU ratios. These findings offer actionable guidelines for designing safe and efficient container-management systems in real-world facilities.
[LG-9] Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability
链接: https://arxiv.org/abs/2503.17173
作者: Sanjif Shanmugavelu,Mathieu Taillefumier,Christopher Culver,Vijay Ganesh,Oscar Hernandez,Ada Sedova
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Under review at EuroPar 2025
点击查看摘要
Abstract:The ability of machine learning (ML) classification models to resist small, targeted input perturbations - known as adversarial attacks - is a key measure of their safety and reliability. We show that floating-point non associativity (FPNA) coupled with asynchronous parallel programming on GPUs is sufficient to result in misclassification, without any perturbation to the input. Additionally, we show this misclassification is particularly significant for inputs close to the decision boundary and that standard adversarial robustness results may be overestimated up to 4.6% when not considering machine-level details. We first study a linear classifier, before focusing on standard Graph Neural Network (GNN) architectures and datasets. We present a novel black-box attack using Bayesian optimization to determine external workloads that bias the output of reductions on GPUs and reliably lead to misclassification. Motivated by these results, we present a new learnable permutation (LP) gradient-based approach, to learn floating point operation orderings that lead to misclassifications, making the assumption that any reduction or permutation ordering is possible. This LP approach provides a worst-case estimate in a computationally efficient manner, avoiding the need to run identical experiments tens of thousands of times over a potentially large set of possible GPU states or architectures. Finally, we investigate parallel reduction ordering across different GPU architectures for a reduction under three conditions: (1) executing external background workloads, (2) utilizing multi-GPU virtualization, and (3) applying power capping. Our results demonstrate that parallel reduction ordering varies significantly across architectures under the first two conditions. The results and methods developed here can help to include machine-level considerations into adversarial robustness assessments.
[LG-10] Principal Eigenvalue Regularization for Improved Worst-Class Certified Robustness of Smoothed Classifiers
链接: https://arxiv.org/abs/2503.17172
作者: Gaojie Jin,Tianjin Huang,Ronghui Mu,Xiaowei Huang
类目: Machine Learning (cs.LG)
*备注: Under Review
点击查看摘要
Abstract:Recent studies have identified a critical challenge in deep neural networks (DNNs) known as ``robust fairness", where models exhibit significant disparities in robust accuracy across different classes. While prior work has attempted to address this issue in adversarial robustness, the study of worst-class certified robustness for smoothed classifiers remains unexplored. Our work bridges this gap by developing a PAC-Bayesian bound for the worst-class error of smoothed classifiers. Through theoretical analysis, we demonstrate that the largest eigenvalue of the smoothed confusion matrix fundamentally influences the worst-class error of smoothed classifiers. Based on this insight, we introduce a regularization method that optimizes the largest eigenvalue of smoothed confusion matrix to enhance worst-class accuracy of the smoothed classifier and further improve its worst-class certified robustness. We provide extensive experimental validation across multiple datasets and model architectures to demonstrate the effectiveness of our approach.
[LG-11] HiFi-Stream: Streaming Speech Enhancement with Generative Adversarial Networks
链接: https://arxiv.org/abs/2503.17141
作者: Ekaterina Dmitrieva,Maksim Kaledin
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 5 pages (4 content pages + 1 page of references)
点击查看摘要
Abstract:Speech Enhancement techniques have become core technologies in mobile devices and voice software simplifying downstream speech tasks. Still, modern Deep Learning (DL) solutions often require high amount of computational resources what makes their usage on low-resource devices challenging. We present HiFi-Stream, an optimized version of recently published HiFi++ model. Our experiments demonstrate that HiFiStream saves most of the qualities of the original model despite its size and computational complexity: the lightest version has only around 490k parameters which is 3.5x reduction in comparison to the original HiFi++ making it one of the smallest and fastest models available. The model is evaluated in streaming setting where it demonstrates its superior performance in comparison to modern baselines.
[LG-12] Structure Is Not Enough: Leverag ing Behavior for Neural Network Weight Reconstruction ICLR
链接: https://arxiv.org/abs/2503.17138
作者: Léo Meynent,Ivan Melev,Konstantin Schürholt,Göran Kauermann,Damian Borth
类目: Machine Learning (cs.LG)
*备注: Accepted at the ICLR Workshop on Neural Network Weights as a New Data Modality 2025
点击查看摘要
Abstract:The weights of neural networks (NNs) have recently gained prominence as a new data modality in machine learning, with applications ranging from accuracy and hyperparameter prediction to representation learning or weight generation. One approach to leverage NN weights involves training autoencoders (AEs), using contrastive and reconstruction losses. This allows such models to be applied to a wide variety of downstream tasks, and they demonstrate strong predictive performance and low reconstruction error. However, despite the low reconstruction error, these AEs reconstruct NN models with deteriorated performance compared to the original ones, limiting their usability with regard to model weight generation. In this paper, we identify a limitation of weight-space AEs, specifically highlighting that a structural loss, that uses the Euclidean distance between original and reconstructed weights, fails to capture some features critical for reconstructing high-performing models. We analyze the addition of a behavioral loss for training AEs in weight space, where we compare the output of the reconstructed model with that of the original one, given some common input. We show a strong synergy between structural and behavioral signals, leading to increased performance in all downstream tasks evaluated, in particular NN weights reconstruction and generation.
[LG-13] Large Language Model Compression via the Nested Activation-Aware Decomposition
链接: https://arxiv.org/abs/2503.17101
作者: Jun Lu,Tianyi Xu,Bill Ding,David Li,Yu Kang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this paper, we tackle the critical challenge of compressing large language models (LLMs) to facilitate their practical deployment and broader adoption. We introduce a novel post-training compression paradigm that focuses on low-rank decomposition of LLM weights. Our analysis identifies two main challenges in this task: the variability in LLM activation distributions and handling unseen activations from different datasets and models. To address these challenges, we propose a nested activation-aware framework (NSVD) for LLMs, a training-free approach designed to enhance the accuracy of low-rank decompositions by managing activation outliers through transforming the weight matrix based on activation distribution and the original weight matrix. This method allows for the absorption of outliers into the transformed weight matrix, improving decomposition accuracy. Our comprehensive evaluation across eight datasets and six models from three distinct LLM families demonstrates the superiority of NSVD over current state-of-the-art methods, especially at medium to large compression ratios or in multilingual and multitask settings. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.17101 [cs.LG] (or arXiv:2503.17101v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.17101 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-14] Multi-Span Optical Power Spectrum Evolution Modeling using ML-based Multi-Decoder Attention Framework
链接: https://arxiv.org/abs/2503.17072
作者: Agastya Raj,Zehao Wang,Frank Slyne,Tingjun Chen,Dan Kilper,Marco Ruffini
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This paper is a preprint of a paper accepted in ECOC 2024 and is subject to Institution of Engineering and Technology Copyright. A copy of record will be available at IET Digital Library
点击查看摘要
Abstract:We implement a ML-based attention framework with component-specific decoders, improving optical power spectrum prediction in multi-span networks. By reducing the need for in-depth training on each component, the framework can be scaled to multi-span topologies with minimal data collection, making it suitable for brown-field scenarios.
[LG-15] Unitless Unrestricted Markov-Consistent SCM Generation: Better Benchmark Datasets for Causal Discovery
链接: https://arxiv.org/abs/2503.17037
作者: Rebecca J. Herman,Jonas Wahl,Urmi Ninad,Jakob Runge
类目: Machine Learning (cs.LG)
*备注: 4th Conference on Causal Learning and Reasoning
点击查看摘要
Abstract:Causal discovery aims to extract qualitative causal knowledge in the form of causal graphs from data. Because causal ground truth is rarely known in the real world, simulated data plays a vital role in evaluating the performance of the various causal discovery algorithms proposed in the literature. But recent work highlighted certain artifacts of commonly used data generation techniques for a standard class of structural causal models (SCM) that may be nonphysical, including var- and R2-sortability, where the variables’ variance and coefficients of determination (R2) after regressing on all other variables, respectively, increase along the causal order. Some causal methods exploit such artifacts, leading to unrealistic expectations for their performance on real-world data. Some modifications have been proposed to remove these artifacts; notably, the internally-standardized structural causal model (iSCM) avoids varsortability and largely alleviates R2-sortability on sparse causal graphs, but exhibits a reversed R2-sortability pattern for denser graphs not featured in their work. We analyze which sortability patterns we expect to see in real data, and propose a method for drawing coefficients that we argue more effectively samples the space of SCMs. Finally, we propose a novel extension of our SCM generation method to the time series setting.
[LG-16] Do regularization methods for shortcut mitigation work as intended?
链接: https://arxiv.org/abs/2503.17015
作者: Haoyang Hong,Ioanna Papanikolaou,Sonali Parbhoo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Mitigating shortcuts, where models exploit spurious correlations in training data, remains a significant challenge for improving generalization. Regularization methods have been proposed to address this issue by enhancing model generalizability. However, we demonstrate that these methods can sometimes overregularize, inadvertently suppressing causal features along with spurious ones. In this work, we analyze the theoretical mechanisms by which regularization mitigates shortcuts and explore the limits of its effectiveness. Additionally, we identify the conditions under which regularization can successfully eliminate shortcuts without compromising causal features. Through experiments on synthetic and real-world datasets, our comprehensive analysis provides valuable insights into the strengths and limitations of regularization techniques for addressing shortcuts, offering guidance for developing more robust models.
[LG-17] RACE: Time SeRies PArameter EffiCient FinE-tuning
链接: https://arxiv.org/abs/2503.16991
作者: Yuze Li,Wei Zhu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We propose an efficient fine-tuning method for time series foundation models, termed TRACE: Time Series Parameter Efficient Fine-tuning. While pretrained time series foundation models are gaining popularity, they face the following challenges: (1) Unlike natural language tasks, time series data vary in frequency, channel numbers, historical/prediction lengths. For long-term forecasting tasks in particular, tailored fine-tuning can significantly enhance performance.(2) Existing parameter-efficient tuning methods like LoRA remain applicable but require adaptation to temporal characteristics. To address these challenges, our TRACE framework introduces two key innovations: (1) Gated DSIC (Gated Dynamic Simulation Importance Calculation), an unbiased LoRA module importance selection mechanism that ensures conditional parameter consistency before and after masking. Experiments demonstrate that Gated DSIC outperforms common fine-tuning. (2) Reconstructed prediction heads for long-term forecasting tasks, which achieve comparable or superior performance to linear probing heads while drastically reducing parameter counts. Extensive experiments on long-/short-term forecasting and anomaly detection tasks across diverse datasets, coupled with ablation studies, validate the effectiveness of our method. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.16991 [cs.LG] (or arXiv:2503.16991v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.16991 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-18] Model-free front-to-end training of a large high performance laser neural network
链接: https://arxiv.org/abs/2503.16943
作者: Anas Skalli,Satoshi Sunada,Mirko Goldmann,Marcin Gebski,Stephan Reitzenstein,James A. Lott,Tomasz Czyszanowski,Daniel Brunner
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:
点击查看摘要
Abstract:Artificial neural networks (ANNs), have become ubiquitous and revolutionized many applications ranging from computer vision to medical diagnoses. However, they offer a fundamentally connectionist and distributed approach to computing, in stark contrast to classical computers that use the von Neumann architecture. This distinction has sparked renewed interest in developing unconventional hardware to support more efficient implementations of ANNs, rather than merely emulating them on traditional systems. Photonics stands out as a particularly promising platform, providing scalability, high speed, energy efficiency, and the ability for parallel information processing. However, fully realized autonomous optical neural networks (ONNs) with in-situ learning capabilities are still rare. In this work, we demonstrate a fully autonomous and parallel ONN using a multimode vertical cavity surface emitting laser (VCSEL) using off-the-shelf components. Our ONN is highly efficient and is scalable both in network size and inference bandwidth towards the GHz range. High performance hardware-compatible optimization algorithms are necessary in order to minimize reliance on external von Neumann computers to fully exploit the potential of ONNs. As such we present and extensively study several algorithms which are broadly compatible with a wide range of systems. We then apply these algorithms to optimize our ONN, and benchmark them using the MNIST dataset. We show that our ONN can achieve high accuracy and convergence efficiency, even under limited hardware resources. Crucially, we compare these different algorithms in terms of scaling and optimization efficiency in term of convergence time which is crucial when working with limited external resources. Our work provides some guidance for the design of future ONNs as well as a simple and flexible way to train them.
[LG-19] MerGen: Micro-electrode recording synthesis using a generative data-driven approach
链接: https://arxiv.org/abs/2503.16928
作者: Thibault Martin,Paul Sauleau,Claire Haegelen,Pierre Jannin,John S. H. Baxter
类目: Machine Learning (cs.LG)
*备注: 19 pages, 7 figures, 2 tables
点击查看摘要
Abstract:The analysis of electrophysiological data is crucial for certain surgical procedures such as deep brain stimulation, which has been adopted for the treatment of a variety of neurological disorders. During the procedure, auditory analysis of these signals helps the clinical team to infer the neuroanatomical location of the stimulation electrode and thus optimize clinical outcomes. This task is complex, and requires an expert who in turn requires significant training. In this paper, we propose a generative neural network, called MerGen, capable of simulating de novo electrophysiological recordings, with a view to providing a realistic learning tool for clinicians trainees for identifying these signals. We demonstrate that the generated signals are perceptually indistinguishable from real signals by experts in the field, and that it is even possible to condition the generation efficiently to provide a didactic simulator adapted to a particular surgical scenario. The efficacy of this conditioning is demonstrated, comparing it to intra-observer and inter-observer variability amongst experts. We also demonstrate the use of this network for data augmentation for automatic signal classification which can play a role in decision-making support in the operating theatre.
[LG-20] Malliavin-Bismut Score-based Diffusion Models
链接: https://arxiv.org/abs/2503.16917
作者: Ehsan Mirafzali,Utkarsh Gupta,Patrick Wyrod,Frank Proske,Daniele Venturi,Razvan Marinescu
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:
点击查看摘要
Abstract:We introduce a new framework that employs Malliavin calculus to derive explicit expressions for the score function – i.e., the gradient of the log-density – associated with solutions to stochastic differential equations (SDEs). Our approach integrates classical integration-by-parts techniques with modern tools, such as Bismut’s formula and Malliavin calculus, to address linear and nonlinear SDEs. In doing so, we establish a rigorous connection between the Malliavin derivative, its adjoint (the Malliavin divergence or the Skorokhod integral), Bismut’s formula, and diffusion generative models, thus providing a systematic method for computing \nabla \log p_t(x) . For the linear case, we present a detailed study proving that our formula is equivalent to the actual score function derived from the solution of the Fokker–Planck equation for linear SDEs. Additionally, we derive a closed-form expression for \nabla \log p_t(x) for nonlinear SDEs with state-independent diffusion coefficients. These advancements provide fresh theoretical insights into the smoothness and structure of probability densities and practical implications for score-based generative modelling, including the design and analysis of new diffusion models. Moreover, our findings promote the adoption of the robust Malliavin calculus framework in machine learning research. These results directly apply to various pure and applied mathematics fields, such as generative modelling, the study of SDEs driven by fractional Brownian motion, and the Fokker–Planck equations associated with nonlinear SDEs.
[LG-21] MP-TraG : Edge-based Temporal Message Passing in Transaction Graphs
链接: https://arxiv.org/abs/2503.16901
作者: Steve Gounoue,Ashutosh Sao,Simon Gottschalk
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Transaction graphs, which represent financial and trade transactions between entities such as bank accounts and companies, can reveal patterns indicative of financial crimes like money laundering and fraud. However, effective detection of such cases requires node and edge classification methods capable of addressing the unique challenges of transaction graphs, including rich edge features, multigraph structures and temporal dynamics. To tackle these challenges, we propose TeMP-TraG, a novel graph neural network mechanism that incorporates temporal dynamics into message passing. TeMP-TraG prioritises more recent transactions when aggregating node messages, enabling better detection of time-sensitive patterns. We demonstrate that TeMP-TraG improves four state-of-the-art graph neural networks by 6.19% on average. Our results highlight TeMP-TraG as an advancement in leveraging transaction graphs to combat financial crime.
[LG-22] Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation
链接: https://arxiv.org/abs/2503.16893
作者: Jingzhi Fang,Yanyan Shen,Yue Wang,Lei Chen
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:As large language models (LLMs) have shown great success in many tasks, they are used in various applications. While a lot of works have focused on the efficiency of single-LLM application (e.g., offloading, request scheduling, parallelism strategy selection), multi-LLM applications receive less attention, particularly in offline inference scenarios. In this work, we aim to improve the offline end-to-end inference efficiency of multi-LLM applications in the single-node multi-GPU environment. The problem involves two key decisions: (1) determining which LLMs to run concurrently each time (we may not run all the models at the same time), and (2) selecting a parallelism strategy to use for each LLM. This problem is NP-hard. Naive solutions may not work well because the running time for a model to complete a set of requests depends on the request workload and the selected parallelism strategy, and they lack an accurate model of the running time. As the LLM output lengths are unknown before running, to estimate the model running time, we propose a sampling-then-simulation method which first estimates the output lengths by sampling from an empirical cumulative function we obtained from a large dataset in advance, and then simulates the LLM inference process accordingly. Based on the simulation, we estimate the per-iteration latencys to get the total latency. A greedy method is proposed to optimize the scheduling of the LLMs in the application across the GPUs. We then propose a framework SamuLLM which contains two phases: planning, which calls the greedy method for an application and running, which runs the application and dynamically adjust the model scheduling based on the runtime information. Experiments on 3 applications and a mixed application show that SamuLLM can achieve 1.0-2.4 \times end-to-end speedups compared to the competitors.
[LG-23] Nonparametric Factor Analysis and Beyond AISTATS2025
链接: https://arxiv.org/abs/2503.16865
作者: Yujia Zheng,Yang Liu,Jiaxiong Yao,Yingyao Hu,Kun Zhang
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: AISTATS 2025
点击查看摘要
Abstract:Nearly all identifiability results in unsupervised representation learning inspired by, e.g., independent component analysis, factor analysis, and causal representation learning, rely on assumptions of additive independent noise or noiseless regimes. In contrast, we study the more general case where noise can take arbitrary forms, depend on latent variables, and be non-invertibly entangled within a nonlinear function. We propose a general framework for identifying latent variables in the nonparametric noisy settings. We first show that, under suitable conditions, the generative model is identifiable up to certain submanifold indeterminacies even in the presence of non-negligible noise. Furthermore, under the structural or distributional variability conditions, we prove that latent variables of the general nonlinear models are identifiable up to trivial indeterminacies. Based on the proposed theoretical framework, we have also developed corresponding estimation methods and validated them in various synthetic and real-world settings. Interestingly, our estimate of the true GDP growth from alternative measurements suggests more insightful information on the economies than official reports. We expect our framework to provide new insight into how both researchers and practitioners deal with latent variables in real-world scenarios.
[LG-24] PRIOT: Pruning-Based Integer-Only Transfer Learning for Embedded Systems
链接: https://arxiv.org/abs/2503.16860
作者: Honoka Anada,Sefutsu Ryu,Masayuki Usui,Tatsuya Kaneko,Shinya Takamaeda-Yamazaki
类目: Machine Learning (cs.LG)
*备注: Accepted for publication in IEEE Embedded Systems Letters
点击查看摘要
Abstract:On-device transfer learning is crucial for adapting a common backbone model to the unique environment of each edge device. Tiny microcontrollers, such as the Raspberry Pi Pico, are key targets for on-device learning but often lack floating-point units, necessitating integer-only training. Dynamic computation of quantization scale factors, which is adopted in former studies, incurs high computational costs. Therefore, this study focuses on integer-only training with static scale factors, which is challenging with existing training methods. We propose a new training method named PRIOT, which optimizes the network by pruning selected edges rather than updating weights, allowing effective training with static scale factors. The pruning pattern is determined by the edge-popup algorithm, which trains a parameter named score assigned to each edge instead of the original parameters and prunes the edges with low scores before inference. Additionally, we introduce a memory-efficient variant, PRIOT-S, which only assigns scores to a small fraction of edges. We implement PRIOT and PRIOT-S on the Raspberry Pi Pico and evaluate their accuracy and computational costs using a tiny CNN model on the rotated MNIST dataset and the VGG11 model on the rotated CIFAR-10 dataset. Our results demonstrate that PRIOT improves accuracy by 8.08 to 33.75 percentage points over existing methods, while PRIOT-S reduces memory footprint with minimal accuracy loss.
[LG-25] An Accelerated Bregman Algorithm for ReLU-based Symmetric Matrix Decomposition
链接: https://arxiv.org/abs/2503.16846
作者: Qingsong Wang
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures
点击查看摘要
Abstract:Symmetric matrix decomposition is an active research area in machine learning. This paper focuses on exploiting the low-rank structure of non-negative and sparse symmetric matrices via the rectified linear unit (ReLU) activation function. We propose the ReLU-based nonlinear symmetric matrix decomposition (ReLU-NSMD) model, introduce an accelerated alternating partial Bregman (AAPB) method for its solution, and present the algorithm’s convergence results. Our algorithm leverages the Bregman proximal gradient framework to overcome the challenge of estimating the global L -smooth constant in the classic proximal gradient algorithm. Numerical experiments on synthetic and real datasets validate the effectiveness of our model and algorithm.
[LG-26] Preferential Multi-Objective Bayesian Optimization for Drug Discovery
链接: https://arxiv.org/abs/2503.16841
作者: Tai Dang,Long-Hung Pham,Sang T. Truong,Ari Glenn,Wendy Nguyen,Edward A. Pham,Jeffrey S. Glenn,Sanmi Koyejo,Thang Luong
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Biomolecules (q-bio.BM)
*备注:
点击查看摘要
Abstract:Despite decades of advancements in automated ligand screening, large-scale drug discovery remains resource-intensive and requires post-processing hit selection, a step where chemists manually select a few promising molecules based on their chemical intuition. This creates a major bottleneck in the virtual screening process for drug discovery, demanding experts to repeatedly balance complex trade-offs among drug properties across a vast pool of candidates. To improve the efficiency and reliability of this process, we propose a novel human-centered framework named CheapVS that allows chemists to guide the ligand selection process by providing preferences regarding the trade-offs between drug properties via pairwise comparison. Our framework combines preferential multi-objective Bayesian optimization with a docking model for measuring binding affinity to capture human chemical intuition for improving hit identification. Specifically, on a library of 100K chemical candidates targeting EGFR and DRD2, CheapVS outperforms state-of-the-art screening methods in identifying drugs within a limited computational budget. Notably, our method can recover up to 16/37 EGFR and 37/58 DRD2 known drugs while screening only 6% of the library, showcasing its potential to significantly advance drug discovery.
[LG-27] A Flexible Fairness Framework with Surrogate Loss Reweighting for Addressing Sociodemographic Disparities
链接: https://arxiv.org/abs/2503.16836
作者: Wen Xu,Elham Dolatabadi
类目: Machine Learning (cs.LG)
*备注: Under review
点击查看摘要
Abstract:This paper presents a new algorithmic fairness framework called \boldsymbol\alpha - \boldsymbol\beta Fair Machine Learning ( \boldsymbol\alpha - \boldsymbol\beta FML), designed to optimize fairness levels across sociodemographic attributes. Our framework employs a new family of surrogate loss functions, paired with loss reweighting techniques, allowing precise control over fairness-accuracy trade-offs through tunable hyperparameters \boldsymbol\alpha and \boldsymbol\beta . To efficiently solve the learning objective, we propose Parallel Stochastic Gradient Descent with Surrogate Loss (P-SGD-S) and establish convergence guarantees for both convex and nonconvex loss functions. Experimental results demonstrate that our framework improves overall accuracy while reducing fairness violations, offering a smooth trade-off between standard empirical risk minimization and strict minimax fairness. Results across multiple datasets confirm its adaptability, ensuring fairness improvements without excessive performance degradation.
[LG-28] When Debate Fails: Bias Reinforcement in Large Language Models ICLR2025
链接: https://arxiv.org/abs/2503.16814
作者: Jihwan Oh,Minchan Jeong,Jongwoo Ko,Se-Young Yun
类目: Machine Learning (cs.LG)
*备注: Published at ICLR 2025 Workshop on Reasoning and Planning for LLMs. First two authors contributed equally
点击查看摘要
Abstract:Large Language Models ( LLMs ) solve complex problems using training-free methods like prompt engineering and in-context learning, yet ensuring reasoning correctness remains challenging. While self-correction methods such as self-consistency and self-refinement aim to improve reliability, they often reinforce biases due to the lack of effective feedback mechanisms. Multi-Agent Debate ( MAD ) has emerged as an alternative, but we identify two key limitations: bias reinforcement, where debate amplifies model biases instead of correcting them, and lack of perspective diversity, as all agents share the same model and reasoning patterns, limiting true debate effectiveness. To systematically evaluate these issues, we introduce \textitMetaNIM Arena , a benchmark designed to assess LLMs in adversarial strategic decision-making, where dynamic interactions influence optimal decisions. To overcome MAD’s limitations, we propose \textbfDReaMAD ( \textbfD iverse \textbfRea soning via \textbfM ulti- \textbfA gent \textbfD ebate with Refined Prompt ) , a novel framework that (1) refines LLM’s strategic prior knowledge to improve reasoning quality and (2) promotes diverse viewpoints within a single model by systematically modifying prompts, reducing bias. Empirical results show that \textbfDReaMAD significantly improves decision accuracy, reasoning diversity, and bias mitigation across multiple strategic tasks, establishing it as a more effective approach for LLM-based decision-making.
[LG-29] BEAC: Imitating Complex Exploration and Task-oriented Behaviors for Invisible Object Nonprehensile Manipulation
链接: https://arxiv.org/abs/2503.16803
作者: Hirotaka Tahara,Takamitsu Matsubara
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 27 pages
点击查看摘要
Abstract:Applying imitation learning (IL) is challenging to nonprehensile manipulation tasks of invisible objects with partial observations, such as excavating buried rocks. The demonstrator must make such complex action decisions as exploring to find the object and task-oriented actions to complete the task while estimating its hidden state, perhaps causing inconsistent action demonstration and high cognitive load problems. For these problems, work in human cognitive science suggests that promoting the use of pre-designed, simple exploration rules for the demonstrator may alleviate the problems of action inconsistency and high cognitive load. Therefore, when performing imitation learning from demonstrations using such exploration rules, it is important to accurately imitate not only the demonstrator’s task-oriented behavior but also his/her mode-switching behavior (exploratory or task-oriented behavior) under partial observation. Based on the above considerations, this paper proposes a novel imitation learning framework called Belief Exploration-Action Cloning (BEAC), which has a switching policy structure between a pre-designed exploration policy and a task-oriented action policy trained on the estimated belief states based on past history. In simulation and real robot experiments, we confirmed that our proposed method achieved the best task performance, higher mode and action prediction accuracies, while reducing the cognitive load in the demonstration indicated by a user study.
[LG-30] Physics-Informed Deep B-Spline Networks for Dynamical Systems
链接: https://arxiv.org/abs/2503.16777
作者: Zhuoyuan Wang,Raffaele Romagnoli,Jasmine Ratchford,Yorie Nakahira
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:Physics-informed machine learning provides an approach to combining data and governing physics laws for solving complex partial differential equations (PDEs). However, efficiently solving PDEs with varying parameters and changing initial conditions and boundary conditions (ICBCs) with theoretical guarantees remains an open challenge. We propose a hybrid framework that uses a neural network to learn B-spline control points to approximate solutions to PDEs with varying system and ICBC parameters. The proposed network can be trained efficiently as one can directly specify ICBCs without imposing losses, calculate physics-informed loss functions through analytical formulas, and requires only learning the weights of B-spline functions as opposed to both weights and basis as in traditional neural operator learning methods. We provide theoretical guarantees that the proposed B-spline networks serve as universal approximators for the set of solutions of PDEs with varying ICBCs under mild conditions and establish bounds on the generalization errors in physics-informed learning. We also demonstrate in experiments that the proposed B-spline network can solve problems with discontinuous ICBCs and outperforms existing methods, and is able to learn solutions of 3D dynamics with diverse initial conditions.
[LG-31] On Explaining (Large) Language Models For Code Using Global Code-Based Explanations
链接: https://arxiv.org/abs/2503.16771
作者: David N. Palacio,Dipin Khati,Daniel Rodriguez-Cardenas,Alejandro Velasco,Denys Poshyvanyk
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 12 pages, under revision
点击查看摘要
Abstract:In recent years, Language Models for Code (LLM4Code) have significantly changed the landscape of software engineering (SE) on downstream tasks, such as code generation, by making software development more efficient. Therefore, a growing interest has emerged in further evaluating these Language Models to homogenize the quality assessment of generated code. As the current evaluation process can significantly overreact on accuracy-based metrics, practitioners often seek methods to interpret LLM4Code outputs beyond canonical benchmarks. While the majority of research reports on code generation effectiveness in terms of expected ground truth, scant attention has been paid to LLMs’ explanations. In essence, the decision-making process to generate code is hard to interpret. To bridge this evaluation gap, we introduce code rationales (Code Q ), a technique with rigorous mathematical underpinning, to identify subsets of tokens that can explain individual code predictions. We conducted a thorough Exploratory Analysis to demonstrate the method’s applicability and a User Study to understand the usability of code-based explanations. Our evaluation demonstrates that Code Q is a powerful interpretability method to explain how (less) meaningful input concepts (i.e., natural language particle `at’) highly impact output generation. Moreover, participants of this study highlighted Code Q 's ability to show a causal relationship between the input and output of the model with readable and informative explanations on code completion and test generation tasks. Additionally, Code Q also helps to uncover model rationale, facilitating comparison with a human rationale to promote a fair level of trust and distrust in the model.
[LG-32] Fast online node labeling with graph subsampling
链接: https://arxiv.org/abs/2503.16755
作者: Yushen Huang,Ertai Luo,Reza Babenezhad,Yifan Sun
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large data applications rely on storing data in massive, sparse graphs with millions to trillions of nodes. Graph-based methods, such as node prediction, aim for computational efficiency regardless of graph size. Techniques like localized approximate personalized page rank (APPR) solve sparse linear systems with complexity independent of graph size, but is in terms of the maximum node degree, which can be much larger in practice than the average node degree for real-world large graphs. In this paper, we consider an \emphonline subsampled APPR method, where messages are intentionally dropped at random. We use tools from graph sparsifiers and matrix linear algebra to give approximation bounds on the graph’s spectral properties ( O(1/\epsilon^2) edges), and node classification performance (added O(n\epsilon) overhead).
[LG-33] Ordered Topological Deep Learning: a Network Modeling Case Study
链接: https://arxiv.org/abs/2503.16746
作者: Guillermo Bernárdez,Miquel Ferriol-Galmés,Carlos Güemes-Palau,Mathilde Papillon,Pere Barlet-Ros,Albert Cabellos-Aparicio,Nina Miolane
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
点击查看摘要
Abstract:Computer networks are the foundation of modern digital infrastructure, facilitating global communication and data exchange. As demand for reliable high-bandwidth connectivity grows, advanced network modeling techniques become increasingly essential to optimize performance and predict network behavior. Traditional modeling methods, such as packet-level simulators and queueing theory, have notable limitations --either being computationally expensive or relying on restrictive assumptions that reduce accuracy. In this context, the deep learning-based RouteNet family of models has recently redefined network modeling by showing an unprecedented cost-performance trade-off. In this work, we revisit RouteNet’s sophisticated design and uncover its hidden connection to Topological Deep Learning (TDL), an emerging field that models higher-order interactions beyond standard graph-based methods. We demonstrate that, although originally formulated as a heterogeneous Graph Neural Network, RouteNet serves as the first instantiation of a new form of TDL. More specifically, this paper presents OrdGCCN, a novel TDL framework that introduces the notion of ordered neighbors in arbitrary discrete topological spaces, and shows that RouteNet’s architecture can be naturally described as an ordered topological neural network. To the best of our knowledge, this marks the first successful real-world application of state-of-the-art TDL principles --which we confirm through extensive testbed experiments–, laying the foundation for the next generation of ordered TDL-driven applications.
[LG-34] NeuroSep-CP-LCB: A Deep Learning-based Contextual Multi-armed Bandit Algorithm with Uncertainty Quantification for Early Sepsis Prediction
链接: https://arxiv.org/abs/2503.16708
作者: Anni Zhou,Raheem Beyah,Rishikesan Kamaleswaran
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In critical care settings, timely and accurate predictions can significantly impact patient outcomes, especially for conditions like sepsis, where early intervention is crucial. We aim to model patient-specific reward functions in a contextual multi-armed bandit setting. The goal is to leverage patient-specific clinical features to optimize decision-making under uncertainty. This paper proposes NeuroSep-CP-LCB, a novel integration of neural networks with contextual bandits and conformal prediction tailored for early sepsis detection. Unlike the algorithm pool selection problem in the previous paper, where the primary focus was identifying the most suitable pre-trained model for prediction tasks, this work directly models the reward function using a neural network, allowing for personalized and adaptive decision-making. Combining the representational power of neural networks with the robustness of conformal prediction intervals, this framework explicitly accounts for uncertainty in offline data distributions and provides actionable confidence bounds on predictions.
[LG-35] Deep Q-Learning with Gradient Target Tracking
链接: https://arxiv.org/abs/2503.16700
作者: Donghwan Lee,Bum Geun Park,Taeho Lee
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:This paper introduces Q-learning with gradient target tracking, a novel reinforcement learning framework that provides a learned continuous target update mechanism as an alternative to the conventional hard update paradigm. In the standard deep Q-network (DQN), the target network is a copy of the online network’s weights, held fixed for a number of iterations before being periodically replaced via a hard update. While this stabilizes training by providing consistent targets, it introduces a new challenge: the hard update period must be carefully tuned to achieve optimal performance. To address this issue, we propose two gradient-based target update methods: DQN with asymmetric gradient target tracking (AGT2-DQN) and DQN with symmetric gradient target tracking (SGT2-DQN). These methods replace the conventional hard target updates with continuous and structured updates using gradient descent, which effectively eliminates the need for manual tuning. We provide a theoretical analysis proving the convergence of these methods in tabular settings. Additionally, empirical evaluations demonstrate their advantages over standard DQN baselines, which suggest that gradient-based target updates can serve as an effective alternative to conventional target update mechanisms in Q-learning.
[LG-36] ATOM: A Framework of Detecting Query-Based Model Extraction Attacks for Graph Neural Networks
链接: https://arxiv.org/abs/2503.16693
作者: Zhan Cheng,Bolin Shen,Tianming Sha,Yuan Gao,Shibo Li,Yushun Dong
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs) have gained traction in Graph-based Machine Learning as a Service (GMLaaS) platforms, yet they remain vulnerable to graph-based model extraction attacks (MEAs), where adversaries reconstruct surrogate models by querying the victim model. Existing defense mechanisms, such as watermarking and fingerprinting, suffer from poor real-time performance, susceptibility to evasion, or reliance on post-attack verification, making them inadequate for handling the dynamic characteristics of graph-based MEA variants. To address these limitations, we propose ATOM, a novel real-time MEA detection framework tailored for GNNs. ATOM integrates sequential modeling and reinforcement learning to dynamically detect evolving attack patterns, while leveraging k -core embedding to capture the structural properties, enhancing detection precision. Furthermore, we provide theoretical analysis to characterize query behaviors and optimize detection strategies. Extensive experiments on multiple real-world datasets demonstrate that ATOM outperforms existing approaches in detection performance, maintaining stable across different time steps, thereby offering a more effective defense mechanism for GMLaaS environments.
[LG-37] A preliminary data fusion study to assess the feasibility of Foundation Process-Property Models in Laser Powder Bed Fusion
链接: https://arxiv.org/abs/2503.16667
作者: Oriol Vendrell-Gallart,Nima Negarandeh,Zahra Zanjani Foumani,Mahsa Amiri,Lorenzo Valdevit,Ramin Bostanabad
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Foundation models are at the forefront of an increasing number of critical applications. In regards to technologies such as additive manufacturing (AM), these models have the potential to dramatically accelerate process optimization and, in turn, design of next generation materials. A major challenge that impedes the construction of foundation process-property models is data scarcity. To understand the impact of this challenge, and since foundation models rely on data fusion, in this work we conduct controlled experiments where we focus on the transferability of information across different material systems and properties. More specifically, we generate experimental datasets from 17-4 PH and 316L stainless steels (SSs) in Laser Powder Bed Fusion (LPBF) where we measure the effect of five process parameters on porosity and hardness. We then leverage Gaussian processes (GPs) for process-property modeling in various configurations to test if knowledge about one material system or property can be leveraged to build more accurate machine learning models for other material systems or properties. Through extensive cross-validation studies and probing the GPs’ interpretable hyperparameters, we study the intricate relation among data size and dimensionality, complexity of the process-property relations, noise, and characteristics of machine learning models. Our findings highlight the need for structured learning approaches that incorporate domain knowledge in building foundation process-property models rather than relying on uninformed data fusion in data-limited applications.
[LG-38] Efficient Training of Neural Fractional-Order Differential Equation via Adjoint Backpropagation AAAI
链接: https://arxiv.org/abs/2503.16666
作者: Qiyu Kang,Xuhao Li,Kai Zhao,Wenjun Cui,Yanan Zhao,Weihua Deng,Wee Peng Tay
类目: Machine Learning (cs.LG)
*备注: AAAI Conference on Artificial Intelligence 2025
点击查看摘要
Abstract:Fractional-order differential equations (FDEs) enhance traditional differential equations by extending the order of differential operators from integers to real numbers, offering greater flexibility in modeling complex dynamical systems with nonlocal characteristics. Recent progress at the intersection of FDEs and deep learning has catalyzed a new wave of innovative models, demonstrating the potential to address challenges such as graph representation learning. However, training neural FDEs has primarily relied on direct differentiation through forward-pass operations in FDE numerical solvers, leading to increased memory usage and computational complexity, particularly in large-scale applications. To address these challenges, we propose a scalable adjoint backpropagation method for training neural FDEs by solving an augmented FDE backward in time, which substantially reduces memory requirements. This approach provides a practical neural FDE toolbox and holds considerable promise for diverse applications. We demonstrate the effectiveness of our method in several tasks, achieving performance comparable to baseline models while significantly reducing computational overhead.
[LG-39] ContextGNN goes to Elliot: Towards Benchmarking Relational Deep Learning for Static Link Prediction (aka Personalized Item Recommendation)
链接: https://arxiv.org/abs/2503.16661
作者: Alejandro Ariza-Casabona,Nikos Kanakaris,Daniele Malitesta
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Relational deep learning (RDL) settles among the most exciting advances in machine learning for relational databases, leveraging the representational power of message passing graph neural networks (GNNs) to derive useful knowledge and run predicting tasks on tables connected through primary-to-foreign key links. The RDL paradigm has been successfully applied to recommendation lately, through its most recent representative deep learning architecture namely, ContextGNN. While acknowledging ContextGNN’s improved performance on real-world recommendation datasets and tasks, preliminary tests for the more traditional static link prediction task (aka personalized item recommendation) on the popular Amazon Book dataset have demonstrated how ContextGNN has still room for improvement compared to other state-of-the-art GNN-based recommender systems. To this end, with this paper, we integrate ContextGNN within Elliot, a popular framework for reproducibility and benchmarking analyses, counting around 50 state-of-the-art recommendation models from the literature to date. On such basis, we run preliminary experiments on three standard recommendation datasets and against six state-of-the-art GNN-based recommender systems, confirming similar trends to those observed by the authors in their original paper. The code is publicly available on GitHub: this https URL.
[LG-40] Advances in Protein Representation Learning: Methods Applications and Future Directions
链接: https://arxiv.org/abs/2503.16659
作者: Viet Thanh Duy Nguyen,Truong-Son Hy
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Proteins are complex biomolecules that play a central role in various biological processes, making them critical targets for breakthroughs in molecular biology, medical research, and drug discovery. Deciphering their intricate, hierarchical structures, and diverse functions is essential for advancing our understanding of life at the molecular level. Protein Representation Learning (PRL) has emerged as a transformative approach, enabling the extraction of meaningful computational representations from protein data to address these challenges. In this paper, we provide a comprehensive review of PRL research, categorizing methodologies into five key areas: feature-based, sequence-based, structure-based, multimodal, and complex-based approaches. To support researchers in this rapidly evolving field, we introduce widely used databases for protein sequences, structures, and functions, which serve as essential resources for model development and evaluation. We also explore the diverse applications of these approaches in multiple domains, demonstrating their broad impact. Finally, we discuss pressing technical challenges and outline future directions to advance PRL, offering insights to inspire continued innovation in this foundational field.
[LG-41] o impute or not to impute: How machine learning modelers treat missing data
链接: https://arxiv.org/abs/2503.16644
作者: Wanyi Chen,Mary Cummings
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
点击查看摘要
Abstract:Missing data is prevalent in tabular machine learning (ML) models, and different missing data treatment methods can significantly affect ML model training results. However, little is known about how ML researchers and engineers choose missing data treatment methods and what factors affect their choices. To this end, we conducted a survey of 70 ML researchers and engineers. Our results revealed that most participants were not making informed decisions regarding missing data treatment, which could significantly affect the validity of the ML models trained by these researchers. We advocate for better education on missing data, more standardized missing data reporting, and better missing data analysis tools.
[LG-42] Whenever Wherever: Towards Orchestrating Crowd Simulations with Spatio-Temporal Spawn Dynamics
链接: https://arxiv.org/abs/2503.16639
作者: Thomas Kreutz,Max Mühlhäuser,Alejandro Sanchez Guinea
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Realistic crowd simulations are essential for immersive virtual environments, relying on both individual behaviors (microscopic dynamics) and overall crowd patterns (macroscopic characteristics). While recent data-driven methods like deep reinforcement learning improve microscopic realism, they often overlook critical macroscopic features such as crowd density and flow, which are governed by spatio-temporal spawn dynamics, namely, when and where agents enter a scene. Traditional methods, like random spawn rates, stochastic processes, or fixed schedules, are not guaranteed to capture the underlying complexity or lack diversity and realism. To address this issue, we propose a novel approach called nTPP-GMM that models spatio-temporal spawn dynamics using Neural Temporal Point Processes (nTPPs) that are coupled with a spawn-conditional Gaussian Mixture Model (GMM) for agent spawn and goal positions. We evaluate our approach by orchestrating crowd simulations of three diverse real-world datasets with nTPP-GMM. Our experiments demonstrate the orchestration with nTPP-GMM leads to realistic simulations that reflect real-world crowd scenarios and allow crowd analysis.
[LG-43] Informative Path Planning to Explore and Map Unknown Planetary Surfaces with Gaussian Processes
链接: https://arxiv.org/abs/2503.16613
作者: Ashten Akemoto,Frances Zhu
类目: Robotics (cs.RO); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Many environments, such as unvisited planetary surfaces and oceanic regions, remain unexplored due to a lack of prior knowledge. Autonomous vehicles must sample upon arrival, process data, and either transmit findings to a teleoperator or decide where to explore next. Teleoperation is suboptimal, as human intuition lacks mathematical guarantees for optimality. This study evaluates an informative path planning algorithm for mapping a scalar variable distribution while minimizing travel distance and ensuring model convergence. We compare traditional open loop coverage methods (e.g., Boustrophedon, Spiral) with information-theoretic approaches using Gaussian processes, which update models iteratively with confidence metrics. The algorithm’s performance is tested on three surfaces, a parabola, Townsend function, and lunar crater hydration map, to assess noise, convexity, and function behavior. Results demonstrate that information-driven methods significantly outperform naive exploration in reducing model error and travel distance while improving convergence potential.
[LG-44] ransformer-based Wireless Symbol Detection Over Fading Channels
链接: https://arxiv.org/abs/2503.16594
作者: Li Fan,Jing Yang,Cong Shen
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: arXiv admin note: substantial text overlap with arXiv:2411.07600
点击查看摘要
Abstract:Pre-trained Transformers, through in-context learning (ICL), have demonstrated exceptional capabilities to adapt to new tasks using example prompts without model update. Transformer-based wireless receivers, where prompts consist of the pilot data in the form of transmitted and received signal pairs, have shown high detection accuracy when pilot data are abundant. However, pilot information is often costly and limited in practice. In this work, we propose the DEcision Feedback INcontExt Detection (DEFINED) solution as a new wireless receiver design, which bypasses channel estimation and directly performs symbol detection using the (sometimes extremely) limited pilot data. The key innovation in DEFINED is the proposed decision feedback mechanism in ICL, where we sequentially incorporate the detected symbols into the prompts as pseudo-labels to improve the detection for subsequent symbols. Furthermore, we proposed another detection method where we combine ICL with Semi-Supervised Learning (SSL) to extract information from both labeled and unlabeled data during inference, thus avoiding the errors propagated during the decision feedback process of the original DEFINED. Extensive experiments across a broad range of wireless communication settings demonstrate that a small Transformer trained with DEFINED or IC-SSL achieves significant performance improvements over conventional methods, in some cases only needing a single pilot pair to achieve similar performance of the latter with more than 4 pilot pairs.
[LG-45] A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: How Many Repeats Are Enough?
链接: https://arxiv.org/abs/2503.16589
作者: Moslem Noori,Elisabetta Valiante,Thomas Van Vaerenbergh,Masoud Mohseni,Ignacio Rozada
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Statistics Theory (math.ST)
*备注:
点击查看摘要
Abstract:A key trait of stochastic optimizers is that multiple runs of the same optimizer in attempting to solve the same problem can produce different results. As a result, their performance is evaluated over several repeats, or runs, on the problem. However, the accuracy of the estimated performance metrics depends on the number of runs and should be studied using statistical tools. We present a statistical analysis of the common metrics, and develop guidelines for experiment design to measure the optimizer’s performance using these metrics to a high level of confidence and accuracy. To this end, we first discuss the confidence interval of the metrics and how they are related to the number of runs of an experiment. We then derive a lower bound on the number of repeats in order to guarantee achieving a given accuracy in the metrics. Using this bound, we propose an algorithm to adaptively adjust the number of repeats needed to ensure the accuracy of the evaluated metric. Our simulation results demonstrate the utility of our analysis and how it allows us to conduct reliable benchmarking as well as hyperparameter tuning and prevent us from drawing premature conclusions regarding the performance of stochastic optimizers.
[LG-46] Exploring Deep Learning Models for EEG Neural Decoding
链接: https://arxiv.org/abs/2503.16567
作者: Laurits Dixen,Stefan Heinrich,Paolo Burelli
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Neural decoding is an important method in cognitive neuroscience that aims to decode brain representations from recorded neural activity using a multivariate machine learning model. The THINGS initiative provides a large EEG dataset of 46 subjects watching rapidly shown images. Here, we test the feasibility of using this method for decoding high-level object features using recent deep learning models. We create a derivative dataset from this of living vs non-living entities test 15 different deep learning models with 5 different architectures and compare to a SOTA linear model. We show that the linear model is not able to solve the decoding task, while almost all the deep learning models are successful, suggesting that in some cases non-linear models are needed to decode neural representations. We also run a comparative study of the models’ performance on individual object categories, and suggest how artificial neural networks can be used to study brain activity.
[LG-47] Bezier Distillation
链接: https://arxiv.org/abs/2503.16562
作者: Ling Feng,SK Yang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In Rectified Flow, by obtaining the rectified flow several times, the mapping relationship between distributions can be distilled into a neural network, and the target distribution can be directly predicted by the straight lines of the flow. However, during the pairing process of the mapping relationship, a large amount of error accumulation will occur, resulting in a decrease in performance after multiple rectifications. In the field of flow models, knowledge distillation of multi - teacher diffusion models is also a problem worthy of discussion in accelerating sampling. I intend to combine multi - teacher knowledge distillation with Bezier curves to solve the problem of error accumulation. Currently, the related paper is being written by myself.
[LG-48] Investigating Cultural Dimensions and Technological Acceptance: The Adoption of Electronic Performance and Tracking Systems in Qatars Football Sector
链接: https://arxiv.org/abs/2503.16557
作者: Abdulaziz Al Mannai
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Qatar’s football sector has undergone a substantial technological transformation with the implementation of Electronic Performance and Tracking Systems (EPTS). This study examines the impact of cultural and technological factors on EPTS adoption, using Hofstede’s Cultural Dimensions Theory and the Technology Acceptance Model (TAM) as theoretical frameworks. An initial exploratory study involved ten participants, followed by an expanded dataset comprising thirty stakeholders, including players, coaches, and staff from Qatari football organizations. Multiple regression analysis was conducted to evaluate the relationships between perceived usefulness, perceived ease of use, power distance, innovation receptiveness, integration complexity, and overall adoption. The results indicate that perceived usefulness, innovation receptiveness, and lower power distance significantly drive EPTS adoption, while ease of use is marginally significant and integration complexity is non-significant in this sample. These findings provide practical insights for sports technology stakeholders in Qatar and emphasize the importance of aligning cultural considerations with technological readiness for successful EPTS integration.
[LG-49] AIDetection: A Generative AI Detection Tool for Educators Using Syntactic Matching of Common ASCII Characters As Potential AI Traces Within Users Internet Browser WWW
链接: https://arxiv.org/abs/2503.16503
作者: Andy Buschmann
类目: Human-Computer Interaction (cs.HC); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Machine Learning (cs.LG); General Economics (econ.GN)
*备注: 10 pages, 3 figures, online version of the script: this https URL , source code available upon request
点击查看摘要
Abstract:This paper introduces a simple JavaScript-based web application designed to assist educators in detecting AI-generated content in student essays and written assignments. Unlike existing AI detection tools that rely on obfuscated machine learning models, this http URL employs a heuristic-based approach to identify common syntactic traces left by generative AI models, such as ChatGPT, Claude, Grok, DeepSeek, Gemini, Llama/Meta, Microsoft Copilot, Grammarly AI, and other text-generating models and wrapper applications. The tool scans documents in bulk for potential AI artifacts, as well as AI citations and acknowledgments, and provides a visual summary with downloadable Excel and CSV reports. This article details its methodology, functionalities, limitations, and applications within educational settings.
[LG-50] Users Favor LLM -Generated Content – Until They Know Its AI
链接: https://arxiv.org/abs/2503.16458
作者: Petr Parshakov,Iuliia Naidenova,Sofia Paklina,Nikita Matkin,Cornel Nesseler
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); General Economics (econ.GN)
*备注:
点击查看摘要
Abstract:In this paper, we investigate how individuals evaluate human and large langue models generated responses to popular questions when the source of the content is either concealed or disclosed. Through a controlled field experiment, participants were presented with a set of questions, each accompanied by a response generated by either a human or an AI. In a randomized design, half of the participants were informed of the response’s origin while the other half remained unaware. Our findings indicate that, overall, participants tend to prefer AI-generated responses. However, when the AI origin is revealed, this preference diminishes significantly, suggesting that evaluative judgments are influenced by the disclosure of the response’s provenance rather than solely by its quality. These results underscore a bias against AI-generated content, highlighting the societal challenge of improving the perception of AI work in contexts where quality assessments should be paramount.
[LG-51] Optimizing Facial Expressions of an Android Robot Effectively: a Bayesian Optimization Approach
链接: https://arxiv.org/abs/2301.05620
作者: Dongsheng Yang,Wataru Sato,Qianying Liu,Takashi Minato,Shushi Namba,Shin’ya Nishida
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 8 figures, accepted by Humanoids2022
点击查看摘要
Abstract:Expressing various facial emotions is an important social ability for efficient communication between humans. A key challenge in human-robot interaction research is providing androids with the ability to make various human-like facial expressions for efficient communication with humans. The android Nikola, we have developed, is equipped with many actuators for facial muscle control. While this enables Nikola to simulate various human expressions, it also complicates identification of the optimal parameters for producing desired expressions. Here, we propose a novel method that automatically optimizes the facial expressions of our android. We use a machine vision algorithm to evaluate the magnitudes of seven basic emotions, and employ the Bayesian Optimization algorithm to identify the parameters that produce the most convincing facial expressions. Evaluations by naive human participants demonstrate that our method improves the rated strength of the android’s facial expressions of anger, disgust, sadness, and surprise compared with the previous method that relied on Ekman’s theory and parameter adjustments by a human expert.
[LG-52] Glivenko-Cantelli for f-divergence
链接: https://arxiv.org/abs/2503.17355
作者: Haoming Wang,Lek-Heng Lim
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注: 26 pages, 1 figure
点击查看摘要
Abstract:We extend the celebrated Glivenko-Cantelli theorem, sometimes called the fundamental theorem of statistics, from its standard setting of total variation distance to all f -divergences. A key obstacle in this endeavor is to define f -divergence on a subcollection of a \sigma -algebra that forms a \pi -system but not a \sigma -subalgebra. This is a side contribution of our work. We will show that this notion of f -divergence on the \pi -system of rays preserves nearly all known properties of standard f -divergence, yields a novel integral representation of the Kolmogorov-Smirnov distance, and has a Glivenko-Cantelli theorem.
[LG-53] On Quantum Perceptron Learning via Quantum Search
链接: https://arxiv.org/abs/2503.17308
作者: Xiaoyu Sun(1),Mathieu Roget(1),Giuseppe Di Molfetta(1),Hachem Kadri(1) ((1) Aix-Marseille Université, CNRS, LIS, Marseille, France.)
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:With the growing interest in quantum machine learning, the perceptron – a fundamental building block in traditional machine learning – has emerged as a valuable model for exploring quantum advantages. Two quantum perceptron algorithms based on Grover’s search, were developed in arXiv:1602.04799 to accelerate training and improve statistical efficiency in perceptron learning. This paper points out and corrects a mistake in the proof of Theorem 2 in arXiv:1602.04799. Specifically, we show that the probability of sampling from a normal distribution for a D -dimensional hyperplane that perfectly classifies the data scales as \Omega(\gamma^D) instead of \Theta(\gamma) , where \gamma is the margin. We then revisit two well-established linear programming algorithms – the ellipsoid method and the cutting plane random walk algorithm – in the context of perceptron learning, and show how quantum search algorithms can be leveraged to enhance the overall complexity. Specifically, both algorithms gain a sub-linear speed-up O(\sqrtN) in the number of data points N as a result of Grover’s algorithm and an additional O(D^1.5) speed-up is possible for cutting plane random walk algorithm employing quantum walk search.
[LG-54] Calibration Strategies for Robust Causal Estimation: Theoretical and Empirical Insights on Propensity Score Based Estimators
链接: https://arxiv.org/abs/2503.17290
作者: Jan Rabenseifner,Sven Klaassen,Jannis Kueck,Philipp Bach
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注:
点击查看摘要
Abstract:The partitioning of data for estimation and calibration critically impacts the performance of propensity score based estimators like inverse probability weighting (IPW) and double/debiased machine learning (DML) frameworks. We extend recent advances in calibration techniques for propensity score estimation, improving the robustness of propensity scores in challenging settings such as limited overlap, small sample sizes, or unbalanced data. Our contributions are twofold: First, we provide a theoretical analysis of the properties of calibrated estimators in the context of DML. To this end, we refine existing calibration frameworks for propensity score models, with a particular emphasis on the role of sample-splitting schemes in ensuring valid causal inference. Second, through extensive simulations, we show that calibration reduces variance of inverse-based propensity score estimators while also mitigating bias in IPW, even in small-sample regimes. Notably, calibration improves stability for flexible learners (e.g., gradient boosting) while preserving the doubly robust properties of DML. A key insight is that, even when methods perform well without calibration, incorporating a calibration step does not degrade performance, provided that an appropriate sample-splitting approach is chosen.
[LG-55] Learning to Solve Related Linear Systems
链接: https://arxiv.org/abs/2503.17265
作者: Disha Hegde,Jon Cockayne
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
点击查看摘要
Abstract:Solving multiple parametrised related systems is an essential component of many numerical tasks. Borrowing strength from the solved systems and learning will make this process faster. In this work, we propose a novel probabilistic linear solver over the parameter space. This leverages information from the solved linear systems in a regression setting to provide an efficient posterior mean and covariance. We advocate using this as companion regression model for the preconditioned conjugate gradient method, and discuss the favourable properties of the posterior mean and covariance as the initial guess and preconditioner. We also provide several design choices for this companion solver. Numerical experiments showcase the benefits of using our novel solver in a hyperparameter optimisation problem.
[LG-56] Generative adversarial framework to calibrate excursion set models for the 3D morphology of all-solid-state battery cathodes
链接: https://arxiv.org/abs/2503.17171
作者: Orkun Furat,Sabrina Weber,Johannes Schubert,René Rekers,Maximilian Luczak,Erik Glatt,Andreas Wiegmann,Jürgen Janek,Anja Bielefeld,Volker Schmidt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 25 pages, 8 Figures
点击查看摘要
Abstract:This paper presents a computational method for generating virtual 3D morphologies of functional materials using low-parametric stochastic geometry models, i.e., digital twins, calibrated with 2D microscopy images. These digital twins allow systematic parameter variations to simulate various morphologies, that can be deployed for virtual materials testing by means of spatially resolved numerical simulations of macroscopic properties. Generative adversarial networks (GANs) have gained popularity for calibrating models to generate realistic 3D morphologies. However, GANs often comprise of numerous uninterpretable parameters make systematic variation of morphologies for virtual materials testing challenging. In contrast, low-parametric stochastic geometry models (e.g., based on Gaussian random fields) enable targeted variation but may struggle to mimic complex morphologies. Combining GANs with advanced stochastic geometry models (e.g., excursion sets of more general random fields) addresses these limitations, allowing model calibration solely from 2D image data. This approach is demonstrated by generating a digital twin of all-solid-state battery (ASSB) cathodes. Since the digital twins are parametric, they support systematic exploration of structural scenarios and their macroscopic properties. The proposed method facilitates simulation studies for optimizing 3D morphologies, benefiting not only ASSB cathodes but also other materials with similar structures.
[LG-57] Adiabatic Fine-Tuning of Neural Quantum States Enables Detection of Phase Transitions in Weight Space ICLR
链接: https://arxiv.org/abs/2503.17140
作者: Vinicius Hernandes,Thomas Spriggs,Saqar Khaleefah,Eliska Greplova
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Accepted at the ICLR Workshop on Neural Network Weights as a New Data Modality 2025
点击查看摘要
Abstract:Neural quantum states (NQS) have emerged as a powerful tool for approximating quantum wavefunctions using deep learning. While these models achieve remarkable accuracy, understanding how they encode physical information remains an open challenge. In this work, we introduce adiabatic fine-tuning, a scheme that trains NQS across a phase diagram, leading to strongly correlated weight representations across different models. This correlation in weight space enables the detection of phase transitions in quantum systems by analyzing the trained network weights alone. We validate our approach on the transverse field Ising model and the J1-J2 Heisenberg model, demonstrating that phase transitions manifest as distinct structures in weight space. Our results establish a connection between physical phase transitions and the geometry of neural network parameters, opening new directions for the interpretability of machine learning models in physics.
[LG-58] Benign Overfitting with Quantum Kernels
链接: https://arxiv.org/abs/2503.17020
作者: Joachim Tomasi,Sandrine Anthoine,Hachem Kadri
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Quantum kernels quantify similarity between data points by measuring the inner product between quantum states, computed through quantum circuit measurements. By embedding data into quantum systems, quantum kernel feature maps, that may be classically intractable to compute, could efficiently exploit high-dimensional Hilbert spaces to capture complex patterns. However, designing effective quantum feature maps remains a major challenge. Many quantum kernels, such as the fidelity kernel, suffer from exponential concentration, leading to near-identity kernel matrices that fail to capture meaningful data correlations and lead to overfitting and poor generalization. In this paper, we propose a novel strategy for constructing quantum kernels that achieve good generalization performance, drawing inspiration from benign overfitting in classical machine learning. Our approach introduces the concept of local-global quantum kernels, which combine two complementary components: a local quantum kernel based on measurements of small subsystems and a global quantum kernel derived from full-system measurements. Through numerical experiments, we demonstrate that local-global quantum kernels exhibit benign overfitting, supporting the effectiveness of our approach in enhancing quantum kernel methods.
[LG-59] Uncertainty-Driven Modeling of Microporosity and Permeability in Clastic Reservoirs Using Random Forest
链接: https://arxiv.org/abs/2503.16957
作者: Muhammad Risha,Mohamed Elsaadany,Paul Liu
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures
点击查看摘要
Abstract:Predicting microporosity and permeability in clastic reservoirs is a challenge in reservoir quality assessment, especially in formations where direct measurements are difficult or expensive. These reservoir properties are fundamental in determining a reservoir’s capacity for fluid storage and transmission, yet conventional methods for evaluating them, such as Mercury Injection Capillary Pressure (MICP) and Scanning Electron Microscopy (SEM), are resource-intensive. The aim of this study is to develop a cost-effective machine learning model to predict complex reservoir properties using readily available field data and basic laboratory analyses. A Random Forest classifier was employed, utilizing key geological parameters such as porosity, grain size distribution, and spectral gamma-ray (SGR) measurements. An uncertainty analysis was applied to account for natural variability, expanding the dataset, and enhancing the model’s robustness. The model achieved a high level of accuracy in predicting microporosity (93%) and permeability levels (88%). By using easily obtainable data, this model reduces the reliance on expensive laboratory methods, making it a valuable tool for early-stage exploration, especially in remote or offshore environments. The integration of machine learning with uncertainty analysis provides a reliable and cost-effective approach for evaluating key reservoir properties in siliciclastic formations. This model offers a practical solution to improve reservoir quality assessments, enabling more informed decision-making and optimizing exploration efforts.
[LG-60] Sparse Additive Contextual Bandits: A Nonparametric Approach for Online Decision-making with High-dimensional Covariates
链接: https://arxiv.org/abs/2503.16941
作者: Wenjia Wang,Qingwen Zhang,Xiaowei Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
点击查看摘要
Abstract:Personalized services are central to today’s digital landscape, where online decision-making is commonly formulated as contextual bandit problems. Two key challenges emerge in modern applications: high-dimensional covariates and the need for nonparametric models to capture complex reward-covariate relationships. We address these challenges by developing a contextual bandit algorithm based on sparse additive reward models in reproducing kernel Hilbert spaces. We establish statistical properties of the doubly penalized method applied to random regions, introducing novel analyses under bandit feedback. Our algorithm achieves sublinear cumulative regret over the time horizon T while scaling logarithmically with covariate dimensionality d . Notably, we provide the first regret upper bound with logarithmic growth in d for nonparametric contextual bandits with high-dimensional covariates. We also establish a lower bound, with the gap to the upper bound vanishing as smoothness increases. Extensive numerical experiments demonstrate our algorithm’s superior performance in high-dimensional settings compared to existing approaches.
[LG-61] Online Selective Conformal Prediction: Errors and Solutions
链接: https://arxiv.org/abs/2503.16809
作者: Yusuf Sale,Aaditya Ramdas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 25 pages, 8 figures
点击查看摘要
Abstract:In online selective conformal inference, data arrives sequentially, and prediction intervals are constructed only when an online selection rule is met. Since online selections may break the exchangeability between the selected test datum and the rest of the data, one must correct for this by suitably selecting the calibration data. In this paper, we evaluate existing calibration selection strategies and pinpoint some fundamental errors in the associated claims that guarantee selection-conditional coverage and control of the false coverage rate (FCR). To address these shortcomings, we propose novel calibration selection strategies that provably preserve the exchangeability of the calibration data and the selected test datum. Consequently, we demonstrate that online selective conformal inference with these strategies guarantees both selection-conditional coverage and FCR control. Our theoretical findings are supported by experimental evidence examining tradeoffs between valid methods.
[LG-62] EarlyStopping: Implicit Regularization for Iterative Learning Procedures in Python
链接: https://arxiv.org/abs/2503.16753
作者: Eric Ziebell,Ratmir Miftachov,Bernhard Stankewitz,Laura Hucker
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注:
点击查看摘要
Abstract:Iterative learning procedures are ubiquitous in machine learning and modern statistics. Regularision is typically required to prevent inflating the expected loss of a procedure in later iterations via the propagation of noise inherent in the data. Significant emphasis has been placed on achieving this regularisation implicitly by stopping procedures early. The EarlyStopping-package provides a toolbox of (in-sample) sequential early stopping rules for several well-known iterative estimation procedures, such as truncated SVD, Landweber (gradient descent), conjugate gradient descent, L2-boosting and regression trees. One of the central features of the package is that the algorithms allow the specification of the true data-generating process and keep track of relevant theoretical quantities. In this paper, we detail the principles governing the implementation of the EarlyStopping-package and provide a survey of recent foundational advances in the theoretical literature. We demonstrate how to use the EarlyStopping-package to explore core features of implicit regularisation and replicate results from the literature. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Mathematical Software (cs.MS) Cite as: arXiv:2503.16753 [stat.ML] (or arXiv:2503.16753v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2503.16753 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ratmir Miftachov [view email] [v1] Thu, 20 Mar 2025 23:53:01 UTC (4,567 KB) Full-text links: Access Paper: View a PDF of the paper titled EarlyStopping: Implicit Regularization for Iterative Learning Procedures in Python, by Eric Ziebell and Ratmir Miftachov and Bernhard Stankewitz and Laura HuckerView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: stat.ML prev | next new | recent | 2025-03 Change to browse by: cs cs.LG cs.MS stat References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-63] Optimal Nonlinear Online Learning under Sequential Price Competition via s-Concavity
链接: https://arxiv.org/abs/2503.16737
作者: Daniele Bracale,Moulinath Banerjee,Cong Shi,Yuekai Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:
点击查看摘要
Abstract:We consider price competition among multiple sellers over a selling horizon of T periods. In each period, sellers simultaneously offer their prices and subsequently observe their respective demand that is unobservable to competitors. The demand function for each seller depends on all sellers’ prices through a private, unknown, and nonlinear relationship. To address this challenge, we propose a semi-parametric least-squares estimation of the nonlinear mean function, which does not require sellers to communicate demand information. We show that when all sellers employ our policy, their prices converge at a rate of O(T^-1/7) to the Nash equilibrium prices that sellers would reach if they were fully informed. Each seller incurs a regret of O(T^5/7) relative to a dynamic benchmark policy. A theoretical contribution of our work is proving the existence of equilibrium under shape-constrained demand functions via the concept of s -concavity and establishing regret bounds of our proposed policy. Technically, we also establish new concentration results for the least squares estimator under shape constraints. Our findings offer significant insights into dynamic competition-aware pricing and contribute to the broader study of non-parametric learning in strategic decision-making.
[LG-64] Universal approximation property of neural stochastic differential equations
链接: https://arxiv.org/abs/2503.16696
作者: Anna P. Kwossek,David J. Prömel,Josef Teichmann
类目: Probability (math.PR); Machine Learning (cs.LG); Functional Analysis (math.FA); Mathematical Finance (q-fin.MF); Machine Learning (stat.ML)
*备注: 20 pages
点击查看摘要
Abstract:We identify various classes of neural networks that are able to approximate continuous functions locally uniformly subject to fixed global linear growth constraints. For such neural networks the associated neural stochastic differential equations can approximate general stochastic differential equations, both of Itô diffusion type, arbitrarily well. Moreover, quantitative error estimates are derived for stochastic differential equations with sufficiently regular coefficients.
[LG-65] Making the unmodulated pyramid wavefront sensor smart II. First on-sky demonstration of extreme adaptive optics with deep learning
链接: https://arxiv.org/abs/2503.16690
作者: R. Landman,S.Y. Haffert,J.D. Long,J.R. Males,L.M. Close,W.B. Foster,K. Van Gorkom,O. Guyon,A.D. Hedglen,P.T. Johnson,M.Y. Kautz,J.K. Kueny,J. Li,J. Liberman,J. Lumbres,E.A. McEwen,A. McLeod,L. Schatz,E. Tonucci,K. Twitchell
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG); Optics (physics.optics)
*备注: Accepted for publication in AA
点击查看摘要
Abstract:Pyramid wavefront sensors (PWFSs) are the preferred choice for current and future extreme adaptive optics (XAO) systems. Almost all instruments use the PWFS in its modulated form to mitigate its limited linearity range. However, this modulation comes at the cost of a reduction in sensitivity, a blindness to petal-piston modes, and a limit to the sensor’s ability to operate at high speeds. Therefore, there is strong interest to use the PWFS without modulation, which can be enabled with nonlinear reconstructors. Here, we present the first on-sky demonstration of XAO with an unmodulated PWFS using a nonlinear reconstructor based on convolutional neural networks. We discuss the real-time implementation on the Magellan Adaptive Optics eXtreme (MagAO-X) instrument using the optimized TensorRT framework and show that inference is fast enough to run the control loop at 2 kHz frequencies. Our on-sky results demonstrate a successful closed-loop operation using a model calibrated with internal source data that delivers stable and robust correction under varying conditions. Performance analysis reveals that our smart PWFS achieves nearly the same Strehl ratio as the highly optimized modulated PWFS under favorable conditions on bright stars. Notably, we observe an improvement in performance on a fainter star under the influence of strong winds. These findings confirm the feasibility of using the PWFS in its unmodulated form and highlight its potential for next-generation instruments. Future efforts will focus on achieving even higher control loop frequencies (3 kHz), optimizing the calibration procedures, and testing its performance on fainter stars, where more gain is expected for the unmodulated PWFS compared to its modulated counterpart.
[LG-66] QCPINN: Quantum Classical Physics-Informed Neural Networks for Solving PDEs
链接: https://arxiv.org/abs/2503.16678
作者: Afrah Farea,Saiful Khan,Mustafa Serdar Celebi
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Hybrid quantum-classical neural network methods represent an emerging approach to solving computational challenges by leveraging advantages from both paradigms. As physics-informed neural networks (PINNs) have successfully applied to solve partial differential equations (PDEs) by incorporating physical constraints into neural architectures, this work investigates whether quantum-classical physics-informed neural networks (QCPINNs) can efficiently solve PDEs with reduced parameter counts compared to classical approaches. We evaluate two quantum circuit paradigms: continuous-variable (CV) and qubit-based discrete-variable (DV) across multiple circuit ansatze (Alternate, Cascade, Cross mesh, and Layered). Benchmarking across five challenging PDEs (Helmholtz, Cavity, Wave, Klein-Gordon, and Convection-Diffusion equations) demonstrates that our hybrid approaches achieve comparable accuracy to classical PINNs while requiring up to 89% fewer trainable parameters. DV-based implementations, particularly those with angle encoding and cascade circuit configurations, exhibit better stability and convergence properties across all problem types. For the Convection-Diffusion equation, our angle-cascade QCPINN achieves parameter efficiency and a 37% reduction in relative L2 error compared to classical counterparts. Our findings highlight the potential of quantum-enhanced architectures for physics-informed learning, establishing parameter efficiency as a quantifiable quantum advantage while providing a foundation for future quantum-classical hybrid systems solving complex physical models.
[LG-67] Subgradient Method for System Identification with Non-Smooth Objectives
链接: https://arxiv.org/abs/2503.16673
作者: Baturalp Yalcin,Javad Lavaei
类目: Optimization and Control (math.OC); Computational Complexity (cs.CC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, 5 figures
点击查看摘要
Abstract:This paper investigates a subgradient-based algorithm to solve the system identification problem for linear time-invariant systems with non-smooth objectives. This is essential for robust system identification in safety-critical applications. While existing work provides theoretical exact recovery guarantees using optimization solvers, the design of fast learning algorithms with convergence guarantees for practical use remains unexplored. We analyze the subgradient method in this setting where the optimization problems to be solved change over time as new measurements are taken, and we establish linear convergence results for both the best and Polyak step sizes after a burn-in period. Additionally, we characterize the asymptotic convergence of the best average sub-optimality gap under diminishing and constant step sizes. Finally, we compare the time complexity of standard solvers with the subgradient algorithm and support our findings with experimental results. This is the first work to analyze subgradient algorithms for system identification with non-smooth objectives.
[LG-68] Procrustes Wasserstein Metric: A Modified Benamou-Brenier Approach with Applications to Latent Gaussian Distributions
链接: https://arxiv.org/abs/2503.16580
作者: Kevine Meugang Toukam
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Applications (stat.AP)
*备注:
点击查看摘要
Abstract:We introduce a modified Benamou-Brenier type approach leading to a Wasserstein type distance that allows global invariance, specifically, isometries, and we show that the problem can be summarized to orthogonal transformations. This distance is defined by penalizing the action with a costless movement of the particle that does not change the direction and speed of its trajectory. We show that for Gaussian distribution resume to measuring the Euclidean distance between their ordered vector of eigenvalues and we show a direct application in recovering Latent Gaussian distributions.
[LG-69] Early Prediction of Alzheimers and Related Dementias: A Machine Learning Approach Utilizing Social Determinants of Health Data
链接: https://arxiv.org/abs/2503.16560
作者: Bereket Kindo,Arjee Restar,Anh Tran
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Alzheimer’s disease and related dementias (AD/ADRD) represent a growing healthcare crisis affecting over 6 million Americans. While genetic factors play a crucial role, emerging research reveals that social determinants of health (SDOH) significantly influence both the risk and progression of cognitive functioning, such as cognitive scores and cognitive decline. This report examines how these social, environmental, and structural factors impact cognitive health trajectories, with a particular focus on Hispanic populations, who face disproportionate risk for AD/ADRD. Using data from the Mexican Health and Aging Study (MHAS) and its cognitive assessment sub study (Mex-Cog), we employed ensemble of regression trees models to predict 4-year and 9-year cognitive scores and cognitive decline based on SDOH. This approach identified key predictive SDOH factors to inform potential multilevel interventions to address cognitive health disparities in this population.
信息检索
[IR-0] owards Carbon Footprint-Aware Recommender Systems for Greener Item Recommendation
链接: https://arxiv.org/abs/2503.17201
作者: Raoul Kalisvaart,Masoud Mansoury,Alan Hanjalic,Elvin Isufi
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:The commodity and widespread use of online shopping are having an unprecedented impact on climate, with emission figures from key actors that are easily comparable to those of a large-scale metropolis. Despite online shopping being fueled by recommender systems (RecSys) algorithms, the role and potential of the latter in promoting more sustainable choices is little studied. One of the main reasons for this could be attributed to the lack of a dataset containing carbon footprint emissions for the items. While building such a dataset is a rather challenging task, its presence is pivotal for opening the doors to novel perspectives, evaluations, and methods for RecSys research. In this paper, we target this bottleneck and study the environmental role of RecSys algorithms. First, we mine a dataset that includes carbon footprint emissions for its items. Then, we benchmark conventional RecSys algorithms in terms of accuracy and sustainability as two faces of the same coin. We find that RecSys algorithms optimized for accuracy overlook greenness and that longer recommendation lists are greener but less accurate. Then, we show that a simple reranking approach that accounts for the item’s carbon footprint can establish a better trade-off between accuracy and greenness. This reranking approach is modular, ready to use, and can be applied to any RecSys algorithm without the need to alter the underlying mechanisms or retrain models. Our results show that a small sacrifice of accuracy can lead to significant improvements of recommendation greenness across all algorithms and list lengths. Arguably, this accuracy-greenness trade-off could even be seen as an enhancement of user satisfaction, particularly for purpose-driven users who prioritize the environmental impact of their choices. We anticipate this work will serve as the starting point for studying RecSys for more sustainable recommendations.
[IR-1] Rankformer: A Graph Transformer for Recommendation based on Ranking Objective WWW2025
链接: https://arxiv.org/abs/2503.16927
作者: Sirui Chen,Shen Han,Jiawei Chen,Binbin Hu,Sheng Zhou,Gang Wang,Yan Feng,Chun Chen,Can Wang
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW2025
点击查看摘要
Abstract:Recommender Systems (RS) aim to generate personalized ranked lists for each user and are evaluated using ranking metrics. Although personalized ranking is a fundamental aspect of RS, this critical property is often overlooked in the design of model architectures. To address this issue, we propose Rankformer, a ranking-inspired recommendation model. The architecture of Rankformer is inspired by the gradient of the ranking objective, embodying a unique (graph) transformer architecture – it leverages global information from all users and items to produce more informative representations and employs specific attention weights to guide the evolution of embeddings towards improved ranking performance. We further develop an acceleration algorithm for Rankformer, reducing its complexity to a linear level with respect to the number of positive instances. Extensive experimental results demonstrate that Rankformer outperforms state-of-the-art methods. The code is available at this https URL.
附件下载
点击下载今日全部论文列表