Arxiv今日论文 | 2025-03-21

本篇博文主要内容为 2025-03-21 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决长上下文Transformer模型（LCTMs）在实际应用中因注意力机制计算复杂度呈二次增长而导致的高计算成本问题。现有块稀疏注意力方法虽能缓解此问题，但难以在准确性与效率之间取得平衡，主要因为块重要性测量代价高昂。论文提出的关键解决方案是XAttention框架，它通过引入一种新的块重要性评估机制显著加速Transformer模型的长上下文推理。XAttention的核心创新在于发现注意力矩阵中从左下到右上的反对角线元素和可以作为块重要性的强大代理指标，从而实现对非关键块的精确识别与剪枝，达到高稀疏性和大幅加速推理的效果。实验结果表明，XAttention在保持与全注意力相当的准确性的同时，实现了高达13.5倍的注意力计算加速。

链接: https://arxiv.org/abs/2503.16428
作者: Ruyi Xu,Guangxuan Xiao,Haofeng Huang,Junxian Guo,Song Han
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally to this work

点击查看摘要

Abstract:Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention’s quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention’s key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks-including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation. XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention’s ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications. Code is available at this https URL.
zh

[NLP-1] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂推理任务中因较长Chain-of-Thought (CoT)序列导致的“过度思考现象”（overthinking phenomenon），即性能提升的同时伴随显著计算开销的问题。论文的关键解决方案在于从多个方向探索高效推理的方法：(1) 基于模型的高效推理，通过优化完整推理模型或直接训练更简洁的高效推理模型来实现；(2) 基于推理输出的高效推理，通过动态减少推理步骤和长度来提高效率；(3) 基于输入提示的高效推理，利用输入提示的特性（如难度或长度控制）增强推理效率。此外，论文还探讨了高效数据在推理模型训练中的应用、小型语言模型的推理能力以及评估方法和基准测试。这些方向共同构成了解决LLMs推理效率问题的核心思路。

链接: https://arxiv.org/abs/2503.16419
作者: Yang Sui,Yu-Neng Chuang,Guanchu Wang,Jiamu Zhang,Tianyi Zhang,Jiayi Yuan,Hongyi Liu,Andrew Wen,Shaochen(Henry)Zhong,Hanjie Chen,Xia Hu
机构: Department of Computer Science (计算机科学系) ; Rice University (莱斯大学)
类目: Computation and Language (cs.CL)
备注: Project Website: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the “overthinking phenomenon”. In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking.
zh

[NLP-2] Survey on Evaluation of LLM -based Agents

【速读】：该论文试图系统性地解决大型语言模型（LLM）驱动的智能体（Agent）评估方法的问题。论文的关键在于全面分析并梳理了现有评估基准与框架，从四个维度展开：智能体的基本能力（如规划、工具使用、自我反思及记忆）、特定应用领域的基准（如网络、软件工程、科学及对话型智能体）、通用型智能体的评估以及智能体评估的总体框架。通过这一分析，论文揭示了评估方法正向更真实、更具挑战性的方向发展，并强调了未来研究需重点关注成本效率、安全性、鲁棒性评估，以及开发更精细且可扩展的评估方法。这些工作旨在填补当前研究中的关键空白，为智能体评估领域的发展指明方向。

链接: https://arxiv.org/abs/2503.16416
作者: Asaf Yehudai,Lilach Eden,Alan Li,Guy Uziel,Yilun Zhao,Roy Bar-Haim,Arman Cohan,Michal Shmueli-Scheuer
机构: The Hebrew University of Jerusalem; IBM Research (IBM研究); Yale University (耶鲁大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.
zh

[NLP-3] he Emperors New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination

【速读】：该论文试图解决Benchmark Data Contamination (BDC)问题，即基准测试样本被包含在训练集中导致的大型语言模型（LLM）评估性能虚假提升及可靠性下降。为应对这一挑战，研究者提出了多种缓解策略，如修改原始问题或基于它们生成新问题来更新现有基准。然而，这些缓解策略的有效性缺乏严格检验。为此，论文设计了一个系统且可控的流程，并提出两个新的度量标准——保真度（fidelity）和污染抗性（contamination resistance），以对现有的BDC缓解策略进行细致全面的评估。与仅关注整体准确率的传统评估方法（如准确率下降和准确率匹配）不同，这两个新指标强调问题级别的结果匹配。通过针对10个LLM、5个基准、20种BDC缓解策略以及2种污染场景的广泛实验发现，没有一种现有策略能在所有基准上显著提高相对于未更新基准（vanilla case）的污染抗性，且无一能够有效平衡保真度和污染抗性。这凸显了开发更有效BDC缓解策略的紧迫需求。论文代码库可从提供的链接获取。

链接: https://arxiv.org/abs/2503.16402
作者: Yifan Sun,Han Wang,Dongbai Li,Gang Wang,Huan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages

点击查看摘要

Abstract:Benchmark Data Contamination (BDC)-the inclusion of benchmark testing samples in the training set-has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics-fidelity and contamination resistance-to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing question-level evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, and none effectively balances fidelity and contamination resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies. Our code repository is available at this https URL.
zh

[NLP-4] Do Visual Imaginations Improve Vision-and-Language Navigation Agents ?

【速读】：该论文旨在研究视觉表示（sub-goal的想象图）是否可以作为导航线索，提升Vision-and-Language Navigation (VLN)任务的性能。解决方案的关键在于利用文本到图像扩散模型，基于分割后的指令中的地标参考合成这些视觉表示，并将它们作为额外的模态提供给VLN代理以充当地标线索。同时，引入辅助损失函数，显式鼓励将这些视觉表示与对应的指代表达关联起来。实验结果表明，该方法提升了成功率(SR)约1个百分点，并在按逆路径长度归一化后的成功率(SPL)上提高多达0.5个百分点，证明了相比仅依赖语言指令，该方法增强了视觉理解能力。

链接: https://arxiv.org/abs/2503.16394
作者: Akhil Perincherry,Jacob Krantz,Stefan Lee
机构: Oregon State University (俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) agents are tasked with navigating an unseen environment using natural language instructions. In this work, we study if visual representations of sub-goals implied by the instructions can serve as navigational cues and lead to increased navigation performance. To synthesize these visual representations or imaginations, we leverage a text-to-image diffusion model on landmark references contained in segmented instructions. These imaginations are provided to VLN agents as an added modality to act as landmark cues and an auxiliary loss is added to explicitly encourage relating these with their corresponding referring expressions. Our findings reveal an increase in success rate (SR) of around 1 point and up to 0.5 points in success scaled by inverse path length (SPL) across agents. These results suggest that the proposed approach reinforces visual understanding compared to relying on language instructions alone. Code and data for our work can be found at this https URL.
zh

[NLP-5] CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners

【速读】：该论文旨在解决现有知识编辑（Knowledge Editing, KE）方法在大规模语言模型（Large Language Models, LLMs）中更新孤立事实后，难以将这些更新有效推广到依赖修改后知识的多跳推理任务中的问题。论文通过分析基于推理电路（reasoning circuits）的神经路径发现，当前基于层局部化的知识编辑方法（如MEMIT和WISE），由于仅修改单个或少数模型层，无法有效地将更新信息整合到这些推理路径中。为克服这一局限性，论文提出了一种名为CaKE（Circuit-aware Knowledge Editing）的新方法，其关键在于利用基于推理电路分析设计的策略性数据，强制模型使用更新后的知识，从而促使模型发展出针对新集成知识的适当推理电路。实验结果表明，CaKE在MQuAKE数据集上的多跳推理准确性比现有KE方法平均提高了20%。

链接: https://arxiv.org/abs/2503.16356
作者: Yunzhi Yao,Jizhan Fang,Jia-Chen Gu,Ningyu Zhang,Shumin Deng,Huajun Chen,Nanyun Peng
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they struggle to generalize these updates to multi-hop reasoning tasks that depend on the modified knowledge. Through an analysis of reasoning circuits – the neural pathways LLMs use for knowledge-based inference, we observe that current layer-localized KE approaches, such as MEMIT and WISE, which edit only single or a few model layers, struggle to effectively incorporate updated information into these reasoning pathways. To address this limitation, we propose CaKE (Circuit-aware Knowledge Editing), a novel method that enables more effective integration of updated knowledge in LLMs. CaKE leverages strategically curated data, guided by our circuits-based analysis, that enforces the model to utilize the modified knowledge, stimulating the model to develop appropriate reasoning circuits for newly integrated knowledge. Experimental results show that CaKE enables more accurate and consistent use of updated knowledge across related reasoning tasks, leading to an average of 20% improvement in multi-hop reasoning accuracy on MQuAKE dataset compared to existing KE methods. We release the code and data in this https URL.
zh

[NLP-6] LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates

【速读】：该论文试图解决如何进一步提升基于Transformer的大规模语言模型（Transformer-based Large Language Model, LLM）性能并实现对其行为的精准控制的问题。论文的关键在于提出了一种名为LLMBRACES的新方法，通过计算前馈网络（Feed-Forward Network, FFN）层中值向量的相关性分数，并利用这些分数动态调整子更新（sub-update）的贡献，从而优化预测过程。这种方法不仅提高了输出的准确性与可靠性，还支持对生成文本特性（如情感）进行条件控制，实现更细粒度的LLM输出调控。实验表明，LLMBRACES在微调和零样本设置下优于基线方法，且所需可调参数显著减少（最多比LoRA少75%），同时在情感控制生成和毒性降低方面表现出色。

链接: https://arxiv.org/abs/2503.16334
作者: Ying Shen,Lifu Huang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); UC Davis (加州大学戴维斯分校)
类目: Computation and Language (cs.CL)
备注: 16 pages, 2 figures

点击查看摘要

Abstract:Recent findings reveal that much of the knowledge in a Transformer-based Large Language Model (LLM) is encoded in its feed-forward (FFN) layers, where each FNN layer can be interpreted as the summation of sub-updates, each corresponding to a weighted column vector from the FFN’s value parameter matrix that often encodes human-interpretable concepts. In light of this, we hypothesize that model performance and behaviors can be further enhanced and controlled by modulating the contributions of these sub-updates based on their relevance to the input or target output style, and propose LLMBRACES, a novel and efficient method that computes relevance scores associated with value vectors in FFN layers and leverages these scores to dynamically adjust the contribution of sub-updates. By optimizing sub-update contributions, LLMBRACES refines the prediction process, leading to more accurate and reliable outputs, much like a ‘brace’ providing support and stability. Moreover, LLMBRACES can be extended to support conditional control over generation characteristics, such as sentiment, thereby offering fine-grained steering of LLM outputs. Extensive experiments on various LLMs-including Qwen2.5-1.5B, Llama2-7B, and Llama3-8B-demonstrate that LLMBRACES outperforms baseline approaches in both fine-tuning and zero-shot settings while requiring significantly fewer tunable parameters, up to 75% fewer compared to LoRA. Furthermore, LLMBRACES excels in sentiment-controlled generation and toxicity reduction, highlighting its potential for flexible, controlled text generation across applications.
zh

[NLP-7] Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

【速读】：该论文试图解决大型语言模型（LLMs）在复杂金融任务处理能力上的不足。为应对这一挑战，论文提出了一种专门设计用于金融领域的推理型大语言模型——Fin-R1。其关键解决方案在于采用两阶段架构，并基于DeepSeek-R1构建了一个经过蒸馏和处理的金融推理数据集。通过有监督微调（SFT）和强化学习（RL）训练，Fin-R1以70亿参数规模，在金融推理任务中表现出接近DeepSeek-R1的性能，同时在FinQA和ConvFinQA任务中达到当前最优（SOTA），并在其他任务中超越更大规模的模型。这表明Fin-R1具备强大的推理与决策能力，能够有效应对金融领域中的多种问题。

链接: https://arxiv.org/abs/2503.16252
作者: Zhaowei Liu,Xin Guo,Fangqi Lou,Lingfeng Zeng,Jinyi Niu,Zixuan Wang,Jiajie Xu,Weige Cai,Ziwei Yang,Xueqian Zhao,Chao Li,Sheng Xu,Dezhi Chen,Yun Chen,Zuo Bai,Liwen Zhang
机构: Shanghai University of Finance and Economics (上海财经大学); Fudan University (复旦大学); FinStep
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning large language models are rapidly evolving across various domains. However, their capabilities in handling complex financial tasks still require in-depth exploration. In this paper, we introduce Fin-R1, a reasoning large language model specifically designed for the financial sector. Fin-R1 is built using a two-stage architecture, leveraging a financial reasoning dataset distilled and processed based on DeepSeek-R1. Through supervised fine-tuning (SFT) and reinforcement learning (RL) training, it demonstrates performance close to DeepSeek-R1 with a parameter size of 7 billion across a range of financial reasoning tasks. It achieves the state-of-the-art (SOTA) in the FinQA and ConvFinQA tasks between those LLMs in our evaluation, surpassing larger models in other tasks as well. Fin-R1 showcases strong reasoning and decision-making capabilities, providing solutions to various problems encountered in the financial domain. Our code is available at this https URL.
zh

[NLP-8] Reinforcement Learning for Reasoning in Small LLM s: What Works and What Doesnt

【速读】：该论文旨在解决大型语言模型 (LLMs) 推理能力增强通常依赖于海量计算资源和数据集的问题，这限制了其在资源受限环境中的可及性。论文的关键解决方案是探索强化学习 (Reinforcement Learning, RL) 在小型 LLM 中提升推理能力的潜力，通过适配 Group Relative Policy Optimization (GRPO) 算法，并构建一个紧凑且高质量的数学推理数据集，在严格的资源约束下（如使用 4 块 NVIDIA A40 GPU，训练时间仅 24 小时）进行实验。研究结果表明，基于 RL 的微调能够在极低成本下显著提升推理性能，例如 AMC23 准确率从 63% 提升至 80%，AIME24 达到 46.7%，同时仅需 7,000 样本和约 42 美元的训练成本，显著优于传统大规模方法。尽管如此，长期训练中仍面临优化不稳定性和长度限制等挑战。论文通过开放代码和数据集，为资源受限环境中构建高效推理能力的 LLM 提供了可行路径与洞见。

链接: https://arxiv.org/abs/2503.16219
作者: Quy-Anh Dang,Chris Ngo
机构: VNU University of Science (越南国立大学科学大学); Knovel Engineering Lab (Knovel工程实验室), Singapore
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a 42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at this https URL.
zh

[NLP-9] MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion

【速读】：该论文旨在解决现有数据增强方法在提升数学推理能力方面的局限性，这些方法主要局限于实例级别的修改（如改述或生成句法变体），无法有效捕捉和利用数学知识中固有的内在关系结构。为了解决这一问题，论文提出了一种名为MathFusion的新框架，其关键在于通过三种融合策略增强数学推理：(1) 序列融合，用于建模相关问题之间的解题依赖关系；(2) 并行融合，结合类比问题以强化概念理解；(3) 条件融合，创建上下文感知的选择性问题以提高推理灵活性。通过这些策略，论文生成了一个新的数据集MathFusionQA，并在其上微调了多种大语言模型。实验结果表明，MathFusion在保持高数据效率的同时显著提升了数学推理能力，在不同基准测试中的准确率提升了18.0个百分点，仅需额外45K合成指令，远超传统单一指令方法的表现。

链接: https://arxiv.org/abs/2503.16212
作者: Qizhi Pei,Lijun Wu,Zhuoshi Pan,Yu Li,Honglin Lin,Chenlin Ming,Xin Gao,Conghui He,Rui Yan
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院，中国人民大学); Shanghai AI Laboratory (上海人工智能实验室); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); School of Computer Science, Wuhan University (武汉大学计算机学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive progress in mathematical reasoning. While data augmentation is promising to enhance mathematical problem-solving ability, current approaches are predominantly limited to instance-level modifications-such as rephrasing or generating syntactic variations-which fail to capture and leverage the intrinsic relational structures inherent in mathematical knowledge. Inspired by human learning processes, where mathematical proficiency develops through systematic exposure to interconnected concepts, we introduce MathFusion, a novel framework that enhances mathematical reasoning through cross-problem instruction synthesis. MathFusion implements this through three fusion strategies: (1) sequential fusion, which chains related problems to model solution dependencies; (2) parallel fusion, which combines analogous problems to reinforce conceptual understanding; and (3) conditional fusion, which creates context-aware selective problems to enhance reasoning flexibility. By applying these strategies, we generate a new dataset, \textbfMathFusionQA, followed by fine-tuning models (DeepSeekMath-7B, Mistral-7B, Llama3-8B) on it. Experimental results demonstrate that MathFusion achieves substantial improvements in mathematical reasoning while maintaining high data efficiency, boosting performance by 18.0 points in accuracy across diverse benchmarks while requiring only 45K additional synthetic instructions, representing a substantial improvement over traditional single-instruction approaches. Our datasets, models, and code are publicly available at this https URL.
zh

[NLP-10] Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

【速读】：该论文旨在解决Scene Text Recognition (STR) 中场景文本识别性能提升的问题，并探索视觉编码器与文本解码器缩放的个体贡献。研究发现，与以往观点不同，解码器缩放能够带来显著的性能提升，且其效果优于单独对编码器进行缩放。此外，论文指出标签噪声是STR面临的关键挑战，尤其是在真实数据中，可能限制STR模型的效果。为了解决这一问题，论文提出了一种名为Cloze Self-Distillation (CSD) 的方法，通过利用教师模型生成的上下文感知软预测和伪标签来训练学生模型，从而缓解标签噪声的影响。同时，论文通过在解码器中引入差分交叉注意力机制进一步增强STR架构。该方法仅使用真实数据就在11个基准测试中的10个达到了当前最佳性能，同时大幅减少了参数量和计算成本。因此，解决方案的关键在于创新性地优化解码器并通过CSD方法有效应对标签噪声问题。

链接: https://arxiv.org/abs/2503.16184
作者: Andrea Maracani,Savas Ozkan,Sijun Cho,Hyowon Kim,Eunchung Noh,Jeongwon Min,Cho Jung Min,Dookun Park,Mete Ozay
机构: Samsung R&D Institute UK (三星英国研发研究院); Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.
zh

[NLP-11] CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在实际软件工程任务中处理代码审查评论时存在的问题。尽管最先进的LLMs在代码生成方面表现出色，但在将代码审查反馈转化为可操作的修改建议方面仍面临挑战。代码审查评论通常隐晦、模糊且具有口语化特征，这要求模型不仅要理解代码本身，还要领会人类意图。当前评估方法主要依赖文本匹配指标，这些方法无法深入揭示模型的失效原因，并且容易受到训练数据污染的影响。

为了解决这些问题，论文提出了一种新的评估基准——\textbf{CodeReviewQA}，它能够进行细粒度的能力评估，并降低数据污染的风险。关键在于将代码优化的生成任务分解为三个基本推理步骤：更改类型识别（Change Type Recognition, CTR）、更改定位（Change Localization, CL）和解决方案识别（Solution Identification, SI），并将每个步骤重新设计为多选题，涵盖不同难度等级。这种方法不仅实现了对模型能力的精确评估，还有效减少了训练数据污染带来的干扰。通过在九种编程语言的900个高质量人工标注示例上的全面评估，研究结果表明，CodeReviewQA可以独立于自动生成的代码优化结果，揭示模型在代码审查理解方面的具体弱点。

链接: https://arxiv.org/abs/2503.16167
作者: Hong Yi Lin,Chunhua Liu,Haoyu Gao,Patanamon Thongtanunam,Christoph Treude
机构: The University of Melbourne (墨尔本大学); Singapore Management University (新加坡管理大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:State-of-the-art large language models (LLMs) have demonstrated impressive code generation capabilities but struggle with real-world software engineering tasks, such as revising source code to address code reviews, hindering their practical use. Code review comments are often implicit, ambiguous, and colloquial, requiring models to grasp both code and human intent. This challenge calls for evaluating large language models’ ability to bridge both technical and conversational contexts. While existing work has employed the automated code refinement (ACR) task to resolve these comments, current evaluation methods fall short, relying on text matching metrics that provide limited insight into model failures and remain susceptible to training data contamination. To address these limitations, we introduce a novel evaluation benchmark, \textbfCodeReviewQA that enables us to conduct fine-grained assessment of model capabilities and mitigate data contamination risks. In CodeReviewQA, we decompose the generation task of code refinement into \textbfthree essential reasoning steps : \textitchange type recognition (CTR), \textitchange localisation (CL), and \textitsolution identification (SI). Each step is reformulated as multiple-choice questions with varied difficulty levels, enabling precise assessment of model capabilities, while mitigating data contamination risks. Our comprehensive evaluation spans 72 recently released large language models on \textbf900 manually curated, high-quality examples across nine programming languages. Our results show that CodeReviewQA is able to expose specific model weaknesses in code review comprehension, disentangled from their generative automated code refinement results.
zh

[NLP-12] SpeCache: Speculative Key-Value Caching for Efficient Generation of LLM s

【速读】：该论文旨在解决Transformer-based大型语言模型（LLMs）在处理长序列任务时，由于GPU显存（VRAM）资源有限，导致关键值（KV）缓存随着序列长度线性增长而难以容纳的问题，这一问题已成为LLMs在长序列应用中的瓶颈。现有KV缓存压缩方法（如驱逐、合并或量化）虽能减小缓存大小，但会不可避免地导致信息遗忘，可能影响后续解码的准确性。

论文的关键解决方案是提出SpeCache方法，其充分利用CPU内存的大容量和可扩展性，将完整的KV缓存卸载至CPU内存，并在每次解码步骤中基于VRAM中低比特KV缓存副本的重要性动态加载KV对。为避免CPU-GPU通信引起的推理延迟，SpeCache通过推测性预测下一个token可能关注的KV对，提前预取这些KV对至GPU，在下一解码步骤中实现预取与计算的并行化。实验表明，SpeCache能够在不重新训练的情况下，以高达10倍的KV缓存压缩比有效减少VRAM使用，同时避免长序列的信息遗忘。

链接: https://arxiv.org/abs/2503.16163
作者: Shibo Jie,Yehui Tang,Kai Han,Zhi-Hong Deng,Jing Han
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the sequence length increases, which has become a bottleneck for the application of LLMs on long sequences. Existing KV cache compression methods include eviction, merging, or quantization of the KV cache to reduce its size. However, compression results in irreversible information forgetting, potentially affecting the accuracy of subsequent decoding. In this paper, we propose SpeCache, which takes full advantage of the large and easily expandable CPU memory to offload the complete KV cache, and dynamically fetches KV pairs back in each decoding step based on their importance measured by low-bit KV cache copy in VRAM. To avoid inference latency caused by CPU-GPU communication, SpeCache speculatively predicts the KV pairs that the next token might attend to, allowing us to prefetch them before the next decoding step which enables parallelization of prefetching and computation. Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10x high KV cache compression ratio.
zh

[NLP-13] owards Lighter and Robust Evaluation for Retrieval Augmented Generation ICLR25

【速读】：该论文试图解决生成式大语言模型（Large Language Models）在RAG框架下生成答案时存在的幻觉现象（hallucination）评估难题。传统方法依赖商业闭源模型（如GPT-4）进行评价，但这些方法成本高昂且缺乏透明性。论文的关键解决方案在于提出一种基于开源权重模型的轻量级方法，利用较小的量化语言模型（quantized LLMs），提供一个可访问且易于解释的度量标准，为生成的答案连续打分，以衡量其正确性和忠实性。此评分机制不仅能够评估生成结果的可靠性，还探索了阈值设定，进一步开发了一种新的AUC指标作为替代方案，以减少对人工判断的相关依赖。

链接: https://arxiv.org/abs/2503.16161
作者: Alex-Razvan Ispas,Charles-Elie Simon,Fabien Caspani,Vincent Guigue
机构: Paris-Saclay University (巴黎萨克雷大学), BNP Paribas CIB; BNP Paribas CIB; AgroParisTech (阿格罗巴黎学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures, published at 1st workshop of Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI at ICLR 25

点击查看摘要

Abstract:Large Language Models are prompting us to view more NLP tasks from a generative perspective. At the same time, they offer a new way of accessing information, mainly through the RAG framework. While there have been notable improvements for the autoregressive models, overcoming hallucination in the generated answers remains a continuous problem. A standard solution is to use commercial LLMs, such as GPT4, to evaluate these algorithms. However, such frameworks are expensive and not very transparent. Therefore, we propose a study which demonstrates the interest of open-weight models for evaluating RAG hallucination. We develop a lightweight approach using smaller, quantized LLMs to provide an accessible and interpretable metric that gives continuous scores for the generated answer with respect to their correctness and faithfulness. This score allows us to question decisions’ reliability and explore thresholds to develop a new AUC metric as an alternative to correlation with human judgment.
zh

[NLP-14] Automatically Generating Chinese Homophone Words to Probe Machine Translation Estimation Systems NAACL2025

【速读】：该论文试图解决在评估用户生成内容（UGC）的机器翻译（Machine Translation, MT）质量时，如何有效检测目标文本是否保留了源文本的情感细微差别这一问题。现有研究虽提出了与情感相关的数据集、框架和模型来自动评估中文UGC的MT质量，但这些方法对情感细微差别的保真度缺乏鲁棒性验证。论文的关键在于提出了一种基于信息论的新颖方法，通过利用自信息（self-information）的概念生成具有挑战性的同音词，特别是那些可能导致情感保真度错误的同音词。这种方法旨在揭示当前MT系统及其评估方法在处理情感类UGC时的脆弱性，并通过人工评估验证所生成同音词的质量，表明其与人类判断的相关性优于现有方法。此外，该研究进一步利用生成的同音词及其中文翻译作为扰动手段，测试现有评价模型的鲁棒性，发现大规模语言模型（Large Language Models, LLMs）表现出更高的稳定性和鲁棒性。

链接: https://arxiv.org/abs/2503.16158
作者: Shenbin Qian,Constantin Orăsan,Diptesh Kanojia,Félix do Carmo
机构: University of Surrey (萨里大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the 10th Workshop on Noisy and User-generated Text at NAACL 2025

点击查看摘要

Abstract:Evaluating machine translation (MT) of user-generated content (UGC) involves unique challenges such as checking whether the nuance of emotions from the source are preserved in the target text. Recent studies have proposed emotion-related datasets, frameworks and models to automatically evaluate MT quality of Chinese UGC, without relying on reference translations. However, whether these models are robust to the challenge of preserving emotional nuances has been left largely unexplored. To address this gap, we introduce a novel method inspired by information theory which generates challenging Chinese homophone words related to emotions, by leveraging the concept of self-information. Our approach generates homophones that were observed to cause translation errors in emotion preservation, and exposes vulnerabilities in MT systems and their evaluation methods when tackling emotional UGC. We evaluate the efficacy of our method using human evaluation for the quality of these generated homophones, and compare it with an existing one, showing that our method achieves higher correlation with human judgments. The generated Chinese homophones, along with their manual translations, are utilized to generate perturbations and to probe the robustness of existing quality evaluation models, including models trained using multi-task learning, fine-tuned variants of multilingual language models, as well as large language models (LLMs). Our results indicate that LLMs with larger size exhibit higher stability and robustness to such perturbations. We release our data and code for reproducibility and further research.
zh

[NLP-15] Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models

【速读】：该论文试图解决语言模型（Language Models）中存在的政治偏见（Political Bias）问题，并评估其在不同场景下的影响。论文的关键在于提出了一种基于政治科学理论设计的新方法，通过遵循调查设计原则构建了一个更科学的测量框架，以测试广泛多样的输入提示（Prompts），同时考虑提示敏感性（Prompt Sensitivity）。解决方案的核心是使用这一新框架对11种开放及商业语言模型进行测试，包括指令微调（Instruction-Tuned）与非指令微调模型，并自动分类它们的政治倾向（Political Stances）以生成包含88,110条响应的数据集。最终，通过分析该数据集，论文揭示了政治偏向测度（如政治罗盘测试，PCT）在某些模型中的夸大效应，并指出政治偏见衡量值通常不稳定，但整体上指令微调模型更倾向于左倾（More Left-Leaning）。

链接: https://arxiv.org/abs/2503.16148
作者: Mats Faulborn,Indira Sen,Max Pellert,Andreas Spitz,David Garcia
机构: University of Konstanz (康斯坦茨大学); University of Mannheim (曼海姆大学); Barcelona Super Computing Center (巴塞罗那超级计算中心)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt-based language models like GPT4 and LLaMa have been used for a wide variety of use cases such as simulating agents, searching for information, or for content analysis. For all of these applications and others, political biases in these models can affect their performance. Several researchers have attempted to study political bias in language models using evaluation suites based on surveys, such as the Political Compass Test (PCT), often finding a particular leaning favored by these models. However, there is some variation in the exact prompting techniques, leading to diverging findings and most research relies on constrained-answer settings to extract model responses. Moreover, the Political Compass Test is not a scientifically valid survey instrument. In this work, we contribute a political bias measured informed by political science theory, building on survey design principles to test a wide variety of input prompts, while taking into account prompt sensitivity. We then prompt 11 different open and commercial models, differentiating between instruction-tuned and non-instruction-tuned models, and automatically classify their political stances from 88,110 responses. Leveraging this dataset, we compute political bias profiles across different prompt variations and find that while PCT exaggerates bias in certain models like GPT3.5, measures of political bias are often unstable, but generally more left-leaning for instruction-tuned models.
zh

[NLP-16] MKG-Rank: Enhancing Large Language Models with Knowledge Graph for Multilingual Medical Question Answering

【速读】：该论文旨在解决医学问答（Medical Question Answering, QA）领域因多语言训练数据不平衡及低资源语言缺乏医疗资源而导致的大规模语言模型（Large Language Models, LLMs）在非英语场景下效果有限的问题。论文的关键解决方案是提出了一种名为Multilingual Knowledge Graph-based Retrieval Ranking (MKG-Rank) 的框架。该框架通过词级翻译机制，以低成本将全面的英语为中心的医学知识图谱整合到LLMs的推理过程中，从而缓解跨语言语义失真，并实现跨越语言障碍的精确医学问答。此外，为了提高效率，MKG-Rank引入了缓存和多角度排名策略来优化检索过程，显著缩短响应时间并优先提供相关医疗知识。

链接: https://arxiv.org/abs/2503.16131
作者: Feiyang Li,Yingjian Chen,Haoran Liu,Rui Yang,Han Yuan,Yuang Jiang,Tianxiao Li,Edison Marrese Taylor,Hossein Rouhizadeh,Yusuke Iwasawa,Douglas Teodoro,Yutaka Matsuo,Irene Li
机构: University of Tokyo (东京大学); Texas A&M University (德克萨斯农工大学); Duke-NUS Medical School (杜克-新加坡国立大学医学研究生院); Smartor LLC (Smartor LLC); NEC Laboratories America (NEC美国研究院); University of Geneva, Switzerland (瑞士日内瓦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable progress in medical question answering (QA), yet their effectiveness remains predominantly limited to English due to imbalanced multilingual training data and scarce medical resources for low-resource languages. To address this critical language gap in medical QA, we propose Multilingual Knowledge Graph-based Retrieval Ranking (MKG-Rank), a knowledge graph-enhanced framework that enables English-centric LLMs to perform multilingual medical QA. Through a word-level translation mechanism, our framework efficiently integrates comprehensive English-centric medical knowledge graphs into LLM reasoning at a low cost, mitigating cross-lingual semantic distortion and achieving precise medical QA across language barriers. To enhance efficiency, we introduce caching and multi-angle ranking strategies to optimize the retrieval process, significantly reducing response times and prioritizing relevant medical knowledge. Extensive evaluations on multilingual medical QA benchmarks across Chinese, Japanese, Korean, and Swahili demonstrate that MKG-Rank consistently outperforms zero-shot LLMs, achieving maximum 33.89% increase in accuracy, while maintaining an average retrieval time of only 0.0009 seconds.
zh

[NLP-17] Cultural Alignment in Large Language Models Using Soft Prompt Tuning

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）在跨文化研究中的对齐问题，传统方法依赖于有监督微调或基于强化学习的对齐框架，但这些方法需要标注数据或偏好数据，并通过更新模型权重来实现对齐。然而，在社会科学研究如跨文化研究中，因子分析常用于揭示解释调查数据中观察模式的潜在维度或潜变量，由于这类测量的非可微性，传统的对齐方法难以应用于文化维度的对齐。为了解决这一问题，论文提出了一种参数高效的策略，结合软提示调优（soft prompt tuning，冻结模型参数同时修改输入提示嵌入）与差分进化（Differential Evolution, DE，一种适用于不可微目标的黑盒优化方法）。该方案的关键在于无需偏好数据或模型参数更新即可确保对齐一致性，显著提高了效率并减轻了过拟合现象。实验结果表明，所提方法在多个地区的LLama-3-8B-Instruct的文化维度上表现出显著改进，优于朴素LLM和基于情境学习（In-context Learning, ICL）的基线，有效弥合了计算模型与人类文化细微差异之间的鸿沟。

链接: https://arxiv.org/abs/2503.16094
作者: Reem I. Masoud,Martin Ferianc,Philip Treleaven,Miguel Rodrigues
机构: Department of Electronic and Electrical Engineering, University College London (伦敦大学学院电子与电气工程系); Department of Computer Science, University College London (伦敦大学学院计算机科学系); Department of Electrical Engineering, King Abdulaziz University (阿卜杜勒阿齐兹国王大学电气工程系); AI Centre, University College London (伦敦大学学院人工智能中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) alignment conventionally relies on supervised fine-tuning or reinforcement learning based alignment frameworks. These methods typically require labeled or preference datasets and involve updating model weights to align the LLM with the training objective or reward model. Meanwhile, in social sciences such as cross-cultural studies, factor analysis is widely used to uncover underlying dimensions or latent variables that explain observed patterns in survey data. The non-differentiable nature of these measurements deriving from survey data renders the former alignment methods infeasible for alignment with cultural dimensions. To overcome this, we propose a parameter efficient strategy that combines soft prompt tuning, which freezes the model parameters while modifying the input prompt embeddings, with Differential Evolution (DE), a black-box optimization method for cases where a differentiable objective is unattainable. This strategy ensures alignment consistency without the need for preference data or model parameter updates, significantly enhancing efficiency and mitigating overfitting. Our method demonstrates significant improvements in LLama-3-8B-Instruct’s cultural dimensions across multiple regions, outperforming both the Naive LLM and the In-context Learning (ICL) baseline, and effectively bridges computational models with human cultural nuances.
zh

[NLP-18] Redefining Toxicity: An Objective and Context-Aware Approach for Stress-Level-Based Detection

【速读】：该论文旨在解决毒性检测领域中“毒性”定义不明确的根本问题，这一不确定性导致研究人员在模型训练过程中依赖主观且模糊的数据，从而产生不稳健和不准确的结果，符合“垃圾进垃圾出”的范式。论文的关键解决方案在于引入了一种新颖的、客观的、上下文感知的毒性检测框架，其核心是将压力水平作为判断毒性的关键指标，并提出了新的定义、评估指标以及训练方法。通过构建并使用自收集的数据集验证了该框架的有效性。

链接: https://arxiv.org/abs/2503.16072
作者: Sergey Berezin,Reza Farahbakhsh,Noel Crespi
机构: SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris (巴黎高等电信学院Polytechnic研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The fundamental problem of toxicity detection lies in the fact that the term “toxicity” is ill-defined. Such uncertainty causes researchers to rely on subjective and vague data during model training, which leads to non-robust and inaccurate results, following the ‘garbage in - garbage out’ paradigm. This study introduces a novel, objective, and context-aware framework for toxicity detection, leveraging stress levels as a key determinant of toxicity. We propose new definition, metric and training approach as a parts of our framework and demonstrate it’s effectiveness using a dataset we collected.
zh

[NLP-19] uning LLM s by RAG Principles: Towards LLM -native Memory

【速读】：本文旨在解决如何有效将记忆（Memory）融入大型语言模型（Large Language Models, LLMs）的生成过程中，以支持如个人助理等实际应用。论文对比了两种主流方案：长上下文LLMs和检索增强生成（Retrieval-Augmented Generation, RAG），发现长上下文方法尽管成本更高，但在需要整体考虑记忆时更能把握全局并提供更优的答案；而当查询涉及具体信息时，RAG方法更具竞争力，特别是关键词可明确匹配的情况下。为此，本文提出了一种新方法RAG-Tuned-LLM，通过利用遵循RAG原则生成的数据微调一个相对较小的LLM（如7B参数规模），从而结合两种方案的优势。实验结果表明，该方法在多种查询类型上优于长上下文LLMs和RAG方法。

链接: https://arxiv.org/abs/2503.16071
作者: Jiale Wei,Shuchi Wu,Ruochen Liu,Xiang Ying,Jingbo Shang,Fangbo Tao
机构: Mindverse.ai
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Memory, additional information beyond the training of large language models (LLMs), is crucial to various real-world applications, such as personal assistant. The two mainstream solutions to incorporate memory into the generation process are long-context LLMs and retrieval-augmented generation (RAG). In this paper, we first systematically compare these two types of solutions on three renovated/new datasets and show that (1) long-context solutions, although more expensive, shall be easier to capture the big picture and better answer queries which require considering the memory as a whole; and (2) when the queries concern specific information, RAG solutions shall be more competitive especially when the keywords can be explicitly matched. Therefore, we propose a novel method RAG-Tuned-LLM which fine-tunes a relative small (e.g., 7B) LLM using the data generated following the RAG principles, so it can combine the advantages of both solutions. Extensive experiments on three datasets demonstrate that RAG-Tuned-LLM can beat long-context LLMs and RAG methods across a wide range of query types.
zh

[NLP-20] wo-stage Incomplete Utterance Rewriting on Editing Operation

【速读】：该论文旨在解决不完整话语重写（Incomplete Utterance Rewriting, IUR）领域中仅依赖对话上下文生成重写结果的问题，忽略了对话中广泛存在的指代消解（coreference）和省略（ellipsis）现象。为了解决这一问题，论文提出了一种名为TEO（\emph{Two-stage approach on Editing Operation}）的新框架，其关键是将任务分为两个阶段：第一阶段生成编辑操作（editing operations），第二阶段利用这些操作与对话上下文重写不完整的话语。此外，为了减轻第二阶段因训练与推理之间不一致导致的级联错误（cascading errors）和暴露偏差（exposure bias），论文还提出了对抗扰动策略（adversarial perturbation strategy）。实验结果表明，TEO在三个IUR数据集上显著优于当前最先进的模型（SOTA）。

链接: https://arxiv.org/abs/2503.16063
作者: Zhiyu Cao,Peifeng Li,Qiaoming Zhu,Yaxin Fan
机构: School of Computer Science and Technology, Soochow University (苏州大学), Suzhou, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Previous work on Incomplete Utterance Rewriting (IUR) has primarily focused on generating rewritten utterances based solely on dialogue context, ignoring the widespread phenomenon of coreference and ellipsis in dialogues. To address this issue, we propose a novel framework called TEO (\emphTwo-stage approach on Editing Operation) for IUR, in which the first stage generates editing operations and the second stage rewrites incomplete utterances utilizing the generated editing operations and the dialogue context. Furthermore, an adversarial perturbation strategy is proposed to mitigate cascading errors and exposure bias caused by the inconsistency between training and inference in the second stage. Experimental results on three IUR datasets show that our TEO outperforms the SOTA models significantly.
zh

[NLP-21] Meta-Learning Neural Mechanisms rather than Bayesian Priors

【速读】：该论文旨在探究元学习（meta-learning）在形式语言（formal languages）上的具体作用机制，并解决“元训练是否使模型习得基于简单性先验（simplicity-based priors）”这一争议问题。研究发现，与先前的主张不同，当元训练的数据集以简单性为中心组织时，元训练的模型并未习得基于简单性的先验，而是将类似计数器（counters）这样的神经机制嵌入模型，这些机制可作为网络在下游任务中的认知基元（cognitive primitives）。此外，研究揭示，针对单一形式语言进行元训练的效果可能等同于针对5000种不同形式语言进行元训练，前提是所选形式语言能够激励有用的神经机制的学习。关键在于，通过精心设计的形式语言，元训练能够高效地赋予模型强大的泛化能力，同时为高效元学习范式提供实践指导，并为连接符号理论与神经机制提供新的理论洞见。

链接: https://arxiv.org/abs/2503.16048
作者: Michael Goodale,Salvador Mascarenhas,Yair Lakretz
机构: Institut Jean Nicod (Institut Jean Nicod); Laboratoire de Sciences Cognitives et Psycholinguistique (实验室认知与心理语言学); Département d’études cognitives, ENS, EHESS, CNRS, PSL University (认知科学系, ENS, EHESS, CNRS, PSL大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Children acquire language despite being exposed to several orders of magnitude less data than large language models require. Meta-learning has been proposed as a way to integrate human-like learning biases into neural-network architectures, combining both the structured generalizations of symbolic models with the scalability of neural-network models. But what does meta-learning exactly imbue the model with? We investigate the meta-learning of formal languages and find that, contrary to previous claims, meta-trained models are not learning simplicity-based priors when meta-trained on datasets organised around simplicity. Rather, we find evidence that meta-training imprints neural mechanisms (such as counters) into the model, which function like cognitive primitives for the network on downstream tasks. Most surprisingly, we find that meta-training on a single formal language can provide as much improvement to a model as meta-training on 5000 different formal languages, provided that the formal language incentivizes the learning of useful neural mechanisms. Taken together, our findings provide practical implications for efficient meta-learning paradigms and new theoretical insights into linking symbolic theories and neural mechanisms.
zh

[NLP-22] Incomplete Utterance Rewriting with Editing Operation Guidance and Utterance Augmentation

【速读】：本文针对不完整话语重写（IUR）任务中现有方法生成的话语常包含无关和冗余标记的问题展开研究，主要原因是这些方法难以聚焦于对话上下文中的关键标记。此外，训练数据集规模有限也导致IUR模型训练不足。为了解决第一个问题，论文提出了一种多任务学习框架EO-IUR（基于编辑操作引导的不完整话语重写），通过引入由序列标注模块生成的编辑操作标签来指导生成模型关注关键标记，并进一步构建了以令牌级别异构图表示对话的方式。为了解决第二个问题，提出了两种维度的话语增强策略：基于编辑操作的不完整话语增强和基于大型语言模型（LLM）的历史话语增强。实验结果表明，该方法在开放式和任务导向型对话中均优于现有的最先进基线。关键在于引入编辑操作标签以及构建异构图来增强模型对关键信息的关注，并结合创新的数据增强策略提升模型性能。

链接: https://arxiv.org/abs/2503.16043
作者: Zhiyu Cao,Peifeng Li,Yaxin Fan,Qiaoming Zhu
机构: School of Computer Science and Technology, Soochow University (苏州大学), Suzhou, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although existing fashionable generation methods on Incomplete Utterance Rewriting (IUR) can generate coherent utterances, they often result in the inclusion of irrelevant and redundant tokens in rewritten utterances due to their inability to focus on critical tokens in dialogue context. Furthermore, the limited size of the training datasets also contributes to the insufficient training of the IUR model. To address the first issue, we propose a multi-task learning framework EO-IUR (Editing Operation-guided Incomplete Utterance Rewriting) that introduces the editing operation labels generated by sequence labeling module to guide generation model to focus on critical tokens. Furthermore, we introduce a token-level heterogeneous graph to represent dialogues. To address the second issue, we propose a two-dimensional utterance augmentation strategy, namely editing operation-based incomplete utterance augmentation and LLM-based historical utterance augmentation. The experimental results on three datasets demonstrate that our EO-IUR outperforms previous state-of-the-art (SOTA) baselines in both open-domain and task-oriented dialogue. The code will be available at this https URL.
zh

[NLP-23] Evaluating Test-Time Scaling LLM s for Legal Reasoning : OpenAI o1 DeepSeek -R1 and Beyond

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在法律领域推理能力评估的问题。解决方案的关键在于对多种LLMs在涵盖中英文的17项法律任务中的表现进行系统性分析，特别是针对多被告判决和法律论证推理等复杂挑战。研究发现，尽管DeepSeek-R1和OpenAI o1等顶级模型表现出色，但在七项中文和两项英文法律推理任务中得分均低于80%，表明当前最先进的LLMs在法律推理能力方面仍显不足。

链接: https://arxiv.org/abs/2503.16040
作者: Yaoyao Yu,Leilei Gan,Yinghao Hu,Bin Wei,Kun Kuang,Fei Wu
机构: Guanghua Law School, Zhejiang University (浙江大学光华法学院); College of Software Technology, Zhejiang University (浙江大学软件学院); College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); Law & AI Lab, Zhejiang University (浙江大学法律与人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, Test-Time Scaling Large Language Models (LLMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated exceptional capabilities across various domains and tasks, particularly in reasoning. While these models have shown impressive performance on general language tasks, their effectiveness in specialized fields like legal remains unclear. To address this, we present a preliminary evaluation of LLMs in various legal scenarios, covering both Chinese and English legal tasks. Our analysis includes 9 LLMs and 17 legal tasks, with a focus on newly published and more complex challenges such as multi-defendant legal judgments and legal argument reasoning. Our findings indicate that, despite DeepSeek-R1 and OpenAI o1 being among the most powerful models, their legal reasoning capabilities are still lacking. Specifically, these models score below 80% on seven Chinese legal reasoning tasks and below 80% on two English legal reasoning tasks. This suggests that, even among the most advanced reasoning models, legal reasoning abilities remain underdeveloped.
zh

[NLP-24] Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models CVPR2025

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在处理大规模视频帧时面临的计算开销问题，现有压缩策略如平均池化虽能缓解这一问题，但会不可避免地导致用户指令相关的重要视觉信息丢失。为此，论文提出了一种名为HICom的混合层级指令注入策略，用于MLLMs中的条件令牌压缩。其关键是利用指令作为条件，在局部层面将指令注入分组的视觉令牌，并在全局层面注入可学习令牌，通过注意力机制完成条件压缩。这种方法能够在减少视觉令牌数量以降低计算负担的同时，尽可能保留用户关注的信息，并通过强调与指令相关的视觉部分以及保持时序空间结构，使大型语言模型更容易理解视频内容。此外，引入新的条件预训练阶段及自建数据集HICom-248K进一步挖掘HICom潜力，实验表明HICom在三个多项选择题问答基准测试中性能提升了2.43%，相比当前最佳方法节省了78.8%的令牌。

链接: https://arxiv.org/abs/2503.16036
作者: Zhihang Liu,Chen-Wei Xie,Pandeng Li,Liming Zhao,Longxiang Tang,Yun Zheng,Chuanbin Liu,Hongtao Xie
机构: University of Science and Technology of China (中国科学技术大学); Tongyi Lab, Alibaba Group (阿里云通义实验室); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43% average on three multiple-choice QA benchmarks and saving 78.8% tokens compared with the SOTA method. The code is available at this https URL.
zh

[NLP-25] Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content

【速读】：该论文试图解决如何利用生成的数据集来研究由虚假陈述和错误信息衍生的幽默（Deceptive Humor）这一问题。解决方案的关键在于构建了一个名为“Deceptive Humor Dataset (DHD)”的新资源，该数据集包含由虚假叙事生成且融入伪造声明和操纵信息的幽默评论，并通过Satire Level和Humor Categories对实例进行标注与分类，同时覆盖多种语言及其代码混合变体。这不仅为分析欺骗性语境下的幽默提供了结构化基础，还为探索幽默如何影响错误信息的感知与传播开辟了新的研究方向，并为后续研究建立强基准以推进欺骗性幽默检测模型的发展。

链接: https://arxiv.org/abs/2503.16031
作者: Sai Kartheek Reddy Kasu,Shankar Biradar,Sunil Saumya
机构: IIIT Dharwad (印度信息技术学院达胡阿德); MIT Manipal (麻省理工学院马尼普尔)
类目: Computation and Language (cs.CL)
备注: 15 Pages, 4 figures, 8 tables

点击查看摘要

Abstract:This paper presents the Deceptive Humor Dataset (DHD), a novel resource for studying humor derived from fabricated claims and misinformation. In an era of rampant misinformation, understanding how humor intertwines with deception is essential. DHD consists of humor-infused comments generated from false narratives, incorporating fabricated claims and manipulated information using the ChatGPT-4o model. Each instance is labeled with a Satire Level, ranging from 1 for subtle satire to 3 for high-level satire and classified into five distinct Humor Categories: Dark Humor, Irony, Social Commentary, Wordplay, and Absurdity. The dataset spans multiple languages including English, Telugu, Hindi, Kannada, Tamil, and their code-mixed variants (Te-En, Hi-En, Ka-En, Ta-En), making it a valuable multilingual benchmark. By introducing DHD, we establish a structured foundation for analyzing humor in deceptive contexts, paving the way for a new research direction that explores how humor not only interacts with misinformation but also influences its perception and spread. We establish strong baselines for the proposed dataset, providing a foundation for future research to benchmark and advance deceptive humor detection models.
zh

[NLP-26] he Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement

【速读】：该论文试图解决大型语言模型（LLMs）在从文本助手向自主代理转型过程中，由于数值奖励信号和验证器提供的上下文指导有限，导致探索替代策略能力受限的问题。论文的关键解决方案是提出Critique-Guided Improvement (CGI)，这是一种新颖的双玩家框架，包含一个负责环境探索的演员模型和一个生成自然语言反馈的评论家模型。通过训练评论家提供细粒度评估和可操作修订，以及演员利用这些反馈，该方法促进了更稳健的策略探索，同时避免局部最优解。实验结果表明，即使是一个小型评论家模型，在反馈质量上也超越了GPT-4，并且最终演员模型实现了最先进的性能。

链接: https://arxiv.org/abs/2503.16024
作者: Ruihan Yang,Fanghua Ye,Jian Li,Siyu Yuan,Yikai Zhang,Zhaopeng Tu,Xiaolong Li,Deqing Yang
机构: Fudan University (复旦大学); Tencent (腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently transformed from text-based assistants to autonomous agents capable of planning, reasoning, and iteratively improving their actions. While numerical reward signals and verifiers can effectively rank candidate actions, they often provide limited contextual guidance. In contrast, natural language feedback better aligns with the generative capabilities of LLMs, providing richer and more actionable suggestions. However, parsing and implementing this feedback effectively can be challenging for LLM-based agents. In this work, we introduce Critique-Guided Improvement (CGI), a novel two-player framework, comprising an actor model that explores an environment and a critic model that generates detailed nature language feedback. By training the critic to produce fine-grained assessments and actionable revisions, and the actor to utilize these critiques, our approach promotes more robust exploration of alternative strategies while avoiding local optima. Experiments in three interactive environments show that CGI outperforms existing baselines by a substantial margin. Notably, even a small critic model surpasses GPT-4 in feedback quality. The resulting actor achieves state-of-the-art performance, demonstrating the power of explicit iterative guidance to enhance decision-making in LLM-based agents.
zh

[NLP-27] Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models NAACL2025

【速读】：该论文旨在解决 In-Context Learning (ICL) 在处理具有挑战性的自然语言处理（NLP）任务时容易出错的问题。为提升 ICL 的性能，论文提出了一种名为 Corrective In-Context Learning (CICL) 的解决方案，其关键是将模型的错误预测与其对应的正确答案一同纳入提示（prompt），期望通过自我修正机制提高分类准确性。然而，实验结果表明，尽管引入了纠正信息，CICL 的表现始终逊于标准 ICL，并且随着提示中纠正比例的增加，性能进一步下降。研究发现，CICL 并未如预期般优化模型预测，反而因干扰任务理解而引入混淆。此外，论文还指出，在标准 ICL 中使用更难的示例并不一定能改善性能，这表明示例难度可能并非有效选择的标准。通过这些负向结果，论文揭示了大型语言模型（LLMs）中自纠错机制的局限性，并为未来的研究提供了方向。

链接: https://arxiv.org/abs/2503.16022
作者: Mario Sanz-Guerrero,Katharina von der Wense
机构: Johannes Gutenberg University Mainz (约翰内斯·古腾堡美因茨大学), Germany; University of Colorado Boulder (科罗拉多大学博尔德分校), USA
类目: Computation and Language (cs.CL)
备注: Accepted to the 6th Workshop on Insights from Negative Results in NLP at NAACL 2025

点击查看摘要

Abstract:In-context learning (ICL) has transformed the use of large language models (LLMs) for NLP tasks, enabling few-shot learning by conditioning on labeled examples without finetuning. Despite its effectiveness, ICL is prone to errors, especially for challenging examples. With the goal of improving the performance of ICL, we propose corrective in-context learning (CICL), an approach that incorporates a model’s incorrect predictions alongside ground truth corrections into the prompt, aiming to enhance classification accuracy through self-correction. However, contrary to our hypothesis, extensive experiments on text classification tasks demonstrate that CICL consistently underperforms standard ICL, with performance degrading as the proportion of corrections in the prompt increases. Our findings indicate that CICL introduces confusion by disrupting the model’s task understanding, rather than refining its predictions. Additionally, we observe that presenting harder examples in standard ICL does not improve performance, suggesting that example difficulty alone may not be a reliable criterion for effective selection. By presenting these negative results, we provide important insights into the limitations of self-corrective mechanisms in LLMs and offer directions for future research.
zh

[NLP-28] Autonomous AI imitators increase diversity in homogeneous information ecosystems

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）驱动的生成式 AI (Generative AI) 在模仿人类生成内容过程中对信息生态系统多样性与民主价值的潜在影响。论文的关键在于引入一个大规模模拟框架，系统性地测试两种不同的模仿策略在初始多样性各异的信息环境中的表现。研究结果表明，AI 生成的文章并非总是导致内容同质化，而是高度依赖于具体情境：在原本较为单一的信息环境中，AI 生成的内容能够引入有价值的多样性；而在初始多样性较高的环境中，其作用可能适得其反。因此，论文强调信息空间的基线多样性是决定 AI 影响的核心因素，挑战了关于 AI 模仿会普遍威胁信息多样性的假设。解决方案的关键在于揭示 AI 的影响具有情境敏感性，并指出在信息初始同质化的情境下，AI 驱动的模仿可以扩展视角、风格和主题，从而促进更具包容性的公共讨论，这对于维护民主制度至关重要。

链接: https://arxiv.org/abs/2503.16021
作者: Emil Bakkensen Johansen,Oliver Baumann
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 35 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Recent breakthroughs in large language models (LLMs) have facilitated autonomous AI agents capable of imitating human-generated content. This technological advancement raises fundamental questions about AI’s potential impact on the diversity and democratic value of information ecosystems. Here, we introduce a large-scale simulation framework to examine AI-based imitation in news, a context critically influential for public discourse. By systematically testing two distinct imitation strategies across a range of information environments varying in initial diversity, we demonstrate that AI-generated articles do not uniformly homogenize content. Instead, AI’s influence is strongly context-dependent: AI-generated articles can introduce valuable diversity in originally homogeneous news environments, while potentially diminishing diversity in contexts that initially display high heterogeneity. These results illustrate that the baseline diversity of an information space critically shapes AI’s impact, challenging assumptions that AI-driven imitation uniformly threatens information diversity. Instead, when information is initially homogeneous, AI-driven imitation can expand perspectives, styles, and topics. This is especially important in news contexts, where information diversity fosters richer public debate by exposing citizens to alternative viewpoints, challenging biases, and preventing narrative monopolies, which is essential for a resilient democracy.
zh

[NLP-29] ECKGBench: Benchmarking Large Language Models in E-commerce Leverag ing Knowledge Graph

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在电子商务领域中事实性（factuality）评估存在的问题。尽管已有方法尝试评估LLMs的事实性，但这些方法存在可靠性不足、成本高昂以及缺乏领域专业知识等局限性，导致其在电子商务场景中的有效性评估存在差距。为填补这一评估鸿沟，论文提出ECKGBench数据集，专门用于评估LLMs在电子商务知识方面的能力。解决方案的关键在于采用标准化的工作流程，基于大规模知识图谱自动生成问题，确保评估的可靠性；利用简单的问答范式，通过最少的输入输出标记提高评估效率；并在每个评估阶段（如人工标注、提示设计、负采样和验证）注入丰富的电子商务专业知识，从而更有效地衡量LLMs在电子商务领域的知识边界与应用潜力。

链接: https://arxiv.org/abs/2503.15990
作者: Langming Liu,Haibin Chen,Yuhao Wang,Yujin Yuan,Shilei Liu,Wenbo Su,Xiangyu Zhao,Bo Zheng
机构: Taobao & Tmall Group of Alibaba (淘宝&天猫集团); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated their capabilities across various NLP tasks. Their potential in e-commerce is also substantial, evidenced by practical implementations such as platform search, personalized recommendations, and customer service. One primary concern associated with LLMs is their factuality (e.g., hallucination), which is urgent in e-commerce due to its significant impact on user experience and revenue. Despite some methods proposed to evaluate LLMs’ factuality, issues such as lack of reliability, high consumption, and lack of domain expertise leave a gap between effective assessment in e-commerce. To bridge the evaluation gap, we propose ECKGBench, a dataset specifically designed to evaluate the capacities of LLMs in e-commerce knowledge. Specifically, we adopt a standardized workflow to automatically generate questions based on a large-scale knowledge graph, guaranteeing sufficient reliability. We employ the simple question-answering paradigm, substantially improving the evaluation efficiency by the least input and output tokens. Furthermore, we inject abundant e-commerce expertise in each evaluation stage, including human annotation, prompt design, negative sampling, and verification. Besides, we explore the LLMs’ knowledge boundaries in e-commerce from a novel perspective. Through comprehensive evaluations of several advanced LLMs on ECKGBench, we provide meticulous analysis and insights into leveraging LLMs for e-commerce.
zh

[NLP-30] InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

【速读】：该论文旨在解决通过结合模型压缩技术和一种名为抑制注意力（Inhibitor Attention）的新颖替代注意力机制来优化基于变压器的语言模型的问题。抑制注意力采用曼哈顿距离和ReLU激活函数，取代传统缩放点积注意力中的矩阵乘法和softmax激活，从而在保持模型有效性的同时实现潜在的计算和能源节省。解决方案的关键在于对抑制机制进行进一步调整以提高其训练效率，并在DistilBERT架构上验证其性能，最终通过知识蒸馏实验表明，改进后的抑制Transformer模型能够在标准NLP基准测试（如GLUE和情感分析任务）中实现具有竞争力的表现。

链接: https://arxiv.org/abs/2503.15983
作者: Tony Zhang,Rickard Brännvall
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 2 tables

点击查看摘要

Abstract:This work explores optimizing transformer-based language models by integrating model compression techniques with inhibitor attention, a novel alternative attention mechanism. Inhibitor attention employs Manhattan distances and ReLU activations instead of the matrix multiplications and softmax activation of the conventional scaled dot-product attention. This shift offers potential computational and energy savings while maintaining model effectiveness. We propose further adjustments to improve the inhibitor mechanism’s training efficiency and evaluate its performance on the DistilBERT architecture. Our knowledge distillation experiments indicate that the modified inhibitor transformer model can achieve competitive performance on standard NLP benchmarks, including General Language Understanding Evaluation (GLUE) and sentiment analysis tasks.
zh

[NLP-31] Exploratory Study into Relations between Cognitive Distortions and Emotional Appraisals

【速读】：该论文试图解决认知扭曲（Cognitive Distortions）与情绪评估维度（Emotional Appraisal Dimensions）之间的关系问题，并探索其在跨学科研究中的潜在联系与意义。解决方案的关键在于通过计算方法探究不同认知扭曲类别下情绪评估维度之间的统计显著关系模式，揭示各扭曲类别特有的评估特征（appraisal profiles）。此外，论文进一步分析了认知重构（Cognitive Restructuring）对情绪评估维度的影响，体现了认知重构在情绪调节中的作用。

链接: https://arxiv.org/abs/2503.15979
作者: Navneet Agarwal,Kairit Sirts
机构: University of Tartu (塔尔图大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, there has been growing interest in studying cognitive distortions and emotional appraisals from both computational and psychological perspectives. Despite considerable similarities between emotional reappraisal and cognitive reframing as emotion regulation techniques, these concepts have largely been examined in isolation. This research explores the relationship between cognitive distortions and emotional appraisal dimensions, examining their potential connections and relevance for future interdisciplinary studies. Under this pretext, we conduct an exploratory computational study, aimed at investigating the relationship between cognitive distortion and emotional appraisals. We show that the patterns of statistically significant relationships between cognitive distortions and appraisal dimensions vary across different distortion categories, giving rise to distinct appraisal profiles for individual distortion classes. Additionally, we analyze the impact of cognitive restructuring on appraisal dimensions, exemplifying the emotion regulation aspect of cognitive restructuring.
zh

[NLP-32] Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning DATE

【速读】：该论文试图解决基于Group Relative Policy Optimization (GRPO)方法在推理大语言模型训练过程中存在的稳定性不足以及推理效率低下的问题。解决方案的关键在于提出Adaptive Group Policy Optimization (AGPO)，通过两项简单但有效的改进实现：一是改进的优势估计方法以缓解零方差情况；二是基于长度的奖励机制，激励模型避免过度思考。实验表明，所提方法实现了更稳定的训练过程，并在推理步骤中使用显著更少的token数，同时达到了相当或更优的性能。

链接: https://arxiv.org/abs/2503.15952
作者: Chen Li,Nazhou Liu,Kai Yang
机构: HPC-AI Tech (高性能计算与人工智能技术)
类目: Computation and Language (cs.CL)
备注: This is an unfinished version and will be updated. We aim to share some findings

点击查看摘要

Abstract:Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of Reasoning LLMs training. However, we find some deficiency that influences RL stability and inference efficiency. Thus, we propose Adaptive Group Policy Optimization (AGPO) which contains two simple but effective modifications: a revised advantage estimation method to mitigate zero-variance situations; a length-based reward, incentivizing the model to avoid overthinking. The experiments demonstrate our methods achieve more stable training and comparable or superior performance with significantly fewer tokens in reasoning steps.
zh

[NLP-33] Dont Fight Hallucinations Use Them: Estimating Image Realism using NLI over Atomic Facts AAAI-2025

【速读】：该论文试图解决如何量化图像真实性的挑战，特别是识别违背常识的图像。论文的关键解决方案在于利用大规模视觉语言模型（Large Vision-Language Models, LVLMs）和自然语言推理（Natural Language Inference, NLI），通过LVLM从图像中提取原子事实，这些事实包含准确信息与错误幻觉。随后，通过计算事实间的成对蕴含得分，并聚合这些得分以生成单一的真实性评分，以此检测真实事实与幻觉元素之间的矛盾，从而判断图像是否违背常识。该方法在WHOOPS!数据集的零样本任务中达到了新的最先进性能。

链接: https://arxiv.org/abs/2503.15948
作者: Elisei Rykov,Kseniia Petrushina,Kseniia Titova,Alexander Panchenko,Vasily Konovalov
机构: Skoltech; Moscow Institute of Physics and Technology; MTS AI; AIRI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Proceedings of De-Factify 4: 4nd Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI-2025

点击查看摘要

Abstract:Quantifying the realism of images remains a challenging problem in the field of artificial intelligence. For example, an image of Albert Einstein holding a smartphone violates common-sense because modern smartphone were invented after Einstein’s death. We introduce a novel method for assessing image realism using Large Vision-Language Models (LVLMs) and Natural Language Inference (NLI). Our approach is based on the premise that LVLMs may generate hallucinations when confronted with images that defy common sense. Using LVLM to extract atomic facts from these images, we obtain a mix of accurate facts and erroneous hallucinations. We proceed by calculating pairwise entailment scores among these facts, subsequently aggregating these values to yield a singular reality score. This process serves to identify contradictions between genuine facts and hallucinatory elements, signaling the presence of images that violate common sense. Our approach has achieved a new state-of-the-art performance in zero-shot mode on the WHOOPS! dataset.
zh

[NLP-34] From Chaos to Order: The Atomic Reason er Framework for Fine-grained Reasoning in Large Language Models

【速读】：该论文致力于解决大型语言模型（Large Language Models, LLMs）在逻辑“慢思考”推理能力方面的不足，当前推理扩展范式面临两大根本性限制：思维流程的碎片化导致逻辑连贯性受损，以及计算复杂度随着搜索空间维度的增加而急剧上升。为克服这些局限，论文提出了Atomic Reasoner (AR)，这是一种认知推理策略，通过系统化的原子级操作实现细粒度推理。AR的关键在于将推理过程分解为原子级的认知单元，并采用认知路由机制动态构建推理表示与协调推理路径。这种方法实现了逐步且结构化的认知过程，确保逻辑连贯性的同时大幅降低了认知负荷，有效模拟了人类深度思考的认知模式。实验结果表明，AR在不依赖穷举搜索的情况下展现了卓越的推理能力，特别是在语言逻辑谜题方面表现尤为突出，验证了其在提升LLMs稳健长序列逻辑推理与论证能力方面的有效性。

链接: https://arxiv.org/abs/2503.15944
作者: Jinyi Liu,Yan Zheng,Rong Cheng,Qiyu Wu,Wei Guo,Fei Ni,Hebin Liang,Yifu Yuan,Hangyu Mao,Fuzheng Zhang,Jianye Hao
机构: Tianjin University (天津大学); Independent Researcher (独立研究者)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown remarkable progress, yet their capacity for logical ``slow-thinking’’ reasoning persists as a critical research frontier. Current inference scaling paradigms suffer from two fundamental constraints: fragmented thought flows compromising logical coherence, and intensively computational complexity that escalates with search space dimensions. To overcome these limitations, we present \textbfAtomic Reasoner (\textbfAR), a cognitive inference strategy that enables fine-grained reasoning through systematic atomic-level operations. AR decomposes the reasoning process into atomic cognitive units, employing a cognitive routing mechanism to dynamically construct reasoning representations and orchestrate inference pathways. This systematic methodology implements stepwise, structured cognition, which ensures logical coherence while significantly reducing cognitive load, effectively simulating the cognitive patterns observed in human deep thinking processes. Extensive experimental results demonstrate AR’s superior reasoning capabilities without the computational burden of exhaustive solution searches, particularly excelling in linguistic logic puzzles. These findings substantiate AR’s effectiveness in enhancing LLMs’ capacity for robust, long-sequence logical reasoning and deliberation.
zh

[NLP-35] owards Automatic Continual Learning: A Self-Adaptive Framework for Continual Instruction Tuning

【速读】：该论文试图解决连续指令微调（Continual Instruction Tuning）在领域特定场景下数据质量维护和系统约束管理的关键挑战，特别是如何动态选择新知识而非仅关注保留旧知识的问题。此外，还致力于应对增量数据处理和分布偏移的难题，并优化实际部署中的模型更新效率。论文解决方案的关键在于提出了一种自动化的连续指令微调框架，通过引入基于小代理模型的困惑度（Perplexity）过滤机制，动态筛选冗余数据以提高数据质量，并实时更新代理模型以确保过滤标准始终与部署模型的状态保持一致。这一方法显著降低了计算成本（减少66.7%），提升了模型性能，并实现了自主更新能力。

链接: https://arxiv.org/abs/2503.15924
作者: Peiyi Lin,Fukai Zhang,Kai Niu,Hao Fu
机构: National Supercomputer Center in Tianjin (国家超级计算机中心（天津）); School of Science, Tianjin University of Technology and Education (天津职业技术师范大学理学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual instruction tuning enables large language models (LLMs) to learn incrementally while retaining past knowledge, whereas existing methods primarily focus on how to retain old knowledge rather than on selecting which new knowledge to learn. In domain-specific contexts, maintaining data quality and managing system constraints remain key challenges. To address these issues, we propose an automated continual instruction tuning framework that dynamically filters incoming data, which identify and reduce redundant data across successive updates. Our approach utilizes a small proxy model for efficient perplexity-based filtering, and updates the proxy to ensure that the filtering criteria remain aligned with the evolving state of the deployed model. Compared to existing static data selection methods, our framework can effectively handle incrementally acquired data and shifting distributions. Additionally, it addresses practical deployment challenges by enabling seamless model updates, supporting version rollback and incorporating automatic checkpoint evaluation. We evaluated the system in real-world medical scenarios. It reduced computational costs by 66.7% and improved model performance, and achieved autonomous updates, thus demonstrating its effectiveness for automatic continual instruction tuning.
zh

[NLP-36] From Structured Prompts to Open Narratives: Measuring Gender Bias in LLM s Through Open-Ended Storytelling

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在训练数据中反映或放大社会偏见的问题，特别是性别偏见在职业叙述中的体现。论文的关键解决方案在于引入了一种新的评估框架，通过自由形式的故事讲述（free-form storytelling）来揭示嵌入模型中的偏见，而非依赖于结构化场景或精心设计的提示。这种方法能够更自然地暴露模型中的性别偏见，并通过系统分析发现六种广泛使用的LLMs中女性角色在职业分布上的过度代表现象，同时揭示模型生成的职业性别排名更符合人类刻板印象而非实际劳动力统计数据。这些发现强调了需要平衡的缓解策略以确保公平性，同时避免强化新的刻板印象。

链接: https://arxiv.org/abs/2503.15904
作者: Evan Chen,Run-Jun Zhan,Yan-Bai Lin,Hung-Hsuan Chen
机构: National Central University (中央大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing, yet concerns persist regarding their tendency to reflect or amplify social biases present in their training data. This study introduces a novel evaluation framework to uncover gender biases in LLMs, focusing on their occupational narratives. Unlike previous methods relying on structured scenarios or carefully crafted prompts, our approach leverages free-form storytelling to reveal biases embedded in the models. Systematic analyses show an overrepresentation of female characters across occupations in six widely used LLMs. Additionally, our findings reveal that LLM-generated occupational gender rankings align more closely with human stereotypes than actual labor statistics. These insights underscore the need for balanced mitigation strategies to ensure fairness while avoiding the reinforcement of new stereotypes.
zh

[NLP-37] Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models

【速读】：该论文试图解决Large Language Models (LLMs)在Retrieval-Augmented Generation (RAG)中因参数知识与检索上下文之间的冲突而导致的可靠性下降问题，尤其是在检索到的信息不可靠或模型内部知识过时的情况下。论文的关键在于提出了一种名为CK-PLUG的插件式方法，用于控制LLMs对参数化知识和上下文知识的依赖程度。其核心创新点是引入了一个新的知识一致性度量指标——Confidence Gain，通过测量上下文插入后标记概率分布的熵移来检测知识冲突。CK-PLUG进一步通过对具有负Confidence Gain的标记概率分布进行细粒度调整，在单一调参参数下实现对知识偏好的精准控制，从而有效调节RAG场景中的知识依赖性，同时保持生成流畅性和知识准确性。实验表明，CK-PLUG能够在广泛的反事实RAG场景中显著调节知识依赖，并在多种通用RAG任务中实现一致的性能提升。

链接: https://arxiv.org/abs/2503.15888
作者: Baolong Bi,Shenghua Liu,Yiwei Wang,Yilong Xu,Junfeng Fang,Lingrui Mei,Xueqi Cheng
机构: AI Safety of Chinese Academy of Sciences, Institute of Computing Technology, CAS (中国科学院计算技术研究所AI安全研究中心); University of Chinese Academy of Sciences (中国科学院大学); University of California, Merced (加州大学默塞德分校); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, conflicts between parametric knowledge and retrieved context pose challenges, particularly when retrieved information is unreliable or the model’s internal knowledge is outdated. In such cases, LLMs struggle to determine whether to rely more on their own parameters or the conflicted context. To address this, we propose CK-PLUG, a plug-and-play method for controlling LLMs’ reliance on parametric and contextual knowledge. We introduce a novel knowledge consistency metric, Confidence Gain, which detects knowledge conflicts by measuring entropy shifts in token probability distributions after context insertion. CK-PLUG then enables fine-grained control over knowledge preference by adjusting the probability distribution of tokens with negative confidence gain through a single tuning parameter. Experiments demonstrate CK-PLUG’s ability to significantly regulate knowledge reliance in counterfactual RAG scenarios while maintaining generation fluency and knowledge accuracy. For instance, on Llama3-8B, memory recall (MR) of RAG response can be adjusted within a broad range (9.9%-71.9%), compared to the baseline of 42.1%. Moreover, CK-PLUG supports adaptive control based on the model’s confidence in both internal and external knowledge, achieving consistent performance improvements across various general RAG tasks. Our code is available at: \hrefthis https URL\textthis https URL .
zh

[NLP-38] InCo-DPO: Balancing Distribution Shift and Data Quality for Enhanced Preference Optimization

【速读】：该论文旨在解决直接偏好优化（Direct Preference Optimization, DPO）在利用候选偏好样本质量以及分布偏移（distribution shift）方面存在的局限性。传统方法主要依赖于与策略模型分布一致的按策略（on-policy）数据，而忽视了来自多样化来源的离策略（off-policy）数据在提升数据质量方面的潜力，尽管这些数据可能面临分布偏移的挑战。论文的关键创新在于提出了一种名为InCo-DPO的高效方法，通过整合按策略和离策略数据，动态调整以平衡分布偏移与数据质量，从而实现最优权衡。这种方法克服了离策略数据中分布偏移的限制以及按策略数据的质量约束，最终在Alpaca-Eval 2.0和Arena-Hard基准测试中取得了显著性能提升。

链接: https://arxiv.org/abs/2503.15880
作者: Yunan Wang,Jijie Li,Bo-Wen Zhang,Liangdong Wang,Guang Liu
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); Beihang University (北京航空航天大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) optimizes language models to align with human preferences. Utilizing on-policy samples, generated directly by the policy model, typically results in better performance due to its distribution consistency with the model compared to off-policy samples. This paper identifies the quality of candidate preference samples as another critical factor. While the quality of on-policy data is inherently constrained by the capabilities of the policy model, off-policy data, which can be derived from diverse sources, offers greater potential for quality despite experiencing distribution shifts. However, current research mostly relies on on-policy data and neglects the value of off-policy data in terms of data quality, due to the challenge posed by distribution shift. In this paper, we propose InCo-DPO, an efficient method for synthesizing preference data by integrating on-policy and off-policy data, allowing dynamic adjustments to balance distribution shifts and data quality, thus finding an optimal trade-off. Consequently, InCo-DPO overcomes the limitations of distribution shifts in off-policy data and the quality constraints of on-policy data. We evaluated InCo-DPO with the Alpaca-Eval 2.0 and Arena-Hard benchmarks. Experimental results demonstrate that our approach not only outperforms both on-policy and off-policy data but also achieves a state-of-the-art win rate of 60.8 on Arena-Hard with the vanilla DPO using Gemma-2 model.
zh

[NLP-39] yped-RAG : Type-aware Multi-Aspect Decomposition for Non-Factoid Question Answering NAACL2025

【速读】：该论文致力于解决非事实型问答（Non-factoid Question Answering, NFQA）中存在的开放性问题、多样化意图以及多方面推理需求，这些问题导致传统的基于检索增强生成（Retrieval-Augmented Generation, RAG）的方法无法有效应对。非事实型问题（NFQs）缺乏明确答案，需要从多个来源通过不同推理维度综合信息。为了解决这些局限性，论文提出了类型感知的多方面分解框架Typed-RAG，其关键在于将非事实型问题分类为不同的类型（如辩论、经验分享和比较等），并通过基于方面的分解方法优化检索与生成策略。通过将多方面的问题分解为单方面子查询并聚合结果，Typed-RAG能够生成更丰富且上下文相关的响应。实验结果显示，Typed-RAG在Wiki-NFQA数据集上的表现优于基线模型，强调了类型感知分解在非事实型问答中对于有效检索与生成的重要性。

链接: https://arxiv.org/abs/2503.15879
作者: DongGeon Lee,Ahjeong Park,Hyeri Lee,Hyeonseo Nam,Yunho Maeng
机构: Pohang University of Science and Technology (POSTECH); Sookmyung Women’s University; Ewha Womans University; KT; Independent Researcher; LLM Experimental Lab, MODULABS
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to NAACL 2025 SRW

点击查看摘要

Abstract:Non-factoid question-answering (NFQA) poses a significant challenge due to its open-ended nature, diverse intents, and the need for multi-aspect reasoning, which renders conventional factoid QA approaches, including retrieval-augmented generation (RAG), inadequate. Unlike factoid questions, non-factoid questions (NFQs) lack definitive answers and require synthesizing information from multiple sources across various reasoning dimensions. To address these limitations, we introduce Typed-RAG, a type-aware multi-aspect decomposition framework within the RAG paradigm for NFQA. Typed-RAG classifies NFQs into distinct types – such as debate, experience, and comparison – and applies aspect-based decomposition to refine retrieval and generation strategies. By decomposing multi-aspect NFQs into single-aspect sub-queries and aggregating the results, Typed-RAG generates more informative and contextually relevant responses. To evaluate Typed-RAG, we introduce Wiki-NFQA, a benchmark dataset covering diverse NFQ types. Experimental results demonstrate that Typed-RAG outperforms baselines, thereby highlighting the importance of type-aware decomposition for effective retrieval and generation in NFQA. Our code and dataset are available at \hrefthis https URLthis https URL.
zh

[NLP-40] Uncertainty Quantification and Confidence Calibration in Large Language Models : A Survey

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在高风险领域应用中可靠性不足的问题，特别是它们经常产生看似合理但错误的回答。为提升LLMs的可信赖性，论文聚焦于不确定性量化（Uncertainty Quantification, UQ），提出了一种新的分类方法，将UQ技术依据计算效率及不确定性维度（包括输入、推理、参数和预测不确定性）进行归类。关键在于开发既可扩展、又具解释性和鲁棒性的UQ方案，以应对传统方法因计算限制和解码不一致性而面临的挑战，并有效处理LLMs特有的不确定性来源。

链接: https://arxiv.org/abs/2503.15850
作者: Xiaoou Liu,Tiejin Chen,Longchao Da,Chacha Chen,Zhen Lin,Hua Wei
机构: Arizona State University; University of Chicago; University of Illinois Urbana-Champaign
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in text generation, reasoning, and decision-making, enabling their adoption in high-stakes domains such as healthcare, law, and transportation. However, their reliability is a major concern, as they often produce plausible but incorrect responses. Uncertainty quantification (UQ) enhances trustworthiness by estimating confidence in outputs, enabling risk mitigation and selective prediction. However, traditional UQ methods struggle with LLMs due to computational constraints and decoding inconsistencies. Moreover, LLMs introduce unique uncertainty sources, such as input ambiguity, reasoning path divergence, and decoding stochasticity, that extend beyond classical aleatoric and epistemic uncertainty. To address this, we introduce a new taxonomy that categorizes UQ methods based on computational efficiency and uncertainty dimensions (input, reasoning, parameter, and prediction uncertainty). We evaluate existing techniques, assess their real-world applicability, and identify open challenges, emphasizing the need for scalable, interpretable, and robust UQ approaches to enhance LLM reliability.
zh

[NLP-41] Entropy-based Exploration Conduction for Multi-step Reasoning

【速读】：该论文旨在解决现有方法在自动决定大规模语言模型（LLM）推理深度时存在的高成本、缺乏灵活性以及由此导致的推理准确性下降的问题。论文的关键在于提出了一种名为熵引导探索深度调整（Entropy-based Exploration Depth Conduction, Entro-duction）的新方法。该方法通过监控LLM输出的熵（output entropy）和方差熵（variance entropy），动态调整多步推理过程中的探索深度。熵被用来捕捉模型当前的不确定性，而方差熵则衡量连续推理步骤中不确定性波动的变化。基于这些观测值的变化，LLM能够根据概率选择加深、扩展或停止探索，从而在推理准确性和探索效率之间实现平衡。实验结果表明，Entro-duction在四个基准数据集上的有效性，并进一步分析其各组成部分对推理性能的贡献。

链接: https://arxiv.org/abs/2503.15848
作者: Jinghan Zhang,Xiting Wang,Fengran Mo,Yeyang Zhou,Wanfu Gao,Kunpeng Liu
机构: Portland State University (波特兰州立大学); Renmin University of China (中国人民大学); University of Montreal (蒙特利尔大学); Uber (优步); Jilin University (吉林大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In large language model (LLM) reasoning, multi-step processes have proven effective for solving complex tasks. However, the depth of exploration can significantly affect the reasoning performance. Existing methods to automatically decide the depth often bring high costs and lack flexibility, and thus undermine the model’s reasoning accuracy. To address these issues, we propose Entropy-based Exploration Depth Conduction (Entro-duction), a novel method that dynamically adjusts the exploration depth during multi-step reasoning by monitoring LLM’s output entropy and variance entropy. We employ these two metrics to capture the model’s current uncertainty and the fluctuation of uncertainty across consecutive reasoning steps. Based on the observed changes, the LLM selects whether to deepen, expand or stop exploration according to the probability. In this way, we balance the reasoning accuracy and exploration effectiveness. Experimental results across four benchmark datasets demonstrate the efficacy of Entro-duction. We further conduct experiments and analysis on the components of Entro-duction to discuss their contributions to reasoning performance.
zh

[NLP-42] Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理古汉语时面临的独特挑战，特别是在评估其生成能力方面的不足。现有基准主要侧重于通过选择题测试模型的理解能力，而在评估生成任务方面存在显著空白。为此，论文引入了一个名为Fùxì的新基准，它涵盖了21种多样化任务，用于评估模型的理解与生成能力。

Fùxì的关键创新点包括：(1) 在任务设计上实现了理解和生成任务的平衡覆盖，新增如诗歌创作和对联补全等新颖任务；(2) 开发了专门针对古汉语文本生成的评估指标，结合基于规则的验证与微调后的LLM评估器；(3) 提出了一个系统性的评估框架，综合考量语言准确性与文化真实性。通过全面评估当前最先进的LLMs，研究揭示了理解与生成任务之间存在的显著性能差距，指出现有模型在需要深厚文化知识及遵循古典格式的生成任务上表现欠佳。这些发现不仅指出了古汉语文本处理领域的现存局限性，也为未来模型的发展提供了重要方向。Fùxì基准及相关工具包已公开发布，以促进该领域研究的进一步开展。

链接: https://arxiv.org/abs/2503.15837
作者: Shangqing Zhao,Yuhao Zhou,Yupei Ren,Zhe Chen,Chenghao Jia,Fang Zhe,Zhaogaung Long,Shu Liu,Man Lan
机构: School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院); Lab of Artificial Intelligence for Education, East China Normal University (华东师范大学人工智能教育实验室); Shanghai Institute of Artificial Intelligence for Education, East China Normal University (上海人工智能教育研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: working in progress

点击查看摘要

Abstract:Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models’ generative capabilities in classical Chinese. We introduce Fùxì, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain.
zh

[NLP-43] ChatGPT and U(X): A Rapid Review on Measuring the User Experience

【速读】：该论文试图解决的问题是如何系统性地评估 ChatGPT 提供的用户体验（User Experience, UX），以填补当前研究中存在的空白。论文通过回顾相关文献（N=58），分析了已有的定量研究中使用的自变量（Independent Variables, IVs）、因变量（Dependent Variables, DVs）以及测量方法，揭示了现有评估中的趋势、不足与共识。论文的关键在于提出初步框架，旨在指导未来研究和工具开发，推动 ChatGPT UX 测量的标准化及方法的广泛适用性，从而提升该领域的研究水平，并为优化用户与 ChatGPT 或类似大型语言模型（Large Language Model, LLM）系统的交互提供支持。

链接: https://arxiv.org/abs/2503.15808
作者: Katie Seaborn
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:ChatGPT, powered by a large language model (LLM), has revolutionized everyday human-computer interaction (HCI) since its 2022 release. While now used by millions around the world, a coherent pathway for evaluating the user experience (UX) ChatGPT offers remains missing. In this rapid review (N = 58), I explored how ChatGPT UX has been approached quantitatively so far. I focused on the independent variables (IVs) manipulated, the dependent variables (DVs) measured, and the methods used for measurement. Findings reveal trends, gaps, and emerging consensus in UX assessments. This work offers a first step towards synthesizing existing approaches to measuring ChatGPT UX, urgent trajectories to advance standardization and breadth, and two preliminary frameworks aimed at guiding future research and tool development. I seek to elevate the field of ChatGPT UX by empowering researchers and practitioners in optimizing user interactions with ChatGPT and similar LLM-based systems.
zh

[NLP-44] Mixture of Lookup Experts

【速读】：该论文旨在解决现有混合专家模型（Mixture-of-Experts, MoE）在大规模参数下推理效率受限的问题。具体而言，尽管MoE在推理阶段仅激活部分专家以降低浮点运算次数（FLOPs）和延迟，但由于其动态选择专家的机制，所有专家仍需加载到视频随机存取存储器（VRAM）中，导致内存占用较大且部署受限。此外，当采用专家卸载策略（offloading）时，虽然可以减少VRAM使用，但会显著增加通信开销和推理延迟。

为了解决这些问题，论文提出了一种新的MoE架构——查找表专家混合模型（Mixture of Lookup Experts, MoLE）。MoLE的关键创新在于将训练阶段的前馈网络（Feed-Forward Network, FFN）专家重新参数化为查找表（Lookup Table, LUT），这些LUT在推理阶段通过输入ID直接检索专家输出结果，并将其存储在外部存储设备中。这种方法避免了推理过程中进行实际的专家计算，从而大幅减少了通信开销并降低了VRAM需求。实验结果显示，与传统MoE相比，MoLE在相同的FLOPs和VRAM消耗下实现了与密集模型相当的推理速度，同时显著快于采用专家卸载的MoE，且保持了相近的性能水平。

链接: https://arxiv.org/abs/2503.15798
作者: Shibo Jie,Yehui Tang,Kai Han,Yitong Li,Duyu Tang,Zhi-Hong Deng,Yunhe Wang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert’s computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE.
zh

[NLP-45] Grammar and Gameplay-aligned RL for Game Description Generation with LLM s

【速读】：该论文旨在解决游戏描述生成（Game Description Generation, GDG）任务中，从自然语言文本生成符合游戏描述语言（Game Description Language, GDL）且准确再现游戏特征的问题。现有方法主要依赖大型语言模型（Large Language Models, LLMs）的上下文理解能力，但难以同时保证语法正确性和对游戏概念的忠实度。论文的关键解决方案是提出了一种基于强化学习的微调方法（Reinforcement Learning-based Fine-Tuning for GDG, RLGDG），通过引入语法奖励和概念奖励，在监督微调（Supervised Fine-Tuning, SFT）后应用强化学习（Reinforcement Learning, RL），实现对语法和游戏概念的双重优化。此外，采用两阶段训练策略进一步提升性能。实验结果表明，该方法显著优于仅使用SFT的基线方法。

链接: https://arxiv.org/abs/2503.15783
作者: Tsunehiko Tanaka,Edgar Simo-Serra
机构: Waseda University (早稻田大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Game Description Generation (GDG) is the task of generating a game description written in a Game Description Language (GDL) from natural language text. Previous studies have explored generation methods leveraging the contextual understanding capabilities of Large Language Models (LLMs); however, accurately reproducing the game features of the game descriptions remains a challenge. In this paper, we propose reinforcement learning-based fine-tuning of LLMs for GDG (RLGDG). Our training method simultaneously improves grammatical correctness and fidelity to game concepts by introducing both grammar rewards and concept rewards. Furthermore, we adopt a two-stage training strategy where Reinforcement Learning (RL) is applied following Supervised Fine-Tuning (SFT). Experimental results demonstrate that our proposed method significantly outperforms baseline methods using SFT alone.
zh

[NLP-46] Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer

【速读】：本文旨在研究跨领域的多文档摘要（MDS）模型在零样本领域迁移设置下的性能表现及其潜在失败原因。论文通过评估不同训练方法（直接法、分块后摘要法、提取后摘要法以及基于GPT风格模型的推理法）、领域类型（新闻、科学和技术对话）以及摘要维度（参考相似性、质量与事实性），分析为何在一个领域上训练的模型在其他领域上可能无法有效生成摘要。论文将领域迁移“失败”定义为事实性的下降、目标偏离度增加及整体摘要质量的降低。此外，文章还探讨了将常用摘要评价指标直接应用于MDS任务时可能存在的问题。关键在于揭示不同训练策略下MDS模型在跨领域应用中的局限性，并深入理解这些限制背后的原因。

链接: https://arxiv.org/abs/2503.15768
作者: Alexandra DeLucia,Mark Dredze
机构: Center for Language and Speech Processing (语言与语音处理中心), Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Abstractive multi-document summarization (MDS) is the task of automatically summarizing information in multiple documents, from news articles to conversations with multiple speakers. The training approaches for current MDS models can be grouped into four approaches: end-to-end with special pre-training (“direct”), chunk-then-summarize, extract-then-summarize, and inference with GPT-style models. In this work, we evaluate MDS models across training approaches, domains, and dimensions (reference similarity, quality, and factuality), to analyze how and why models trained on one domain can fail to summarize documents from another (News, Science, and Conversation) in the zero-shot domain transfer setting. We define domain-transfer “failure” as a decrease in factuality, higher deviation from the target, and a general decrease in summary quality. In addition to exploring domain transfer for MDS models, we examine potential issues with applying popular summarization metrics out-of-the-box.
zh

[NLP-47] KoGNER: A Novel Framework for Knowledge Graph Distillation on Biomedical Named Entity Recognition

【速读】：该论文旨在解决传统深度学习基 Named Entity Recognition (NER) 模型在领域特定泛化能力不足以及数据稀疏性问题。论文的关键解决方案是提出了一种名为 Knowledge Graph distilled for Named Entity Recognition (KoGNER) 的新方法，通过将知识图谱 (Knowledge Graph, KG) 的蒸馏技术集成到 NER 模型中来提升实体识别性能。KoGNER 的核心在于利用知识图谱的结构化知识表示来丰富上下文嵌入 (contextual embeddings)，从而改善实体分类效果并减少实体检测中的歧义。其解决方案的关键步骤包括：(1) 知识蒸馏，即将外部知识源蒸馏成轻量级表示形式以无缝集成到 NER 模型中；(2) 实体感知增强，即直接将富集了知识图谱信息的上下文嵌入融入图神经网络 (Graph Neural Network, GNN)，以提高模型理解与表达实体关系的能力。实验结果表明，KoGNER 在基准数据集上的表现达到当前最优 (state-of-the-art)，显著优于微调后的 NER 模型和大型语言模型 (LLMs)，这表明利用知识图谱作为辅助信息可显著提升 NER 准确率，使 KoGNER 成为知识感知自然语言处理领域的一个有前景的研究方向。

链接: https://arxiv.org/abs/2503.15737
作者: Heming Zhang,Wenyu Li,Di Huang,Yinjie Tang,Yixin Chen,Philip Payne,Fuhai Li
机构: Institute for Informatics, Data Science and Biostatistics, Washington University in St. Louis (华盛顿大学圣路易斯分校); Department of Computer Science and Engineering, Washington University in St. Louis (华盛顿大学圣路易斯分校); Energy, Environmental and Chemical Engineering, Washington University in St. Louis (华盛顿大学圣路易斯分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that plays a crucial role in information extraction, question answering, and knowledge-based systems. Traditional deep learning-based NER models often struggle with domain-specific generalization and suffer from data sparsity issues. In this work, we introduce Knowledge Graph distilled for Named Entity Recognition (KoGNER), a novel approach that integrates Knowledge Graph (KG) distillation into NER models to enhance entity recognition performance. Our framework leverages structured knowledge representations from KGs to enrich contextual embeddings, thereby improving entity classification and reducing ambiguity in entity detection. KoGNER employs a two-step process: (1) Knowledge Distillation, where external knowledge sources are distilled into a lightweight representation for seamless integration with NER models, and (2) Entity-Aware Augmentation, which integrates contextual embeddings that have been enriched with knowledge graph information directly into GNN, thereby improving the model’s ability to understand and represent entity relationships. Experimental results on benchmark datasets demonstrate that KoGNER achieves state-of-the-art performance, outperforming finetuned NER models and LLMs by a significant margin. These findings suggest that leveraging knowledge graphs as auxiliary information can significantly improve NER accuracy, making KoGNER a promising direction for future research in knowledge-aware NLP.
zh

[NLP-48] Am I eligible? Natural Language Inference for Clinical Trial Patient Recruitment: the Patients Point of View

【速读】：该论文试图解决的问题是如何通过患者自述的语言描述其医疗档案，以判断其是否符合特定临床试验的纳入标准，从而提高患者参与临床试验的效率。传统方式通常由医护人员发起并推荐患者参与临床试验，而本文探索了由患者主动发起招募过程的可能性。然而，这种自述语言可能与正式医学语言存在差异，可能导致匹配过程中出现困难。

解决方案的关键在于构建一个新的数据集和任务——Natural Language Inference for Patient Recruitment (NLI4PR)，该任务要求将患者的自然语言描述与其潜在匹配的临床试验进行适配。研究者通过对TREC 2022 Clinical Trial Track数据集中的患者医疗档案进行手动重述，将其转换为患者语言，并结合对应的临床试验报告（标明患者是否符合条件），来创建这一数据集。实验结果显示，在使用患者语言时，最佳模型的表现仅略有下降（F1分数从64.7到73.1降至56.5到71.8），表明从患者视角出发的设计能够有效用于临床试验的患者招募。

链接: https://arxiv.org/abs/2503.15718
作者: Mathilde Aguiar,Pierre Zweigenbaum,Nona Naderi
机构: Université Paris-Saclay (巴黎萨克雷大学); CNRS (法国国家科学研究中心); Laboratoire Interdisciplinaire des Sciences du Numérique (跨学科数字科学实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recruiting patients to participate in clinical trials can be challenging and time-consuming. Usually, participation in a clinical trial is initiated by a healthcare professional and proposed to the patient. Promoting clinical trials directly to patients via online recruitment might help to reach them more efficiently. In this study, we address the case where a patient is initiating their own recruitment process and wants to determine whether they are eligible for a given clinical trial, using their own language to describe their medical profile. To study whether this creates difficulties in the patient trial matching process, we design a new dataset and task, Natural Language Inference for Patient Recruitment (NLI4PR), in which patient language profiles must be matched to clinical trials. We create it by adapting the TREC 2022 Clinical Trial Track dataset, which provides patients’ medical profiles, and rephrasing them manually using patient language. We also use the associated clinical trial reports where the patients are either eligible or excluded. We prompt several open-source Large Language Models on our task and achieve from 56.5 to 71.8 of F1 score using patient language, against 64.7 to 73.1 for the same task using medical language. When using patient language, we observe only a small loss in performance for the best model, suggesting that having the patient as a starting point could be adopted to help recruit patients for clinical trials. The corpus and code bases are all freely available on our Github and HuggingFace repositories.
zh

[NLP-49] Enhancing Pancreatic Cancer Staging with Large Language Models : The Role of Retrieval-Augmented Generation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在医疗领域应用中功能和可靠性提升的问题，并探讨生成式 AI (Generative AI) 中检索增强生成（Retrieval-Augmented Generation, RAG）技术的实际效果。论文的关键在于通过设计一个对照实验，将带有 RAG 的 NotebookLM 与内部模型 Gemini 2.0 Flash 进行对比，以明确 RAG 对模型性能的影响是否独立于模型本身的差异。解决方案的核心是利用胰腺癌分期任务作为测试场景，采用日本胰腺癌分期指南作为可靠外部知识（Reliable External Knowledge, REK），并通过量化检索准确性及整体分期准确性来评估 RAG 的有效性。实验结果表明，NotebookLM 在胰腺癌分期任务中的表现优于 Gemini 2.0 Flash，验证了 RAG 技术能够显著提高 LLM 的分期准确率，并增强了模型的透明性，为临床诊断提供了支持。

链接: https://arxiv.org/abs/2503.15664
作者: Hisashi Johno,Yuki Johno,Akitomo Amakawa,Junichi Sato,Ryota Tozuka,Atsushi Komaba,Hiroaki Watanabe,Hiroki Watanabe,Chihiro Goto,Hiroyuki Morisaka,Hiroshi Onishi,Kazunori Nakamoto
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 figures, 2 tables, 6 supplementary files

点击查看摘要

Abstract:Purpose: Retrieval-augmented generation (RAG) is a technology to enhance the functionality and reliability of large language models (LLMs) by retrieving relevant information from reliable external knowledge (REK). RAG has gained interest in radiology, and we previously reported the utility of NotebookLM, an LLM with RAG (RAG-LLM), for lung cancer staging. However, since the comparator LLM differed from NotebookLM’s internal model, it remained unclear whether its advantage stemmed from RAG or inherent model differences. To better isolate RAG’s impact and assess its utility across different cancers, we compared NotebookLM with its internal LLM, Gemini 2.0 Flash, in a pancreatic cancer staging experiment. Materials and Methods: A summary of Japan’s pancreatic cancer staging guidelines was used as REK. We compared three groups - REK+/RAG+ (NotebookLM with REK), REK+/RAG- (Gemini 2.0 Flash with REK), and REK-/RAG- (Gemini 2.0 Flash without REK) - in staging 100 fictional pancreatic cancer cases based on CT findings. Staging criteria included TNM classification, local invasion factors, and resectability classification. In REK+/RAG+, retrieval accuracy was quantified based on the sufficiency of retrieved REK excerpts. Results: REK+/RAG+ achieved a staging accuracy of 70%, outperforming REK+/RAG- (38%) and REK-/RAG- (35%). For TNM classification, REK+/RAG+ attained 80% accuracy, exceeding REK+/RAG- (55%) and REK-/RAG- (50%). Additionally, REK+/RAG+ explicitly presented retrieved REK excerpts, achieving a retrieval accuracy of 92%. Conclusion: NotebookLM, a RAG-LLM, outperformed its internal LLM, Gemini 2.0 Flash, in a pancreatic cancer staging experiment, suggesting that RAG may improve LLM’s staging accuracy. Furthermore, its ability to retrieve and present REK excerpts provides transparency for physicians, highlighting its applicability for clinical diagnosis and classification. Comments: 11 pages, 6 figures, 2 tables, 6 supplementary files Subjects: Computation and Language (cs.CL) Cite as: arXiv:2503.15664 [cs.CL] (or arXiv:2503.15664v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.15664 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hisashi Johno [view email] [v1] Wed, 19 Mar 2025 19:29:47 UTC (703 KB)
zh

[NLP-50] UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

【速读】：该论文试图解决在桌面环境中对自主代理进行离线、细粒度评估的数据稀缺和基准不足的问题。现有研究主要集中在在线设置下的通用任务，而桌面环境作为许多专业和日常任务的关键场景，由于数据收集挑战和许可问题，尚未得到充分探索。为了解决这一问题，论文提出了UI-Vision，这是一个全面且许可宽松的基准，用于在真实桌面环境中评估计算机使用代理的性能。其关键在于提供了包含密集高质量标注的人类演示数据（如边界框、UI标签和操作轨迹）以及定义明确的细粒度任务（元素定位、布局定位和动作预测），从而能够严格评估代理在桌面环境中的表现。通过释放UI-Vision开源，论文旨在推动更强大的真实桌面任务代理的发展。

链接: https://arxiv.org/abs/2503.15661
作者: Shravan Nayak,Xiangru Jian,Kevin Qinghong Lin,Juan A. Rodriguez,Montek Kalsi,Rabiul Awal,Nicolas Chapados,M. Tamer Özsu,Aishwarya Agrawal,David Vazquez,Christopher Pal,Perouz Taslakian,Spandana Gella,Sai Rajeswar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents’ performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.
zh

[NLP-51] LLaVA-MORE: A Comparative Study of LLM s and Visual Backbones for Enhanced Visual Instruction Tuning

【速读】：该论文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在模型设计与性能评估中的关键问题，包括模型规模、架构选择与性能之间的权衡，以及现有研究中因训练数据和评估协议不一致导致的难以直接比较的问题。论文的关键在于提出了一种新的MLLM家族——LLaVA-MORE，并通过统一的训练协议确保公平对比，同时系统性地研究了不同规模的语言模型（如Phi-4、LLaMA-3.1、Gemma-2）与多种视觉编码器（如CLIP、DINOv2、SigLIP、SigLIP2）的组合效果。此外，论文还探讨了图像分辨率提升及预训练数据集变化对模型性能的影响，为更有效的MLLM设计提供了洞见，并构建了一个可复现的评估框架以指导未来模型开发。

链接: https://arxiv.org/abs/2503.15621
作者: Federico Cocchi,Nicholas Moratelli,Davide Caffagni,Sara Sarto,Lorenzo Baraldi,Marcella Cornia,Rita Cucchiara
机构: University of Modena and Reggio Emilia (摩德纳-雷焦艾米利亚大学, Italy); University of Pisa (比萨大学, Italy); IIT-CNR (意大利技术研究院, Italy)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs – including Phi-4, LLaMA-3.1, and Gemma-2 – to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: this https URL.
zh

[NLP-52] Does Context Matter? ContextualJudgeBench for Evaluating LLM -based Judges in Contextual Settings

【速读】：该论文试图解决在生成式 AI (Generative AI) 系统开发及部署后监控过程中，现有基于大型语言模型（LLM）作为评判者（judge）的范式未能有效应对包含上下文信息的评估场景的问题。尽管现有的评判模型（经过微调以专门评估和批评模型输出的 LLM）被宣传为通用评估工具，但它们通常仅在非上下文相关的场景（如指令跟随）下进行评估，而忽略了需要外部信息作为上下文的场景。这种忽略在检索增强生成（RAG）和摘要等日益普及的应用场景中显得尤为突出。上下文评估的独特挑战在于其依赖于实践者的优先级，导致条件性评估标准（如先比较事实准确性，再考虑完整性）。为填补这一空白，论文提出了一种名为 ContextualJudgeBench 的基准测试集，包含八个受现实世界上下文评估场景启发的、总计 2,000 对具有挑战性的响应样本。该基准通过多管齐下的数据构建管道构建，结合现有的人类标注与基于模型的扰动方法。全面研究显示，上下文信息及其评估标准对最先进的模型提出了显著挑战，例如，表现最佳的 OpenAI o1 模型的一致性准确率仅为约 55%。解决方案的关键在于设计一个能够有效捕捉上下文相关评估复杂性的基准测试集，并揭示当前模型在处理此类任务中的不足。

链接: https://arxiv.org/abs/2503.15620
作者: Austin Xu,Srijan Bansal,Yifei Ming,Semih Yavuz,Shafiq Joty
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 13 figures, 6 tables

点击查看摘要

Abstract:The large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs during AI system development and post-deployment monitoring. While judge models – LLMs finetuned to specialize in assessing and critiquing model outputs – have been touted as general purpose evaluators, they are typically evaluated only on non-contextual scenarios, such as instruction following. The omission of contextual settings – those where external information is used as context to generate an output – is surprising given the increasing prevalence of retrieval-augmented generation (RAG) and summarization use cases. Contextual assessment is uniquely challenging, as evaluation often depends on practitioner priorities, leading to conditional evaluation criteria (e.g., comparing responses based on factuality and then considering completeness if they are equally factual). To address the gap, we propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios. We build our benchmark with a multi-pronged data construction pipeline that leverages both existing human annotations and model-based perturbations. Our comprehensive study across 11 judge models and 9 general purpose models, reveals that the contextual information and its assessment criteria present a significant challenge to even state-of-the-art models. For example, OpenAI’s o1, the best-performing model, barely reaches 55% consistent accuracy.
zh

[NLP-53] Personalized Attacks of Social Engineering in Multi-turn Conversations – LLM Agents for Simulation and Detection

【速读】：本文旨在解决由大型语言模型（Large Language Models, LLMs）驱动的聊天机器人在社交媒体平台上引发的社会工程（Social Engineering, SE）攻击检测难题，特别是多轮对话场景下的SE攻击检测。传统单次实例检测难以应对动态变化的对话过程，而多轮交互中的SE检测更为复杂。论文的关键在于通过理解攻击者如何利用系统漏洞以及受害者个性特征如何影响其易感性来缓解这一威胁。为此，作者提出了一个基于LLM的框架——SE-VSim，用于模拟SE攻击机制，并生成多轮对话以研究不同个性特征的受害者对操控的敏感性。此外，利用包含超过1000段模拟对话的数据集，评估了伪装成招聘人员、资助机构或记者的攻击者试图获取敏感信息的各种攻击情景。最终，基于此分析，论文提出了一种概念验证方案SE-OmniGuard，它通过利用受害者的个性先验知识、评估攻击策略以及监控对话中的信息交换，为用户提供个性化保护，从而有效识别潜在的SE攻击企图。

链接: https://arxiv.org/abs/2503.15552
作者: Tharindu Kumarage,Cameron Johnson,Jadie Adams,Lin Ai,Matthias Kirchner,Anthony Hoogs,Joshua Garland,Julia Hirschberg,Arslan Basharat,Huan Liu
机构: Arizona State University (亚利桑那州立大学); Kitware, Inc (基特维尔公司); Columbia University (哥伦比亚大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of conversational agents, particularly chatbots powered by Large Language Models (LLMs), poses a significant risk of social engineering (SE) attacks on social media platforms. SE detection in multi-turn, chat-based interactions is considerably more complex than single-instance detection due to the dynamic nature of these conversations. A critical factor in mitigating this threat is understanding the mechanisms through which SE attacks operate, specifically how attackers exploit vulnerabilities and how victims’ personality traits contribute to their susceptibility. In this work, we propose an LLM-agentic framework, SE-VSim, to simulate SE attack mechanisms by generating multi-turn conversations. We model victim agents with varying personality traits to assess how psychological profiles influence susceptibility to manipulation. Using a dataset of over 1000 simulated conversations, we examine attack scenarios in which adversaries, posing as recruiters, funding agencies, and journalists, attempt to extract sensitive information. Based on this analysis, we present a proof of concept, SE-OmniGuard, to offer personalized protection to users by leveraging prior knowledge of the victims personality, evaluating attack strategies, and monitoring information exchanges in conversations to identify potential SE attempts.
zh

[NLP-54] From Divergence to Consensus: Evaluating the Role of Large Language Models in Facilitating Agreement through Adaptive Strategies

【速读】：该论文试图解决在大规模、快速决策环境中实现群体共识过程中面临的挑战，特别是如何克服不同视角的分歧及偏见对达成一致意见的影响。传统依赖人工协调者的办法受限于可扩展性和效率，尤其是在处理大规模、快节奏讨论时。为此，论文提出了一种创新框架，利用大型语言模型（Large Language Models, LLMs）作为自动化的协调者嵌入到定制的多用户聊天系统中。该方案的关键在于采用余弦相似度为核心指标，评估三种最先进的LLMs（ChatGPT 4.0、Mistral Large 2和AI21 Jamba Instruct）合成符合参与者观点的共识提案的能力，并通过整合自适应协调策略（如澄清误解、总结讨论、提出折衷方案等），使这些模型能够根据用户反馈迭代优化共识提案。实验结果表明，相较于其他模型，ChatGPT 4.0表现出更高的与参与者观点的一致性，并且需要更少的迭代次数即可达成共识。此外，分析还揭示了这些模型在不同可持续发展相关话题（如气候行动、优质教育、健康福祉以及清洁水和卫生设施获取）上的细微性能差异。这些发现强调了LLM驱动协调提升集体决策过程潜力的重要性，并指出了未来研究中改进评估指标和跨文化适应性的关键方向。

链接: https://arxiv.org/abs/2503.15521
作者: Loukas Triantafyllopoulos,Dimitris Kalles
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 32 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Achieving consensus in group decision-making often involves overcoming significant challenges, particularly in reconciling diverse perspectives and mitigating biases that hinder agreement. Traditional methods relying on human facilitators are often constrained by scalability and efficiency, especially in large-scale, fast-paced discussions. To address these challenges, this study proposes a novel framework employing large language models (LLMs) as automated facilitators within a custom-built multi-user chat system. Leveraging cosine similarity as a core metric, this approach evaluates the ability of three state-of-the-art LLMs- ChatGPT 4.0, Mistral Large 2, and AI21 Jamba Instruct- to synthesize consensus proposals that align with participants’ viewpoints. Unlike conventional techniques, the system integrates adaptive facilitation strategies, including clarifying misunderstandings, summarizing discussions, and proposing compromises, enabling the LLMs to iteratively refine consensus proposals based on user feedback. Experimental results demonstrate the superiority of ChatGPT 4.0, which achieves higher alignment with participant opinions, requiring fewer iterations to reach consensus compared to its counterparts. Moreover, analysis reveals the nuanced performance of the models across various sustainability-focused discussion topics, such as climate action, quality education, good health and well-being, and access to clean water and sanitation. These findings highlight the transformative potential of LLM-driven facilitation for improving collective decision-making processes and underscore the importance of advancing evaluation metrics and cross-cultural adaptability in future research.
zh

[NLP-55] Superhuman AI Disclosure: Impacts on Toxicity Fairness and Trust Vary by Expertise and Persona Attributes

【速读】：该论文试图解决在人工智能（Artificial Intelligence, AI）展现超人能力时如何平衡透明度与公平性、可问责性及信任的问题。论文的关键解决方案在于通过引入一组基于实证验证的合成人格（synthetic personas），这些人格反映了不同的公平关切和对技术接受程度的多样性，评估在不同场景下（竞争性的《星际争霸II》和合作性的个人助理任务）披露或隐瞒AI超人能力的影响。研究发现透明度的效果具有领域特异性：在竞争场景中，明确标注AI为超人能力可能减少对作弊的怀疑，但在合作场景中则可能提升信任但伴随潜在的过度依赖风险。因此，透明度并非万能解药，其设计需针对具体应用场景优化。

链接: https://arxiv.org/abs/2503.15514
作者: Jaymari Chua,Chen Wang,Lina Yao
机构: CSIRO’s Data61(Sydney, NSW, Australia); UNSW(Sydney, NSW, Australia)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:As artificial intelligence demonstrates surpassing human performance across real-world tasks, disclosing superhuman capabilities poses challenges for fairness, accountability, and trust. To investigate how transparency impacts attitudes and perceptions, we introduce a grounded and validated set of synthetic personas reflecting diverse fairness concerns and technology acceptance levels. Then we evaluate responses in two contrasting domains: (1) a competitive player in StarCraft II, where strategy and high-skill gameplay often elicit toxic interactions, and (2) a cooperative personal-assistant in providing information. Across numerous interactions spanning persona profiles, we test non-disclosure versus explicit superhuman labelling under controlled game outcomes and usage contexts. Our findings reveal sharp domain-specific effects: in StarCraft II, explicitly labelling AI as superhuman, novice personas who learned of it reported lower toxicity and higher fairness-attributing defeat to advanced skill rather than hidden cheating-whereas expert personas found the disclosure statements irksome but still less deceptive than non-disclosure. Conversely, in the LLM as personal-assistant setting, disclosure of superhuman capabilities improved perceived trustworthiness, though it risked AI overreliance among certain persona segments. We release Dataset X-containing persona cards-including profile attributes, disclosure prompts, and detailed interaction logs, accompanied by reproducible protocols and disclaimers for adapting them to diverse tasks. Our results demonstrate that transparency is not a cure-all: while it reduces suspicion and enhances trust in cooperative contexts, it may inflame resistance or disappointment in competitive domains.
zh

[NLP-56] Representing data in words

【速读】：该论文试图解决的问题是如何通过自然语言描述（wordalisations）以直观且易于理解的方式呈现数据洞察，而非依赖传统的可视化手段。论文的关键解决方案在于利用大型语言模型（Large Language Models, LLMs），并通过任务无关的提示模板（prompt templates）自动生成针对特定数据的描述性文字。这些模板能够将复杂的数值信息转化为通俗易懂的语言表达，并结合模型卡片框架（model cards framework）明确描述生成过程中所施加的数据假设、数值到文字的转换方法、背景信息融入方式以及描述的局限性，从而为数据的wordalisations提供更规范的最佳实践指导，而非单纯依赖基准数据集上的性能测试。

链接: https://arxiv.org/abs/2503.15509
作者: Amandine M. Caut,Amy Rouillard,Beimnet Zenebe,Matthias Green,Ágúst Pálmason Morthens,David J. T. Sumpter
机构: Uppsala University (乌普萨拉大学); Stellenbosch University (斯特兰德堡大学); Addis Ababa University (亚的斯亚贝巴大学); Twelve Football (十二足球); Twelve Football (十二足球); Uppsala University (乌普萨拉大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:An important part of data science is the use of visualisations to display data in a way that is easy to digest. Visualisations often rely on underlying statistical or machine learning models – ranging from basic calculations like category means to advanced methods such as principal component analysis of multidimensional datasets – to convey insights. We introduce an analogous concept for word descriptions of data, which we call wordalisations. Wordalisations describe data in easy to digest words, without necessarily reporting numerical values from the data. We show how to create wordalisations using large language models, through prompt templates engineered according to a task-agnostic structure which can be used to automatically generate prompts from data. We show how to produce reliable and engaging texts on three application areas: scouting football players, personality tests, and international survey data. Using the model cards framework, we emphasise the importance of clearly stating the model we are imposing on the data when creating the wordalisation, detailing how numerical values are translated into words, incorporating background information into prompts for the large language model, and documenting the limitations of the wordalisations. We argue that our model cards approach is a more appropriate framework for setting best practices in wordalisation of data than performance tests on benchmark datasets.
zh

[NLP-57] Agreeing to Interact in Human-Robot Interaction using Large Language Models and Vision Language Models

【速读】：该论文旨在解决人机交互（Human-Robot Interaction, HRI）中交互起点复杂性的问题，即机器人是否应与人类进行通信依赖于多种情境因素（如人类当前活动、交互紧迫性等）。论文的关键在于探索大型语言模型（Large Language Models, LLM）和视觉语言模型（Vision-Language Models, VLM）能否提供有效的解决方案。研究通过对比四种基于LLM和VLM的不同系统设计模式，在包含84种人机交互场景的数据集上进行测试。结果显示，使用GPT-4o和Phi-3 Vision模型表明，当期望行为明确时，LLM和VLM能够有效应对交互起点问题；然而，在需要平衡人类与机器人情境的开放性场景中仍面临挑战。

链接: https://arxiv.org/abs/2503.15491
作者: Kazuhiro Sasabuchi,Naoki Wake,Atsushi Kanehira,Jun Takamatsu,Katsushi Ikeuchi
机构: Applied Robotics Research (应用机器人研究), Microsoft (微软), Redmond, WA, USA
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In human-robot interaction (HRI), the beginning of an interaction is often complex. Whether the robot should communicate with the human is dependent on several situational factors (e.g., the current human’s activity, urgency of the interaction, etc.). We test whether large language models (LLM) and vision language models (VLM) can provide solutions to this problem. We compare four different system-design patterns using LLMs and VLMs, and test on a test set containing 84 human-robot situations. The test set mixes several publicly available datasets and also includes situations where the appropriate action to take is open-ended. Our results using the GPT-4o and Phi-3 Vision model indicate that LLMs and VLMs are capable of handling interaction beginnings when the desired actions are clear, however, challenge remains in the open-ended situations where the model must balance between the human and robot situation.
zh

计算机视觉

[CV-0] Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

【速读】：该论文旨在解决离散令牌（discrete tokens）与连续令牌（continuous tokens）在视觉生成任务中的根本矛盾：离散令牌便于使用标准交叉熵损失进行建模，但存在信息丢失和令牌训练不稳定的问题；而连续令牌能更好地保留视觉细节，却需要复杂的分布建模，增加生成管道的复杂性。论文的关键解决方案是提出TokenBridge方法，通过后训练量化（post-training quantization）将离散化过程从令牌训练过程中解耦，直接从连续表示中获得离散令牌，同时保持连续令牌的强表征能力以及离散令牌的建模简单性。具体实现包括引入维度感知量化策略独立离散化每个特征维度，并结合轻量级自回归预测机制高效处理大规模令牌空间。实验结果表明，该方法在重建和生成质量上与连续方法相当，同时采用标准类别预测。这一工作展示了在视觉生成任务中融合离散和连续范式的有效性，为简化自回归建模的高质量视觉生成提供了有前景的方向。

链接: https://arxiv.org/abs/2503.16430
作者: Yuqing Wang,Zhijie Lin,Yao Teng,Yuanzhi Zhu,Shuhuai Ren,Jiashi Feng,Xihui Liu
机构: University of Hong Kong (香港大学); ByteDance Seed (字节跳动种子团队); École Polytechnique (巴黎高等理工学院); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling. Project page: this https URL.
zh

[CV-1] Sonata: Self-Supervised Learning of Reliable Point Representations CVPR2025

【速读】：本文旨在探究是否存在一种可靠的自监督点云模型，能够通过简单的线性探测（linear probing）适用于多样化的3D任务，尤其是在数据有限且计算资源受限的情况下。研究发现，现有的3D自监督学习方法在通过线性探测评估表示质量时表现不佳。作者推测这源于所谓的“几何捷径”（geometric shortcut），即表示倾向于坍塌到低级的空间特征。这一挑战独特于3D领域，源于点云数据的稀疏特性。为解决此问题，论文提出两项关键策略：一是模糊空间信息，二是增强对输入特征的依赖，最终通过自蒸馏（self-distillation）构建了一个包含140k点云的Sonata模型。Sonata模型简单直观，但其学习到的表示强大且可靠，在ScanNet数据集上实现了线性探测精度从21.8%大幅提升至72.5%，同时仅使用1%的数据即可超越先前方法近两倍性能，展示了卓越的参数效率和数据效率，并在3D室内和室外感知任务中达到SOTA水平。

链接: https://arxiv.org/abs/2503.16429
作者: Xiaoyang Wu,Daniel DeTone,Duncan Frost,Tianwei Shen,Chris Xie,Nan Yang,Jakob Engel,Richard Newcombe,Hengshuang Zhao,Julian Straub
机构: The University of Hong Kong (香港大学); Meta Reality Labs Research (Meta 实景实验室研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, produced by Pointcept x Meta, project page: this https URL

点击查看摘要

Abstract:In this paper, we question whether we have a reliable self-supervised point cloud model that can be used for diverse 3D tasks via simple linear probing, even with limited data and minimal computation. We find that existing 3D self-supervised learning approaches fall short when evaluated on representation quality through linear probing. We hypothesize that this is due to what we term the “geometric shortcut”, which causes representations to collapse to low-level spatial features. This challenge is unique to 3D and arises from the sparse nature of point cloud data. We address it through two key strategies: obscuring spatial information and enhancing the reliance on input features, ultimately composing a Sonata of 140k point clouds through self-distillation. Sonata is simple and intuitive, yet its learned representations are strong and reliable: zero-shot visualizations demonstrate semantic grouping, alongside strong spatial reasoning through nearest-neighbor relationships. Sonata demonstrates exceptional parameter and data efficiency, tripling linear probing accuracy (from 21.8% to 72.5%) on ScanNet and nearly doubling performance with only 1% of the data compared to previous approaches. Full fine-tuning further advances SOTA across both 3D indoor and outdoor perception tasks.
zh

[CV-2] DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding

【速读】：该论文旨在解决现有遥感影像理解方法在跨任务泛化能力上的局限性问题，特别是无法充分利用高分辨率数据及全面的大场景语义信息。论文指出，传统方法在处理遥感影像时面临挑战，因为目标物体（如海洋目标或人工结构）通常仅占据极小的空间比例（约1%），且分布稀疏，导致从长度达数十万的二维令牌中高效建模跨任务通用知识具有显著难度。此外，尽管某些当代基础模型展现出潜力，但它们在跨任务适应性和处理有限大小低分辨率影像方面仍存在不足。

为了解决这些问题，论文提出了DynamicVis，这是一种基于人类视觉系统选择性注意机制的动态视觉感知基础模型。其关键创新在于引入了一个新颖的动态区域感知主干网络，该网络基于选择性状态空间模型，能够战略性地平衡局部细节提取与全局上下文整合，从而实现大规模数据的计算高效编码，同时保持架构的可扩展性。此外，为了增强跨任务的知识迁移能力，DynamicVis采用了多实例学习范式，并利用元嵌入表示，在百万级别的区域级标注数据上进行训练。实验结果表明，DynamicVis在九个下游任务中表现出色，实现了高效的多层次特征建模，其处理(2048x2048)像素的延迟仅为97毫秒（ViT的6%），GPU内存占用为833 MB（ViT的3%）。

链接: https://arxiv.org/abs/2503.16426
作者: Keyan Chen,Chenyang Liu,Bowen Chen,Wenyuan Li,Zhengxia Zou,Zhenwei Shi
机构: Beihang University (北京航空航天大学); the University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model’s versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT’s) and 833 MB GPU memory (3% of ViT’s).
zh

[CV-3] okenize Image as a Set

【速读】：该论文旨在解决传统图像生成方法中基于固定位置潜码的均匀压缩比序列化处理所存在的局限性，提出了一种全新的基于集合的标记化（set-based tokenization）和分布建模范式。论文的关键创新在于引入无序标记集表示法（TokenSet），通过动态分配编码容量来适应不同区域的语义复杂度，从而增强全局上下文聚合能力并提升对局部扰动的鲁棒性。为应对离散集合建模的挑战，论文设计了一种双变换机制，将集合双向转换为具有求和约束的固定长度整数序列。此外，提出了首个能够同时处理离散值、固定序列长度及求和不变性的Fixed-Sum离散扩散框架（Fixed-Sum Discrete Diffusion），实现有效的集合分布建模。实验结果验证了该方法在语义感知表征与生成质量方面的优越性，推动了视觉生成技术超越传统的顺序标记范式。

链接: https://arxiv.org/abs/2503.16425
作者: Zigang Geng,Mengde Xu,Han Hu,Shuyang Gu
机构: University of Science and Technology of China (中国科学技术大学); Tencent Hunyuan Research (腾讯混元实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a fundamentally new paradigm for image generation through set-based tokenization and distribution modeling. Unlike conventional methods that serialize images into fixed-position latent codes with a uniform compression ratio, we introduce an unordered token set representation to dynamically allocate coding capacity based on regional semantic complexity. This TokenSet enhances global context aggregation and improves robustness against local perturbations. To address the critical challenge of modeling discrete sets, we devise a dual transformation mechanism that bijectively converts sets into fixed-length integer sequences with summation constraints. Further, we propose Fixed-Sum Discrete Diffusion–the first framework to simultaneously handle discrete values, fixed sequence length, and summation invariance–enabling effective set distribution modeling. Experiments demonstrate our method’s superiority in semantic-aware representation and generation quality. Our innovations, spanning novel representation and modeling strategies, advance visual generation beyond traditional sequential token paradigms. Our code and models are publicly available at this https URL.
zh

[CV-4] Bézier Splatting for Fast and Differentiable Vector Graphics

【速读】：该论文致力于解决现有可微向量图形（Differentiable Vector Graphics, VGs）表示在优化成本高昂且难以实现高分辨率图像高质量渲染的问题。论文的关键解决方案是提出了一种新的可微向量图形表示方法——Bézier splatting，它通过沿Bézier曲线采样二维高斯函数，自然提供了物体边界的定位梯度，从而实现了快速而高保真的向量图形栅格化。此外，论文引入了一种自适应剪枝与密集化策略，动态调整曲线的空间分布以跳出局部最优，进一步提升向量图形质量。实验结果表明，Bézier splatting在视觉保真度和优化速度方面显著优于现有方法，优化速度提升了10倍。

链接: https://arxiv.org/abs/2503.16424
作者: Xi Liu,Chaoyi Zhou,Nanxuan Zhao,Siyu Huang
机构: Clemson University (克莱姆森大学); Adobe Research (Adobe 研究院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Differentiable vector graphics (VGs) are widely used in image vectorization and vector synthesis, while existing representations are costly to optimize and struggle to achieve high-quality rendering results for high-resolution images. This work introduces a new differentiable VG representation, dubbed Bézier splatting, that enables fast yet high-fidelity VG rasterization. Bézier splatting samples 2D Gaussians along Bézier curves, which naturally provide positional gradients at object boundaries. Thanks to the efficient splatting-based differentiable rasterizer, Bézier splatting achieves over 20x and 150x faster per forward and backward rasterization step for open curves compared to DiffVG. Additionally, we introduce an adaptive pruning and densification strategy that dynamically adjusts the spatial distribution of curves to escape local minima, further improving VG quality. Experimental results show that Bézier splatting significantly outperforms existing methods with better visual fidelity and 10x faster optimization speed.
zh

[CV-5] GAEA: A Geolocation Aware Conversational Model

【速读】：该论文试图解决传统图像地理定位模型仅能输出精确GPS坐标而缺乏对地理位置的理解及与用户进行有效交互能力的问题。针对这一局限性，特别是在大型多模态模型（Large Multimodal Models, LMMs）应用于更专业化下游任务如地理定位时存在的挑战，论文提出了解决方案。关键在于引入了一个名为GAEA的对话模型，它能够根据用户需求提供关于图像位置的相关信息，并通过构建包含80万张图像和约160万问答对的综合数据集GAEA来实现此功能，该数据集利用了OpenStreetMap (OSM) 属性和地理上下文线索进行构建。此外，为了定量评估模型性能，研究还提出了一个包含4000个图像-文本对的多样化基准测试集，涵盖多种问题类型。实验结果显示，GAEA在开放源代码和专有LMMs的对比中显著优于现有最佳模型，分别比LLaVA-OneVision高出25.69%，比GPT-4o高出8.28%。

链接: https://arxiv.org/abs/2503.16423
作者: Ron Campos,Ashmal Vayani,Parth Parag Kulkarni,Rohit Gupta,Aritra Dutta,Mubarak Shah
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: The dataset and code used in this submission is available at: this https URL

点击查看摘要

Abstract:Image geolocalization, in which, traditionally, an AI model predicts the precise GPS coordinates of an image is a challenging task with many downstream applications. However, the user cannot utilize the model to further their knowledge other than the GPS coordinate; the model lacks an understanding of the location and the conversational ability to communicate with the user. In recent days, with tremendous progress of large multimodal models (LMMs) proprietary and open-source researchers have attempted to geolocalize images via LMMs. However, the issues remain unaddressed; beyond general tasks, for more specialized downstream tasks, one of which is geolocalization, LMMs struggle. In this work, we propose to solve this problem by introducing a conversational model GAEA that can provide information regarding the location of an image, as required by a user. No large-scale dataset enabling the training of such a model exists. Thus we propose a comprehensive dataset GAEA with 800K images and around 1.6M question answer pairs constructed by leveraging OpenStreetMap (OSM) attributes and geographical context clues. For quantitative evaluation, we propose a diverse benchmark comprising 4K image-text pairs to evaluate conversational capabilities equipped with diverse question types. We consider 11 state-of-the-art open-source and proprietary LMMs and demonstrate that GAEA significantly outperforms the best open-source model, LLaVA-OneVision by 25.69% and the best proprietary model, GPT-4o by 8.28%. Our dataset, model and codes are available
zh

[CV-6] 1000 FPS 4D Gaussian Splatting for Dynamic Scene Rendering

【速读】：该论文旨在解决4D Gaussian Splatting (4DGS) 在动态场景重建中存储需求大和渲染速度慢的问题。论文识别出两个关键的时序冗余来源：(Q1) 短生命周期高斯分布（Short-Lifespan Gaussians），即4DGS 使用大量时间跨度短的高斯分布来表示场景动态，导致高斯分布的数量过多；(Q2) 不活跃高斯分布（Inactive Gaussians），即在渲染过程中仅一小部分高斯分布对每一帧有贡献，但所有高斯分布在光栅化过程中都被处理，造成冗余计算开销。为解决这些问题，论文提出4DGS-1K 方法，在现代GPU上可达到超过1000 FPS 的运行速度。针对Q1，引入空间-时间变化评分（Spatial-Temporal Variation Score）这一新的剪枝标准，有效去除短生命周期高斯分布，同时鼓励4DGS 使用时间跨度更长的高斯分布捕捉场景动态；针对Q2，存储连续帧中活跃高斯分布的掩码，显著减少了渲染过程中的冗余计算。与原始4DGS相比，该方法在复杂动态场景下将存储需求减少41倍，光栅化速度提高9倍，同时保持相当的视觉质量。

链接: https://arxiv.org/abs/2503.16422
作者: Yuheng Yuan,Qiuhong Shen,Xingyi Yang,Xinchao Wang
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:4D Gaussian Splatting (4DGS) has recently gained considerable attention as a method for reconstructing dynamic scenes. Despite achieving superior quality, 4DGS typically requires substantial storage and suffers from slow rendering speed. In this work, we delve into these issues and identify two key sources of temporal redundancy. (Q1) \textbfShort-Lifespan Gaussians: 4DGS uses a large portion of Gaussians with short temporal span to represent scene dynamics, leading to an excessive number of Gaussians. (Q2) \textbfInactive Gaussians: When rendering, only a small subset of Gaussians contributes to each frame. Despite this, all Gaussians are processed during rasterization, resulting in redundant computation overhead. To address these redundancies, we present \textbf4DGS-1K, which runs at over 1000 FPS on modern GPUs. For Q1, we introduce the Spatial-Temporal Variation Score, a new pruning criterion that effectively removes short-lifespan Gaussians while encouraging 4DGS to capture scene dynamics using Gaussians with longer temporal spans. For Q2, we store a mask for active Gaussians across consecutive frames, significantly reducing redundant computations in rendering. Compared to vanilla 4DGS, our method achieves a 41\times reduction in storage and 9\times faster rasterization speed on complex dynamic scenes, while maintaining comparable visual quality. Please see our project page at this https URL.
zh

[CV-7] MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

【速读】：该论文旨在解决轨迹可控视频生成中存在的复杂物体运动控制困难、多物体运动协调不足以及视觉质量下降等问题，并填补现有方法仅支持单一轨迹格式的局限性。此外，论文还关注缺乏专门针对轨迹可控视频生成的公开数据集和基准的问题。为了解决这些问题，论文提出的关键方案是MagicMotion框架，它通过从密集到稀疏的三种条件（蒙版、边界框和稀疏框）实现轨迹控制，从而在保持物体一致性与视觉质量的同时，灵活地沿定义轨迹动画化对象。同时，论文还发布了MagicData数据集及标注过滤自动化管道，并提出了MagicBench基准，用于评估视频质量和轨迹控制精度。实验结果表明，MagicMotion在多个指标上优于现有方法。

链接: https://arxiv.org/abs/2503.16421
作者: Quanhao Li,Zhen Xing,Rui Wang,Hui Zhang,Qi Dai,Zuxuan Wu
机构: Fudan University (复旦大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths. However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality. Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios. Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation. To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. Given an input image and trajectories, MagicMotion seamlessly animates objects along defined trajectories while maintaining object consistency and visual quality. Furthermore, we present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering. We also introduce MagicBench, a comprehensive benchmark that assesses both video quality and trajectory control accuracy across different numbers of objects. Extensive experiments demonstrate that MagicMotion outperforms previous methods across various metrics. Our project page are publicly available at this https URL.
zh

[CV-8] SynCity: Training-Free Generation of 3D Worlds

【速读】：该论文试图解决从文本描述生成三维（3D）世界的问题。现有大多数3D生成模型专注于物体级生成，难以直接构建大规模场景。为应对这一挑战，论文提出了一种无需训练和优化的方案SynCity。其关键是结合预训练的3D生成模型的几何精确性与二维（2D）图像生成器的艺术多样性，通过基于瓦片（tile-based）的方法实现对场景布局和外观的精细控制，从而逐步生成扩展的高质量3D空间。这种方法允许逐块生成世界，并确保新生成的瓦片与其所在场景上下文无缝融合，最终形成细节丰富且多样化的沉浸式场景。

链接: https://arxiv.org/abs/2503.16420
作者: Paul Engstler,Aleksandar Shtedritski,Iro Laina,Christian Rupprecht,Andrea Vedaldi
机构: Visual Geometry Group, University of Oxford (牛津大学视觉几何组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training- and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.
zh

[CV-9] InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

【速读】：该论文致力于解决在身份保真图像生成任务中灵活性与高保真度难以兼顾的问题，特别是针对先进扩散Transformer（Diffusion Transformers, DiTs）模型如FLUX所面临的挑战。现有方法存在身份相似性不足、文本-图像对齐效果差、生成质量及美学水平较低等问题。论文提出的关键解决方案是InfiniteYou (InfU) 框架，其中InfuseNet为核心组件，通过残差连接将身份特征注入到DiT基础模型中，从而提升身份相似性的同时保持生成能力。此外，多阶段训练策略（包括预训练和带合成单人多样本数据的监督微调）进一步优化了文本-图像对齐、改善图像质量和缓解人脸复制粘贴问题。实验结果表明，InfU在性能上超越了现有基线模型，并且其插拔式设计兼容多种现有方法，为相关领域做出了重要贡献。

链接: https://arxiv.org/abs/2503.16418
作者: Liming Jiang,Qing Yan,Yumin Jia,Zichuan Liu,Hao Kang,Xin Lu
机构: ByteDance Intelligent Creation (字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL Code and model: this https URL

点击查看摘要

Abstract:Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.
zh

[CV-10] M3: 3D-Spatial MultiModal Memory ICLR2025

【速读】：该论文旨在解决3D特征蒸馏中的核心压缩挑战，具体目标是通过视频源实现对中等大小静态场景的视觉感知信息保留。论文提出的解决方案核心在于设计了一种名为3D Spatial MultiModal Memory (M3) 的多模态记忆系统，其关键技术包括主场景成分（principal scene components）和高斯记忆注意力（Gaussian memory attention）。这些组件通过结合3D高斯点 splatting 技术与基础模型，实现了跨粒度特征表示的高效构建，并解决了之前方法中存在的两个关键问题：(1) 高维特征存储的计算约束；(2) 蒸馏特征与基础模型特征之间的不匹配或信息丢失。通过M3，论文不仅在特征相似性和下游任务上进行了全面定量评估，还通过可视化展示了高斯记忆注意力的像素轨迹，验证了其有效性。

链接: https://arxiv.org/abs/2503.16413
作者: Xueyan Zou,Yuchen Song,Ri-Zhao Qiu,Xuanbin Peng,Jianglong Ye,Sifei Liu,Xiaolong Wang
机构: UC San Diego (加州大学圣地亚哥分校); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICLR2025 homepage: this https URL code: this https URL

点击查看摘要

Abstract:We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rendering feature representations across granularities, encompassing a wide range of knowledge. In our exploration, we identify two key challenges in previous works on feature splatting: (1) computational constraints in storing high-dimensional features for each Gaussian primitive, and (2) misalignment or information loss between distilled features and foundation model features. To address these challenges, we propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and inference. To validate M3, we conduct comprehensive quantitative evaluations of feature similarity and downstream tasks, as well as qualitative visualizations to highlight the pixel trace of Gaussian memory attention. Our approach encompasses a diverse range of foundation models, including vision-language models (VLMs), perception models, and large multimodal and language models (LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy M3’s feature field in indoor scenes on a quadruped robot. Notably, we claim that M3 is the first work to address the core compression challenges in 3D feature distillation.
zh

[CV-11] DreamTexture: Shape from Virtual Texture with Analysis by Augmentation

【速读】：该论文旨在解决现有无监督三维重建方法（如DreamFusion）因多视角渲染及大规模生成模型监督导致的计算成本高且欠约束的问题。论文提出了一种名为DreamTexture的新方法，通过结合虚拟纹理与输入图像中的单目深度线索，利用现代扩散模型中编码的单目几何理解，实现无需多视角监督的三维物体重建。关键在于引入基于虚拟纹理形状提取的方法，并通过一种新的共形映射优化从虚拟纹理变形中重建深度，从而避免内存密集型体素表示。此外，论文还提出了一种新颖的单目重建范式——通过增强和对齐纹理线索进行分析（Analysis by Augmentation）。

链接: https://arxiv.org/abs/2503.16412
作者: Ananta R. Bhattarai,Xingzhe He,Alla Sheffer,Helge Rhodin
机构: Bielefeld University (比勒费尔德大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:DreamFusion established a new paradigm for unsupervised 3D reconstruction from virtual views by combining advances in generative models and differentiable rendering. However, the underlying multi-view rendering, along with supervision from large-scale generative models, is computationally expensive and under-constrained. We propose DreamTexture, a novel Shape-from-Virtual-Texture approach that leverages monocular depth cues to reconstruct 3D objects. Our method textures an input image by aligning a virtual texture with the real depth cues in the input, exploiting the inherent understanding of monocular geometry encoded in modern diffusion models. We then reconstruct depth from the virtual texture deformation with a new conformal map optimization, which alleviates memory-intensive volumetric representations. Our experiments reveal that generative models possess an understanding of monocular shape cues, which can be extracted by augmenting and aligning texture cues – a novel monocular reconstruction paradigm that we call Analysis by Augmentation.
zh

[CV-12] RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

【速读】：该论文旨在解决复杂跨领域真实任务中设计高效且安全的具身多智能体系统的问题，现有方法难以自动生成适用于此类系统的训练数据。论文的关键在于提出具身多智能体系统的**组合约束（compositional constraints）**概念，并为此设计了多种特定接口以实现与物理世界的无缝交互。基于此，论文开发了一个自动数据收集框架，并引入了首个具身多智能体操作任务基准——RoboFactory。通过在RoboFactory上的实验，论文评估了模仿学习方法的性能，并探索了多智能体模仿学习的架构与训练策略，以构建安全高效的具身多智能体系统。

链接: https://arxiv.org/abs/2503.16408
作者: Yiran Qin,Li Kang,Xiufeng Song,Zhenfei Yin,Xiaohong Liu,Xihui Liu,Ruimao Zhang,Lei Bai
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Designing effective embodied multi-agent systems is critical for solving complex real-world tasks across domains. Due to the complexity of multi-agent embodied systems, existing methods fail to automatically generate safe and efficient training data for such systems. To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. We design various interfaces tailored to different types of constraints, enabling seamless interaction with the physical world. Leveraging compositional constraints and specifically designed interfaces, we develop an automated data collection framework for embodied multi-agent systems and introduce the first benchmark for embodied multi-agent manipulation, RoboFactory. Based on RoboFactory benchmark, we adapt and evaluate the method of imitation learning and analyzed its performance in different difficulty agent tasks. Furthermore, we explore the architectures and training strategies for multi-agent imitation learning, aiming to build safe and efficient embodied multi-agent systems.
zh

[CV-13] VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness CVPR2025

【速读】：该论文旨在解决现有大规模文本到图像扩散模型（Text-to-Image Diffusion Models）在生成照片级真实图像时，难以准确描绘人类与物体之间交互（Human-Object Interactions）的问题。其核心挑战源于这些模型对不同交互词（Interaction Words）区分能力的局限性。为应对这一挑战，论文提出了一种名为VerbDiff的新方法。VerbDiff通过弱化交互词与物体之间的固有偏见（Bias），增强模型对交互语义的理解。其关键技术在于解耦（Disentangle）基于频率的锚词（Frequency-based Anchor Words）中的交互词，并利用生成图像中的局部交互区域（Localized Interaction Regions）来帮助模型更精准地捕捉特定交互词的语义，而无需额外条件。这种方法使模型能够准确理解指定交互词所表达的人类与物体之间的交互意图，从而生成高质量且交互精确的图像。实验结果表明，VerbDiff在HICO-DET数据集上的表现显著优于现有方法。

链接: https://arxiv.org/abs/2503.16406
作者: SeungJu Cha,Kwanyoung Lee,Ye-Chan Kim,Hyunwoo Oh,Dong-Jin Kim
机构: Hanyang University (汉阳大学), South Korea
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted at CVPR 2025, code : this https URL

点击查看摘要

Abstract:Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words. In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.
zh

[CV-14] SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

【速读】：该论文旨在解决现有基于视觉的3D占用预测方法因仅依赖街景图像而导致精度受限的问题，特别是忽视了整合卫星视图的潜在优势。论文提出了一种名为SA-Occ的卫星辅助3D占用预测模型，通过GPS和IMU利用历史且易于获取的卫星图像增强实时应用中的自车感知能力，有效缓解了远距离区域中遮挡和性能退化等问题。该解决方案的关键在于提出了三种核心技术：1）动态解耦融合（Dynamic-Decoupling Fusion），用于解决卫星和街景之间时间异步引起的动态区域不一致性；2）3D投影引导模块（3D-Proj Guidance），提升从本质上二维的卫星图像中提取三维特征的能力；3）均匀采样对齐（Uniform Sampling Alignment），实现街景与卫星视图之间的采样密度对齐。这些技术共同提升了跨视图感知的准确性与效率。

链接: https://arxiv.org/abs/2503.16399
作者: Chen Chen,Zhirui Wang,Taowei Sheng,Yi Jiang,Yundu Li,Peirui Cheng,Luning Zhang,Kaiqiang Chen,Yanfeng Hu,Xue Yang,Xian Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views. We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions. To address the core challenges of cross-view perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views. Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame. Our code and newly curated dataset are available at this https URL.
zh

[CV-15] Scale-wise Distillation of Diffusion Models

【速读】：该论文旨在解决扩散模型 (Diffusion Models, DMs) 在高效推理方面的计算成本问题，提出了一种名为 SwD 的尺度级蒸馏框架。SwD 的关键创新在于引入基于下一尺度预测的思想，通过在较低分辨率下初始化生成，并逐步对去噪步骤中的样本进行上采样，从而在保持性能的同时显著降低计算开销。此外，SwD 将这一思想自然融入现有的基于分布匹配的蒸馏方法中，并通过引入一种新的补丁损失函数 (patch loss)，进一步强化目标分布的细粒度相似性。实验表明，SwD 在相同计算预算下，接近两步全分辨率推理的速度，同时在自动化指标和人类偏好研究中优于对比方法。

链接: https://arxiv.org/abs/2503.16397
作者: Nikita Starodubcev,Denis Kuznedelev,Artem Babenko,Dmitry Baranchuk
机构: Yandex Research (Yandex 研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present SwD, a scale-wise distillation framework for diffusion models (DMs), which effectively employs next-scale prediction ideas for diffusion-based few-step generators. In more detail, SwD is inspired by the recent insights relating diffusion processes to the implicit spectral autoregression. We suppose that DMs can initiate generation at lower data resolutions and gradually upscale the samples at each denoising step without loss in performance while significantly reducing computational costs. SwD naturally integrates this idea into existing diffusion distillation methods based on distribution matching. Also, we enrich the family of distribution matching approaches by introducing a novel patch loss enforcing finer-grained similarity to the target distribution. When applied to state-of-the-art text-to-image diffusion models, SwD approaches the inference times of two full resolution steps and significantly outperforms the counterparts under the same computation budget, as evidenced by automated metrics and human preference studies.
zh

[CV-16] SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

【速读】：该论文旨在解决动态三维资产生成中的多视角视频扩散模型在处理遮挡、大范围运动以及真实世界视频泛化性方面的不足，同时提升细节清晰度和时空一致性。解决方案的关键在于从网络架构（消除参考多视角依赖并设计三维与帧注意力融合机制）、数据增强（提高训练数据质量和数量）、训练策略（采用渐进式三维到四维训练以改善泛化能力）以及四维优化（通过两阶段精炼和渐进帧采样处理三维不一致性和大范围运动）四个方面进行了重要改进。这些改进显著提升了SV4D 2.0在新颖视角视频合成和四维优化中的细节表现（LPIPS降低14%）及四维一致性（FV4D降低44%）。

链接: https://arxiv.org/abs/2503.16396
作者: Chun-Han Yao,Yiming Xie,Vikram Voleti,Huaizu Jiang,Varun Jampani
机构: Stability AI (Stability.AI); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14% LPIPS) and 4D consistency (-44% FV4D) in novel-view video synthesis and 4D optimization (-12% LPIPS and -24% FV4D) compared to SV4D. Project page: this https URL.
zh

[CV-17] Panoptic-CUDAL Technical Report: Rural Australia Point Cloud Dataset in Rainy Conditions

【速读】：该论文试图解决现有自动驾驶数据集在应对农村环境及恶劣天气条件（如降雨）时的不足问题。现有数据集主要聚焦于结构化城市环境和良好天气条件，而对农村复杂场景及恶劣天气下的传感器退化（如LiDAR和摄像头数据中的噪声和反射）关注较少。论文的关键解决方案是引入Panoptic-CUDAL数据集，该数据集专门针对农村降雨条件下全景分割任务设计，通过采集高分辨率的LiDAR、相机和姿态数据，提供了一个信息丰富且具有挑战性的数据集。此外，论文还提供了基于该数据集的全景分割和语义分割方法的基准结果。

链接: https://arxiv.org/abs/2503.16378
作者: Tzu-Yun Tseng,Alexey Nekrasov,Malcolm Burdorf,Bastian Leibe,Julie Stephany Berrio,Mao Shan,Stewart Worrall
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing autonomous driving datasets are predominantly oriented towards well-structured urban settings and favorable weather conditions, leaving the complexities of rural environments and adverse weather conditions largely unaddressed. Although some datasets encompass variations in weather and lighting, bad weather scenarios do not appear often. Rainfall can significantly impair sensor functionality, introducing noise and reflections in LiDAR and camera data and reducing the system’s capabilities for reliable environmental perception and safe navigation. We introduce the Panoptic-CUDAL dataset, a novel dataset purpose-built for panoptic segmentation in rural areas subject to rain. By recording high-resolution LiDAR, camera, and pose data, Panoptic-CUDAL offers a diverse, information-rich dataset in a challenging scenario. We present analysis of the recorded data and provide baseline results for panoptic and semantic segmentation methods on LiDAR point clouds. The dataset can be found here: this https URL
zh

[CV-18] LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images

【速读】：该论文旨在解决高质量配对数据获取困难且成本高昂的问题，特别是在面部翻译网络等现代机器学习任务中对高精度配对图像的需求。论文提出了一种名为LLM-assisted Paired Image Generation (LaPIG) 的新框架，其关键是利用大型语言模型（Large Language Models, LLMs）生成描述性提示词，结合ArcFace嵌入进行可见光图像合成，以及使用潜扩散模型（Latent Diffusion Models, LDMs）实现热图像翻译，从而高效生成多样化且高质量的多视角配对可见光与热图像，同时保留身份信息。通过与现有方法在公开数据集上的对比评估，验证了LaPIG方法的优越性。

链接: https://arxiv.org/abs/2503.16376
作者: Leyang Wang,Joice Lin
机构: University College London (伦敦大学学院); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The success of modern machine learning, particularly in facial translation networks, is highly dependent on the availability of high-quality, paired, large-scale datasets. However, acquiring sufficient data is often challenging and costly. Inspired by the recent success of diffusion models in high-quality image synthesis and advancements in Large Language Models (LLMs), we propose a novel framework called LLM-assisted Paired Image Generation (LaPIG). This framework enables the construction of comprehensive, high-quality paired visible and thermal images using captions generated by LLMs. Our method encompasses three parts: visible image synthesis with ArcFace embedding, thermal image translation using Latent Diffusion Models (LDMs), and caption generation with LLMs. Our approach not only generates multi-view paired visible and thermal images to increase data diversity but also produces high-quality paired data while maintaining their identity information. We evaluate our method on public datasets by comparing it with existing methods, demonstrating the superiority of LaPIG.
zh

[CV-19] NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

【速读】：该论文致力于解决生成大规模户外场景（包括城堡到摩天大楼等多样化高度场景）的问题，与室内场景生成相比，户外场景生成面临场景高度变化大及快速生成大面积景观的独特挑战。为了解决这些问题，论文提出了一种高效的解决方案，其关键是将场景块编码为统一的向量集（uniform vector sets），相比之前方法使用的空间结构潜变量（spatially structured latents），该方法提供了更好的压缩效率和性能。此外，论文还训练了一个显式的外延生成模型（outpainting model），用于无边界生成，通过提高一致性并消除额外的扩散步骤来加速生成过程，从而改进了传统的基于重采样的修复方案（inpainting schemes）。

链接: https://arxiv.org/abs/2503.16375
作者: Han-Hung Lee,Qinghong Han,Angel X. Chang
机构: Simon Fraser University (西蒙弗雷泽大学); Canada CIFAR AI Chair, Amii (加拿大 CIFAR 人工智能主席, Amii)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we explore the task of generating expansive outdoor scenes, ranging from castles to high-rises. Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including wide variations in scene heights and the need for a method capable of rapidly producing large landscapes. To address this, we propose an efficient approach that encodes scene chunks as uniform vector sets, offering better compression and performance than the spatially structured latents used in prior methods. Furthermore, we train an explicit outpainting model for unbounded generation, which improves coherence compared to prior resampling-based inpainting schemes while also speeding up generation by eliminating extra diffusion steps. To facilitate this task, we curate NuiScene43, a small but high-quality set of scenes, preprocessed for joint training. Notably, when trained on scenes of varying styles, our model can blend different environments, such as rural houses and city skyscrapers, within the same scene, highlighting the potential of our curation process to leverage heterogeneous scenes for joint training.
zh

[CV-20] JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

【速读】：该论文旨在解决视觉语言动作（Visual Language Action, VLA）模型在开放世界环境中基于动作的决策能力不足的问题。尽管现有的VLA模型通过大规模预训练在决策任务中表现出潜力，但以往研究主要集中在动作后训练（action post-training），而忽视了对基础模型本身的改进。为了解决这一局限，论文提出了一种创新的方法——Act from Visual Language Post-Training，通过视觉和语言引导以自监督的方式优化视觉语言模型（Visual Language Models, VLMs）。这种方法的关键在于通过增强模型的世界知识、视觉识别能力和空间定位能力，显著提升了其在开放世界环境中的表现。论文通过在Minecraft环境中验证，证明了非轨迹任务的后训练范式使模型在超过1k个不同原子任务上的性能相比最佳基线提升了40%，并且超越了传统的模仿学习策略，达到了最先进的性能。

链接: https://arxiv.org/abs/2503.16365
作者: Muyao Li,Zihao Wang,Kaichen He,Xiaojian Ma,Yitao Liang
机构: Peking University (北京大学); BIGAI (未知)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures

点击查看摘要

Abstract:Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models’ capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in this https URL.
zh

[CV-21] UniSync: A Unified Framework for Audio-Visual Synchronization ICME2025

【速读】：该论文试图解决精确语音视频同步（audio-visual synchronization）在复杂场景下的局限性问题，现有方法通常依赖于有限的音频-视觉表征以及次优的学习策略。论文的关键解决方案是提出UniSync，这是一种基于嵌入相似性的新方法，通过兼容多种音频（如Mel谱图、HuBERT）和视觉表征（如RGB图像、人脸解析图、面部标志点、3DMM），有效应对维度差异。此外，UniSync通过引入基于边距的损失组件和跨说话者非同步样本对对比学习框架进行增强，提升了区分能力，并在标准数据集上表现出色，同时提升了自然与生成内容的同步质量。

链接: https://arxiv.org/abs/2503.16357
作者: Tao Feng,Yifan Xie,Xun Guan,Jiyuan Song,Zhou Liu,Fei Ma,Fei Yu
机构: Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室（深圳）), Shenzhen, China; Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院), Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 7 pages, 3 figures, accepted by ICME 2025

点击查看摘要

Abstract:Precise audio-visual synchronization in speech videos is crucial for content quality and viewer comprehension. Existing methods have made significant strides in addressing this challenge through rule-based approaches and end-to-end learning techniques. However, these methods often rely on limited audio-visual representations and suboptimal learning strategies, potentially constraining their effectiveness in more complex scenarios. To address these limitations, we present UniSync, a novel approach for evaluating audio-visual synchronization using embedding similarities. UniSync offers broad compatibility with various audio representations (e.g., Mel spectrograms, HuBERT) and visual representations (e.g., RGB images, face parsing maps, facial landmarks, 3DMM), effectively handling their significant dimensional differences. We enhance the contrastive learning framework with a margin-based loss component and cross-speaker unsynchronized pairs, improving discriminative capabilities. UniSync outperforms existing methods on standard datasets and demonstrates versatility across diverse audio-visual representations. Its integration into talking face generation frameworks enhances synchronization quality in both natural and AI-generated content.
zh

[CV-22] Gaussian Graph Network: Learning Efficient and Generalizable Gaussian Representations from Multi-view Images NEURIPS2024

【速读】：该论文旨在解决现有基于高斯点扩散（3D Gaussian Splatting, 3DGS）方法中存在的两个主要问题：一是传统方法需要针对每个场景进行优化，而最近提出的前馈方法虽能生成可泛化的高斯表示，但简单组合像素对齐的高斯分布会导致伪影且增加内存消耗；二是现有方法未能充分捕捉不同视图间高斯分布的关系。为解决这些问题，论文提出了高斯图网络（Gaussian Graph Network, GGN）。其关键是构建高斯图来建模来自不同视图的高斯组之间的关系，并通过重新定义高斯表示的基本图操作实现高斯级的消息传递，使每个高斯能够从其连接的高斯组中受益并通过高斯特征融合增强表达能力。此外，设计了一种高斯池化层以高效聚合多种高斯组，从而实现更高效的表示学习和更好的图像质量。

链接: https://arxiv.org/abs/2503.16338
作者: Shengjun Zhang,Xin Fei,Fangfu Liu,Haixu Song,Yueqi Duan
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2024

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis performance. While conventional methods require per-scene optimization, more recently several feed-forward methods have been proposed to generate pixel-aligned Gaussian representations with a learnable network, which are generalizable to different scenes. However, these methods simply combine pixel-aligned Gaussians from multiple views as scene representations, thereby leading to artifacts and extra memory cost without fully capturing the relations of Gaussians from different images. In this paper, we propose Gaussian Graph Network (GGN) to generate efficient and generalizable Gaussian representations. Specifically, we construct Gaussian Graphs to model the relations of Gaussian groups from different views. To support message passing at Gaussian level, we reformulate the basic graph operations over Gaussian representations, enabling each Gaussian to benefit from its connected Gaussian groups with Gaussian feature fusion. Furthermore, we design a Gaussian pooling layer to aggregate various Gaussian groups for efficient representations. We conduct experiments on the large-scale RealEstate10K and ACID datasets to demonstrate the efficiency and generalization of our method. Compared to the state-of-the-art methods, our model uses fewer Gaussians and achieves better image quality with higher rendering speed.
zh

[CV-23] Ultra-Resolution Adaptation with Ease

【速读】：该论文旨在解决在训练数据和计算资源受限的情况下，利用文本到图像扩散模型生成高分辨率图像的挑战。论文从数据效率和参数效率两个关键视角出发，提出了名为\emphURAE的一系列指导原则。解决方案的关键在于：一方面，通过理论分析与实证研究证明，由教师模型生成的合成数据能够显著加速训练收敛；另一方面，在缺乏合成数据时，调整权重矩阵的小部分组件比广泛使用的低秩适配器（low-rank adapters）表现更优，同时保持高效性。此外，对于采用指导蒸馏方法（如FLUX）的模型，禁用分类器自由引导（即在适应过程中将指导尺度设置为1）对实现满意性能至关重要。实验表明，URAE仅使用3K样本和2K迭代即可达到与最先进的闭源模型（如FLUX1.1 [Pro] Ultra）相当的2K分辨率生成效果，并在4K分辨率生成任务中树立了新标杆。

链接: https://arxiv.org/abs/2503.16322
作者: Ruonan Yu,Songhua Liu,Zhenxiong Tan,Xinchao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Codes are available \href{ this https URL }{here}

点击查看摘要

Abstract:Text-to-image diffusion models have achieved remarkable progress in recent years. However, training models for high-resolution image generation remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency, and propose a set of key guidelines for ultra-resolution adaptation termed \emphURAE. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by some teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, \textiti.e., setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation. Codes are available \hrefthis https URLhere.
zh

[CV-24] Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction WWW

【速读】：该论文旨在解决多视图几何任务在动态场景下的局限性问题。传统方法通过将多种任务（如相机内外参数估计、三维场景重建和图像对应关系建立）归约为一对视点不变点图的预测，但这种方法无法处理动态场景。论文的关键创新在于引入动态点图（Dynamic Point Maps, DPM）的概念，扩展了标准点图以支持四维任务（如运动分割、场景流估计、三维物体跟踪和二维对应关系）。其核心思想是在引入时间维度后，存在多个可能的空间和时间参考框架来定义点图，并识别出一组最小的组合，通过网络回归这些组合来解决上述子任务。论文通过混合真实与合成数据训练DPM预测器，并在多个基准测试中实现了最先进的性能。

链接: https://arxiv.org/abs/2503.16318
作者: Edgar Sucar,Zihang Lai,Eldar Insafutdinov,Andrea Vedaldi
机构: Visual Geometry Group (VGG), University of Oxford (牛津大学视觉几何组; 牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Web page: this https URL

点击查看摘要

Abstract:DUSt3R has recently shown that one can reduce many tasks in multi-view geometry, including estimating camera intrinsics and extrinsics, reconstructing the scene in 3D, and establishing image correspondences, to the prediction of a pair of viewpoint-invariant point maps, i.e., pixel-aligned point clouds defined in a common reference frame. This formulation is elegant and powerful, but unable to tackle dynamic scenes. To address this challenge, we introduce the concept of Dynamic Point Maps (DPM), extending standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. Our key intuition is that, when time is introduced, there are several possible spatial and time references that can be used to define the point maps. We identify a minimal subset of such combinations that can be regressed by a network to solve the sub tasks mentioned above. We train a DPM predictor on a mixture of synthetic and real data and evaluate it across diverse benchmarks for video depth prediction, dynamic point cloud reconstruction, 3D scene flow and object pose tracking, achieving state-of-the-art performance. Code, models and additional results are available at this https URL.
zh

[CV-25] Unleashing Vecset Diffusion Model for Fast Shape Generation

【速读】：该论文旨在解决基于 Vecset Diffusion Model (VDM) 的高分辨率 3D 形状快速生成问题，尤其聚焦于提高扩散采样（diffusion sampling）和变分自编码器（VAE）解码的速度。现有方法在加速 VDM 的过程中面临挑战，主要是因为扩散采样的加速以及 VAE 解码的效率优化尚未得到充分研究。论文的关键解决方案是提出 FlashVDM，这是一种系统性框架，用于同时加速 VDM 中的扩散模型（DiT）和 VAE。对于 DiT，FlashVDM 实现了仅需 5 个推理步骤即可获得与传统方法相当质量的灵活扩散采样，这得益于通过引入 Progressive Flow Distillation 稳定一致性蒸馏实现的改进。对于 VAE，FlashVDM 提出了配备自适应键值选择（Adaptive KV Selection）、分层体素解码（Hierarchical Volume Decoding）和高效网络设计的闪电式 vecset 解码器。该解码器利用 vecset 的局部性和形状表面在体素中的稀疏性，大幅降低了浮点运算次数（FLOPs），从而显著减少了整体解码开销。通过将 FlashVDM 应用于 Hunyuan3D-2，论文展示了所提模型在重建和生成任务中分别比现有快速 3D 生成方法快 45 倍和 32 倍，同时保持与最先进方法相当的性能。

链接: https://arxiv.org/abs/2503.16302
作者: Zeqiang Lai,Yunfei Zhao,Zibo Zhao,Haolin Liu,Fuyun Wang,Huiwen Shi,Xianghui Yang,Qinxiang Lin,Jinwei Huang,Yuhong Liu,Jie Jiang,Chunchao Guo,Xiangyu Yue
机构: MMLab, CUHK (多模态实验室，香港中文大学); Tencent Hunyuan (腾讯混元); ShanghaiTech (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Technical report

点击查看摘要

Abstract:3D shape generation has greatly flourished through the development of so-called “native” 3D diffusion, particularly through the Vecset Diffusion Model (VDM). While recent advancements have shown promising results in generating high-resolution 3D shapes, VDM still struggles with high-speed generation. Challenges exist because of difficulties not only in accelerating diffusion sampling but also VAE decoding in VDM, areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. For DiT, FlashVDM enables flexible diffusion sampling with as few as 5 inference steps and comparable quality, which is made possible by stabilizing consistency distillation with our newly introduced Progressive Flow Distillation. For VAE, we introduce a lightning vecset decoder equipped with Adaptive KV Selection, Hierarchical Volume Decoding, and Efficient Network Design. By exploiting the locality of the vecset and the sparsity of shape surface in the volume, our decoder drastically lowers FLOPs, minimizing the overall decoding overhead. We apply FlashVDM to Hunyuan3D-2 to obtain Hunyuan3D-2 Turbo. Through systematic evaluation, we show that our model significantly outperforms existing fast 3D generation methods, achieving comparable performance to the state-of-the-art while reducing inference time by over 45x for reconstruction and 32x for generation. Code and models are available at this https URL.
zh

[CV-26] SceneMI: Motion In-betweening for Modeling Human-Scene Interactions

【速读】：该论文旨在解决人类-场景交互（Human-Scene Interaction, HSI）建模在实际应用中可控性和灵活性不足的问题。传统方法利用生成模型虽有所进展，但在真实场景中的适应性有限。为应对这些挑战，论文提出将HSI建模重新定义为基于场景感知的关键帧插值（Scene-aware Motion In-betweening），这是一个更易处理且实用的任务。解决方案的关键在于引入SceneMI框架，该框架通过双重场景描述符全面编码全局与局部场景上下文，并利用扩散模型的去噪特性来处理噪声关键帧。实验结果验证了SceneMI在场景感知关键帧插值及对GIMO数据集的泛化能力，同时展示了其在从单目视频重建HSI中的适用性。

链接: https://arxiv.org/abs/2503.16289
作者: Inwoo Hwang,Bing Zhou,Young Min Kim,Jian Wang,Chuan Guo
机构: Seoul National University (首尔国立大学); Snap Inc (Snap公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, Project page: this http URL

点击查看摘要

Abstract:Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening – a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes. Experimental results demonstrate SceneMI’s effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI’s applicability in HSI reconstruction from monocular videos.
zh

[CV-27] PSA-MIL: A Probabilistic Spatial Attention-Based Multiple Instance Learning for Whole Slide Image Classification

【速读】：该论文旨在解决现有基于注意力机制的多重实例学习（MIL）方法在处理全视野图像（WSI）分类时未能充分挖掘组织切片中空间关系的问题，这可能导致对诊断至关重要的复杂组织结构被忽略。为了解决这一局限性，论文提出了一种名为概率空间注意多重实例学习（PSA-MIL）的新框架。其关键在于通过可学习的距离衰减先验将空间上下文整合到注意力机制中，并以自我注意的后验分布形式进行概率解释，从而实现在训练过程中动态推断空间关系的能力，避免了对预定义假设的需求。此外，还提出了后验的空间剪枝策略以降低计算复杂度，并引入多样性损失来确保不同注意力头捕获独特的空间表示，共同实现了更数据驱动且适应性强的空间上下文集成。

链接: https://arxiv.org/abs/2503.16284
作者: Sharon Peled,Yosef E. Maruvka,Moti Freiman
机构: Technion – Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Whole Slide Images (WSIs) are high-resolution digital scans widely used in medical diagnostics. WSI classification is typically approached using Multiple Instance Learning (MIL), where the slide is partitioned into tiles treated as interconnected instances. While attention-based MIL methods aim to identify the most informative tiles, they often fail to fully exploit the spatial relationships among them, potentially overlooking intricate tissue structures crucial for accurate diagnosis. To address this limitation, we propose Probabilistic Spatial Attention MIL (PSA-MIL), a novel attention-based MIL framework that integrates spatial context into the attention mechanism through learnable distance-decayed priors, formulated within a probabilistic interpretation of self-attention as a posterior distribution. This formulation enables a dynamic inference of spatial relationships during training, eliminating the need for predefined assumptions often imposed by previous approaches. Additionally, we suggest a spatial pruning strategy for the posterior, effectively reducing self-attention’s quadratic complexity. To further enhance spatial modeling, we introduce a diversity loss that encourages variation among attention heads, ensuring each captures distinct spatial representations. Together, PSA-MIL enables a more data-driven and adaptive integration of spatial context, moving beyond predefined constraints. We achieve state-of-the-art performance across both contextual and non-contextual baselines, while significantly reducing computational costs.
zh

[CV-28] Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model CVPR2025

【速读】：该论文致力于解决通用少样本三维点云分割（Generalized Few-Shot 3D Point Cloud Segmentation, GFS-PCS）问题，即在仅使用少量支持样本的同时，保持基础类别分割性能，并适应开放世界中的新类别。现有方法通过与支持或查询特征交互来增强原型，但受限于少样本带来的稀疏知识。同时，3D视觉语言模型（3D Vision-Language Models, 3D VLMs）虽包含丰富的新类别知识，但也存在噪声问题。为解决此挑战，论文提出了一种名为GFS-VL的框架，将来自3D VLMs的密集但嘈杂的伪标签与精准但稀疏的少样本数据相结合，充分发挥两者的优势。

方案的关键在于：首先，采用基于原型引导的伪标签选择策略过滤低质量区域；其次，设计自适应填充策略，结合伪标签上下文与少样本数据的知识，对过滤后的未标注区域进行自适应标记；此外，还提出了新颖-基础混合策略，将少样本数据嵌入到训练场景中，以保留重要上下文信息，从而提升新类别学习效果。此外，为了应对当前基准数据集多样性不足的问题，论文引入了两个具有多样化新类别的挑战性基准，用于全面评估泛化能力。实验验证了所提框架在不同模型和数据集上的有效性。

链接: https://arxiv.org/abs/2503.16282
作者: Zhaochong An,Guolei Sun,Yun Liu,Runjia Li,Junlin Han,Ender Konukoglu,Serge Belongie
机构: Department of Computer Science, University of Copenhagen (哥本哈根大学计算机科学系); Computer Vision Laboratory, ETH Zurich (苏黎世联邦理工学院计算机视觉实验室); College of Computer Science, Nankai University (南开大学计算机科学学院); Department of Engineering Science, University of Oxford (牛津大学工程科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Generalized few-shot 3D point cloud segmentation (GFS-PCS) adapts models to new classes with few support samples while retaining base class segmentation. Existing GFS-PCS methods enhance prototypes via interacting with support or query features but remain limited by sparse knowledge from few-shot samples. Meanwhile, 3D vision-language models (3D VLMs), generalizing across open-world novel classes, contain rich but noisy novel class knowledge. In this work, we introduce a GFS-PCS framework that synergizes dense but noisy pseudo-labels from 3D VLMs with precise yet sparse few-shot samples to maximize the strengths of both, named GFS-VL. Specifically, we present a prototype-guided pseudo-label selection to filter low-quality regions, followed by an adaptive infilling strategy that combines knowledge from pseudo-label contexts and few-shot samples to adaptively label the filtered, unlabeled areas. Additionally, we design a novel-base mix strategy to embed few-shot samples into training scenes, preserving essential context for improved novel class learning. Moreover, recognizing the limited diversity in current GFS-PCS benchmarks, we introduce two challenging benchmarks with diverse novel classes for comprehensive generalization evaluation. Experiments validate the effectiveness of our framework across models and datasets. Our approach and benchmarks provide a solid foundation for advancing GFS-PCS in the real world. The code is at this https URL
zh

[CV-29] From Monocular Vision to Autonomous Action: Guiding Tumor Resection via 3D Reconstruction IROS

【速读】：该论文旨在解决微创手术中自动化引导面临的挑战，特别是如何在空间受限的临床环境中实现精确的解剖结构理解与场景重建。当前方法依赖于体积较大的深度相机（RGB-D cameras）来创建解剖图谱，但这些设备不适用于空间受限的应用场景。而单目相机（monocular cameras）虽小巧且适合微创手术，但需要额外处理以生成三维场景理解。论文的关键解决方案在于提出了一种仅利用RGB图像的三维映射管道，通过生成目标解剖结构的分割点云（segmented point clouds），实现了精确的场景重建。为了确保最佳重建效果，研究对比了多种运动结构算法（Structure from Motion, SfM）在中央气道阻塞建模中的性能，并在下游肿瘤切除任务中验证了该管道的有效性。实验结果显示，在包括术后组织模型评估等多项指标中，该方案的表现与RGB-D相机相当，甚至在某些情况下超越后者，证明了单目相机在微创手术自动化引导中的潜力。

链接: https://arxiv.org/abs/2503.16263
作者: Ayberk Acar,Mariana Smith,Lidia Al-Zogbi,Tanner Watts,Fangjie Li,Hao Li,Nural Yilmaz,Paul Maria Scheikl,Jesse F. d’Almeida,Susheela Sharma,Lauren Branscombe,Tayfun Efe Ertop,Robert J. Webster III,Ipek Oguz,Alan Kuntz,Axel Krieger,Jie Ying Wu
机构: Department of Computer Science, Vanderbilt University (范德比尔特大学), Nashville, TN 37235, USA; Department of Mechanical Engineering, Johns Hopkins University (约翰斯·霍普金斯大学), Baltimore, MD 21211, USA; Robotics Center and Kahlert School of Computing, University of Utah (犹他大学), Salt Lake City, UT 84112, USA; Department of Mechanical Engineering, Vanderbilt University (范德比尔特大学), Nashville, TN 37235, USA; Virtuoso Surgical, Nashville, TN 37205, USA; Department of Mechanical, Aerospace and Biomedical Engineering, University of Tennessee (田纳西大学), Knoxville, TN 37996, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 Pages, 8 Figures, 1 Table. This work has been submitted IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) for possible publication

点击查看摘要

Abstract:Surgical automation requires precise guidance and understanding of the scene. Current methods in the literature rely on bulky depth cameras to create maps of the anatomy, however this does not translate well to space-limited clinical applications. Monocular cameras are small and allow minimally invasive surgeries in tight spaces but additional processing is required to generate 3D scene understanding. We propose a 3D mapping pipeline that uses only RGB images to create segmented point clouds of the target anatomy. To ensure the most precise reconstruction, we compare different structure from motion algorithms’ performance on mapping the central airway obstructions, and test the pipeline on a downstream task of tumor resection. In several metrics, including post-procedure tissue model evaluation, our pipeline performs comparably to RGB-D cameras and, in some cases, even surpasses their performance. These promising results demonstrate that automation guidance can be achieved in minimally invasive procedures with monocular cameras. This study is a step toward the complete autonomy of surgical robots.
zh

[CV-30] Chain of Functions: A Programmatic Pipeline for Fine-Grained Chart Reasoning Data

【速读】：该论文旨在解决多模态大语言模型（MLLMs）在处理复杂图表查询时视觉推理能力不足的问题，特别是高质量的推理数据稀缺这一挑战。现有方法依赖于（M）LLMs直接提示生成数据，但这种方法往往精度和多样性有限。论文的关键解决方案是提出了一种名为“函数链（Chain of Functions, CoF）”的新颖程序化推理数据生成管道。CoF 利用自由探索的推理路径作为监督信号，确保生成数据的精确性和多样性。具体而言，它从无人类干预的原子函数（如最大值提取和算术运算）组合开始，生成多样化的函数链，再通过开源轻量级语言模型将其转化为语言形式的推理依据和问题。CoF 的关键创新点在于通过函数驱动的生成方式减少幻觉、提升数据多样性，并提供可解释性，同时避免对超大规模模型的依赖。通过 CoF 方法，论文构建了 ChartCoF 数据集，包含 1.4k 复杂推理问答用于细粒度分析和 50k 问答对用于推理增强。

链接: https://arxiv.org/abs/2503.16260
作者: Zijian Li,Jingjing Fu,Lei Song,Jiang Bian,Jun Zhang,Rui Wang
机构: Microsoft Research (微软研究); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Visual reasoning is crucial for multimodal large language models (MLLMs) to address complex chart queries, yet high-quality rationale data remains scarce. Existing methods leveraged (M)LLMs for data generation, but direct prompting often yields limited precision and diversity. In this paper, we propose \textitChain of Functions (CoF), a novel programmatic reasoning data generation pipeline that utilizes freely-explored reasoning paths as supervision to ensure data precision and diversity. Specifically, it starts with human-free exploration among the atomic functions (e.g., maximum data and arithmetic operations) to generate diverse function chains, which are then translated into linguistic rationales and questions with only a moderate open-sourced LLM. \textitCoF provides multiple benefits: 1) Precision: function-governed generation reduces hallucinations compared to freeform generation; 2) Diversity: enumerating function chains enables varied question taxonomies; 3) Explainability: function chains serve as built-in rationales, allowing fine-grained evaluation beyond overall accuracy; 4) Practicality: eliminating reliance on extremely large models. Employing \textitCoF, we construct the \textitChartCoF dataset, with 1.4k complex reasoning Q\A for fine-grained analysis and 50k Q\A for reasoning enhancement. The fine-grained evaluation on \textitChartCoF reveals varying performance across question taxonomies for each MLLM, and the experiments also show that finetuning with \textitChartCoF achieves state-of-the-art performance among same-scale MLLMs on widely used benchmarks. Furthermore, the novel paradigm of function-governed rationale generation in \textitCoF could inspire broader applications beyond charts.
zh

[CV-31] Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

【速读】：该论文旨在解决视频大语言模型（VideoLLMs）在推理过程中因处理数千个视觉令牌而导致的关键值（KV）缓存内存需求激增的问题，这成为影响推理速度和内存使用效率的瓶颈。为应对这一挑战，已有研究广泛采用KV缓存量化方法，但其在低于2位精度下的极限尚未被充分探索。本文提出了一种名为VidKV的即插即用型KV缓存量化方法，能够将KV缓存压缩至低于2位的精度范围，同时保持模型性能几乎不受影响。

VidKV的关键创新在于：(1) 对于键（Key），提出了一种基于通道维度的混合精度量化策略，其中异常通道采用2位量化，而正常通道则结合1位量化与快速傅里叶变换（FFT）进行处理；(2) 对于值（Value），实现了1.58位量化，并通过选择性过滤语义显著的视觉令牌来实现针对性保留，从而在精度与模型性能之间取得更好的权衡。此外，研究发现VideoLLMs的值缓存应以逐通道的方式而非传统LLMs中逐令牌的方式进行量化。实证结果表明，在六个基准数据集上，使用VidKV方法后，LLaVA-OV-7B和Qwen2.5-VL-7B的KV缓存可被有效压缩至1.5位和1.58位精度，且性能下降几乎可以忽略不计，与FP16精度相比无明显差异。

链接: https://arxiv.org/abs/2503.16257
作者: Keda Tao,Haoxuan You,Yang Sui,Can Qin,Huan Wang
机构: Westlake University (西湖大学); Xidian University (西安电子科技大学); Columbia University (哥伦比亚大学); Rice University (莱斯大学); Salesforce AI Research (Salesforce AI 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Video large language models (VideoLLMs) have demonstrated the capability to process longer video inputs and enable complex reasoning and analysis. However, due to the thousands of visual tokens from the video frames, key-value (KV) cache can significantly increase memory requirements, becoming a bottleneck for inference speed and memory usage. KV cache quantization is a widely used approach to address this problem. In this paper, we find that 2-bit KV quantization of VideoLLMs can hardly hurt the model performance, while the limit of KV cache quantization in even lower bits has not been investigated. To bridge this gap, we introduce VidKV, a plug-and-play KV cache quantization method to compress the KV cache to lower than 2 bits. Specifically, (1) for key, we propose a mixed-precision quantization strategy in the channel dimension, where we perform 2-bit quantization for anomalous channels and 1-bit quantization combined with FFT for normal channels; (2) for value, we implement 1.58-bit quantization while selectively filtering semantically salient visual tokens for targeted preservation, for a better trade-off between precision and model performance. Importantly, our findings suggest that the value cache of VideoLLMs should be quantized in a per-channel fashion instead of the per-token fashion proposed by prior KV cache quantization works for LLMs. Empirically, extensive results with LLaVA-OV-7B and Qwen2.5-VL-7B on six benchmarks show that VidKV effectively compresses the KV cache to 1.5-bit and 1.58-bit precision with almost no performance drop compared to the FP16 counterparts.
zh

[CV-32] M2N2V2: Multi-Modal Unsupervised and Training-free Interactive Segmentation

【速读】：本文旨在解决无监督且无需训练的点提示交互分割（point-prompt-based interactive segmentation）任务中性能不足的问题。论文的关键创新在于引入深度引导的马尔可夫图（Markov Maps, MMs）和自适应评分函数。具体而言，通过将深度信息作为额外模态（additional modality），作者精心设计了深度引导的马尔可夫图（depth-guided Markov-maps），以充分利用深度线索；同时，为了缓解交互过程中因点提示引起的片段大小波动问题，提出了一个自适应评分函数（adaptive score function），该函数结合前一次分割结果与当前提示点，从而避免不合理片段大小变化。这些改进显著提升了点击次数（Number of Clicks, NoC）和平均交并比（mean Intersection over Union, mIoU），特别是在DAVIS和HQSeg44K等具有挑战性的数据集上，其无监督方法的表现接近于有监督方法如SAM和SimpleClick。

链接: https://arxiv.org/abs/2503.16254
作者: Markus Karmann,Peng-Tao Jiang,Bo Li,Onay Urfalioglu
机构: Vivo Tech Research GmbH; vivo Mobile Communication Co., Ltd, Shanghai, China.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Markov Map Nearest Neighbor V2 (M2N2V2), a novel and simple, yet effective approach which leverages depth guidance and attention maps for unsupervised and training-free point-prompt-based interactive segmentation. Following recent trends in supervised multimodal approaches, we carefully integrate depth as an additional modality to create novel depth-guided Markov-maps. Furthermore, we observe occasional segment size fluctuations in M2N2 during the interactive process, which can decrease the overall mIoU’s. To mitigate this problem, we model the prompting as a sequential process and propose a novel adaptive score function which considers the previous segmentation and the current prompt point in order to prevent unreasonable segment size changes. Using Stable Diffusion 2 and Depth Anything V2 as backbones, we empirically show that our proposed M2N2V2 significantly improves the Number of Clicks (NoC) and mIoU compared to M2N2 in all datasets except those from the medical domain. Interestingly, our unsupervised approach achieves competitive results compared to supervised methods like SAM and SimpleClick in the more challenging DAVIS and HQSeg44K datasets in the NoC metric, reducing the gap between supervised and unsupervised methods.
zh

[CV-33] RESFL: An Uncertainty-Aware Framework for Responsible Federated Learning by Balancing Privacy Fairness and Utility in Autonomous Vehicles

【速读】：本文旨在解决联邦学习（Federated Learning, FL）在自动驾驶车辆（Autonomous Vehicles, AVs）感知模型优化中的隐私保护与公平性平衡问题。现有FL框架难以同时满足隐私性、公平性和鲁棒性的需求，导致不同人口群体间性能存在显著差异。尽管差分隐私等隐私保护技术可降低数据泄露风险，但会因限制敏感属性的使用而恶化公平性。为此，本文引入了RESFL方案，其核心在于通过对抗隐私解耦和不确定性引导的公平性感知聚合来实现隐私与公平性的协同优化。其中，对抗组件利用梯度反转层移除敏感属性以减少隐私风险，同时保持公平性；不确定性感知聚合则采用确信神经网络动态加权客户端更新，优先采纳公平性差距小且置信度高的贡献，从而确保FL模型更新的稳健性和公平性。评估结果显示，RESFL在提高检测精度、缩小公平性差距、降低隐私攻击成功率以及增强鲁棒性方面表现优异。

链接: https://arxiv.org/abs/2503.16251
作者: Dawood Wasif,Terrence J. Moore,Jin-Hee Cho
机构: Virginia Tech (弗吉尼亚理工大学); U.S. Army Research Laboratory (美国陆军研究实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
备注: Submitted to PETS 2025 (under review)

点击查看摘要

Abstract:Autonomous vehicles (AVs) increasingly rely on Federated Learning (FL) to enhance perception models while preserving privacy. However, existing FL frameworks struggle to balance privacy, fairness, and robustness, leading to performance disparities across demographic groups. Privacy-preserving techniques like differential privacy mitigate data leakage risks but worsen fairness by restricting access to sensitive attributes needed for bias correction. This work explores the trade-off between privacy and fairness in FL-based object detection for AVs and introduces RESFL, an integrated solution optimizing both. RESFL incorporates adversarial privacy disentanglement and uncertainty-guided fairness-aware aggregation. The adversarial component uses a gradient reversal layer to remove sensitive attributes, reducing privacy risks while maintaining fairness. The uncertainty-aware aggregation employs an evidential neural network to weight client updates adaptively, prioritizing contributions with lower fairness disparities and higher confidence. This ensures robust and equitable FL model updates. We evaluate RESFL on the FACET dataset and CARLA simulator, assessing accuracy, fairness, privacy resilience, and robustness under varying conditions. RESFL improves detection accuracy, reduces fairness disparities, and lowers privacy attack success rates while demonstrating superior robustness to adversarial conditions compared to other approaches.
zh

[CV-34] OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection

【速读】：该论文旨在解决在医疗领域中，当面对意外或异常输入时，如何确保人工智能（AI）系统的可信度这一问题。论文的关键在于提出开放医学影像分布外检测基准（OpenMIBOOD），这是一个针对医学影像背景下分布外（Out-Of-Distribution, OOD）检测方法的综合评估框架。OpenMIBOOD包含来自不同医学领域的三个基准，涵盖14个数据集，并分为协变量偏移的分布内、近似分布外和远距离分布外类别。通过在这些基准上评估24种后验方法，提供了一个标准化参考来推动OOD检测方法的发展和公平比较。研究结果表明，自然图像域中的大规模OOD基准测试结果不能直接应用于医学领域，从而强调了在医学领域建立此类基准的重要性。通过减少AI模型暴露于超出其训练分布之外输入的风险，OpenMIBOOD致力于支持医疗领域可靠且可信的AI系统的发展。相关资源可在提供的链接地址获取。

链接: https://arxiv.org/abs/2503.16247
作者: Max Gutbrod,David Rauber,Danilo Weber Nunes,Christoph Palm
机构: Regensburg Medical Image Computing (ReMIC), OTH Regensburg (奥格斯堡应用技术大学); Regensburg Center of Health Sciences and Technology (RCHST), OTH Regensburg (奥格斯堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The repository is available at this https URL.
zh

[CV-35] mporal Score Analysis for Understanding and Correcting Diffusion Artifacts

【速读】：该论文旨在解决扩散模型中视觉伪影这一长期存在的挑战，即使在大规模数据集上训练后依然显著。当前主流方法依赖于监督检测器，但未能深入理解这些伪影产生的根本原因。论文的关键在于分析了扩散生成过程中的三个阶段：轮廓描绘（Profiling）、突变（Mutation）和精炼（Refinement），并发现伪影通常出现在突变阶段，由于特定区域的时间分数动态异常，导致正常演化模式的突然中断。基于此洞察，论文提出了一种名为ASCED（用于增强扩散的异常分数校正）的方法，通过监控扩散过程中的异常分数动态来检测伪影，并采用轨迹感知的在线缓解策略，在检测到的区域适当地生成噪声。与大多数现有方法不同，这些方法通常采用事后修正（如生成后的加噪-去噪方案），ASCED的缓解策略无缝集成到现有的扩散过程中，从而实现更有效的伪影定位和消除。实验结果表明，该方法在多个领域均能有效减少伪影，性能达到或超越现有的监督方法，且无需额外训练。

链接: https://arxiv.org/abs/2503.16218
作者: Yu Cao,Zengqun Zhao,Ioannis Patras,Shaogang Gong
机构: Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual artifacts remain a persistent challenge in diffusion models, even with training on massive datasets. Current solutions primarily rely on supervised detectors, yet lack understanding of why these artifacts occur in the first place. In our analysis, we identify three distinct phases in the diffusion generative process: Profiling, Mutation, and Refinement. Artifacts typically emerge during the Mutation phase, where certain regions exhibit anomalous score dynamics over time, causing abrupt disruptions in the normal evolution pattern. This temporal nature explains why existing methods focusing only on spatial uncertainty of the final output fail at effective artifact localization. Based on these insights, we propose ASCED (Abnormal Score Correction for Enhancing Diffusion), that detects artifacts by monitoring abnormal score dynamics during the diffusion process, with a trajectory-aware on-the-fly mitigation strategy that appropriate generation of noise in the detected areas. Unlike most existing methods that apply post hoc corrections, \eg, by applying a noising-denoising scheme after generation, our mitigation strategy operates seamlessly within the existing diffusion process. Extensive experiments demonstrate that our proposed approach effectively reduces artifacts across diverse domains, matching or surpassing existing supervised methods without additional training.
zh

[CV-36] VP-NTK: Exploring the Benefits of Visual Prompting in Differentially Private Data Synthesis ICASSP2025

【速读】：该论文旨在解决不同ially private (DP)生成的数据在高分辨率图像任务中的低效用问题。论文的关键创新在于结合视觉提示（Visual Prompting, VP）与DP-NTK方法，其中DP-NTK是一种利用神经切线核（Neural Tangent Kernel, NTK）提升DP生成模型性能的技术。通过这一组合，论文实现了显著的性能提升，在高分辨率图像数据集上的准确性从0.644 ± 0.044提高到0.769，从而有效改善了DP合成数据的实用性。

链接: https://arxiv.org/abs/2503.16195
作者: Chia-Yi Hsu,Jia-You Chen,Yu-Lin Tsai,Chih-Hsun Lin,Pin-Yu Chen,Chia-Mu Yu,Chun-Ying Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Differentially private (DP) synthetic data has become the de facto standard for releasing sensitive data. However, many DP generative models suffer from the low utility of synthetic data, especially for high-resolution images. On the other hand, one of the emerging techniques in parameter efficient fine-tuning (PEFT) is visual prompting (VP), which allows well-trained existing models to be reused for the purpose of adapting to subsequent downstream tasks. In this work, we explore such a phenomenon in constructing captivating generative models with DP constraints. We show that VP in conjunction with DP-NTK, a DP generator that exploits the power of the neural tangent kernel (NTK) in training DP generative models, achieves a significant performance boost, particularly for high-resolution image datasets, with accuracy improving from 0.644 \pm 0.044 to 0.769. Lastly, we perform ablation studies on the effect of different parameters that influence the overall performance of VP-NTK. Our work demonstrates a promising step forward in improving the utility of DP synthetic data, particularly for high-resolution images.
zh

[CV-37] Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction

【速读】：该论文旨在解决在基于自回归模型的图像生成任务中，通过向量化量化的离散化方法引入的量化误差问题，同时避免因扩大码本（codebook）规模导致的词汇表增大对自回归建模任务的复杂性提升。解决方案的关键在于发现大码本中存在的显著冗余性，即具有相似编码词表示的标记对最终生成图像的影响类似。基于此洞察，论文提出了一种从粗到细（Coarse-to-Fine, CTF）预测的方法，通过为相似标记分配相同的粗标签来减少冗余。该框架包含两个阶段：首先，一个自回归模型按序列顺序预测每个标记的粗标签；其次，一个辅助模型在给定粗标签的条件下同时预测所有标记的细粒度标签。实验结果表明，该方法在ImageNet上的表现优于基线模型，Inception分数平均提升了59点，且尽管增加了推理步骤，采样速度反而更快。

链接: https://arxiv.org/abs/2503.16194
作者: Ziyao Guo,Kaipeng Zhang,Michael Qizhe Shieh
机构: National University of Singapore; Shanghai AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:Autoregressive models have shown remarkable success in image generation by adapting sequential prediction techniques from language modeling. However, applying these approaches to images requires discretizing continuous pixel data through vector quantization methods like VQ-VAE. To alleviate the quantization errors that existed in VQ-VAE, recent works tend to use larger codebooks. However, this will accordingly expand vocabulary size, complicating the autoregressive modeling task. This paper aims to find a way to enjoy the benefits of large codebooks without making autoregressive modeling more difficult. Through empirical investigation, we discover that tokens with similar codeword representations produce similar effects on the final generated image, revealing significant redundancy in large codebooks. Based on this insight, we propose to predict tokens from coarse to fine (CTF), realized by assigning the same coarse label for similar tokens. Our framework consists of two stages: (1) an autoregressive model that sequentially predicts coarse labels for each token in the sequence, and (2) an auxiliary model that simultaneously predicts fine-grained labels for all tokens conditioned on their coarse labels. Experiments on ImageNet demonstrate our method’s superior performance, achieving an average improvement of 59 points in Inception Score compared to baselines. Notably, despite adding an inference step, our approach achieves faster sampling speeds.
zh

[CV-38] CLS-RL: Image Classification with Rule-Based Reinforcement Learning

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在少量样本分类微调中的过拟合问题以及性能提升受限的问题。传统的方法如监督微调（Supervised Fine-Tuning, SFT）虽然能够提高MLLMs的分类性能，但容易导致严重过拟合，并可能使性能退化至零样本（zero-shot）方法以下。为了解决这一挑战，论文提出了一种基于规则强化学习思想的分类强化学习方法（Classification Reinforcement Learning, CLS-RL），通过利用可验证信号作为奖励来微调MLLMs。关键在于CLS-RL方法不仅在大多数数据集上优于SFT，还能在基础到新任务以及少量样本学习场景下提供更高的平均准确率，并观察到一种“免费午餐”现象，即当模型在一个数据集上进行微调时，其在其他分布不同或类别名称不同的数据集上的表现也能超越零样本模型。此外，受推理时间思考研究的启发，论文进一步提出了无需思考的分类强化学习方法（No-Thinking-CLS-RL），通过设置等精度奖励以最小化训练过程中的思考步骤，从而实现更高效的领域内性能和泛化能力。

链接: https://arxiv.org/abs/2503.16188
作者: Ming Li,Shitian Zhao,Jike Zhong,Yuxiang Lai,Kaipeng Zhang
机构: Shanghai AI Laboratory; University of Southern California (南加州大学); Emory University (埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, work in progress

点击查看摘要

Abstract:Classification is a core task in machine learning. Recent research has shown that although Multimodal Large Language Models (MLLMs) are initially poor at image classification, fine-tuning them with an adequate amount of data can significantly enhance their performance, making them comparable to SOTA classification models. However, acquiring large-scale labeled data is expensive. In this paper, we explore few-shot MLLM classification fine-tuning. We found that SFT can cause severe overfitting issues and may even degrade performance over the zero-shot approach. To address this challenge, inspired by the recent successes in rule-based reinforcement learning, we propose CLS-RL, which uses verifiable signals as reward to fine-tune MLLMs. We discovered that CLS-RL outperforms SFT in most datasets and has a much higher average accuracy on both base-to-new and few-shot learning setting. Moreover, we observed a free-lunch phenomenon for CLS-RL; when models are fine-tuned on a particular dataset, their performance on other distinct datasets may also improve over zero-shot models, even if those datasets differ in distribution and class names. This suggests that RL-based methods effectively teach models the fundamentals of classification. Lastly, inspired by recent works in inference time thinking, we re-examine the `thinking process’ during fine-tuning, a critical aspect of RL-based methods, in the context of visual classification. We question whether such tasks require extensive thinking process during fine-tuning, proposing that this may actually detract from performance. Based on this premise, we introduce the No-Thinking-CLS-RL method, which minimizes thinking processes during training by setting an equality accuracy reward. Our findings indicate that, with much less fine-tuning time, No-Thinking-CLS-RL method achieves superior in-domain performance and generalization capabilities than CLS-RL.
zh

[CV-39] MapGlue: Multimodal Remote Sensing Image Matching

【速读】：该论文旨在解决多模态遥感图像（MRSI）匹配面临的几何、辐射和视角差异带来的严重挑战，现有单模态数据集规模与多样性不足限制了深度学习方法的应用。论文提出的解决方案包括两个关键部分：一是构建了一个名为MapData的大规模多模态数据集，包含233个采样点、7,000x5,000到20,000x15,000像素的原始图像，并经过严格清洗后提供了121,781对512x512像素的电子地图-可见图像配对，具有混合人工-自动标注的地面实况信息；二是提出了一种名为MapGlue的通用MRSI匹配框架，通过整合语义上下文与双图引导机制来提取跨模态不变特征，实现全局到局部的交互，增强描述符对模态特定失真的鲁棒性。实验结果表明，MapGlue在复杂条件下表现出更高的匹配精度，并且无需再训练即可有效推广到未见过的模态匹配任务中。

链接: https://arxiv.org/abs/2503.16185
作者: Peihao Wu,Yongxiang Yao,Wenfei Zhang,Dong Wei,Yi Wan,Yansheng Li,Yongjun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The dataset and code are available at this https URL

点击查看摘要

Abstract:Multimodal remote sensing image (MRSI) matching is pivotal for cross-modal fusion, localization, and object detection, but it faces severe challenges due to geometric, radiometric, and viewpoint discrepancies across imaging modalities. Existing unimodal datasets lack scale and diversity, limiting deep learning solutions. This paper proposes MapGlue, a universal MRSI matching framework, and MapData, a large-scale multimodal dataset addressing these gaps. Our contributions are twofold. MapData, a globally diverse dataset spanning 233 sampling points, offers original images (7,000x5,000 to 20,000x15,000 pixels). After rigorous cleaning, it provides 121,781 aligned electronic map-visible image pairs (512x512 pixels) with hybrid manual-automated ground truth, addressing the scarcity of scalable multimodal benchmarks. MapGlue integrates semantic context with a dual graph-guided mechanism to extract cross-modal invariant features. This structure enables global-to-local interaction, enhancing descriptor robustness against modality-specific distortions. Extensive evaluations on MapData and five public datasets demonstrate MapGlue’s superiority in matching accuracy under complex conditions, outperforming state-of-the-art methods. Notably, MapGlue generalizes effectively to unseen modalities without retraining, highlighting its adaptability. This work addresses longstanding challenges in MRSI matching by combining scalable dataset construction with a robust, semantics-driven framework. Furthermore, MapGlue shows strong generalization capabilities on other modality matching tasks for which it was not specifically trained. The dataset and code are available at this https URL.
zh

[CV-40] Narrowing Class-Wise Robustness Gaps in Adversarial Training ICLR2025

【速读】：该论文试图解决因数据分布偏移导致模型准确性下降的问题。为应对这一挑战，论文聚焦于对抗训练（Adversarial Training）这一数据增强策略，旨在提升模型对最坏情况下的分布偏移（由对抗样本引起）的鲁棒性。然而，对抗训练可能削弱模型在干净样本上的泛化能力，并加剧不同类别之间的性能不平衡。为此，论文的关键解决方案是通过增强标注（Enhanced Labeling）来优化对抗训练过程，在保持对抗鲁棒性的同时缓解类别不平衡问题。实验结果显示，这种改进方法将对抗鲁棒性提升了53.50%，将类别不平衡降低了5.73%，从而在清洁样本和对抗样本环境中均实现了更高的准确性。

链接: https://arxiv.org/abs/2503.16179
作者: Fatemeh Amerehi,Patrick Healy
机构: University of Limerick (利默里克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 figures, ICLR 2025 Workshop on Foundation Models in the Wild

点击查看摘要

Abstract:Efforts to address declining accuracy as a result of data shifts often involve various data-augmentation strategies. Adversarial training is one such method, designed to improve robustness to worst-case distribution shifts caused by adversarial examples. While this method can improve robustness, it may also hinder generalization to clean examples and exacerbate performance imbalances across different classes. This paper explores the impact of adversarial training on both overall and class-specific performance, as well as its spill-over effects. We observe that enhanced labeling during training boosts adversarial robustness by 53.50% and mitigates class imbalances by 5.73%, leading to improved accuracy in both clean and adversarial settings compared to standard adversarial training.
zh

[CV-41] OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering CCL

【速读】：该论文旨在解决在基于3D高斯点 spline 的大场景重建中，现有分区方法因忽略遮挡（occlusion-agnostic）而导致的相机相关性弱及对整体重建贡献低的问题。论文的关键解决方案包括提出一种遮挡感知的场景分区策略，通过聚类训练相机的位置和共可见性来划分多个区域，使得每个区域内相机之间的相关性更强，对整体重建的平均贡献更高，从而实现高质量的场景重建。此外，论文还提出了一种基于区域的渲染技术，通过裁剪与当前视点所在区域不可见的高斯点，显著加速大场景渲染速度而不降低质量。

链接: https://arxiv.org/abs/2503.16177
作者: Shiyong Liu,Xiao Tang,Zhihao Li,Yingfan He,Chongjie Ye,Jianzhuang Liu,Binxiao Huang,Shunbo Zhou,Xiaofei Wu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); The Chinese University of Hong Kong (Shenzhen) (香港中文大学（深圳）); Shenzhen Institute of Advanced Technology (深圳先进技术研究院); The University of Hong Kong (香港大学); Huawei Embodied Intelligence Lab (华为具身智能实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:In large-scale scene reconstruction using 3D Gaussian splatting, it is common to partition the scene into multiple smaller regions and reconstruct them individually. However, existing division methods are occlusion-agnostic, meaning that each region may contain areas with severe occlusions. As a result, the cameras within those regions are less correlated, leading to a low average contribution to the overall reconstruction. In this paper, we propose an occlusion-aware scene division strategy that clusters training cameras based on their positions and co-visibilities to acquire multiple regions. Cameras in such regions exhibit stronger correlations and a higher average contribution, facilitating high-quality scene reconstruction. We further propose a region-based rendering technique to accelerate large scene rendering, which culls Gaussians invisible to the region where the viewpoint is located. Such a technique significantly speeds up the rendering without compromising quality. Extensive experiments on multiple large scenes show that our method achieves superior reconstruction results with faster rendering speed compared to existing state-of-the-art approaches. Project page: this https URL.
zh

[CV-42] Guardians of Generation: Dynamic Inference-Time Copyright Shielding with Adaptive Guidance for AI Image Generation

【速读】：该论文旨在解决现代文本到图像生成模型在训练过程中无意中再现其训练数据中存储的受版权保护内容的问题，从而避免潜在的版权侵权风险。论文提出了一种名为“Guardians of Generation”的与模型无关的推理时间框架，用于动态版权防护。解决方案的关键在于其无需重新训练或修改生成模型权重，而是无缝集成到现有的扩散管道中。该方法通过一个自适应引导机制实现，该机制包含三个核心组件：检测模块、提示重写模块和引导调整模块。检测模块监控用户提示和中间生成步骤以识别可能指示受版权保护内容的特征；如果检测到此类内容，提示重写机制会动态转换用户的提示，通过清理或替换可能触发版权材料的引用来消除侵权风险，同时保持提示的语义意图；引导调整模块则通过调节模型的采样轨迹，将生成过程引导远离被标记的内容。这三个组件共同构成了一个强大的防护系统，能够在保留创作保真度的同时确保版权合规性。实验验证表明，该方法在多种生成模型（如Stable Diffusion、SDXL和Flux）上显著减少了受版权保护内容的生成，且对输出保真度和用户意图一致性的影响微乎其微。

链接: https://arxiv.org/abs/2503.16171
作者: Soham Roy,Abhishek Mishra,Shirish Karande,Murari Mandal
机构: RespAI Lab, KIIT Bhubaneswar (RespAI实验室, KIIT布巴内斯瓦尔); TCS Research (塔塔咨询服务公司研究部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern text-to-image generative models can inadvertently reproduce copyrighted content memorized in their training data, raising serious concerns about potential copyright infringement. We introduce Guardians of Generation, a model agnostic inference time framework for dynamic copyright shielding in AI image generation. Our approach requires no retraining or modification of the generative model weights, instead integrating seamlessly with existing diffusion pipelines. It augments the generation process with an adaptive guidance mechanism comprising three components: a detection module, a prompt rewriting module, and a guidance adjustment module. The detection module monitors user prompts and intermediate generation steps to identify features indicative of copyrighted content before they manifest in the final output. If such content is detected, the prompt rewriting mechanism dynamically transforms the user’s prompt by sanitizing or replacing references that could trigger copyrighted material while preserving the prompt’s intended semantics. The adaptive guidance module adaptively steers the diffusion process away from flagged content by modulating the model’s sampling trajectory. Together, these components form a robust shield that enables a tunable balance between preserving creative fidelity and ensuring copyright compliance. We validate our method on a variety of generative models such as Stable Diffusion, SDXL, and Flux, demonstrating substantial reductions in copyrighted content generation with negligible impact on output fidelity or alignment with user intent. This work provides a practical, plug-and-play safeguard for generative image models, enabling more responsible deployment under real-world copyright constraints. Source code is available at: this https URL
zh

[CV-43] Iterative Optimal Attention and Local Model for Single Image Rain Streak Removal

【速读】：该论文旨在解决因恶劣天气条件（尤其是雨天）导致视觉测量系统（Vision-Based Measurement Systems, VBMS）图像质量下降的问题，特别是由于雨痕引起的图像模糊和对比度降低，从而避免由此引发的评估不准确和误判风险。论文的关键解决方案是提出了一种名为Expectation Maximization Reconstruction Transformer (EMResformer) 的方法，用于单张图像的雨痕去除。其核心在于通过保留关键的自注意力值实现特征聚合，并引入Expectation Maximization Block以有效消除冗余信息并恢复清晰背景；同时，通过Local Model Residual Block进一步增强局部特征提取能力，从而在保持模型复杂度可控的同时显著提升去雨效果，最终验证了其在合成与真实数据集上的优越性能，并展示了高质量成像对VBMS任务精度和可靠性的重要提升作用。

链接: https://arxiv.org/abs/2503.16165
作者: Xiangyu Li,Wanshu Fan,Yue Shen,Cong Wang,Wei Wang,Xin Yang,Qiang Zhang,Dongsheng Zhou
机构: National and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian University, Dalian, China (国家与地方联合计算机辅助设计工程实验室, 大连大学软件工程学院); Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China (香港理工大学计算系); School of Cyber Science and Technology, Sun Yat-Sen University, Shenzhen, China (中山大学网络科学与技术学院, 深圳); School of Computer Science and Technology, Dalian University of Technology, Dalian, China (大连理工大学计算机科学与技术学院); National and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian University, School of Computer Science and Technology, Dalian University of Technology, Dalian, China (国家与地方联合计算机辅助设计工程实验室, 大连大学软件工程学院; 大连理工大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 14 pages, 14 figures, 6 tables

点击查看摘要

Abstract:High-fidelity imaging is crucial for the successful safety supervision and intelligent deployment of vision-based measurement systems (VBMS). It ensures high-quality imaging in VBMS, which is fundamental for reliable visual measurement and analysis. However, imaging quality can be significantly impaired by adverse weather conditions, particularly rain, leading to blurred images and reduced contrast. Such impairments increase the risk of inaccurate evaluations and misinterpretations in VBMS. To address these limitations, we propose an Expectation Maximization Reconstruction Transformer (EMResformer) for single image rain streak removal. The EMResformer retains the key self-attention values for feature aggregation, enhancing local features to produce superior image reconstruction. Specifically, we propose an Expectation Maximization Block seamlessly integrated into the single image rain streak removal network, enhancing its ability to eliminate superfluous information and restore a cleaner background image. Additionally, to further enhance local information for improved detail rendition, we introduce a Local Model Residual Block, which integrates two local model blocks along with a sequence of convolutions and activation functions. This integration synergistically facilitates the extraction of more pertinent features for enhanced single image rain streak removal. Extensive experiments validate that our proposed EMResformer surpasses current state-of-the-art single image rain streak removal methods on both synthetic and real-world datasets, achieving an improved balance between model complexity and single image deraining performance. Furthermore, we evaluate the effectiveness of our method in VBMS scenarios, demonstrating that high-quality imaging significantly improves the accuracy and reliability of VBMS tasks.
zh

[CV-44] FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

【速读】：本文旨在解决生成式多模态扩散Transformer（MMDiT）模型中旋转位置嵌入（RoPE）在文本到图像生成中的作用机制问题，特别是探讨自注意力层在生成过程中对位置嵌入依赖与查询-键相似性之间的关系。为了解决这一问题，论文提出了首个基于RoPE的MMDiT模型（如FLUX）的机制分析，并引入了一种自动化探测策略，通过战略性操控RoPE来分离位置信息与内容依赖性。研究揭示了不直接与深度相关的独特依赖模式，为理解RoPE基MMDiT中各层的具体角色提供了新见解。基于这些发现，作者提出了一种无需训练的任务特定图像编辑框架，将编辑任务分为三类：位置依赖编辑（如对象添加）、内容相似度依赖编辑（如非刚性编辑）和区域保留编辑（如背景替换）。针对每种类型，设计了定制的关键值注入策略以匹配编辑任务特性。广泛的定性和定量评估表明，该方法优于现有技术，在保持原始语义内容和实现平滑修改方面表现尤为突出。

链接: https://arxiv.org/abs/2503.16153
作者: Tianyi Wei,Yifan Zhou,Dongdong Chen,Xingang Pan
机构: S-Lab, Nanyang Technological University (南洋理工大学); Microsoft GenAI (微软人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:The integration of Rotary Position Embedding (RoPE) in Multimodal Diffusion Transformer (MMDiT) has significantly enhanced text-to-image generation quality. However, the fundamental reliance of self-attention layers on positional embedding versus query-key similarity during generation remains an intriguing question. We present the first mechanistic analysis of RoPE-based MMDiT models (e.g., FLUX), introducing an automated probing strategy that disentangles positional information versus content dependencies by strategically manipulating RoPE during generation. Our analysis reveals distinct dependency patterns that do not straightforwardly correlate with depth, offering new insights into the layer-specific roles in RoPE-based MMDiT. Based on these findings, we propose a training-free, task-specific image editing framework that categorizes editing tasks into three types: position-dependent editing (e.g., object addition), content similarity-dependent editing (e.g., non-rigid editing), and region-preserved editing (e.g., background replacement). For each type, we design tailored key-value injection strategies based on the characteristics of the editing task. Extensive qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art approaches, particularly in preserving original semantic content and achieving seamless modifications.
zh

[CV-45] Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing CVPR2025

【速读】：该论文旨在解决 Quad Bayer demosaicing 在 Hybrid Event-based Vision Sensors (HybridEVS) 应用中的挑战，特别是现有基于学习的方法因长程依赖建模导致计算复杂度过高而难以在移动设备上部署的问题。论文的关键解决方案是提出了一种轻量级的基于 Mamba 的二值神经网络 BMTNet，用于高效且高性能地处理 HybridEVS RAW 图像的去马赛克任务。其核心创新点包括：(1) 引入混合二值化 Mamba-Transformer (BMT) 架构，结合 Mamba 和 Swin Transformer 的优点以同时捕获全局与局部依赖关系；(2) 提出二值化 Mamba (Bi-Mamba)，通过全精度保留 Selective Scan 同时对所有投影进行二值化，并引入额外的全局视觉信息以增强上下文感知并缓解精度损失。这些方法显著降低了计算复杂度，同时保持了性能，为实际边缘设备提供了高效的轻量级解决方案。

链接: https://arxiv.org/abs/2503.16134
作者: Shiyang Zhou,Haijin Zeng,Yunfan Lu,Tong Shao,Ke Tang,Yongyong Chen,Jie Liu,Jingyong Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Quad Bayer demosaicing is the central challenge for enabling the widespread application of Hybrid Event-based Vision Sensors (HybridEVS). Although existing learning-based methods that leverage long-range dependency modeling have achieved promising results, their complexity severely limits deployment on mobile devices for real-world applications. To address these limitations, we propose a lightweight Mamba-based binary neural network designed for efficient and high-performing demosaicing of HybridEVS RAW images. First, to effectively capture both global and local dependencies, we introduce a hybrid Binarized Mamba-Transformer architecture that combines the strengths of the Mamba and Swin Transformer architectures. Next, to significantly reduce computational complexity, we propose a binarized Mamba (Bi-Mamba), which binarizes all projections while retaining the core Selective Scan in full precision. Bi-Mamba also incorporates additional global visual information to enhance global context and mitigate precision loss. We conduct quantitative and qualitative experiments to demonstrate the effectiveness of BMTNet in both performance and computational efficiency, providing a lightweight demosaicing solution suited for real-world edge devices. Our codes and models are available at this https URL.
zh

[CV-46] Coupling deep and handcrafted features to assess smile genuineness

【速读】：该论文旨在解决从视频序列中评估微笑真实性的难题，这是面部表情识别与情感状态关联研究中的重要课题。现有方法主要分为两类：一类基于手工设计特征，另一类依赖深度学习提取有用特征，但两者均存在优缺点。论文的关键在于提出了一种结合长短期记忆网络（LSTM）所学特征与手工设计特征的方法，以捕捉面部动作单元的动力学特性。实验结果表明，该方案比基线技术更有效，并可实现实时评估微笑真实性。

链接: https://arxiv.org/abs/2503.16128
作者: Benedykt Pawlus,Bogdan Smolka,Jolanta Kawulok,Michal Kawulok
机构: Faculty of Automatic Control, Electronics and Computer Science, Gliwice, Poland (自动控制、电子与计算机科学学院，格利维采，波兰)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to SPIE Defense + Commercial Sensing 2024

点击查看摘要

Abstract:Assessing smile genuineness from video sequences is a vital topic concerned with recognizing facial expression and linking them with the underlying emotional states. There have been a number of techniques proposed underpinned with handcrafted features, as well as those that rely on deep learning to elaborate the useful features. As both of these approaches have certain benefits and limitations, in this work we propose to combine the features learned by a long short-term memory network with the features handcrafted to capture the dynamics of facial action units. The results of our experiments indicate that the proposed solution is more effective than the baseline techniques and it allows for assessing the smile genuineness from video sequences in real-time.
zh

[CV-47] Uncertainty Meets Diversity: A Comprehensive Active Learning Framework for Indoor 3D Object Detection CVPR2025

【速读】：该论文旨在解决室内3D目标检测中标注负担过重的问题，特别是针对主动学习在室内环境中的应用尚未被充分探索这一挑战。论文提出了一种新颖的框架，其关键是结合不确定性 (Uncertainty) 和多样性 (Diversity) 两个关键标准来主动选择最模糊且信息量最大的未标注样本进行标注。不确定性标准同时考虑了不准确的检测结果和未检测到的目标对象，确保优先处理最模糊的样本；而多样性标准则通过一个新提出的类别感知自适应原型 (Class-aware Adaptive Prototype, CAP) 银行，将场景类型和物体类别的分布多样性作为联合优化问题，最大化不同类别内部及类别间的多样性。实验表明，该方法在SUN RGB-D和ScanNetV2数据集上显著优于基线方法，仅需10%的标注预算即可达到全监督方法85%以上的性能。

链接: https://arxiv.org/abs/2503.16125
作者: Jiangyi Wang,Na Zhao
机构: Singapore University of Technology and Design (SUTD)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Active learning has emerged as a promising approach to reduce the substantial annotation burden in 3D object detection tasks, spurring several initiatives in outdoor environments. However, its application in indoor environments remains unexplored. Compared to outdoor 3D datasets, indoor datasets face significant challenges, including fewer training samples per class, a greater number of classes, more severe class imbalance, and more diverse scene types and intra-class variances. This paper presents the first study on active learning for indoor 3D object detection, where we propose a novel framework tailored for this task. Our method incorporates two key criteria - uncertainty and diversity - to actively select the most ambiguous and informative unlabeled samples for annotation. The uncertainty criterion accounts for both inaccurate detections and undetected objects, ensuring that the most ambiguous samples are prioritized. Meanwhile, the diversity criterion is formulated as a joint optimization problem that maximizes the diversity of both object class distributions and scene types, using a new Class-aware Adaptive Prototype (CAP) bank. The CAP bank dynamically allocates representative prototypes to each class, helping to capture varying intra-class diversity across different categories. We evaluate our method on SUN RGB-D and ScanNetV2, where it outperforms baselines by a significant margin, achieving over 85% of fully-supervised performance with just 10% of the annotation budget.
zh

[CV-48] Probabilistic Prompt Distribution Learning for Animal Pose Estimation CVPR2025

【速读】：该论文致力于解决多物种动物姿态估计中的跨物种泛化问题，这一任务因显著的视觉多样性和不确定性而极具挑战性。论文的关键在于通过高效的提示学习（Prompt Learning）方法改进预训练的视觉-语言模型（如CLIP），以缓解数据分布不平衡和长尾效应带来的多样性挑战。解决方案的核心包括提示设计、概率提示建模以及跨模态适应，使提示能够补偿跨模态信息，并有效应对大规模的数据变化。具体而言，作者提出了一种新颖的概率提示方法，引入可学习提示并通过多样性损失保持提示间的差异性，以表征丰富的图像属性。同时，在空间层面上探索了三种不同的跨模态融合策略，以减轻视觉不确定性带来的负面影响。实验结果表明，该方法在有监督和零样本设置下均达到了最先进的性能。

链接: https://arxiv.org/abs/2503.16120
作者: Jiyong Rao,Brian Nlong Zhao,Yu Wang
机构: School of Computer Science and Technology, Tongji University (同济大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Multi-species animal pose estimation has emerged as a challenging yet critical task, hindered by substantial visual diversity and uncertainty. This paper challenges the problem by efficient prompt learning for Vision-Language Pretrained (VLP) models, \textite.g. CLIP, aiming to resolve the cross-species generalization problem. At the core of the solution lies in the prompt designing, probabilistic prompt modeling and cross-modal adaptation, thereby enabling prompts to compensate for cross-modal information and effectively overcome large data variances under unbalanced data distribution. To this end, we propose a novel probabilistic prompting approach to fully explore textual descriptions, which could alleviate the diversity issues caused by long-tail property and increase the adaptability of prompts on unseen category instance. Specifically, we first introduce a set of learnable prompts and propose a diversity loss to maintain distinctiveness among prompts, thus representing diverse image attributes. Diverse textual probabilistic representations are sampled and used as the guidance for the pose estimation. Subsequently, we explore three different cross-modal fusion strategies at spatial level to alleviate the adverse impacts of visual uncertainty. Extensive experiments on multi-species animal pose benchmarks show that our method achieves the state-of-the-art performance under both supervised and zero-shot settings. The code is available at this https URL.
zh

[CV-49] OSLoPrompt: Bridging Low-Supervision Challenges and Open-Set Domain Generalization in CLIP CVPR2025

【速读】：本文提出了一种名为Low-Shot Open-Set Domain Generalization (LSOSDG)的新范式，旨在统一低样本学习与开集域泛化（ODG）。现有基于提示的方法（如使用CLIP模型）在小数据场景（如1-shot）下表现不佳，并且难以精确检测与训练类别语义相关的开集样本。为了解决这些问题，论文提出了OSLOPROMPT，一种针对CLIP的高级提示学习框架，其关键创新点包括：首先，引入了一种领域无关的提示学习机制，通过新颖的交叉注意力模块整合可适应的领域特定线索和视觉引导的语义属性，同时结合可学习的领域和类别通用视觉提示，以增强跨模态适应性；其次，通过系统合成伪开集样本并采用目标查询策略训练专用提示，将未知样本分类为“未知”，从而提升推理阶段的异常样本拒绝能力，这些伪样本保留了与已知类别的细粒度关系，生成自现成的基础模型。这种策略增强了特征学习能力，使模型能够更有效地检测具有不同粒度的开集样本。广泛的基准测试表明，OSLOPROMPT在LSOSDG任务中达到了新的技术水平，显著优于现有方法。

链接: https://arxiv.org/abs/2503.16106
作者: Mohamad Hassan N C,Divyam Gupta,Mainak Singha,Sai Bhargav Rongali,Ankit Jha,Muhammad Haris Khan,Biplab Banerjee
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); The LNM Institute of Information Technology (LNMIIT) (拉贾斯坦邦LNMI信息技术学院); Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德 bin 拉希德阿勒马克图姆人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:We introduce Low-Shot Open-Set Domain Generalization (LSOSDG), a novel paradigm unifying low-shot learning with open-set domain generalization (ODG). While prompt-based methods using models like CLIP have advanced DG, they falter in low-data regimes (e.g., 1-shot) and lack precision in detecting open-set samples with fine-grained semantics related to training classes. To address these challenges, we propose OSLOPROMPT, an advanced prompt-learning framework for CLIP with two core innovations. First, to manage limited supervision across source domains and improve DG, we introduce a domain-agnostic prompt-learning mechanism that integrates adaptable domain-specific cues and visually guided semantic attributes through a novel cross-attention module, besides being supported by learnable domain- and class-generic visual prompts to enhance cross-modal adaptability. Second, to improve outlier rejection during inference, we classify unfamiliar samples as “unknown” and train specialized prompts with systematically synthesized pseudo-open samples that maintain fine-grained relationships to known classes, generated through a targeted query strategy with off-the-shelf foundation models. This strategy enhances feature learning, enabling our model to detect open samples with varied granularity more effectively. Extensive evaluations across five benchmarks demonstrate that OSLOPROMPT establishes a new state-of-the-art in LSOSDG, significantly outperforming existing methods.
zh

[CV-50] MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures

【速读】：该论文旨在解决化学文献中Markush结构自动识别的问题，特别是专利文档中复杂多模态Markush结构的提取与解析挑战。传统方法在处理这类结构时存在局限性，而当前化学领域特定及通用视觉-语言模型的表现也未达到理想水平。为此，论文提出了一种名为MarkushGrapher的多模态方法，其关键在于通过Vision-Text-Layout编码器和光学化学结构识别视觉编码器联合表征文本、图像及布局信息，并将这些表示融合后自回归生成Markush结构的序列图表示及其变量组定义表格。此外，为缓解真实世界训练数据不足的问题，论文还设计了一套合成数据生成管道，并发布了首个包含实际Markush结构的标注基准数据集M2S，以推动相关研究发展。实验结果表明，该方法在大多数评估场景中优于现有技术。

链接: https://arxiv.org/abs/2503.16096
作者: Lucas Morin,Valéry Weber,Ahmed Nassar,Gerhard Ingmar Meijer,Luc Van Gool,Yawei Li,Peter Staar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The automated analysis of chemical literature holds promise to accelerate discovery in fields such as material science and drug development. In particular, search capabilities for chemical structures and Markush structures (chemical structure templates) within patent documents are valuable, e.g., for prior-art search. Advancements have been made in the automatic extraction of chemical structures from text and images, yet the Markush structures remain largely unexplored due to their complex multi-modal nature. In this work, we present MarkushGrapher, a multi-modal approach for recognizing Markush structures in documents. Our method jointly encodes text, image, and layout information through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. These representations are merged and used to auto-regressively generate a sequential graph representation of the Markush structure along with a table defining its variable groups. To overcome the lack of real-world training data, we propose a synthetic data generation pipeline that produces a wide range of realistic Markush structures. Additionally, we present M2S, the first annotated benchmark of real-world Markush structures, to advance research on this challenging task. Extensive experiments demonstrate that our approach outperforms state-of-the-art chemistry-specific and general-purpose vision-language models in most evaluation settings. Code, models, and datasets will be available.
zh

[CV-51] Hyperspectral Imaging for Identifying Foreign Objects on Pork Belly

【速读】：该论文旨在解决食品加工行业中检测污染物这一持续存在的挑战，特别是在猪肉五花肉中识别异物的问题。传统视觉检测方法难以发现的污染物通过近红外高光谱成像（Hyperspectral Imaging, HSI）在900-1700 nm波段范围内采集的数据得以准确识别。论文的关键解决方案在于结合预处理技术和基于轻量级视觉Transformer（Vision Transformer, ViT）的分割方法，以有效区分污染物与肉、脂肪及传送带材料。此策略不仅实现了高检测精度和高效的模型训练，还克服了工业应用中的噪声、温度变化以及污染物与猪肉五花肉光谱相似性等难题。实验结果验证了高光谱成像在提升食品安全方面的有效性，并展示了其在自动化质量控制中实时应用的巨大潜力。

链接: https://arxiv.org/abs/2503.16086
作者: Gabriela Ghimpeteanu,Hayat Rajani,Josep Quintana,Rafael Garcia
机构: Coronis (未知中文); Universitat de Girona (乌得勒支大学); Coronis (未知中文); Universitat de Girona (乌得勒支大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Article under review by Computers in Industry, Elsevier

点击查看摘要

Abstract:Ensuring food safety and quality is critical in the food processing industry, where the detection of contaminants remains a persistent challenge. This study presents an automated solution for detecting foreign objects on pork belly meat using hyperspectral imaging (HSI). A hyperspectral camera was used to capture data across various bands in the near-infrared (NIR) spectrum (900-1700 nm), enabling accurate identification of contaminants that are often undetectable through traditional visual inspection methods. The proposed solution combines pre-processing techniques with a segmentation approach based on a lightweight Vision Transformer (ViT) to distinguish contaminants from meat, fat, and conveyor belt materials. The adopted strategy demonstrates high detection accuracy and training efficiency, while also addressing key industrial challenges such as inherent noise, temperature variations, and spectral similarity between contaminants and pork belly. Experimental results validate the effectiveness of hyperspectral imaging in enhancing food safety, highlighting its potential for broad real-time applications in automated quality control processes.
zh

[CV-52] Disentangled and Interpretable Multimodal Attention Fusion for Cancer Survival Prediction

【速读】：该论文旨在解决利用全片图像（Whole-Slide Images, WSI）和转录组学数据预测癌症生存期时，现有多模态框架难以有效分离模态共享信息与模态特定信息的问题。这种信息纠缠不仅限制了解释性，还可能抑制判别性特征的表现。为解决此问题，论文提出了一种名为解缠和可解释多模态注意力融合（Disentangled and Interpretable Multimodal Attention Fusion, DIMAF）的方法。其关键是通过基于注意力机制的融合策略分离模态内和模态间的交互作用，从而分别学习模态特定和模态共享的表示，并引入基于距离相关性（Distance Correlation）的损失函数以促进表示之间的解缠，同时结合Shapley加性解释（Shapley Additive Explanations）评估这些表示对生存预测的相对贡献。这种方法在四个公开的癌症生存数据集上的实验结果表明，与当前最先进的多模态模型相比，性能提升了1.85%，解缠程度提高了23.7%。

链接: https://arxiv.org/abs/2503.16069
作者: Aniek Eijpe,Soufyan Lakbir,Melis Erdal Cesur,Sara P. Oliveira,Sanne Abeln,Wilson Silva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 1 figure, 3 tables

点击查看摘要

Abstract:To improve the prediction of cancer survival using whole-slide images and transcriptomics data, it is crucial to capture both modality-shared and modality-specific information. However, multimodal frameworks often entangle these representations, limiting interpretability and potentially suppressing discriminative features. To address this, we propose Disentangled and Interpretable Multimodal Attention Fusion (DIMAF), a multimodal framework that separates the intra- and inter-modal interactions within an attention-based fusion mechanism to learn distinct modality-specific and modality-shared representations. We introduce a loss based on Distance Correlation to promote disentanglement between these representations and integrate Shapley additive explanations to assess their relative contributions to survival prediction. We evaluate DIMAF on four public cancer survival datasets, achieving a relative average improvement of 1.85% in performance and 23.7% in disentanglement compared to current state-of-the-art multimodal models. Beyond improved performance, our interpretable framework enables a deeper exploration of the underlying interactions between and within modalities in cancer biology.
zh

[CV-53] PoseTraj: Pose-Aware Trajectory Control in Video Diffusion

【速读】：该论文旨在解决现有轨迹引导视频生成模型在处理具有潜在6D姿态变化的物体运动时面临的挑战，尤其是在宽范围旋转下，由于三维理解能力受限而产生的问题。为了解决这一问题，论文提出了一种名为PoseTraj的方法，这是一种姿态感知的视频拖拽模型，用于从二维轨迹生成三维对齐的运动。方法的关键在于采用了一种新颖的两阶段姿态感知预训练框架，以提升不同轨迹下的三维理解能力。具体而言，构建了一个包含10,000段遵循旋转轨迹的物体视频的大规模合成数据集PoseTraj-10K，并通过引入三维边界框作为中间监督信号来增强模型对物体姿态变化的感知能力。随后，在真实世界视频上微调轨迹控制模块，并添加相机解缠模块以进一步提高运动精度。实验结果表明，该方法不仅在旋转轨迹的三维姿态对齐拖拽方面表现优异，而且在轨迹准确性与视频质量上也超越了现有的基线方法。

链接: https://arxiv.org/abs/2503.16068
作者: Longbin Ji,Lei Zhong,Pengfei Wei,Changjian Li
机构: University of Edinburgh (爱丁堡大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code, data and project page: this https URL

点击查看摘要

Abstract:Recent advancements in trajectory-guided video generation have achieved notable progress. However, existing models still face challenges in generating object motions with potentially changing 6D poses under wide-range rotations, due to limited 3D understanding. To address this problem, we introduce PoseTraj, a pose-aware video dragging model for generating 3D-aligned motion from 2D trajectories. Our method adopts a novel two-stage pose-aware pretraining framework, improving 3D understanding across diverse trajectories. Specifically, we propose a large-scale synthetic dataset PoseTraj-10K, containing 10k videos of objects following rotational trajectories, and enhance the model perception of object pose changes by incorporating 3D bounding boxes as intermediate supervision signals. Following this, we fine-tune the trajectory-controlling module on real-world videos, applying an additional camera-disentanglement module to further refine motion accuracy. Experiments on various benchmark datasets demonstrate that our method not only excels in 3D pose-aligned dragging for rotational trajectories but also outperforms existing baselines in trajectory accuracy and video quality.
zh

[CV-54] Bokehlicious: Photorealistic Bokeh Rendering with Controllable Apertures

【速读】：该论文旨在解决在生成具有可变强度的真实感散景（Bokeh）效果时面临的挑战，现有方法通常依赖合成数据且需要额外输入，导致散景再现不真实。为解决这些问题，论文提出了两个关键创新：一是Bokehlicious网络，通过孔径感知注意力机制（Aperture-Aware Attention）实现对散景强度的直观控制，模拟物理镜头光圈；二是RealBokeh数据集，包含23,000张由专业摄影师拍摄的高分辨率图像，覆盖多样场景及不同光圈与焦距设置。这些方案显著提升了散景渲染的真实性和效率，并扩展至散焦去模糊任务，在RealDOF基准测试中取得竞争力结果。

链接: https://arxiv.org/abs/2503.16067
作者: Tim Seizinger,Florin-Alexandru Vasluianu,Marcos V. Conde,Radu Timofte
机构: Computer Vision Lab, CAIDAS, University of Wurzburg (伍尔兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Bokeh rendering methods play a key role in creating the visually appealing, softly blurred backgrounds seen in professional photography. While recent learning-based approaches show promising results, generating realistic Bokeh with variable strength remains challenging. Existing methods require additional inputs and suffer from unrealistic Bokeh reproduction due to reliance on synthetic data. In this work, we propose Bokehlicious, a highly efficient network that provides intuitive control over Bokeh strength through an Aperture-Aware Attention mechanism, mimicking the physical lens aperture. To further address the lack of high-quality real-world data, we present RealBokeh, a novel dataset featuring 23,000 high-resolution (24-MP) images captured by professional photographers, covering diverse scenes with varied aperture and focal length settings. Evaluations on both our new RealBokeh and established Bokeh rendering benchmarks show that Bokehlicious consistently outperforms SOTA methods while significantly reducing computational cost and exhibiting strong zero-shot generalization. Our method and dataset further extend to defocus deblurring, achieving competitive results on the RealDOF benchmark. Our code and data can be found at this https URL
zh

[CV-55] Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model

【速读】：该论文试图解决虚拟试戴（virtual try-on）在饰品（如手链、戒指、耳环和项链等）领域的研究不足问题，特别是针对饰品复杂的小图案和重复几何子结构，在大尺度和姿态变化下难以保证身份一致性和外观保真度的挑战。解决方案的关键在于提出了一种迭代式的配准方法，通过估计精确的佩戴掩码（wearing mask）来改善饰品与模型之间的对齐，并结合去噪过程优化几何和外观表现。此外，通过隐式映射参考饰品掩码到佩戴掩码的方式，进一步正则化注意力层以保留结构细节。实验结果验证了该方法能够成功将参考图像中的饰品试戴到目标模型上，同时处理显著的尺度和姿态差异，保持身份一致性并实现逼真的视觉效果。

链接: https://arxiv.org/abs/2503.16065
作者: Yingmao Miao,Zhanpeng Huang,Rui Han,Zibin Wang,Chenhao Lin,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学); SenseTime Research (商汤科技研究部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While virtual try-on for clothes and shoes with diffusion models has gained attraction, virtual try-on for ornaments, such as bracelets, rings, earrings, and necklaces, remains largely unexplored. Due to the intricate tiny patterns and repeated geometric sub-structures in most ornaments, it is much more difficult to guarantee identity and appearance consistency under large pose and scale variances between ornaments and models. This paper proposes the task of virtual try-on for ornaments and presents a method to improve the geometric and appearance preservation of ornament virtual try-ons. Specifically, we estimate an accurate wearing mask to improve the alignments between ornaments and models in an iterative scheme alongside the denoising process. To preserve structure details, we further regularize attention layers to map the reference ornament mask to the wearing mask in an implicit way. Experimental results demonstrate that our method successfully wears ornaments from reference images onto target models, handling substantial differences in scale and pose while preserving identity and achieving realistic visual effects.
zh

[CV-56] PromptHash: Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval CVPR2025

【速读】：该论文旨在解决跨模态哈希在语义保存、上下文完整性以及信息冗余方面的显著局限性，这些问题限制了检索效能。为应对这些挑战，论文提出了PromptHash，这是一种创新框架，利用亲和提示感知协作学习实现自适应跨模态哈希。其核心解决方案包括：(i) 一种文本亲和提示学习机制，能够在保持参数效率的同时保留上下文信息；(ii) 一种自适应门控选择融合架构，结合状态空间模型与Transformer网络以实现精确的跨模态特征集成；(iii) 一种提示亲和对齐策略，通过分层对比学习桥接模态异构性。这些技术贡献共同建立了增强跨模态语义一致性的新范式。

链接: https://arxiv.org/abs/2503.16064
作者: Qiang Zou,Shuli Cheng,Jiayi Chen
机构: School of Computer Science and Technology, Xinjiang University (新疆大学), China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Cross-modal hashing is a promising approach for efficient data retrieval and storage optimization. However, contemporary methods exhibit significant limitations in semantic preservation, contextual integrity, and information redundancy, which constrains retrieval efficacy. We present PromptHash, an innovative framework leveraging affinity prompt-aware collaborative learning for adaptive cross-modal hashing. We propose an end-to-end framework for affinity-prompted collaborative hashing, with the following fundamental technical contributions: (i) a text affinity prompt learning mechanism that preserves contextual information while maintaining parameter efficiency, (ii) an adaptive gated selection fusion architecture that synthesizes State Space Model with Transformer network for precise cross-modal feature integration, and (iii) a prompt affinity alignment strategy that bridges modal heterogeneity through hierarchical contrastive learning. To the best of our knowledge, this study presents the first investigation into affinity prompt awareness within collaborative cross-modal adaptive hash learning, establishing a paradigm for enhanced semantic consistency across modalities. Through comprehensive evaluation on three benchmark multi-label datasets, PromptHash demonstrates substantial performance improvements over existing approaches. Notably, on the NUS-WIDE dataset, our method achieves significant gains of 18.22% and 18.65% in image-to-text and text-to-image retrieval tasks, respectively. The code is publicly available at this https URL.
zh

[CV-57] Landmarks Are Alike Yet Distinct: Harnessing Similarity and Individuality for One-Shot Medical Landmark Detection

【速读】：该论文旨在解决多地标检测在医学影像应用中的两个主要挑战：一是“跷跷板现象”（seesaw phenomenon），即优化某些地标检测会导致其他地标检测性能下降；二是训练单独模型以分别检测每个地标会增加内存使用和计算开销。为了解决这些问题，论文提出了一种创新方法，其关键是基于“地标是独特的”这一信念，通过在整个训练过程中使用伪标签和模板数据更新，训练专注于单个地标检测的模型以实现高精度。同时，论文还基于“地标也是相似的”这一信念，引入了一种基于适配器的融合模型，该模型结合共享权重和特定于地标权重，以高效共享模型参数并灵活适应个体地标。这种方法不仅显著减少了内存和计算资源需求，还有效缓解了多地标训练中的跷跷板现象。实验结果表明，单地标模型在检测个体地标方面明显优于传统的多点联合训练模型，而基于适配器的融合模型在资源效率方面取得了显著改进，同时保持了较高的性能。

链接: https://arxiv.org/abs/2503.16058
作者: Xu He,Zhen Huang,Qingsong Yao,Xiaoqian Zhou,S. Kevin Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Landmark detection plays a crucial role in medical imaging applications such as disease diagnosis, bone age estimation, and therapy planning. However, training models for detecting multiple landmarks simultaneously often encounters the “seesaw phenomenon”, where improvements in detecting certain landmarks lead to declines in detecting others. Yet, training a separate model for each landmark increases memory usage and computational overhead. To address these challenges, we propose a novel approach based on the belief that “landmarks are distinct” by training models with pseudo-labels and template data updated continuously during the training process, where each model is dedicated to detecting a single landmark to achieve high accuracy. Furthermore, grounded on the belief that “landmarks are also alike”, we introduce an adapter-based fusion model, combining shared weights with landmark-specific weights, to efficiently share model parameters while allowing flexible adaptation to individual landmarks. This approach not only significantly reduces memory and computational resource requirements but also effectively mitigates the seesaw phenomenon in multi-landmark training. Experimental results on publicly available medical image datasets demonstrate that the single-landmark models significantly outperform traditional multi-point joint training models in detecting individual landmarks. Although our adapter-based fusion model shows slightly lower performance compared to the combined results of all single-landmark models, it still surpasses the current state-of-the-art methods while achieving a notable improvement in resource efficiency.
zh

[CV-58] Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

【速读】：该论文旨在解决扩散模型（Diffusion Models）在扩展性和性能提升方面的挑战，特别是在视觉生成任务中引入混合专家（Mixture of Experts, MoE）方法时面临的浅层学习困难及模式崩溃（mode collapse）等问题。论文的关键创新在于提出了一种名为Race-DiT的新型MoE模型，其核心解决方案包括：1) 引入一种灵活的路由策略——“Expert Race”，通过让tokens与专家竞争并选择顶级候选者，使模型能够动态地为关键tokens分配专家；2) 提出每层正则化技术以缓解浅层学习的难题；3) 设计路由相似性损失函数以防止模式崩溃，从而确保专家的有效利用。实验结果验证了该方法在ImageNet上的显著性能提升及良好的可扩展性。

链接: https://arxiv.org/abs/2503.16057
作者: Yike Yuan,Ziyu Wang,Zihao Huang,Defa Zhu,Xun Zhou,Jingyi Yu,Qiyang Min
机构: ShanghaiTech University (上海科技大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models have emerged as mainstream framework in visual generation. Building upon this success, the integration of Mixture of Experts (MoE) methods has shown promise in enhancing model scalability and performance. In this paper, we introduce Race-DiT, a novel MoE model for diffusion transformers with a flexible routing strategy, Expert Race. By allowing tokens and experts to compete together and select the top candidates, the model learns to dynamically assign experts to critical tokens. Additionally, we propose per-layer regularization to address challenges in shallow layer learning, and router similarity loss to prevent mode collapse, ensuring better expert utilization. Extensive experiments on ImageNet validate the effectiveness of our approach, showcasing significant performance gains while promising scaling properties.
zh

[CV-59] Semantic-Guided Global-Local Collaborative Networks for Lightweight Image Super-Resolution

【速读】：该论文旨在解决单图像超分辨率（Single-Image Super-Resolution, SISR）在视觉测量系统中的应用问题，特别是提升这些系统中图像清晰度和细节质量的需求。视觉测量工具捕获的图像常因模糊和细节丢失而影响测量精度，论文提出了一种语义引导的全局-局部协作网络（Semantic-Guided Global-Local Collaborative Network, SGGLC-Net）作为轻量级SISR的潜在解决方案。解决方案的关键在于引入语义先验信息以指导超分辨率过程：通过设计语义引导模块（Semantic Guidance Module），将从预训练模型提取的语义先验无缝集成到超分辨率网络中；同时，通过全局-局部协作模块（Global-Local Collaborative Module）结合三种全局与局部细节增强模块及混合注意力机制（Hybrid Attention Mechanism），有效探索局部和非局部交互以提升图像细节表现力。实验结果表明，SGGLC-Net在多个基准数据集上实现了竞争性的PSNR和SSIM值，并显著减少了计算复杂度，证明了其在提高视觉测量系统精度和效率方面的潜力。

链接: https://arxiv.org/abs/2503.16056
作者: Wanshu Fan,Yue Wang,Cong Wang,Yunzhe Zhang,Wei Wang,Dongsheng Zhou
机构: National and Local Joint Engineering Laboratory of Computer Aided Design, School of Software Engineering, Dalian University (大连大学); Centre for Advances in Reliability and Safety (CAiRS) (香港); School of Cyber Science and Technology, Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages,13 figures, 9 tables

点击查看摘要

Abstract:Single-Image Super-Resolution (SISR) plays a pivotal role in enhancing the accuracy and reliability of measurement systems, which are integral to various vision-based instrumentation and measurement applications. These systems often require clear and detailed images for precise object detection and recognition. However, images captured by visual measurement tools frequently suffer from degradation, including blurring and loss of detail, which can impede measurement this http URL a potential remedy, we in this paper propose a Semantic-Guided Global-Local Collaborative Network (SGGLC-Net) for lightweight SISR. Our SGGLC-Net leverages semantic priors extracted from a pre-trained model to guide the super-resolution process, enhancing image detail quality effectively. Specifically,we propose a Semantic Guidance Module that seamlessly integrates the semantic priors into the super-resolution network, enabling the network to more adeptly capture and utilize semantic priors, thereby enhancing image details. To further explore both local and non-local interactions for improved detail rendition,we propose a Global-Local Collaborative Module, which features three Global and Local Detail Enhancement Modules, as well as a Hybrid Attention Mechanism to work together to efficiently learn more useful features. Our extensive experiments show that SGGLC-Net achieves competitive PSNR and SSIM values across multiple benchmark datasets, demonstrating higher performance with the multi-adds reduction of 12.81G compared to state-of-the-art lightweight super-resolution approaches. These improvements underscore the potential of our approach to enhance the precision and effectiveness of visual measurement systems. Codes are at this https URL.
zh

[CV-60] Closer to Ground Truth: Realistic Shape and Appearance Labeled Data Generation for Unsupervised Underwater Image Segmentation ECCV

【速读】：该论文旨在解决水下视频中鱼类分割这一在海洋与水产养殖行业中具有重要实际价值但极具挑战性的任务。由于拍摄环境复杂、能见度差以及现有标注的水下鱼类数据有限，该问题尤为困难。论文提出了一种新颖的两阶段无监督分割方法，无需人工标注，并结合人工合成图像与真实图像。方案的关键在于通过Thin Plate Spline形状扭曲和颜色直方图匹配等鱼体变换技术，将虚拟鱼放置于真实的水下栖息地中生成具有挑战性的合成训练数据，使生成的图像逐步逼近真实世界数据。论文验证了此无监督方法在DeepFish数据集上的性能接近全监督最先进的模型，并进一步展示了其在鲑鱼分割中的有效性，为此引入了DeepSalmon数据集（30 GB）。此外，在两个数据集上证明了该方法能够提升全监督最先进模型的性能。

链接: https://arxiv.org/abs/2503.16051
作者: Andrei Jelea,Ahmed Nabil Belbachir,Marius Leordeanu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Proceedings of ECCVW 2024

点击查看摘要

Abstract:Solving fish segmentation in underwater videos, a real-world problem of great practical value in marine and aquaculture industry, is a challenging task due to the difficulty of the filming environment, poor visibility and limited existing annotated underwater fish data. In order to overcome these obstacles, we introduce a novel two stage unsupervised segmentation approach that requires no human annotations and combines artificially created and real images. Our method generates challenging synthetic training data, by placing virtual fish in real-world underwater habitats, after performing fish transformations such as Thin Plate Spline shape warping and color Histogram Matching, which realistically integrate synthetic fish into the backgrounds, making the generated images increasingly closer to the real world data with every stage of our approach. While we validate our unsupervised method on the popular DeepFish dataset, obtaining a performance close to a fully-supervised SoTA model, we further show its effectiveness on the specific case of salmon segmentation in underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature (30 GB). Moreover, on both datasets we prove the capability of our approach to boost the performance of the fully-supervised SoTA model.
zh

[CV-61] Agent ic Keyframe Search for Video Question Answering

【速读】：该论文旨在解决视频问答（VideoQA）任务中对视频的全面理解需求与高计算成本之间的矛盾。为应对这一挑战，论文提出了一种名为Agentic Keyframe Search (AKeyS) 的算法，其关键是通过现代语言代理（language agent）引导经典搜索算法，有效区分视频中的关键信息与冗余无关内容。具体而言，AKeyS 首先将视频分割并组织为树形结构，然后利用语言代理动态评估启发式值和移动代价以扩展节点，并最终依据终止条件判断是否收集到足够的关键帧并给出答案。实验结果表明，AKeyS 在关键帧搜索效率方面超越了所有现有方法，同时显著降低了计算开销。

链接: https://arxiv.org/abs/2503.16032
作者: Sunqi Fan,Meng-Hao Guo,Shuojin Yang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video question answering (VideoQA) enables machines to extract and comprehend key information from videos through natural language interaction, which is a critical step towards achieving intelligence. However, the demand for a thorough understanding of videos and high computational costs still limit the widespread applications of VideoQA. To address it, we propose Agentic Keyframe Search (AKeyS), a simple yet powerful algorithm for identifying keyframes in the VideoQA task. It can effectively distinguish key information from redundant, irrelevant content by leveraging modern language agents to direct classical search algorithms. Specifically, we first segment the video and organize it as a tree structure. Then, AKeyS uses a language agent to estimate heuristics and movement costs while dynamically expanding nodes. Finally, the agent determines if sufficient keyframes have been collected based on termination conditions and provides answers. Extensive experiments on the EgoSchema and NExT-QA datasets show that AKeyS outperforms all previous methods with the highest keyframe searching efficiency, which means it can accurately identify key information and conduct effective visual reasoning with minimal computational overhead. For example, on the EgoSchema subset, it achieves 1.8% higher accuracy while processing only 43.5% of the frames compared to VideoTree. We believe that AKeyS represents a significant step towards building intelligent agents for video understanding. The code is publicly available at this https URL.
zh

[CV-62] Single Image Iterative Subject-driven Generation and Editing

【速读】：该论文致力于解决从单个图像个性化生成和编辑图像的难题，尤其是在仅有少量甚至单一目标图像的情况下。现有方法通常依赖概念学习或预训练编码器来实现个性化，但前者在目标图像数量较少时生成质量快速下降，后者则受限于训练分布且耗时较长。论文提出了一种无需训练的新方法SISO，其关键是通过优化输入目标图像的相似性评分，在不进行模型训练的前提下迭代生成图像并调整模型参数，直至达到满意的相似度水平。这种方法可无缝适配任意图像生成器，并在图像质量和保真度、背景保留等方面显著优于现有技术。

链接: https://arxiv.org/abs/2503.16025
作者: Yair Shpitzer,Gal Chechik,Idan Schwartz
机构: Bar-Ilan University (巴伊兰大学); Bar-Ilan University (巴伊兰大学), NVIDIA; Bar-Ilan University (巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page is at this https URL

点击查看摘要

Abstract:Personalizing image generation and editing is particularly challenging when we only have a few images of the subject, or even a single image. A common approach to personalization is concept learning, which can integrate the subject into existing models relatively quickly, but produces images whose quality tends to deteriorate quickly when the number of subject images is small. Quality can be improved by pre-training an encoder, but training restricts generation to the training distribution, and is time consuming. It is still an open hard challenge to personalize image generation and editing from a single image without training. Here, we present SISO, a novel, training-free approach based on optimizing a similarity score with an input subject image. More specifically, SISO iteratively generates images and optimizes the model based on loss of similarity with the given subject image until a satisfactory level of similarity is achieved, allowing plug-and-play optimization to any image generator. We evaluated SISO in two tasks, image editing and image generation, using a diverse data set of personal subjects, and demonstrate significant improvements over existing methods in image quality, subject fidelity, and background preservation.
zh

[CV-63] GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language Instructions

【速读】：本文旨在解决灵活指令引导的六自由度（6-DoF）抓取任务中的挑战，特别是现有方法在利用大型语言模型（LLMs）进行上下文理解以建立表达与目标之间映射的同时，忽略了物体物理属性的知识。为了解决这一问题，论文提出了GraspCoT框架，其关键在于集成了一种面向物理属性的链式思维（Chain-of-Thought, CoT）推理机制，并通过辅助问答（QA）任务进行引导。具体而言，设计了一组QA模板，实现包括目标解析、物理属性分析以及抓取动作选择在内的分层推理过程。此外，GraspCoT采用统一的多模态LLM架构，将三维场景的多视角观测编码为具有三维感知能力的视觉标记，并结合由CoT推导出的文本标记，在LLMs内联合嵌入以生成抓取姿态预测。同时，论文还构建了IntentGrasp数据集，填补了多对象抓取检测在多样化和间接口头命令下公开数据集的空白。实验结果验证了该方法的有效性，并在现实世界机器人应用中进一步证明其实用性。

链接: https://arxiv.org/abs/2503.16013
作者: Xiaomeng Chu,Jiajun Deng,Guoliang You,Wei Liu,Xingchen Li,Jianmin Ji,Yanyong Zhang
机构: University of Science and Technology of China (中国科学技术大学); The University of Adelaide (阿德莱德大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users’ intentions in the instructions. However, the LLM’s knowledge about objects’ physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. Codes and data will be released.
zh

[CV-64] GazeSCRNN: Event-based Near-eye Gaze Tracking using a Spiking Neural Network

【速读】：该论文旨在解决传统近眼凝视跟踪系统在捕捉动态运动时的局限性问题。为应对这一挑战，论文提出了一种新颖的事件驱动近眼凝视跟踪方法——GazeSCRNN，它基于脉冲卷积递归神经网络（Spiking Convolutional Recurrent Neural Network, SCRNN），利用动态视觉传感器（Dynamic Vision Sensor, DVS）相机的高时间分辨率、能效及与事件驱动系统的兼容性。解决方案的关键在于采用自适应泄漏积分-发放（Adaptive Leaky-Integrate-and-Fire, ALIF）神经元和针对时空数据优化的混合架构来处理来自DVS相机的事件流，并通过前向传播通过时间（Forward-Propagation-Through-Time）等训练技术进一步提升性能。实验结果表明，最精确的模型实现了平均角度误差（Mean Angle Error, MAE）为6.034°和平均瞳孔误差（Mean Pupil Error, MPE）为2.094 mm的预测精度，验证了使用脉冲神经网络（Spike Neural Network, SNN）进行事件驱动凝视跟踪的可行性，并揭示了改进系统性能的关键要素。

链接: https://arxiv.org/abs/2503.16012
作者: Stijn Groenen,Marzieh Hassanshahi Varposhti,Mahyar Shahsavari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:This work introduces GazeSCRNN, a novel spiking convolutional recurrent neural network designed for event-based near-eye gaze tracking. Leveraging the high temporal resolution, energy efficiency, and compatibility of Dynamic Vision Sensor (DVS) cameras with event-based systems, GazeSCRNN uses a spiking neural network (SNN) to address the limitations of traditional gaze-tracking systems in capturing dynamic movements. The proposed model processes event streams from DVS cameras using Adaptive Leaky-Integrate-and-Fire (ALIF) neurons and a hybrid architecture optimized for spatio-temporal data. Extensive evaluations on the EV-Eye dataset demonstrate the model’s accuracy in predicting gaze vectors. In addition, we conducted ablation studies to reveal the importance of the ALIF neurons, dynamic event framing, and training techniques, such as Forward-Propagation-Through-Time, in enhancing overall system performance. The most accurate model achieved a Mean Angle Error (MAE) of 6.034° and a Mean Pupil Error (MPE) of 2.094 mm. Consequently, this work is pioneering in demonstrating the feasibility of using SNNs for event-based gaze tracking, while shedding light on critical challenges and opportunities for further improvement.
zh

[CV-65] SenseExpo: Efficient Autonomous Exploration with Prediction Information from Lightweight Neural Networks

【速读】：本文旨在解决传统自主探索方法在计算开销和环境泛化能力方面的局限性。为应对这些挑战，论文提出SenseExpo框架，其核心解决方案是设计了一个轻量级预测网络。该模型通过集成生成对抗网络（GANs）、Transformer和快速傅里叶卷积（FFC），仅包含709k参数，却在KTH数据集上实现了比U-net（24.5M参数）和LaMa（51M参数）更优的性能，峰值信噪比（PSNR）提升38.7%，同时具备显著的跨域泛化能力，HouseExpo数据集上的Fréchet Inception Distance (FID) 得分为161.55。此外，SenseExpo在KTH和MRPB 1.0数据集上的探索时间分别减少了67.9%和77.1%，展现了高效的探索效率。作为即插即用的ROS节点，该框架可无缝集成到现有导航系统中，为资源受限设备提供高效解决方案。

链接: https://arxiv.org/abs/2503.16000
作者: Haojia Gao,Haohua Que,Hoiian Au,Weihao Shan,Mingkai Liu,Yusen Qin,Lei Mu,Rong Zhao,Xinghua Yang,Qi Wei,Fei Qiao
机构: Beijing University of Technology (北京工业大学); Tsinghua University (清华大学); Peking University (北京大学); Beijing Forestry University (北京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes SenseExpo, an efficient autonomous exploration framework based on a lightweight prediction network, which addresses the limitations of traditional methods in computational overhead and environmental generalization. By integrating Generative Adversarial Networks (GANs), Transformer, and Fast Fourier Convolution (FFC), we designed a lightweight prediction model with merely 709k parameters. Our smallest model achieves better performance on the KTH dataset than U-net (24.5M) and LaMa (51M), delivering PSNR 9.026 and SSIM 0.718, particularly representing a 38.7% PSNR improvement over the 51M-parameter LaMa model. Cross-domain testing demonstrates its strong generalization capability, with an FID score of 161.55 on the HouseExpo dataset, significantly outperforming comparable methods. Regarding exploration efficiency, on the KTH dataset,SenseExpo demonstrates approximately a 67.9% time reduction in exploration time compared to MapEx. On the MRPB 1.0 dataset, SenseExpo achieves 77.1% time reduction roughly compared to MapEx. Deployed as a plug-and-play ROS node, the framework seamlessly integrates with existing navigation systems, providing an efficient solution for resource-constrained devices.
zh

[CV-66] Automating 3D Dataset Generation with Neural Radiance Fields

【速读】：该论文旨在解决3D检测领域中高质量大规模标注数据集稀缺的问题。现有的公开3D数据集不仅数量有限，且类别范围受限，而创建多样化、精确标注的大规模数据集过程复杂且成本高昂。论文提出了一种自动化的3D数据集生成管道，其关键在于利用辐射场（Radiance Fields）的通用3D表示与渲染能力，快速生成高质量的任意物体3D模型，并将其作为合成数据集生成的输入。该方法高效、易用且高度自动化，实验表明，基于生成的数据集训练的3D姿态估计网络在典型应用场景中表现出色。

链接: https://arxiv.org/abs/2503.15997
作者: P. Schulz,T. Hempel,A. Al-Hamadi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented at ROBOVIS 2025 (5th International Conference on Robotics, Computer Vision and Intelligent Systems)

点击查看摘要

Abstract:3D detection is a critical task to understand spatial characteristics of the environment and is used in a variety of applications including robotics, augmented reality, and image retrieval. Training performant detection models require diverse, precisely annotated, and large scale datasets that involve complex and expensive creation processes. Hence, there are only few public 3D datasets that are additionally limited in their range of classes. In this work, we propose a pipeline for automatic generation of 3D datasets for arbitrary objects. By utilizing the universal 3D representation and rendering capabilities of Radiance Fields, our pipeline generates high quality 3D models for arbitrary objects. These 3D models serve as input for a synthetic dataset generator. Our pipeline is fast, easy to use and has a high degree of automation. Our experiments demonstrate, that 3D pose estimation networks, trained with our generated datasets, archive strong performance in typical application scenarios.
zh

[CV-67] Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models

【速读】：该论文旨在解决通过传统方法创建逼真的人形角色动画耗时且成本高昂的问题。论文提出了一种利用生成式视频模型中的强泛化运动先验来合成输入静态3D人形网格的4D动画序列的方法。解决方案的关键在于结合生成式视频模型的强大运动信息与SMPL（Skinned Multi-Person Linear）表示法，通过渲染的3D网格图像条件生成对应的视频，并基于运动优化技术将视频生成的运动应用于相应的3D网格，从而实现高效且经济的多样化真实感4D动画合成。

链接: https://arxiv.org/abs/2503.15996
作者: Marc Benedí San Millán,Angela Dai,Matthias Nießner
机构: Technical University of Munich (慕尼黑工业大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:Animation of humanoid characters is essential in various graphics applications, but requires significant time and cost to create realistic animations. We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes, leveraging strong generalized motion priors from generative video models – as such video models contain powerful motion information covering a wide variety of human motions. From an input static 3D humanoid mesh and a text prompt describing the desired animation, we synthesize a corresponding video conditioned on a rendered image of the 3D mesh. We then employ an underlying SMPL representation to animate the corresponding 3D mesh according to the video-generated motion, based on our motion optimization. This enables a cost-effective and accessible solution to enable the synthesis of diverse and realistic 4D animations.
zh

[CV-68] SpiLiFormer: Enhancing Spiking Transformers with Lateral Inhibition

【速读】：该论文旨在解决现有基于Transformer的Spiking Neural Networks (SNNs) 中，其尖峰注意力模块未能充分应对过度分配注意力到无关上下文的问题。传统方法通常直接沿用模拟Transformer中的注意力机制，而忽略了这一局限性。为了解决这一基础且被忽视的问题，论文提出了一种受侧抑制（Lateral Inhibition）启发的尖峰Transformer模型（SpiLiFormer）。该方案的关键在于模拟大脑的侧抑制机制，通过增强对相关标记的关注同时抑制对无关标记的关注，从而更有效地分配注意力资源，提升模型性能。

链接: https://arxiv.org/abs/2503.15986
作者: Zeqi Zheng,Yanchen Huang,Yingchao Yu,Zizheng Zhu,Junfeng Tang,Zhaofei Yu,Yaochu Jin
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Nanjing University (南京大学); Donghua University (东华大学); University of Electronic Science and Technology of China (电子科技大学); Peking University (北京大学)
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) based on Transformers have garnered significant attention due to their superior performance and high energy efficiency. However, the spiking attention modules of most existing Transformer-based SNNs are adapted from those of analog Transformers, failing to fully address the issue of over-allocating attention to irrelevant contexts. To fix this fundamental yet overlooked issue, we propose a Lateral Inhibition-inspired Spiking Transformer (SpiLiFormer). It emulates the brain’s lateral inhibition mechanism, guiding the model to enhance attention to relevant tokens while suppressing attention to irrelevant ones. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including CIFAR-10 (+0.45%), CIFAR-100 (+0.48%), CIFAR10-DVS (+2.70%), N-Caltech101 (+1.94%), and ImageNet-1K (+1.6%). Notably, on the ImageNet-1K dataset, SpiLiFormer (69.9M parameters, 4 time steps, 384 resolution) outperforms E-SpikeFormer (173.0M parameters, 8 time steps, 384 resolution), a SOTA spiking Transformer, by 0.46% using only 39% of the parameters and half the time steps. Our code and training checkpoints will be released upon acceptance.
zh

[CV-69] DIPLI: Deep Image Prior Lucky Imaging for Blind Astronomical Image Restoration

【速读】：该论文旨在解决天文图像恢复与超分辨率处理中因受限训练数据导致的传统深度学习方法难以有效工作的挑战。为应对这些挑战，论文提出了一种改进的Deep Image Prior (DIP)模型，其关键是结合多种先进技术和策略：通过同时处理多帧图像并采用Back Projection方法及TVNet模型优化模型结构；引入基于Markov方法的Monte Carlo估计、Langevin动力学以及变分输入技术以实现无偏估计并最小化方差，从而有效减少过拟合现象。这些改进显著降低了噪声学习的可能性，稳定了损失函数波动，提升了结果的可靠性，最终在多个天文及天体图像数据集上验证了算法性能，不仅超越了传统的Lucky Imaging技术，还优于原始DIP模型以及当前最先进的Transformer和扩散模型。

链接: https://arxiv.org/abs/2503.15984
作者: Suraj Singh,Anastasia Batsheva,Oleg Y. Rogov,Ahmed Bouridane
机构: Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究院, Skoltech); College of Computing and Informatics Computer Engineering Department, University of Sharjah (沙迦大学计算与信息科学计算机工程系, UAE); AIRI (人工智能研究院, AIRI)
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 10 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Contemporary image restoration and super-resolution techniques effectively harness deep neural networks, markedly outperforming traditional methods. However, astrophotography presents unique challenges for deep learning due to limited training data. This work explores hybrid strategies, such as the Deep Image Prior (DIP) model, which facilitates blind training but is susceptible to overfitting, artifact generation, and instability when handling noisy images. We propose enhancements to the DIP model’s baseline performance through several advanced techniques. First, we refine the model to process multiple frames concurrently, employing the Back Projection method and the TVNet model. Next, we adopt a Markov approach incorporating Monte Carlo estimation, Langevin dynamics, and a variational input technique to achieve unbiased estimates with minimal variance and counteract overfitting effectively. Collectively, these modifications reduce the likelihood of noise learning and mitigate loss function fluctuations during training, enhancing result stability. We validated our algorithm across multiple image sets of astronomical and celestial objects, achieving performance that not only mitigates limitations of Lucky Imaging, a classical computer vision technique that remains a standard in astronomical image reconstruction but surpasses the original DIP model, state of the art transformer- and diffusion-based models, underscoring the significance of our improvements.
zh

[CV-70] A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli

【速读】：该论文旨在解决通过功能磁共振成像（fMRI）进行脑信号解码以重构外部刺激的问题。解决方案的关键在于结合神经影像技术和先进的图像生成模型，如生成对抗网络（GANs）、变分自编码器（VAEs）以及扩散模型（Diffusion Models），以提高重建刺激的质量，并利用跨模态预训练模型增强跨模态解码能力。同时，论文系统性地总结了现有方法的模型结构、相关数据集及脑区信息，评估了解码性能，并提出了未来的研究方向，从而为提升fMRI在脑感知与认知过程研究中的应用效果提供了重要参考。

链接: https://arxiv.org/abs/2503.15978
作者: Pengyu Liu,Guohua Dong,Dan Guo,Kun Li,Fengling Li,Xun Yang,Meng Wang,Xiaomin Ying
机构: Center for Computational Biology, Beijing Institute of Basic Medical Sciences (北京基础医学研究所计算生物学中心); School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机科学与信息工程学院); Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education (教育部知识工程大数据重点实验室（合肥工业大学）); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院); ReLER, CCAI, Zhejiang University (浙江大学ReLER, CCAI); Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney (悉尼科技大学澳大利亚人工智能研究所工程与信息技术学院); School of Information Science and Technology, University of Science and Technology of China (中国科学技术大学信息科学技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 6 figures

点击查看摘要

Abstract:In daily life, we encounter diverse external stimuli, such as images, sounds, and videos. As research in multimodal stimuli and neuroscience advances, fMRI-based brain decoding has become a key tool for understanding brain perception and its complex cognitive processes. Decoding brain signals to reconstruct stimuli not only reveals intricate neural mechanisms but also drives progress in AI, disease treatment, and brain-computer interfaces. Recent advancements in neuroimaging and image generation models have significantly improved fMRI-based decoding. While fMRI offers high spatial resolution for precise brain activity mapping, its low temporal resolution and signal noise pose challenges. Meanwhile, techniques like GANs, VAEs, and Diffusion Models have enhanced reconstructed image quality, and multimodal pre-trained models have boosted cross-modal decoding tasks. This survey systematically reviews recent progress in fMRI-based brain decoding, focusing on stimulus reconstruction from passive brain signals. It summarizes datasets, relevant brain regions, and categorizes existing methods by model structure. Additionally, it evaluates model performance and discusses their effectiveness. Finally, it identifies key challenges and proposes future research directions, offering valuable insights for the field. For more information and resources related to this survey, visit this https URL.
zh

[CV-71] Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation

【速读】：该论文旨在解决通过单图像生成高质量三维模型时扩散过程耗时过长的问题。为实现通过少量推理步骤获得高质量重建，论文强调了在随机噪声状态中正则化分数函数学习的关键问题。为此，提出边缘一致性（edge consistency）的概念，即在高信噪比区域保持一致预测，以增强预训练的扩散模型，并基于此对终点分数函数进行蒸馏式的精炼。此外，构建基于这些蒸馏扩散模型的对抗性增强策略，进一步丰富生成细节并提升整体生成质量。这两部分模块相互补充，共同提升生成性能。大量实验表明，与现有技术相比，Acc3D不仅将计算效率提升了20倍以上，还显著提高了生成质量。

链接: https://arxiv.org/abs/2503.15975
作者: Kendong Liu,Zhiyu Zhu,Hui Liu,Junhui Hou
机构: City University of Hong Kong (香港城市大学); Saint Francis University (圣弗朗西斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Acc3D to tackle the challenge of accelerating the diffusion process to generate 3D models from single images. To derive high-quality reconstructions through few-step inferences, we emphasize the critical issue of regularizing the learning of score function in states of random noise. To this end, we propose edge consistency, i.e., consistent predictions across the high signal-to-noise ratio region, to enhance a pre-trained diffusion model, enabling a distillation-based refinement of the endpoint score function. Building on those distilled diffusion models, we propose an adversarial augmentation strategy to further enrich the generation detail and boost overall generation quality. The two modules complement each other, mutually reinforcing to elevate generative performance. Extensive experiments demonstrate that our Acc3D not only achieves over a 20\times increase in computational efficiency but also yields notable quality improvements, compared to the state-of-the-arts.
zh

[CV-72] STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding

【速读】：该论文旨在解决将预训练的视觉-语言模型（如CLIP）从图像任务扩展到视频任务时所面临的挑战，主要体现在有限的标注视频数据和高昂的训练成本。当前基于视频提示的方法通过引入可学习提示来适应CLIP进行视频任务，但通常使用单一静态提示处理所有视频序列，未能充分利用帧间的时间动态性和空间变化性，从而限制了模型捕获关键时间信息的能力。为此，论文提出了一种集成的空间-时间动态提示（STOP）模型，其关键是设计了两个互补模块：帧内空间提示和帧间时间提示。帧内空间提示通过利用帧内注意力和时间变化自适应地突出每帧中的判别区域，而帧间时间提示则根据帧相似度动态插入提示于具有高时间方差的帧之间，以强调不同帧在视频理解中的重要性，从而增强模型理解时间依赖关系的能力。实验结果表明，STOP模型在多个视频基准测试中表现出色，优于现有方法。

链接: https://arxiv.org/abs/2503.15973
作者: Zichen Liu,Kunlun Xu,Bing Su,Xu Zou,Yuxin Peng,Jiahuan Zhou
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所，北京大学); Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院，中国人民大学); School of Artificial Intelligence and Automation, Huazhong University of Science and Technology (人工智能与自动化学院，华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-trained on tremendous image-text pairs, vision-language models like CLIP have demonstrated promising zero-shot generalization across numerous image-based tasks. However, extending these capabilities to video tasks remains challenging due to limited labeled video data and high training costs. Recent video prompting methods attempt to adapt CLIP for video tasks by introducing learnable prompts, but they typically rely on a single static prompt for all video sequences, overlooking the diverse temporal dynamics and spatial variations that exist across frames. This limitation significantly hinders the model’s ability to capture essential temporal information for effective video understanding. To address this, we propose an integrated Spatial-TempOral dynamic Prompting (STOP) model which consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. Our intra-frame spatial prompts are designed to adaptively highlight discriminative regions within each frame by leveraging intra-frame attention and temporal variation, allowing the model to focus on areas with substantial temporal dynamics and capture fine-grained spatial details. Additionally, to highlight the varying importance of frames for video understanding, we further introduce inter-frame temporal prompts, dynamically inserting prompts between frames with high temporal variance as measured by frame similarity. This enables the model to prioritize key frames and enhances its capacity to understand temporal dependencies across sequences. Extensive experiments on various video benchmarks demonstrate that STOP consistently achieves superior performance against state-of-the-art methods. The code is available at this https URL.
zh

[CV-73] V-NAW: Video-based Noise-aware Adaptive Weighting for Facial Expression Recognition

【速读】：该论文旨在解决基于视频的面部表情识别（Video-based Facial Expression Recognition, FER）中的标签模糊性和类别不平衡问题，这些问题会导致性能下降。论文的关键解决方案是提出了一种名为Video-based Noise-aware Adaptive Weighting (V-NAW)的方法，它能够自适应地为视频片段中的每一帧分配重要性权重，以应对标签模糊性并有效捕捉面部表情的时间变化。此外，还引入了一种简单而有效的增强策略，用于减少连续帧之间的冗余，从而缓解过拟合问题。通过大量实验验证，所提出的方法显著提升了基于视频的FER性能。

链接: https://arxiv.org/abs/2503.15970
作者: JunGyu Lee,Kunyoung Lee,Haesol Park,Ig-Jae Kim,Gi Pyo Nam
机构: Korea Institute of Science and Technology (韩国科学技术研究院), Seoul, Korea; AI-Robotics, KIST School, University of Science and Technology (KIST 学校, 韩国科学技术院), Daejeon, Korea
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial Expression Recognition (FER) plays a crucial role in human affective analysis and has been widely applied in computer vision tasks such as human-computer interaction and psychological assessment. The 8th Affective Behavior Analysis in-the-Wild (ABAW) Challenge aims to assess human emotions using the video-based Aff-Wild2 dataset. This challenge includes various tasks, including the video-based EXPR recognition track, which is our primary focus. In this paper, we demonstrate that addressing label ambiguity and class imbalance, which are known to cause performance degradation, can lead to meaningful performance improvements. Specifically, we propose Video-based Noise-aware Adaptive Weighting (V-NAW), which adaptively assigns importance to each frame in a clip to address label ambiguity and effectively capture temporal variations in facial expressions. Furthermore, we introduce a simple and effective augmentation strategy to reduce redundancy between consecutive frames, which is a primary cause of overfitting. Through extensive experiments, we validate the effectiveness of our approach, demonstrating significant improvements in video-based FER performance.
zh

[CV-74] Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation

【速读】：该论文试图解决传统基于视觉光谱的地球观测（Earth Observation, EO）模型未能充分利用卫星记录的多光谱通道中丰富光谱信息的问题。解决方案的关键在于引入Llama3-MS-CLIP，这是一种首次通过对比学习在大规模多光谱数据集上预训练的视觉-语言模型，并扩展了光谱范围以提升性能。论文还构建了一个包含一百万个Sentinel-2样本及其对应文本描述的迄今为止最大的多光谱图像-标题数据集，并开发了一种可扩展的图像标题生成管道，验证结果得到了领域专家的认可。实验表明，Llama3-MS-CLIP在多光谱零样本图像分类和检索任务中显著优于其他基于RGB的方法，平均分类准确率提高了6.77%，mAP检索性能提升了4.63%。这些成果突显了多光谱视觉-语言学习的重要性，并开源了相关数据集、代码及模型权重。

链接: https://arxiv.org/abs/2503.15969
作者: Clive Tinashe Marimo,Benedikt Blumenstiel,Maximilian Nitsche,Johannes Jakubik,Thomas Brunschwiler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models for Earth observation (EO) typically rely on the visual spectrum of data as the only model input, thus failing to leverage the rich spectral information available in the multispectral channels recorded by satellites. Therefore, in this paper, we introduce Llama3-MS-CLIP, the first vision-language model pre-trained with contrastive learning on a large-scale multispectral dataset and report on the performance gains due to the extended spectral range. Furthermore, we present the largest-to-date image-caption dataset for multispectral data, consisting of one million Sentinel-2 samples and corresponding textual descriptions generated with Llama3-LLaVA-Next and Overture Maps data. We develop a scalable captioning pipeline, which is validated by domain experts. We evaluate Llama3-MS-CLIP on multispectral zero-shot image classification and retrieval using three datasets of varying complexity. Our results demonstrate that Llama3-MS-CLIP significantly outperforms other RGB-based approaches, improving classification accuracy by 6.77% on average and retrieval performance by 4.63% mAP compared to the second-best model. Our results emphasize the relevance of multispectral vision-language learning. We release the image-caption dataset, code, and model weights under an open-source license.
zh

[CV-75] CausalCLIPSeg: Unlocking CLIPs Potential in Referring Medical Image Segmentation with Causal Intervention MICCAI2024

【速读】：该论文致力于解决基于文本描述的医学图像分割（referring medical image segmentation）问题，即根据自然语言描述精确勾勒医学图像中的病灶区域。由于视觉与文本线索具有截然不同的数据特性，将两者对齐极具挑战性。为应对这一问题，论文提出了一种端到端框架CausalCLIPSeg，其核心在于利用CLIP模型并通过定制化的跨模态解码方法（tailored cross-modal decoding），即使CLIP未在医学数据上训练，也能将其丰富的语义空间适配至医学领域以实现文本到像素的精准对齐。此外，为缓解可能引入的混淆偏差（confounding bias），CausalCLIPSeg引入因果干预模块（causal intervention module），通过自标注混杂因素并挖掘输入中的因果特征（causal features），从而避免模型学习到虚假相关性（spurious correlations）。同时，论文设计了一个对抗极小-极大博弈（adversarial min-max game），以优化因果特征并惩罚混淆特征。这些方法共同构成了CausalCLIPSeg的关键创新点，使其在实验中表现出最先进的性能。

链接: https://arxiv.org/abs/2503.15949
作者: Yaxiong Chen,Minghong Wei,Zixuan Zheng,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou
机构: MedAI Technology (Wuxi) Co. Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2024

点击查看摘要

Abstract:Referring medical image segmentation targets delineating lesions indicated by textual descriptions. Aligning visual and textual cues is challenging due to their distinct data properties. Inspired by large-scale pre-trained vision-language models, we propose CausalCLIPSeg, an end-to-end framework for referring medical image segmentation that leverages CLIP. Despite not being trained on medical data, we enforce CLIP’s rich semantic space onto the medical domain by a tailored cross-modal decoding method to achieve text-to-pixel alignment. Furthermore, to mitigate confounding bias that may cause the model to learn spurious correlations instead of meaningful causal relationships, CausalCLIPSeg introduces a causal intervention module which self-annotates confounders and excavates causal features from inputs for segmentation judgments. We also devise an adversarial min-max game to optimize causal features while penalizing confounding ones. Extensive experiments demonstrate the state-of-the-art performance of our proposed method. Code is available at this https URL.
zh

[CV-76] UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation MICCAI2024

【速读】：该论文旨在解决自动化放射学报告生成任务中医学图像与文本对齐的挑战，由于标注医学数据的相对匮乏，这一任务相较于计算机视觉中的图像描述任务面临更大的困难。论文的关键解决方案是提出了一种迁移学习框架，通过引入轻量级适配模块UniCrossAdapter，将大规模预训练的视觉-语言模型CLIP的知识迁移到医学领域。直接应用CLIP的效果不佳，因为其训练数据主要基于自然图像，与放射学领域的域差距较大。UniCrossAdapter模块能够在保持CLIP基础参数不变的情况下进行微调，分布于多模态及其交互中以增强视觉-语言对齐能力，从而高效适应医学任务需求。实验结果验证了该方法在两个公开数据集上的有效性，推动了放射学报告生成领域的技术进步。

链接: https://arxiv.org/abs/2503.15940
作者: Yaxiong Chen,Chuang Du,Chunlei Li,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou
机构: MedAI Technology (Wuxi) Co. Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2024 Workshop

点击查看摘要

Abstract:Automated radiology report generation aims to expedite the tedious and error-prone reporting process for radiologists. While recent works have made progress, learning to align medical images and textual findings remains challenging due to the relative scarcity of labeled medical data. For example, datasets for this task are much smaller than those used for image captioning in computer vision. In this work, we propose to transfer representations from CLIP, a large-scale pre-trained vision-language model, to better capture cross-modal semantics between images and texts. However, directly applying CLIP is suboptimal due to the domain gap between natural images and radiology. To enable efficient adaptation, we introduce UniCrossAdapter, lightweight adapter modules that are incorporated into CLIP and fine-tuned on the target task while keeping base parameters fixed. The adapters are distributed across modalities and their interaction to enhance vision-language alignment. Experiments on two public datasets demonstrate the effectiveness of our approach, advancing state-of-the-art in radiology report generation. The proposed transfer learning framework provides a means of harnessing semantic knowledge from large-scale pre-trained models to tackle data-scarce medical vision-language tasks. Code is available at this https URL.
zh

[CV-77] SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer

【速读】：该论文旨在解决图像风格迁移（Style Transfer, ST）任务中全局有效感受野难以高效建模的问题，同时避免现有基于卷积神经网络（CNNs）和Transformer的架构因计算复杂度过高而无法实现全局感受野的困境。论文的关键解决方案是利用状态空间模型（State Space Model, SSM）中改进的变体Mamba，其具备线性复杂度的长程依赖建模能力。具体而言，作者提出了一个基于Mamba的风格迁移框架SaMam，包含高效的Mamba编码器用于提取内容与风格信息，以及风格感知的Mamba解码器以灵活适应多种风格。此外，为应对现有SSM在局部像素遗忘、通道冗余及空间不连续性方面的问题，引入了局部增强和Z字形扫描技术。实验结果表明，SaMam在准确性和效率上均优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.15934
作者: Hongda Liu,Longguang Wang,Ye Zhang,Ziru Yu,Yulan Guo
机构: The Shenzhen Campus of Sun Yat-Sen University, Sun Yat-Sen University (中山大学深圳校区，中山大学); Aviation University of Air Force (空军航空大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Global effective receptive field plays a crucial role for image style transfer (ST) to obtain high-quality stylized results. However, existing ST backbones (e.g., CNNs and Transformers) suffer huge computational complexity to achieve global receptive fields. Recently, the State Space Model (SSM), especially the improved variant Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a approach to resolve the above dilemma. In this paper, we develop a Mamba-based style transfer framework, termed SaMam. Specifically, a mamba encoder is designed to efficiently extract content and style information. In addition, a style-aware mamba decoder is developed to flexibly adapt to various styles. Moreover, to address the problems of local pixel forgetting, channel redundancy and spatial discontinuity of existing SSMs, we introduce both local enhancement and zigzag scan. Qualitative and quantitative results demonstrate that our SaMam outperforms state-of-the-art methods in terms of both accuracy and efficiency.
zh

[CV-78] DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables CVPR2025

【速读】：该论文旨在解决深度神经网络在边缘设备上部署困难的问题，主要由于其显著的计算和内存需求。为了解决这一挑战，论文提出了一种名为DnLUT的超高效基于查找表（Lookup Table, LUT）的框架，用于高质量彩色图像去噪，同时保持极低的资源消耗。解决方案的关键在于两个互补组件：一是成对通道混合器（Pairwise Channel Mixer, PCM），它能够并行有效地捕捉通道间的相关性和空间依赖性；二是新颖的L形卷积设计，它在最大化感受野覆盖范围的同时最小化存储开销。通过在训练后将这些组件转换为优化的查找表，DnLUT实现了卓越的效率——仅需500KB存储空间和比其CNN对手DnCNN少99.9%的能量消耗，推理速度提高了20倍。实验结果表明，DnLUT在峰值信噪比（PSNR）方面比现有所有基于LUT的方法高出超过1dB，确立了在资源高效彩色图像去噪领域的最新技术水平。项目链接见文中提供的地址。

链接: https://arxiv.org/abs/2503.15931
作者: Sidi Yang,Binxiao Huang,Yulun Zhang,Dahai Yu,Yujiu Yang,Ngai Wong
机构: The University of Hong Kong (香港大学); Shanghai Jiaotong University (上海交通大学); TCL Corporate Research (TCL企业研究院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:While deep neural networks have revolutionized image denoising capabilities, their deployment on edge devices remains challenging due to substantial computational and memory requirements. To this end, we present DnLUT, an ultra-efficient lookup table-based framework that achieves high-quality color image denoising with minimal resource consumption. Our key innovation lies in two complementary components: a Pairwise Channel Mixer (PCM) that effectively captures inter-channel correlations and spatial dependencies in parallel, and a novel L-shaped convolution design that maximizes receptive field coverage while minimizing storage overhead. By converting these components into optimized lookup tables post-training, DnLUT achieves remarkable efficiency - requiring only 500KB storage and 0.1% energy consumption compared to its CNN contestant DnCNN, while delivering 20X faster inference. Extensive experiments demonstrate that DnLUT outperforms all existing LUT-based methods by over 1dB in PSNR, establishing a new state-of-the-art in resource-efficient color image denoising. The project is available at this https URL.
zh

[CV-79] BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers CVPR2025

【速读】：该论文旨在解决扩散变压器（Diffusion Transformers, DiTs）在推理速度方面的低效问题，主要源于其迭代去噪过程导致的冗余计算。为了解决这一挑战，论文提出了一种无需训练的方法——BlockDance，其关键是通过探索相邻时间步的特征相似性来加速DiTs。与之前缺乏针对不同尺度特征定制重用策略的方法不同，BlockDance专注于识别结构上最相似的时空特征（Structurally Similar Spatio-Temporal, STSS），这些特征主要位于去噪后期结构关注块中的Transformer内。通过缓存和重用这些高度相似的特征，BlockDance减少了冗余计算，从而提升了DiTs的推理速度，同时保持了与原始模型生成结果的一致性。此外，为了应对生成内容的多样性及冗余特征分布的变化，论文进一步提出了轻量级决策网络BlockDance-Ada，用于实例特定的加速，动态分配资源以提供更高质量的内容。实验表明，无论是BlockDance还是BlockDance-Ada，在多种生成任务和模型中均实现了25%到50%的加速比，并保持了生成质量。

链接: https://arxiv.org/abs/2503.15927
作者: Hui Zhang,Tingwei Gao,Jie Shao,Zuxuan Wu
机构: Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University (复旦大学计算机学院，上海智能信息处理重点实验室); Shanghai Collaborative Innovation Center of Intelligent Visual Computing (上海智能视觉创新中心); ByteDance Intelligent Creation (字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Diffusion models have demonstrated impressive generation capabilities, particularly with recent advancements leveraging transformer architectures to improve both visual and artistic quality. However, Diffusion Transformers (DiTs) continue to encounter challenges related to low inference speed, primarily due to the iterative denoising process. To address this issue, we propose BlockDance, a training-free approach that explores feature similarities at adjacent time steps to accelerate DiTs. Unlike previous feature-reuse methods that lack tailored reuse strategies for features at different scales, BlockDance prioritizes the identification of the most structurally similar features, referred to as Structurally Similar Spatio-Temporal (STSS) features. These features are primarily located within the structure-focused blocks of the transformer during the later stages of denoising. BlockDance caches and reuses these highly similar features to mitigate redundant computation, thereby accelerating DiTs while maximizing consistency with the generated results of the original model. Furthermore, considering the diversity of generated content and the varying distributions of redundant features, we introduce BlockDance-Ada, a lightweight decision-making network tailored for instance-specific acceleration. BlockDance-Ada dynamically allocates resources and provides superior content quality. Both BlockDance and BlockDance-Ada have proven effective across various generation tasks and models, achieving accelerations between 25% and 50% while maintaining generation quality.
zh

[CV-80] Learning to Efficiently Adapt Foundation Models for Self-Supervised Endoscopic 3D Scene Reconstruction from Any Cameras

【速读】：本文旨在解决内窥镜深度估计及基于自监督学习 (Self-Supervised Learning, SSL) 的3D场景重建问题。由于获取真实标签数据的困难，现有方法在医学领域的直接应用往往导致性能不佳。论文的关键在于提出了一种名为Endo3DAC的统一框架，通过高效适配基础模型来提升内窥镜任务的表现。具体而言，Endo3DAC设计了一个集成网络，能够同时估计深度图、相对位姿以及相机内参，并通过冻结基础模型主干网络、仅训练专门设计的门控动态向量低秩适配模块（Gated Dynamic Vector-Based Low-Rank Adaptation, GDV-LoRA）及其独立解码器头，实现了优异的深度与位姿估计效果，同时保持了训练效率。此外，还提出了一个优化深度图尺度、偏移及少量参数的3D场景重建流程。实验结果表明，Endo3DAC在四个内窥镜数据集上的表现显著优于当前其他先进方法，且所需可训练参数更少。据我们所知，这是首次利用单一网络仅依赖手术视频即可完成自监督深度估计与场景重建任务的工作。代码将在论文被接受后公开发布。

链接: https://arxiv.org/abs/2503.15917
作者: Beilei Cui,Long Bai,Mobarakol Islam,An Wang,Zhiqi Ma,Yiming Huang,Feng Li,Zhen Chen,Zhongliang Jiang,Nassir Navab,Hongliang Ren
机构: The Chinese University of Hong Kong (香港中文大学); University College London (伦敦大学学院); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D scene reconstruction is essential for numerous medical tasks. Given the challenges in obtaining ground truth data, there has been an increasing focus on self-supervised learning (SSL) for endoscopic depth estimation as a basis for scene reconstruction. While foundation models have shown remarkable progress in visual tasks, their direct application to the medical domain often leads to suboptimal results. However, the visual features from these models can still enhance endoscopic tasks, emphasizing the need for efficient adaptation strategies, which still lack exploration currently. In this paper, we introduce Endo3DAC, a unified framework for endoscopic scene reconstruction that efficiently adapts foundation models. We design an integrated network capable of simultaneously estimating depth maps, relative poses, and camera intrinsic parameters. By freezing the backbone foundation model and training only the specially designed Gated Dynamic Vector-Based Low-Rank Adaptation (GDV-LoRA) with separate decoder heads, Endo3DAC achieves superior depth and pose estimation while maintaining training efficiency. Additionally, we propose a 3D scene reconstruction pipeline that optimizes depth maps’ scales, shifts, and a few parameters based on our integrated network. Extensive experiments across four endoscopic datasets demonstrate that Endo3DAC significantly outperforms other state-of-the-art methods while requiring fewer trainable parameters. To our knowledge, we are the first to utilize a single network that only requires surgical videos to perform both SSL depth estimation and scene reconstruction tasks. The code will be released upon acceptance.
zh

[CV-81] xt-Driven Diffusion Model for Sign Language Production

【速读】：该论文旨在解决从文本输入生成语义对齐的手语姿势序列的问题。为实现这一目标，提出了一个基于文本驱动的扩散模型（Text-driven Diffusion Model, TDM）框架。解决方案的关键在于利用联合损失函数 (L_{\text{joint}}) 精确衡量并最小化生成姿势序列与真实值之间关节位置的差异，并通过骨骼方向损失函数 (L_{\text{bone}}) 确保生成姿势中骨骼的方向与实际正确方向一致。在推理阶段，TDM 框架以噪声序列为起点，在文本条件的严格约束下逐步优化并生成语义一致的手语姿势序列。最终，该方法在挑战赛中取得了 BLEU-1 分数 20.17，位列第二。

链接: https://arxiv.org/abs/2503.15914
作者: Jiayi He,Xu Wang,Ruobei Zhang,Shengeng Tang,Yaxiong Wang,Lechao Cheng
机构: School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机科学与信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:We introduce the hfut-lmc team’s solution to the SLRTP Sign Production Challenge. The challenge aims to generate semantically aligned sign language pose sequences from text inputs. To this end, we propose a Text-driven Diffusion Model (TDM) framework. During the training phase, TDM utilizes an encoder to encode text sequences and incorporates them into the diffusion model as conditional input to generate sign pose sequences. To guarantee the high quality and accuracy of the generated pose sequences, we utilize two key loss functions. The joint loss function L_joint is used to precisely measure and minimize the differences between the joint positions of the generated pose sequences and those of the ground truth. Similarly, the bone orientation loss function L_bone is instrumental in ensuring that the orientation of the bones in the generated poses aligns with the actual, correct orientations. In the inference stage, the TDM framework takes on a different yet equally important task. It starts with noisy sequences and, under the strict constraints of the text conditions, gradually refines and generates semantically consistent sign language pose sequences. Our carefully designed framework performs well on the sign language production task, and our solution achieves a BLEU-1 score of 20.17, placing second in the challenge.
zh

[CV-82] No Thing Nothing: Highlighting Safety-Critical Classes for Robust LiDAR Semantic Segmentation in Adverse Weather CVPR2025

【速读】：该论文致力于解决现有领域泛化方法在恶劣天气条件下激光雷达语义分割中对“thing”（物体）类别预测准确率较低的问题，相比“stuff”（场景）类别的表现存在明显不足。在典型驾驶场景中，“thing”类别动态性强且与碰撞风险密切相关，因此其精准识别对于安全导航和规划至关重要。论文指出，“thing”类别性能下降是现有方法的重要瓶颈，并观察到恶劣天气会导致语义级特征退化以及局部特征被污染，从而引发“thing”类别的误分类为“stuff”。为缓解这些污染问题，论文提出了NTN（segmeNt Things for No-accident），其关键在于通过将每个点特征绑定到其超类来解决语义级特征退化问题，避免“thing”类别被错误分类至视觉上不相似的类别；同时定义每条激光雷达射线为一个局部区域，并提出一种正则化项，在特征空间中对齐干净数据与其受污染的对应物，以增强对抗局部污染的鲁棒性。实验结果表明，NTN在SemanticKITTI-to-SemanticSTF和SemanticPOSS-to-SemanticSTF基准测试中分别取得了+2.6 mIoU和+7.9 mIoU的提升，并显著改善了“thing”类别的性能，分别提高了+4.8 mIoU和+7.9 mIoU，验证了方法的有效性。

链接: https://arxiv.org/abs/2503.15910
作者: Junsung Park,Hwijeong Lee,Inha Kang,Hyunjung Shim
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, CVPR 2025

点击查看摘要

Abstract:Existing domain generalization methods for LiDAR semantic segmentation under adverse weather struggle to accurately predict “things” categories compared to “stuff” categories. In typical driving scenes, “things” categories can be dynamic and associated with higher collision risks, making them crucial for safe navigation and planning. Recognizing the importance of “things” categories, we identify their performance drop as a serious bottleneck in existing approaches. We observed that adverse weather induces degradation of semantic-level features and both corruption of local features, leading to a misprediction of “things” as “stuff”. To mitigate these corruptions, we suggest our method, NTN - segmeNt Things for No-accident. To address semantic-level feature corruption, we bind each point feature to its superclass, preventing the misprediction of things classes into visually dissimilar categories. Additionally, to enhance robustness against local corruption caused by adverse weather, we define each LiDAR beam as a local region and propose a regularization term that aligns the clean data with its corrupted counterpart in feature space. NTN achieves state-of-the-art performance with a +2.6 mIoU gain on the SemanticKITTI-to-SemanticSTF benchmark and +7.9 mIoU on the SemanticPOSS-to-SemanticSTF benchmark. Notably, NTN achieves a +4.8 and +7.9 mIoU improvement on “things” classes, respectively, highlighting its effectiveness.
zh

[CV-83] Enhancing Close-up Novel View Synthesis via Pseudo-labeling AAAI2025

【速读】：该论文致力于解决现有方法（如Neural Radiance Fields (NeRF) 和 3D Gaussian Splatting (3DGS)）在生成大幅偏离训练视点的详细近距离视角图像时表现不佳的问题。这一挑战的核心在于缺乏针对近距离视角的具体训练数据，导致当前方法难以精确渲染这些视点。为了解决此问题，论文提出了一种基于伪标签的学习策略，通过利用从现有训练数据中生成的伪标签，为目标监督提供了广泛覆盖的近距离视点。此外，为了评估当前及未来方法在此领域的有效性，论文还引入了一个新的基准数据集。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.15908
作者: Jiatong Xia,Libo Sun,Lingqiao Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Recent methods, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have demonstrated remarkable capabilities in novel view synthesis. However, despite their success in producing high-quality images for viewpoints similar to those seen during training, they struggle when generating detailed images from viewpoints that significantly deviate from the training set, particularly in close-up views. The primary challenge stems from the lack of specific training data for close-up views, leading to the inability of current methods to render these views accurately. To address this issue, we introduce a novel pseudo-label-based learning strategy. This approach leverages pseudo-labels derived from existing training data to provide targeted supervision across a wide range of close-up viewpoints. Recognizing the absence of benchmarks for this specific challenge, we also present a new dataset designed to assess the effectiveness of both current and future methods in this area. Our extensive experiments demonstrate the efficacy of our approach.
zh

[CV-84] Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation

【速读】：该论文试图解决单目深度估计中自监督方法因遮挡、无纹理区域及光照变化等问题导致预测模糊且包含伪影的核心问题。为应对这些挑战，论文提出了一种基于Stable Diffusion (SD) 的自监督框架Jasmine。其解决方案的关键在于构建了一个混合图像重建的代理任务，通过在不引入额外监督的情况下恢复图像细节，同时避免深度估计退化，从而保留SD模型的先验信息；此外，设计了Scale-Shift GRU模块，不仅弥合了SD输出与自监督尺度不变深度估计之间的分布差异，还有效减少了重投影损失对SD输出精细纹理的干扰。实验表明，Jasmine在KITTI基准上达到当前最优性能(SoTA)，并在多个数据集上展现出卓越的零样本泛化能力。

链接: https://arxiv.org/abs/2503.15905
作者: Jiyuan Wang,Chunyu Lin,Cheng Guan,Lang Nie,Jing He,Haodong Li,Kang Liao,Yao Zhao
机构: BJTU (北京交通大学); HKUST (香港科技大学); NTU (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose Jasmine, the first Stable Diffusion (SD)-based self-supervised framework for monocular depth estimation, which effectively harnesses SD’s visual priors to enhance the sharpness and generalization of unsupervised prediction. Previous SD-based methods are all supervised since adapting diffusion models for dense prediction requires high-precision supervision. In contrast, self-supervised reprojection suffers from inherent challenges (e.g., occlusions, texture-less regions, illumination variance), and the predictions exhibit blurs and artifacts that severely compromise SD’s latent priors. To resolve this, we construct a novel surrogate task of hybrid image reconstruction. Without any additional supervision, it preserves the detail priors of SD models by reconstructing the images themselves while preventing depth estimation from degradation. Furthermore, to address the inherent misalignment between SD’s scale and shift invariant estimation and self-supervised scale-invariant depth estimation, we build the Scale-Shift GRU. It not only bridges this distribution gap but also isolates the fine-grained texture of SD output against the interference of reprojection loss. Extensive experiments demonstrate that Jasmine achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.
zh

[CV-85] Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions CVPR2025

【速读】：该论文旨在解决单图像中重建人类-物体交互（Human-Object Interaction, HOI）的问题，并克服现有方法因缺乏三维数据而局限于室内场景且难以泛化到包含广泛物体的真实世界场景的局限性。论文的关键创新在于提出了一种从二维HOI图像中注释细粒度三维人体、物体及其交互的管道，并利用单图像三维重建技术，标注了超过2500个三维HOI资产以构建首个开放词汇野外三维HOI数据集Open3DHOI。此外，论文设计了一种新颖的高斯-HOI优化器，能够高效重建人与物体之间的空间交互并学习接触区域。解决方案的关键在于结合最新的单图像三维重建技术，扩展了现有数据集的维度，并引入了新的优化方法来提升交互理解的精度和适用范围。

链接: https://arxiv.org/abs/2503.15898
作者: Boran Wen,Dingbang Huang,Zichen Zhang,Jiahong Zhou,Jianbin Deng,Jingyu Gong,Yulong Chen,Lizhuang Ma,Yong-Lu Li
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Reconstructing human-object interactions (HOI) from single images is fundamental in computer vision. Existing methods are primarily trained and tested on indoor scenes due to the lack of 3D data, particularly constrained by the object variety, making it challenging to generalize to real-world scenes with a wide range of objects. The limitations of previous 3D HOI datasets were primarily due to the difficulty in acquiring 3D object assets. However, with the development of 3D reconstruction from single images, recently it has become possible to reconstruct various objects from 2D HOI images. We therefore propose a pipeline for annotating fine-grained 3D humans, objects, and their interactions from single images. We annotated 2.5k+ 3D HOI assets from existing 2D HOI datasets and built the first open-vocabulary in-the-wild 3D HOI dataset Open3DHOI, to serve as a future test set. Moreover, we design a novel Gaussian-HOI optimizer, which efficiently reconstructs the spatial interactions between humans and objects while learning the contact regions. Besides the 3D HOI reconstruction, we also propose several new tasks for 3D HOI understanding to pave the way for future work. Data and code will be publicly available at this https URL.
zh

[CV-86] Learning 3D Scene Analogies with Neural Contextual Scene Maps

【速读】：该论文试图解决在未见过或嘈杂的3D环境中，机器无法有效利用先验知识执行任务的问题。传统数据驱动方法难以全面捕捉多样化布局和开放空间的特性，因此论文提出通过识别三维场景中的关系共性来教学机器理解场景上下文。解决方案的关键在于引入三维场景类比（3D Scene Analogies），这是一种在三维场景区域之间建立平滑映射以对齐空间关系的方法，与单一实例级别的映射不同，它能够平滑连接大规模场景区域。为实现这一目标，论文提出了神经上下文场景图（Neural Contextual Scene Maps），通过提取描述符场来总结语义和几何上下文，并以由粗到精的方式整体对齐这些场以估计映射。这种方法减少了对特定特征点的依赖，增强了对输入噪声或形状变化的鲁棒性。

链接: https://arxiv.org/abs/2503.15897
作者: Junho Kim,Gwangtak Bae,Eun Sun Lee,Young Min Kim
机构: Dept. of Electrical and Computer Engineering, Seoul National University (首尔国立大学); Interdisciplinary Program in Artificial Intelligence and INMC, Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications.
zh

[CV-87] UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis

【速读】：该论文旨在解决文档层次结构分析（Hierarchical Document Structure Analysis, HDSA）的问题，特别是针对由具有分层模式的创作软件生成的文档，恢复其分层结构。传统方法主要分为两类：一类专注于孤立处理HDSA的特定子任务（如表格检测或阅读顺序预测），另一类采用统一框架，通过多个分支或模块分别解决不同任务。论文提出的关键解决方案是一种名为UniHDSA的统一关系预测方法，将HDSA的各种子任务视为关系预测问题，并将所有关系预测标签整合到一个统一的标签空间中。这种方法使单一的关系预测模块能够同时处理多层级的任务，无论是页面级还是文档级的结构分析。通过基于Transformer架构构建的端到端多模态系统验证，实验结果表明UniHDSA在Comp-HRDoc基准数据集上达到最先进的性能，并在DocLayNet大规模数据集上取得竞争性结果，充分展示了该方法在各子任务中的优越性。

链接: https://arxiv.org/abs/2503.15893
作者: Jiawei Wang,Kai Hu,Qiang Huo
机构: Microsoft Research Asia (微软研究院亚洲); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Pattern Recognition. arXiv admin note: substantial text overlap with arXiv:2405.11757

点击查看摘要

Abstract:Document structure analysis, aka document layout analysis, is crucial for understanding both the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. Hierarchical Document Structure Analysis (HDSA) specifically aims to restore the hierarchical structure of documents created using authoring software with hierarchical schemas. Previous research has primarily followed two approaches: one focuses on tackling specific subtasks of HDSA in isolation, such as table detection or reading order prediction, while the other adopts a unified framework that uses multiple branches or modules, each designed to address a distinct task. In this work, we propose a unified relation prediction approach for HDSA, called UniHDSA, which treats various HDSA sub-tasks as relation prediction problems and consolidates relation prediction labels into a unified label space. This allows a single relation prediction module to handle multiple tasks simultaneously, whether at a page-level or document-level structure analysis. To validate the effectiveness of UniHDSA, we develop a multimodal end-to-end system based on Transformer architectures. Extensive experimental results demonstrate that our approach achieves state-of-the-art performance on a hierarchical document structure analysis benchmark, Comp-HRDoc, and competitive results on a large-scale document layout analysis dataset, DocLayNet, effectively illustrating the superiority of our method across all sub-tasks.
zh

[CV-88] UMIT: Unifying Medical Imaging Tasks via Vision-Language Models

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在医疗影像分析领域中多任务、多模态通用性不足的问题。传统研究多集中于特定任务或单一模态，限制了其在多样化医疗场景中的适用性和泛化能力。为应对这一挑战，论文提出了一种名为UMIT的统一多模态、多任务视觉-语言模型，专门针对医学影像任务设计。UMIT的关键创新在于其独特的两阶段训练策略及基于设计指令模板的微调方法，这显著提升了模型的任务处理能力和适应性，使其能够同时支持多种模态（如X光、CT、PET）和任务（如视觉问答、疾病检测、医学报告生成），并在多语种（英语和中文）环境下提供服务，从而大幅提升诊断准确性和工作流效率。

链接: https://arxiv.org/abs/2503.15892
作者: Haiyang Yu,Siyang Yi,Ke Niu,Minghan Zhuo,Bin Li
机构: Shanghai Key Laboratory of Intelligent Information Processing (上海智能信息处理重点实验室); School of Computer Science, Fudan University (复旦大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of deep learning, particularly in the field of medical image analysis, an increasing number of Vision-Language Models (VLMs) are being widely applied to solve complex health and biomedical challenges. However, existing research has primarily focused on specific tasks or single modalities, which limits their applicability and generalization across diverse medical scenarios. To address this challenge, we propose UMIT, a unified multi-modal, multi-task VLM designed specifically for medical imaging tasks. UMIT is able to solve various tasks, including visual question answering, disease detection, and medical report generation. In addition, it is applicable to multiple imaging modalities (e.g., X-ray, CT and PET), covering a wide range of applications from basic diagnostics to complex lesion analysis. Moreover, UMIT supports both English and Chinese, expanding its applicability globally and ensuring accessibility to healthcare services in different linguistic contexts. To enhance the model’s adaptability and task-handling capability, we design a unique two-stage training strategy and fine-tune UMIT with designed instruction templates. Through extensive empirical evaluation, UMIT outperforms previous methods in five tasks across multiple datasets. The performance of UMIT indicates that it can significantly enhance diagnostic accuracy and workflow efficiency, thus providing effective solutions for medical imaging applications.
zh

[CV-89] DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering

【速读】：该论文旨在解决文档类教学视频（document-centric videos）的多模态理解难题，这类视频因包含密集的图文信息且彼此高度相关，对模型的跨模态理解能力提出了较高要求。然而，由于现有数据集的匮乏及其固有复杂性，这一领域尚未得到充分研究。为应对这一挑战，论文引入了DocVideoQA任务与数据集，并提出了关键解决方案——DV-LLaMA，这是一种针对文档类视频优化的多模态大型语言模型（multimodal large language model, MLLM）。DV-LLaMA通过引入多样化指令微调数据增强单模态特征提取能力，并采用对比学习强化多模态整合，最终显著提升了文档类视频的理解性能。

链接: https://arxiv.org/abs/2503.15887
作者: Haochen Wang,Kai Hu,Liangcai Gao
机构: Peking University (北京大学); University of Science and Technology of China (中国科学技术大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote work and online courses have become important methods of knowledge dissemination, leading to a large number of document-based instructional videos. Unlike traditional video datasets, these videos mainly feature rich-text images and audio that are densely packed with information closely tied to the visual content, requiring advanced multimodal understanding capabilities. However, this domain remains underexplored due to dataset availability and its inherent complexity. In this paper, we introduce the DocVideoQA task and dataset for the first time, comprising 1454 videos across 23 categories with a total duration of about 828 hours. The dataset is annotated with 154k question-answer pairs generated manually and via GPT, assessing models’ comprehension, temporal awareness, and modality integration capabilities. Initially, we establish a baseline using open-source MLLMs. Recognizing the challenges in modality comprehension for document-centric videos, we present DV-LLaMA, a robust video MLLM baseline. Our method enhances unimodal feature extraction with diverse instruction-tuning data and employs contrastive learning to strengthen modality integration. Through fine-tuning, the LLM is equipped with audio-visual capabilities, leading to significant improvements in document-centric video understanding. Extensive testing on the DocVideoQA dataset shows that DV-LLaMA significantly outperforms existing models. We’ll release the code and dataset to facilitate future research.
zh

[CV-90] Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance

【速读】：该论文旨在解决零样本图像识别任务中现有视觉-语言模型（Vision-Language Models, VLMs）在实际应用中因次优提示工程和无法有效适应目标类别而导致性能不足的问题。论文的关键解决方案是提出了一种基于概念引导的人类似贝叶斯推理（Concept-guided Human-like Bayesian Reasoning, CHBR）框架。该框架以贝叶斯定理为基础，将人类图像识别过程中使用的概念建模为潜在变量，并通过先验分布和似然函数加权的概念空间求和来形式化此任务。为应对无限概念空间中难以处理的计算问题，引入了一种重要性采样算法，迭代地提示大型语言模型（Large Language Models, LLMs）生成区分性的概念以强调类间差异。此外，还提出了三种启发式方法，包括平均似然、置信度似然和测试时增强（Test Time Augmentation, TTA）似然，这些方法动态调整测试图像的概念组合。广泛的实验评估表明，CHBR在十五个数据集上的表现始终优于现有的最先进的零样本泛化方法。

链接: https://arxiv.org/abs/2503.15886
作者: Hui Liu,Wenya Wang,Kecheng Chen,Jie Liu,Yibing Liu,Tiexin Qin,Peisong He,Xinghao Jiang,Haoliang Li
机构: City University of Hong Kong (香港城市大学); Nanyang Technological University (南洋理工大学); City University of Hong Kong (香港城市大学); City University of Hong Kong (香港城市大学); unknown; unknown; Sichuan University (四川大学); Shanghai Jiao Tong University (上海交通大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 7 figures 7 tables

点击查看摘要

Abstract:In zero-shot image recognition tasks, humans demonstrate remarkable flexibility in classifying unseen categories by composing known simpler concepts. However, existing vision-language models (VLMs), despite achieving significant progress through large-scale natural language supervision, often underperform in real-world applications because of sub-optimal prompt engineering and the inability to adapt effectively to target classes. To address these issues, we propose a Concept-guided Human-like Bayesian Reasoning (CHBR) framework. Grounded in Bayes’ theorem, CHBR models the concept used in human image recognition as latent variables and formulates this task by summing across potential concepts, weighted by a prior distribution and a likelihood function. To tackle the intractable computation over an infinite concept space, we introduce an importance sampling algorithm that iteratively prompts large language models (LLMs) to generate discriminative concepts, emphasizing inter-class differences. We further propose three heuristic approaches involving Average Likelihood, Confidence Likelihood, and Test Time Augmentation (TTA) Likelihood, which dynamically refine the combination of concepts based on the test image. Extensive evaluations across fifteen datasets demonstrate that CHBR consistently outperforms existing state-of-the-art zero-shot generalization methods.
zh

[CV-91] Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

【速读】：该论文旨在解决3D扩散模型发展受限于高质量3D数据匮乏的问题，与2D扩散模型相比，现有3D扩散模型性能竞争力较弱。为应对这一挑战，论文提出利用预训练的2D扩散模型进行3D物体生成。关键解决方案在于引入一种名为Gaussian Atlas的新表示方法，它采用密集的2D网格，使得能够微调2D扩散模型以生成3D高斯分布，并通过从3D结构展平的2D流形实现迁移学习。此外，论文构建了一个包含205K高质量3D高斯拟合的大规模数据集GaussianVerse来支持模型训练。实验结果表明，文本到图像的扩散模型可以有效适应3D内容生成任务，弥合了2D和3D建模之间的差距。

链接: https://arxiv.org/abs/2503.15877
作者: Tiange Xiang,Kai Li,Chengjiang Long,Christian Häne,Peihong Guo,Scott Delp,Ehsan Adeli,Li Fei-Fei
机构: Stanford University (斯坦福大学); Meta Reality Labs (Meta现实实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text-to-image diffusion models have been driven by the increasing availability of paired 2D data. However, the development of 3D diffusion models has been hindered by the scarcity of high-quality 3D data, resulting in less competitive performance compared to their 2D counterparts. To address this challenge, we propose repurposing pre-trained 2D diffusion models for 3D object generation. We introduce Gaussian Atlas, a novel representation that utilizes dense 2D grids, enabling the fine-tuning of 2D diffusion models to generate 3D Gaussians. Our approach demonstrates successful transfer learning from a pre-trained 2D diffusion model to a 2D manifold flattened from 3D structures. To support model training, we compile GaussianVerse, a large-scale dataset comprising 205K high-quality 3D Gaussian fittings of various 3D objects. Our experimental results show that text-to-image diffusion models can be effectively adapted for 3D content generation, bridging the gap between 2D and 3D modeling.
zh

[CV-92] MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving

【速读】：该论文旨在解决自动驾驶领域中数据驱动技术面临的挑战，特别是罕见且多样化训练数据需求的问题。传统方法在利用世界模型（World Models）生成长时、一致性的标注视频数据时，往往因误差累积而难以生成高质量结果，尤其是在动态场景下表现欠佳。为了解决这一问题，论文提出了一种名为MiLA的新框架，其关键在于采用粗到精（Coarse-to-Fine）的方法来稳定视频生成并校正动态物体的失真，同时引入时序渐进去噪调度器（Temporal Progressive Denoising Scheduler）和联合去噪与校正流（Joint Denoising and Correcting Flow）模块以提升生成视频的质量。实验结果显示，MiLA在nuScenes数据集上的视频生成质量达到了当前最优水平。

链接: https://arxiv.org/abs/2503.15875
作者: Haiguang Wang,Daqi Liu,Hongwei Xie,Haisong Liu,Enhui Ma,Kaicheng Yu,Limin Wang,Bing Wang
机构: Nanjing University (南京大学); Xiaomi EV (小米汽车); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project website: this https URL

点击查看摘要

Abstract:In recent years, data-driven techniques have greatly advanced autonomous driving systems, but the need for rare and diverse training data remains a challenge, requiring significant investment in equipment and labor. World models, which predict and generate future environmental states, offer a promising solution by synthesizing annotated video data for training. However, existing methods struggle to generate long, consistent videos without accumulating errors, especially in dynamic scenes. To address this, we propose MiLA, a novel framework for generating high-fidelity, long-duration videos up to one minute. MiLA utilizes a Coarse-to-Re(fine) approach to both stabilize video generation and correct distortion of dynamic objects. Additionally, we introduce a Temporal Progressive Denoising Scheduler and Joint Denoising and Correcting Flow modules to improve the quality of generated videos. Extensive experiments on the nuScenes dataset show that MiLA achieves state-of-the-art performance in video generation quality. For more information, visit the project website: this https URL.
zh

[CV-93] MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLM s through Disentangled Spatial-Temporal Representations CVPR2025

【速读】：该论文致力于解决视频大型语言模型（Video-LLMs）中的动作场景幻觉问题，即模型在基于场景上下文预测动作或基于观察到的动作生成场景时产生的错误。论文指出，现有Video-LLMs面临这一问题的主要原因有两个：一是它们通过跨所有标记应用注意力操作混杂了空间和时间特征；二是使用标准旋转位置嵌入（RoPE），导致文本标记根据其顺序过度强调某些类型的标记。为了解决这些问题，论文提出了MASH-VLM（Mitigating Action-Scene Hallucination in Video-LLMs），通过解耦空间-时间表示来缓解动作场景幻觉。其关键创新包括DST-attention（一种新型注意力机制，通过掩码注意限制空间和时间标记之间的直接交互以解耦空间和时间标记）以及Harmonic-RoPE（扩展位置ID维度，使空间和时间标记相对于文本标记保持平衡位置）。

链接: https://arxiv.org/abs/2503.15871
作者: Kyungho Bae,Jinhyung Kim,Sihaeng Lee,Soonyoung Lee,Gunhee Lee,Jinwoo Choi
机构: Kyung Hee University (庆熙大学); LG AI Research (LG AI 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for CVPR 2025

点击查看摘要

Abstract:In this work, we tackle action-scene hallucination in Video Large Language Models (Video-LLMs), where models incorrectly predict actions based on the scene context or scenes based on observed actions. We observe that existing Video-LLMs often suffer from action-scene hallucination due to two main factors. First, existing Video-LLMs intermingle spatial and temporal features by applying an attention operation across all tokens. Second, they use the standard Rotary Position Embedding (RoPE), which causes the text tokens to overemphasize certain types of tokens depending on their sequential orders. To address these issues, we introduce MASH-VLM, Mitigating Action-Scene Hallucination in Video-LLMs through disentangled spatial-temporal representations. Our approach includes two key innovations: (1) DST-attention, a novel attention mechanism that disentangles the spatial and temporal tokens within the LLM by using masked attention to restrict direct interactions between the spatial and temporal tokens; (2) Harmonic-RoPE, which extends the dimensionality of the positional IDs, allowing the spatial and temporal tokens to maintain balanced positions relative to the text tokens. To evaluate the action-scene hallucination in Video-LLMs, we introduce the UNSCENE benchmark with 1,320 videos and 4,078 QA pairs. Extensive experiments demonstrate that MASH-VLM achieves state-of-the-art results on the UNSCENE benchmark, as well as on existing video understanding benchmarks.
zh

[CV-94] UniCoRN: Latent Diffusion-based Unified Controllable Image Restoration Network across Multiple Degradations

【速读】：该论文旨在解决现有图像恢复方法仅能处理单一类型退化（如模糊、噪声或雾霾）的问题，而无法有效应对现实中多种退化同时存在的挑战。为了解决这一问题，论文提出了一种名为UniCoRN的统一图像恢复方法，其关键在于利用多头扩散模型同时处理多种退化类型。具体而言，通过挖掘图像中低层次视觉线索来指导可控扩散模型，并设计了一种可通过“专家混合”策略自适应调整的多头控制网络。此外，训练过程中采用了精心设计的课程学习方案，无需假设特定的退化类型。论文还引入了一个包含多种退化和伪影的金属透镜成像基准MetaRestore，以验证所提方法的有效性。实验结果表明，该方法在多个具有挑战性的数据集上实现了显著的性能提升，并能够稳健地恢复严重退化的图像。

链接: https://arxiv.org/abs/2503.15868
作者: Debabrata Mandal,Soumitri Chattopadhyay,Guansen Tong,Praneeth Chakravarthula
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration is essential for enhancing degraded images across computer vision tasks. However, most existing methods address only a single type of degradation (e.g., blur, noise, or haze) at a time, limiting their real-world applicability where multiple degradations often occur simultaneously. In this paper, we propose UniCoRN, a unified image restoration approach capable of handling multiple degradation types simultaneously using a multi-head diffusion model. Specifically, we uncover the potential of low-level visual cues extracted from images in guiding a controllable diffusion model for real-world image restoration and we design a multi-head control network adaptable via a mixture-of-experts strategy. We train our model without any prior assumption of specific degradations, through a smartly designed curriculum learning recipe. Additionally, we also introduce MetaRestore, a metalens imaging benchmark containing images with multiple degradations and artifacts. Extensive evaluations on several challenging datasets, including our benchmark, demonstrate that our method achieves significant performance gains and can robustly restore images with severe degradations. Project page: this https URL
zh

[CV-95] ruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data

【速读】：该论文旨在解决DeepFake检测领域中存在的两大挑战：一是现有方法多局限于二分类（真实 vs. 深度伪造）且缺乏可解释性；二是难以同时有效处理人脸操作类DeepFake和完全由AI生成的内容，并缺乏对细粒度查询的支持。为应对这些挑战，论文提出了一种名为TruthLens的新框架，其关键在于结合多模态大型语言模型（如PaliGemma2）的全局上下文理解能力和仅视觉模型（如DINOv2）的局部特征提取能力，通过这种混合设计充分利用两种模型的优势，在保持高检测准确性的同时实现对细微篡改的鲁棒检测以及预测结果的可解释性。实验结果表明，TruthLens在多种数据集上显著优于当前最先进的方法，在检测精度（提升2%-14%）和可解释性方面均表现出色，并能够跨传统与新兴的篡改技术进行有效泛化。

链接: https://arxiv.org/abs/2503.15867
作者: Rohit Kundu,Athula Balachandran,Amit K. Roy-Chowdhury
机构: Google LLC (谷歌有限责任公司); University of California, Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting DeepFakes has become a crucial research area as the widespread use of AI image generators enables the effortless creation of face-manipulated and fully synthetic content, yet existing methods are often limited to binary classification (real vs. fake) and lack interpretability. To address these challenges, we propose TruthLens, a novel and highly generalizable framework for DeepFake detection that not only determines whether an image is real or fake but also provides detailed textual reasoning for its predictions. Unlike traditional methods, TruthLens effectively handles both face-manipulated DeepFakes and fully AI-generated content while addressing fine-grained queries such as “Does the eyes/nose/mouth look real or fake?” The architecture of TruthLens combines the global contextual understanding of multimodal large language models like PaliGemma2 with the localized feature extraction capabilities of vision-only models like DINOv2. This hybrid design leverages the complementary strengths of both models, enabling robust detection of subtle manipulations while maintaining interpretability. Extensive experiments on diverse datasets demonstrate that TruthLens outperforms state-of-the-art methods in detection accuracy (by 2-14%) and explainability, in both in-domain and cross-data settings, generalizing effectively across traditional and emerging manipulation techniques. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.15867 [cs.CV] (or arXiv:2503.15867v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.15867 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-96] VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

【速读】：该论文旨在解决现有直接文本到3D模型在扩展至联合建模多视角图像与相机姿态时因模态差距（modality gap）导致的训练不稳定问题。解决方案的关键在于提出了一种双流架构（dual-stream architecture），通过通信块将预训练的视频生成模型与专用的相机姿态生成模型结合，分别生成多视角图像和相机姿态，从而减少模态间的干扰。此外，论文还提出了异步去噪采样策略（asynchronous sampling strategy），优先对相机姿态进行快速去噪，并利用这些去噪后的姿态条件约束多视角图像生成，进一步提升跨模态一致性并降低歧义性。这一方法在多个大规模真实场景数据集上实现了优于现有依赖后处理优化技术的性能表现。

链接: https://arxiv.org/abs/2503.15855
作者: Hyojun Go,Byeongjun Park,Hyelin Nam,Byung-Hoon Kim,Hyungjin Chung,Changick Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.
zh

[CV-97] Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion CVPR2025

【速读】：该论文旨在解决动捕头像生成对大规模训练数据的需求问题，提出了一种基于无数据静态头像生成方法的改进方案。传统方法直接从视频扩散模型蒸馏4D头像常导致过平滑结果，因生成视频的空间与时间一致性不足。为解决此问题，论文提出了Zero-1-to-A方法，通过构建空间和时间一致性数据集来优化可动画化头像。其关键是采用渐进学习策略，分为两阶段：首先进行空间一致性学习以固定表情并从前到侧视图学习；其次进行时间一致性学习以固定视角并从放松到夸张的表情学习，从而实现从简单到复杂的4D头像生成。实验表明，该方法在保真度、动画质量和渲染速度方面优于现有扩散基方法，为逼真头像创建提供了有效方案。

链接: https://arxiv.org/abs/2503.15851
作者: Zhou Zhenglin,Ma Fan,Fan Hehe,Chua Tat-Seng
机构: State Key Laboratory of Brain-machine Intelligence, Zhejiang University (浙江大学国家重点实验室); ReLER, CCAI, Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025, project page: this https URL

点击查看摘要

Abstract:Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: this https URL.
zh

[CV-98] What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?

【速读】：该论文旨在解决动态场景图生成（Dynamic Scene Graph Generation, DSGG）任务中现有方法存在的三个关键问题：严重的精确度-召回率权衡、缺乏对三元组重要性的关注以及不恰当的评估协议。同时，尽管大型多模态模型（Large Multimodal Models, LMMs）在视频理解方面展现出强大能力，但其尚未被应用于如DSGG这类细粒度、帧级理解的任务中。论文的关键解决方案在于通过简单的解码器-only结构设计，将Video LMMs转化为最先进的场景图生成器，有效克服上述问题，且仅需少量微调（仅使用5-10%的训练数据）。

链接: https://arxiv.org/abs/2503.15846
作者: Xuanming Cui,Jaiminkumar Ashokbhai Bhoi,Chionh Wei Peng,Adriel Kuek,Ser Nam Lim
机构: University of Central Florida (中佛罗里达大学); DSO National Laboratories (DSO 国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic Scene Graph Generation (DSGG) for videos is a challenging task in computer vision. While existing approaches often focus on sophisticated architectural design and solely use recall during evaluation, we take a closer look at their predicted scene graphs and discover three critical issues with existing DSGG methods: severe precision-recall trade-off, lack of awareness on triplet importance, and inappropriate evaluation protocols. On the other hand, recent advances of Large Multimodal Models (LMMs) have shown great capabilities in video understanding, yet they have not been tested on fine-grained, frame-wise understanding tasks like DSGG. In this work, we conduct the first systematic analysis of Video LMMs for performing DSGG. Without relying on sophisticated architectural design, we show that LMMs with simple decoder-only structure can be turned into State-of-the-Art scene graph generators that effectively overcome the aforementioned issues, while requiring little finetuning (5-10% training data).
zh

[CV-99] BARD-GS: Blur-Aware Reconstruction of Dynamic Scenes via Gaussian Splatting CVPR2025

【速读】：该论文旨在解决动态场景重建中因输入图像模糊和相机姿态不精确导致的重建质量下降问题。论文的关键在于提出了一种名为BARD-GS的新方法，通过将运动模糊显式分解为相机运动模糊和物体运动模糊，并分别建模处理这两类模糊，有效提升了动态区域的渲染效果。此外，作者构建了一个包含真实世界运动模糊场景的数据集以评估该方法，实验结果表明，BARD-GS在实际条件下能够显著优于现有方法，实现高质量的动态场景重建。

链接: https://arxiv.org/abs/2503.15835
作者: Yiren Lu,Yunlai Zhou,Disheng Liu,Tuo Liang,Yu Yin
机构: Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025. Project page at this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has shown remarkable potential for static scene reconstruction, and recent advancements have extended its application to dynamic scenes. However, the quality of reconstructions depends heavily on high-quality input images and precise camera poses, which are not that trivial to fulfill in real-world scenarios. Capturing dynamic scenes with handheld monocular cameras, for instance, typically involves simultaneous movement of both the camera and objects within a single exposure. This combined motion frequently results in image blur that existing methods cannot adequately handle. To address these challenges, we introduce BARD-GS, a novel approach for robust dynamic scene reconstruction that effectively handles blurry inputs and imprecise camera poses. Our method comprises two main components: 1) camera motion deblurring and 2) object motion deblurring. By explicitly decomposing motion blur into camera motion blur and object motion blur and modeling them separately, we achieve significantly improved rendering results in dynamic regions. In addition, we collect a real-world motion blur dataset of dynamic scenes to evaluate our approach. Extensive experiments demonstrate that BARD-GS effectively reconstructs high-quality dynamic scenes under realistic conditions, significantly outperforming existing methods.
zh

[CV-100] EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation CVPR2025

【速读】：该论文旨在解决复杂或非线性运动模式下视频帧插值面临的挑战，特别是扩散模型在处理大运动场景时难以生成清晰且时间一致性良好的帧的问题。论文的关键解决方案在于提出了一种名为EDEN（Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation）的方法。EDEN通过引入基于Transformer的分词器来优化扩散模型中间帧的潜在表示，并在整个过程中增强扩散Transformer以包含时间注意力机制，同时加入起始帧与结束帧差异嵌入以引导动态运动的生成。这一系列改进显著提升了大运动场景下的视频帧插值质量。

链接: https://arxiv.org/abs/2503.15831
作者: Zihao Zhang,Haoran Chen,Haoyu Zhao,Guansong Lu,Yanwei Fu,Hang Xu,Zuxuan Wu
机构: Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University (上海关键实验室智能信息处理, 复旦大学计算机学院); Shanghai Collaborative Innovation Center of Intelligent Visual Computing (上海智能视觉计算协同创新中心); Noah’s Ark Lab, Huawei (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:Handling complex or nonlinear motion patterns has long posed challenges for video frame interpolation. Although recent advances in diffusion-based methods offer improvements over traditional optical flow-based approaches, they still struggle to generate sharp, temporally consistent frames in scenarios with large motion. To address this limitation, we introduce EDEN, an Enhanced Diffusion for high-quality large-motion vidEo frame iNterpolation. Our approach first utilizes a transformer-based tokenizer to produce refined latent representations of the intermediate frames for diffusion models. We then enhance the diffusion transformer with temporal attention across the process and incorporate a start-end frame difference embedding to guide the generation of dynamic motion. Extensive experiments demonstrate that EDEN achieves state-of-the-art results across popular benchmarks, including nearly a 10% LPIPS reduction on DAVIS and SNU-FILM, and an 8% improvement on DAIN-HD.
zh

[CV-101] Computation-Efficient and Recognition-Friendly 3D Point Cloud Privacy Protection

【速读】：该论文旨在解决三维点云隐私泄露的问题，这是在自动驾驶汽车、机器人技术和CAD模型等广泛应用中逐渐凸显但尚未得到充分研究的挑战。与关注纹理和二维几何结构的二维图像隐私不同，三维点云仅与三维几何结构相关，因此其隐私保护方法需要专门设计。论文的关键解决方案是提出了一种名为PointFlowGMM的高效隐私保护框架，该框架能够在不访问原始数据的情况下支持下游分类和分割任务。具体而言，通过基于流的生成模型将点云投影到潜在高斯混合分布子空间，并引入一种新颖的角度相似性损失函数来模糊原始几何结构，同时显著减小模型规模（从767MB降至120MB）而保持识别性能不变。此外，通过在潜在空间中随机正交旋转投影后的点云进一步增强隐私保护，同时保留类别间的关联性，从而确保保护后的点云仍能支持有效的识别任务。论文在多个数据集上的实验验证了该方法在加密点云上的识别结果可与原始点云媲美。

链接: https://arxiv.org/abs/2503.15818
作者: Haotian Ma,Lin Gu,Siyi Wu,Yingying Zhu
机构: University of Texas at Arlington (UTA); RIKEN (RIKEN); University of Texas at Arlington (UTA); University of Texas at Arlington (UTA)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D point cloud has been widely used in applications such as self-driving cars, robotics, CAD models, etc. To the best of our knowledge, these applications raised the issue of privacy leakage in 3D point clouds, which has not been studied well. Different from the 2D image privacy, which is related to texture and 2D geometric structure, the 3D point cloud is texture-less and only relevant to 3D geometric structure. In this work, we defined the 3D point cloud privacy problem and proposed an efficient privacy-preserving framework named PointFlowGMM that can support downstream classification and segmentation tasks without seeing the original data. Using a flow-based generative model, the point cloud is projected into a latent Gaussian mixture distributed subspace. We further designed a novel angular similarity loss to obfuscate the original geometric structure and reduce the model size from 767MB to 120MB without a decrease in recognition performance. The projected point cloud in the latent space is orthogonally rotated randomly to further protect the original geometric structure, the class-to-class relationship is preserved after rotation, thus, the protected point cloud can support the recognition task. We evaluated our model on multiple datasets and achieved comparable recognition results on encrypted point clouds compared to the original point clouds.
zh

[CV-102] A Vision Centric Remote Sensing Benchmark CVPR

【速读】：该论文旨在解决当前基于CLIP的多模态大语言模型（MLLMs）在遥感（RS）任务中面临的视觉定位和空间推理等独特挑战，特别是其难以区分视觉上不同但语义上相似的遥感图像的问题。论文的关键解决方案是引入了一个名为“遥感多模态视觉模式基准”(Remote Sensing Multimodal Visual Patterns, RSMMVP) 的评估基准，该基准通过识别CLIP盲对（CLIP-blind pairs），即那些视觉上不同但被CLIP-based模型错误赋予高相似度分数的遥感图像对，来评估现有MLLMs在遥感任务中的表现。这一方法揭示了最先进的MLLMs在遥感特定表示学习方面的显著局限性，并为未来开发更有效的针对遥感应用的MLLMs提供了重要的研究基础。

链接: https://arxiv.org/abs/2503.15816
作者: Abduljaleel Adejumo,Faegheh Yeganli,Clifford Broni-bediako,Aoran Xiao,Naoto Yokoya,Mennatullah Siam
机构: AMMI/AIMS (非洲数学科学研究所与非洲数学模型与计算网络); University of British Columbia (不列颠哥伦比亚大学); RIKEN AIP (理化学研究所人工智能项目); the University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 PAGES, 7 figures, CVPR

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks but their remote sensing (RS) counterpart are relatively under explored. Unlike natural images, RS imagery presents unique challenges that current MLLMs struggle to handle, particularly in visual grounding and spatial reasoning. This study investigates the limitations of CLIP-based MLLMs in RS, highlighting their failure to differentiate visually distinct yet semantically similar RS images. To address this, we introduce a remote sensing multimodal visual patterns (RSMMVP) benchmark. It is designed to evaluate MLLMs in RS tasks by identifying the CLIP-blind pairs, where CLIP-based models incorrectly assign high similarity scores to visually distinct RS images. Through a visual question answering (VQA) evaluation, we analyze the performance of state-of-the-art MLLMs, revealing significant limitations in RS specific representation learning. The results provide valuable insights into the weaknesses of CLIP-based visual encoding and offer a foundation for future research to develop more effective MLLMs tailored for remote sensing applications.
zh

[CV-103] Controlling Avatar Diffusion with Learnable Gaussian Embedding

【速读】：该论文旨在解决数字人生成中现有扩散模型在三维一致性（3D consistency）、时间连贯性（temporal coherence）以及运动准确性方面的不足。这些问题的核心原因在于常用控制信号（如标志点、深度图等）的表征能力有限，同时公开数据集中身份和姿态变化的多样性不足进一步阻碍了进展。论文的关键解决方案是引入一种新型可优化（optimizable）、密集（dense）、表达性强（expressive）且三维一致（3D consistent）的控制信号表示方法。具体而言，作者通过将一个可学习的神经高斯函数嵌入到参数化的人头表面，显著提升了基于扩散模型的头部生成模型的一致性和表现力。此外，为了克服数据集限制，论文还合成了一组包含多姿态和多身份的大规模数据集，并利用真实/合成标签有效区分真实与合成数据，从而减少合成数据缺陷对生成图像的影响。实验表明，所提方法在真实性、表达力和三维一致性方面优于现有方法。

链接: https://arxiv.org/abs/2503.15809
作者: Xuan Gao,Jingtao Zhou,Dongyu Liu,Yuqi Zhou,Juyong Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advances in diffusion models have made significant progress in digital human generation. However, most existing models still struggle to maintain 3D consistency, temporal coherence, and motion accuracy. A key reason for these shortcomings is the limited representation ability of commonly used control signals(e.g., landmarks, depth maps, etc.). In addition, the lack of diversity in identity and pose variations in public datasets further hinders progress in this area. In this paper, we analyze the shortcomings of current control signals and introduce a novel control signal representation that is optimizable, dense, expressive, and 3D consistent. Our method embeds a learnable neural Gaussian onto a parametric head surface, which greatly enhances the consistency and expressiveness of diffusion-based head models. Regarding the dataset, we synthesize a large-scale dataset with multiple poses and identities. In addition, we use real/synthetic labels to effectively distinguish real and synthetic data, minimizing the impact of imperfections in synthetic data on the generated head images. Extensive experiments show that our model outperforms existing methods in terms of realism, expressiveness, and 3D consistency. Our code, synthetic datasets, and pre-trained models will be released in our project page: this https URL
zh

[CV-104] Frequency Enhancement for Image Demosaicking

【速读】：该论文旨在解决图像去马赛克（image demosaicking）过程中恢复高频纹理信息这一具有挑战性的问题。现有方法虽引入了精巧的空间学习策略，但性能提升仍有限。为应对这一挑战，论文提出了一种基于频率增强的方法，关键在于设计了Dual-path Frequency Enhancement Network (DFENet)。该网络通过傅里叶域频率选择以分而治之的方式重建RGB图像，包含两条并行路径：一条专注于在空间域通过细节优化生成缺失信息，另一条则利用CFA图像在频域引导抑制不必要的频率成分。此外，采用多级频率监督与逐步训练策略进一步提升了重建性能。这些设计使DFENet在不同数据集上超越其他最先进的算法，并在复杂场景下展现出显著优势。论文还贡献了一个新的数据集LineSet37，用于更精准评估算法在高难度情况下的高频纹理重建能力。

链接: https://arxiv.org/abs/2503.15800
作者: Jingyun Liu,Daiqin Yang,Zhenzhong Chen
机构: School of Remote Sensing and Information Engineering, Wuhan University (武汉大学遥感信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Recovering high-frequency textures in image demosaicking remains a challenging issue. While existing methods introduced elaborate spatial learning methods, they still exhibit limited performance. To address this issue, a frequency enhancement approach is proposed. Based on the frequency analysis of color filter array (CFA)/demosaicked/ground truth images, we propose Dual-path Frequency Enhancement Network (DFENet), which reconstructs RGB images in a divide-and-conquer manner through fourier-domain frequency selection. In DFENet, two frequency selectors are employed, each selecting a set of frequency components for processing along separate paths. One path focuses on generating missing information through detail refinement in spatial domain, while the other aims at suppressing undesirable frequencies with the guidance of CFA images in frequency domain. Multi-level frequency supervision with a stagewise training strategy is employed to further improve the reconstruction performance. With these designs, the proposed DFENet outperforms other state-of-the-art algorithms on different datasets and demonstrates significant advantages on hard cases. Moreover, to better assess algorithms’ ability to reconstruct high-frequency textures, a new dataset, LineSet37, is contributed, which consists of 37 artificially designed and generated images. These images feature complex line patterns and are prone to severe visual artifacts like color moiré after demosaicking. Experiments on LineSet37 offer a more targeted evaluation of performance on challenging cases. The code and dataset are available at this https URL.
zh

[CV-105] RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models

【速读】：该论文旨在解决视觉-语言基础模型（Vision-Language Foundation Models, VLFM）在细粒度对齐任务中的局限性，特别是在医疗影像领域，这类任务需要精确对应图像区域与文本描述。现有模型虽具备跨模态丰富的语义理解能力，但在精准定位和检测临床特征方面表现不足，这对诊断和分析至关重要。为解决此问题，论文提出了一种多阶段架构：首先利用预训练的VLFM获取粗略的语义理解，随后通过强化学习（Reinforcement Learning, RL）算法迭代优化以实现更精确的语义上下文对齐。奖励信号被设计用于将文本的语义信息与合成图像对齐。关键在于结合预训练模型的基础能力和强化学习的迭代优化过程，从而显著提升生成图像的质量及其与提示（prompt）的对齐效果，并进一步改善欠代表子群体的疾病分类器性能。

链接: https://arxiv.org/abs/2503.15784
作者: Parham Saremi,Amar Kumar,Mohammed Mohammed,Zahra TehraniNasab,Tal Arbel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images. While VLFMs show a rich understanding of semantic content across modalities, they often struggle with fine-grained alignment tasks that require precise correspondence between image regions and textual descriptions a limitation in medical imaging, where accurate localization and detection of clinical features are essential for diagnosis and analysis. To address this issue, we propose a multi-stage architecture where a pre-trained VLFM provides a cursory semantic understanding, while a reinforcement learning (RL) algorithm refines the alignment through an iterative process that optimizes for understanding semantic context. The reward signal is designed to align the semantic information of the text with synthesized images. We demonstrate the effectiveness of our method on a medical imaging skin dataset where the generated images exhibit improved generation quality and alignment with prompt over the fine-tuned Stable Diffusion. We also show that the synthesized samples could be used to improve disease classifier performance for underrepresented subgroups through augmentation.
zh

[CV-106] UAS Visual Navigation in Large and Unseen Environments via a Meta Agent

【速读】：该论文旨在解决无人飞行系统（UAS）在大规模城市环境中高效学习导航能力，并将其习得的经验迁移到新环境中的问题。为实现这一目标，论文提出了一种元课程训练方案，其关键是结合增量自适应强化学习（Incremental Self-Adaptive Reinforcement learning, ISAR）算法与分层任务引导策略。具体而言，通过元训练使智能体学习通用主策略以跨任务泛化，并在下游任务中微调模型；同时，ISAR 算法融合了增量学习和元强化学习（Meta-Reinforcement Learning, MRL）的思想，在保持快速迁移能力的同时显著提升了收敛速度，优于传统强化学习（Reinforcement Learning, RL）和常规 MRL 方法。实验表明，这种训练方法显著提高了大规模城市导航的收敛效率及新环境下的适应能力。

链接: https://arxiv.org/abs/2503.15781
作者: Yuci Han,Charles Toth,Alper Yilmaz
机构: Photogrammetric Computer Vision Laboratory (摄影测量计算机视觉实验室), Satellite Positioning and Inertial Navigation Laboratory (卫星定位与惯性导航实验室), The Ohio State University (俄亥俄州立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The aim of this work is to develop an approach that enables Unmanned Aerial System (UAS) to efficiently learn to navigate in large-scale urban environments and transfer their acquired expertise to novel environments. To achieve this, we propose a meta-curriculum training scheme. First, meta-training allows the agent to learn a master policy to generalize across tasks. The resulting model is then fine-tuned on the downstream tasks. We organize the training curriculum in a hierarchical manner such that the agent is guided from coarse to fine towards the target task. In addition, we introduce Incremental Self-Adaptive Reinforcement learning (ISAR), an algorithm that combines the ideas of incremental learning and meta-reinforcement learning (MRL). In contrast to traditional reinforcement learning (RL), which focuses on acquiring a policy for a specific task, MRL aims to learn a policy with fast transfer ability to novel tasks. However, the MRL training process is time consuming, whereas our proposed ISAR algorithm achieves faster convergence than the conventional MRL algorithm. We evaluate the proposed methodologies in simulated environments and demonstrate that using this training philosophy in conjunction with the ISAR algorithm significantly improves the convergence speed for navigation in large-scale cities and the adaptation proficiency in novel environments.
zh

[CV-107] AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models

【速读】：该论文旨在解决自动驾驶领域开放性问答评估不可靠的问题，由于自由形式的答案需要复杂的度量方法或主观的人类判断，导致现有评估方式缺乏标准化和客观性。为应对这一挑战，论文提出AutoDrive-QA，这是一种自动化的管道系统，将现有的驾驶问答数据集（如DriveLM、NuScenes-QA和LingoQA）转换为结构化的多项选择题（MCQ）格式。该基准测试系统性地评估感知（Perception）、预测（Prediction）和规划（Planning）任务，提供了标准化且客观的评价框架。

解决方案的关键在于AutoDrive-QA采用了一种自动化流程，利用大型语言模型（Large Language Models, LLMs）生成高质量、上下文相关的干扰选项（distractors），这些干扰选项基于在自动驾驶场景中常见的领域特定错误模式构建。此外，通过在三个公开数据集上进行测试以及对未见数据集开展零样本实验，进一步验证了模型的通用能力和泛化性能。实验结果显示，GPT-4V在零样本设置下取得了69.57%的整体准确率，在感知、预测和规划任务上的具体表现为74.94%、65.33%和68.45%，表明虽然所有模型在感知方面表现优异，但在预测任务上存在明显不足。因此，AutoDrive-QA为跨不同自动驾驶数据集整合与评估视觉-语言模型提供了一个严格且无偏的标准，有助于提升该领域的泛化能力。相关代码已发布于AutoDrive-QA的GitHub仓库。

链接: https://arxiv.org/abs/2503.15778
作者: Boshra Khalili,Andrew W.Smyth
机构: Columbia University (哥伦比亚大学); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In autonomous driving, open-ended question answering often suffers from unreliable evaluations because freeform responses require either complex metrics or subjective human judgment. To address this challenge, we introduce AutoDrive-QA, an automatic pipeline that converts existing driving QA datasets (including DriveLM, NuScenes-QA, and LingoQA) into a structured multiple-choice question (MCQ) format. This benchmark systematically assesses perception, prediction, and planning tasks, providing a standardized and objective evaluation framework. AutoDrive-QA employs an automated pipeline that leverages large language models (LLMs) to generate high-quality, contextually relevant distractors based on domain-specific error patterns commonly found in autonomous driving scenarios. To evaluate both general capabilities and generalization performance, we test the benchmark on three public datasets and conduct zero-shot experiments on an unseen dataset. The zero-shot evaluations reveal that GPT-4V leads with 69.57% accuracy – achieving 74.94% in Perception, 65.33% in Prediction, and 68.45% in Planning – demonstrating that while all models excel in Perception, they struggle in Prediction. Consequently, AutoDrive-QA establishes a rigorous, unbiased standard for integrating and evaluating different vision-language models across various autonomous driving datasets, thereby improving generalization in this field. We release all the codes in the AutoDrive-QA GitHub Repository.
zh

[CV-108] OffsetOPT: Explicit Surface Reconstruction without Normals CVPR2025

【速读】：该论文旨在解决基于点云的显式表面重建问题，传统方法通常依赖高质量的法向量以实现精确重建，但这种方法在缺乏法向量或法向量质量较差的情况下表现不佳。论文提出的OffsetOPT方法通过直接从3D点云中重建显式表面，避免了对点云法向量的需求。其关键在于将重建过程分为两个阶段：首先训练神经网络根据局部点几何预测表面三角形；其次，在冻结网络后，通过对每个点进行偏移优化来最大化三角形预测的准确性，从而实现对未见点云的表面重建。与现有技术相比，OffsetOPT不仅能够更好地重建整体表面，还能显著保留锐利的表面特征。

链接: https://arxiv.org/abs/2503.15763
作者: Huan Lei
机构: AIML, University of Adelaide (AIML, 阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Neural surface reconstruction has been dominated by implicit representations with marching cubes for explicit surface extraction. However, those methods typically require high-quality normals for accurate reconstruction. We propose OffsetOPT, a method that reconstructs explicit surfaces directly from 3D point clouds and eliminates the need for point normals. The approach comprises two stages: first, we train a neural network to predict surface triangles based on local point geometry, given uniformly distributed training point clouds. Next, we apply the frozen network to reconstruct surfaces from unseen point clouds by optimizing a per-point offset to maximize the accuracy of triangle predictions. Compared to state-of-the-art methods, OffsetOPT not only excels at reconstructing overall surfaces but also significantly preserves sharp surface features. We demonstrate its accuracy on popular benchmarks, including small-scale shapes and large-scale open surfaces.
zh

[CV-109] GraPLUS: Graph-based Placement Using Semantics for Image Composition

【速读】：该论文旨在解决图像中物体 plausible（合理且自然）放置的问题。现有方法在生成场景中的物体布局时往往缺乏对语义关系和空间上下文的充分理解，导致放置结果不够准确或视觉质量下降。为了解决这一问题，论文提出了一种名为 GraPLUS (Graph-based Placement Using Semantics) 的新框架，其核心在于结合图结构的场景表示与语义理解，以确定上下文相关的物体位置。

GraPLUS 的关键创新点包括：(i) 利用预训练的场景图模型迁移其他领域的知识；(ii) 边感知的图神经网络通过结构化关系处理场景语义；(iii) 跨模态注意力机制将类别嵌入与增强的场景特征对齐；以及 (iv) 多目标训练策略引入语义一致性约束。这些技术共同确保了框架在保持高视觉质量的同时实现更精准的物体放置。最终，GraPLUS 在 OPA 数据集上的放置准确率达到 92.1%，FID 得分为 28.83，并在人类评估中显著优于现有方法。

链接: https://arxiv.org/abs/2503.15761
作者: Mir Mohammad Khaleghi,Mehran Safayani,Abdolreza Mirzaei
机构: Department of Electrical and Computer Engineering, Isfahan University of Technology (伊斯法罕理工大学), Isfahan 84156-83111, Iran
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 3 figures, 6 tables

点击查看摘要

Abstract:We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling nuanced understanding of object relationships and placement patterns. GraPLUS achieves placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.1% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 19 participants, our method was preferred in 52.1% of cases, significantly outperforming previous approaches. The framework’s key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints.
zh

[CV-110] Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes

【速读】：该论文旨在解决从单张图像重建三维场景（Single Image to 3D Reconstruction）这一本质上不适定（ill-posed）的问题，特别是在渲染新颖视角时，现有方法常产生不一致且模糊的结果。尤其当待重建区域远离输入图像的视野范围时，这一问题更为显著。为克服这些局限性，论文提出的关键解决方案包括：1）利用预训练的潜在视频扩散模型（Latent Video Diffusion Model）作为强生成先验（Generative Prior），通过优化高斯参数迭代优化粗略场景表示；2）引入输入图像与生成图像之间的实时傅里叶风格迁移（Fourier-style Transfer），以确保生成图像的风格和纹理与输入图像保持一致；3）设计语义不确定性量化模块（Semantic Uncertainty Quantification Module），通过计算逐像素熵生成不确定性图，指导从置信度最高的像素开始的细化过程，并丢弃高度不确定的像素。实验表明，该方法在RealEstate-10K（域内）和KITTI-v2（域外）数据集上均优于现有最先进的方法，提供了更真实且高保真的新视角合成结果。

链接: https://arxiv.org/abs/2503.15742
作者: Sarosij Bose,Arindam Dutta,Sayak Nag,Junge Zhang,Jiachen Li,Konstantinos Karydis,Amit K. Roy Chowdhury
机构: University of California, Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image’s view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the per-pixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods.
zh

[CV-111] Graph-Weighted Contrastive Learning for Semi-Supervised Hyperspectral Image Classification

【速读】：该论文旨在解决现有基于图的半监督高光谱图像分类方法因超像素分割技术导致的边界不准确问题，从而避免某些像素误分类的现象。这些方法受限于初始超像素划分的不准确性，限制了整体分类性能。论文的关键解决方案是提出了一种新颖的图加权对比学习方法，该方法无需依赖超像素分割，而是直接利用神经网络学习高光谱图像表示。此外，与许多需要在整个训练过程中使用完整图节点的方法不同，该方法支持小批量训练，每次仅处理图中的子集节点，从而降低了计算复杂度并提升了对未见节点的泛化能力。实验结果表明，所提方法在三个常用数据集上的有效性优于基于超像素分割的基准方法。

链接: https://arxiv.org/abs/2503.15731
作者: Yuqing Zhang,Qi Han,Ligeng Wang,Kai Cheng,Bo Wang,Kun Zhan
机构: School of Information Science and Engineering, Lanzhou University (兰州大学信息科学与工程学院); Key Laboratory of AI and Information Processing, Hechi University (河池大学人工智能与信息处理重点实验室); School of Artificial Intelligence and Smart Manufacturing, Hechi University (河池大学人工智能与智能制造学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal of Electronic Imaging, 2025

点击查看摘要

Abstract:Most existing graph-based semi-supervised hyperspectral image classification methods rely on superpixel partitioning techniques. However, they suffer from misclassification of certain pixels due to inaccuracies in superpixel boundaries, \ie, the initial inaccuracies in superpixel partitioning limit overall classification performance. In this paper, we propose a novel graph-weighted contrastive learning approach that avoids the use of superpixel partitioning and directly employs neural networks to learn hyperspectral image representation. Furthermore, while many approaches require all graph nodes to be available during training, our approach supports mini-batch training by processing only a subset of nodes at a time, reducing computational complexity and improving generalization to unseen nodes. Experimental results on three widely-used datasets demonstrate the effectiveness of the proposed approach compared to baselines relying on superpixel partitioning.
zh

[CV-112] SPNeRF: Open Vocabulary 3D Neural Scene Segmentation with Superpoints

【速读】：本文旨在解决将基于大型视觉-语言模型（如CLIP）的开放词汇分割能力从2D场景扩展到3D场景时所面临的挑战。CLIP基于图像的嵌入通常缺乏3D场景分割所需的几何细节，而现有方法通过引入额外的分割模型或替换为在分割数据上训练的变体来应对这一问题，但这些方法可能导致CLIP通用语言能力的冗余或退化。为克服此限制，论文提出了SPNeRF，这是一种基于NeRF的零样本3D分割方法，利用了几何先验。其关键是将从3D场景中推导出的几何基元整合到NeRF训练中，以生成基于基元的CLIP特征，从而避免点特征的模糊性，并提出了一种增强亲和分数的基于基元的合并机制。该方法无需依赖额外的分割模型，进一步探索了CLIP在3D分割中的能力，并在原始LERF的基础上取得了显著改进。

链接: https://arxiv.org/abs/2503.15712
作者: Weiwen Hu,Niccolò Parodi,Marcus Zepp,Ingo Feldmann,Oliver Schreer,Peter Eisert
机构: Fraunhofer Heinrich Hertz Institute (弗劳恩霍夫海因里希赫兹研究所), Berlin, Germany; Technische Universität Berlin (柏林工业大学), Germany; Humboldt-Universität zu Berlin (洪堡大学), Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2025)

点击查看摘要

Abstract:Open-vocabulary segmentation, powered by large visual-language models like CLIP, has expanded 2D segmentation capabilities beyond fixed classes predefined by the dataset, enabling zero-shot understanding across diverse scenes. Extending these capabilities to 3D segmentation introduces challenges, as CLIP’s image-based embeddings often lack the geometric detail necessary for 3D scene segmentation. Recent methods tend to address this by introducing additional segmentation models or replacing CLIP with variations trained on segmentation data, which lead to redundancy or loss on CLIP’s general language capabilities. To overcome this limitation, we introduce SPNeRF, a NeRF based zero-shot 3D segmentation approach that leverages geometric priors. We integrate geometric primitives derived from the 3D scene into NeRF training to produce primitive-wise CLIP features, avoiding the ambiguity of point-wise features. Additionally, we propose a primitive-based merging mechanism enhanced with affinity scores. Without relying on additional segmentation models, our method further explores CLIP’s capability for 3D segmentation and achieves notable improvements over original LERF.
zh

[CV-113] Sustainable Deep Learning-Based Breast Lesion Segmentation: Impact of Breast Region Segmentation on Performance

【速读】：该论文旨在研究乳腺区域分割（Breast Region Segmentation, BRS）对基于深度学习的乳腺病变分割（Breast Lesion Segmentation, BLS）在动态对比增强磁共振成像（Dynamic Contrast-Enhanced Magnetic Resonance Imaging, DCE-MRI）中的影响。论文通过使用Stavanger数据集以及UNet++模型，设计了四种不同的处理流程来比较BRS对BLS的影响，包括未进行BRS的全体积分割、带有BRS的全体积分割、仅针对选定病灶切片的BRS以及优化体积后的BRS。关键在于通过精确调整最优体积大小确保所有病灶存在于切片中，并结合混合损失函数及5折交叉验证方法提升模型性能。实验结果表明，采用BRS显著提高了模型性能与验证效果，在最优体积加BRS的情况下，相比不使用BRS的方法提升了约50%，同时大幅减少了能耗，为未来大规模数据集的研究提供了更环保的解决方案。

链接: https://arxiv.org/abs/2503.15708
作者: Sam Narimani,Solveig Roth Hoff,Kathinka Dahli Kurz,Kjell-Inge Gjesdal,Jurgen Geisler,Endre Grovik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Purpose: Segmentation of the breast lesion in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is an essential step to accurately diagnose and plan treatment and monitor progress. This study aims to highlight the impact of breast region segmentation (BRS) on deep learning-based breast lesion segmentation (BLS) in breast DCE-MRI. Methods Using the Stavanger Dataset containing primarily 59 DCE-MRI scans and UNet++ as deep learning models, four different process were conducted to compare effect of BRS on BLS. These four approaches included the whole volume without BRS and with BRS, BRS with the selected lesion slices and lastly optimal volume with BRS. Preprocessing methods like augmentation and oversampling were used to enhance the small dataset, data shape uniformity and improve model performance. Optimal volume size were investigated by a precise process to ensure that all lesions existed in slices. To evaluate the model, a hybrid loss function including dice, focal and cross entropy along with 5-fold cross validation method were used and lastly a test dataset which was randomly split used to evaluate the model performance on unseen data for each of four mentioned approaches. Results Results demonstrate that using BRS considerably improved model performance and validation. Significant improvement in last approach – optimal volume with BRS – compared to the approach without BRS counting around 50 percent demonstrating how effective BRS has been in BLS. Moreover, huge improvement in energy consumption, decreasing up to 450 percent, introduces a green solution toward a more environmentally sustainable approach for future work on large dataset. Subjects: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph) Cite as: arXiv:2503.15708 [cs.CV] (or arXiv:2503.15708v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.15708 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-114] Representational Similarity via Interpretable Visual Concepts ICLR2025

【速读】：该论文试图解决如何衡量两个深度神经网络在决策过程中的差异性这一长期存在的开放问题。现有方法大多仅提供单一数值来表示两网络在特定层的相似度，但无法揭示其相似或相异的具体原因。论文的关键解决方案是引入一种可解释的表征相似性方法（Representational Similarity with Voting Clustering, RSVC），通过该方法可以发现两个模型之间的共享及独有的视觉概念。研究进一步表明，模型间的某些差异可归因于某一模型所发现的独特概念，在另一模型中这些概念未被良好表征。最后，论文通过对不同视觉模型架构和训练协议进行广泛评估，验证了RSVC方法的有效性。

链接: https://arxiv.org/abs/2503.15699
作者: Neehar Kondapaneni,Oisin Mac Aodha,Pietro Perona
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 32 pages, 5 Figures, 16 Supplemental Figures, ICLR 2025

点击查看摘要

Abstract:How do two deep neural networks differ in how they arrive at a decision? Measuring the similarity of deep networks has been a long-standing open question. Most existing methods provide a single number to measure the similarity of two networks at a given layer, but give no insight into what makes them similar or dissimilar. We introduce an interpretable representational similarity method (RSVC) to compare two networks. We use RSVC to discover shared and unique visual concepts between two models. We show that some aspects of model differences can be attributed to unique concepts discovered by one model that are not well represented in the other. Finally, we conduct extensive evaluation across different vision model architectures and training protocols to demonstrate its effectiveness.
zh

[CV-115] chnical Report for the 5th CLVision Challenge at CVPR: Addressing the Class-Incremental with Repetition using Unlabeled Data – 4th Place Solution

【速读】：本文针对CVPR第5届CLVision挑战赛中的Class-Incremental with Repetition (CIR) 场景提出了解决方案。与传统类增量学习不同，CIR场景引入了独特的挑战和研究机会，特别是在训练过程中整合未标注数据方面。论文的关键在于利用知识蒸馏（Knowledge Distillation）和伪标签（Pseudo-labeling）技术保留先前学到的知识，并在训练过程中有效利用未标注数据，以保持对之前类别实例的良好性能，同时减轻灾难性遗忘（Catastrophic Forgetting）的影响。实验结果显示，该方法在预选阶段平均准确率为16.68%，最终评估阶段为21.19%，显著优于基线准确率9.39%。实施代码可在提供的链接获取。

链接: https://arxiv.org/abs/2503.15697
作者: Panagiota Moraiti,Efstathios Karypidis
机构: Tech Hive Labs (Tech Hive 实验室); National Technical University of Athens (雅典国立技术大学); Archimedes/Athena Research Center (阿基米德/雅典娜研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper outlines our approach to the 5th CLVision challenge at CVPR, which addresses the Class-Incremental with Repetition (CIR) scenario. In contrast to traditional class incremental learning, this novel setting introduces unique challenges and research opportunities, particularly through the integration of unlabeled data into the training process. In the CIR scenario, encountered classes may reappear in later learning experiences, and each experience may involve only a subset of the overall class distribution. Additionally, the unlabeled data provided during training may include instances of unseen classes, or irrelevant classes which should be ignored. Our approach focuses on retaining previously learned knowledge by utilizing knowledge distillation and pseudo-labeling techniques. The key characteristic of our method is the exploitation of unlabeled data during training, in order to maintain optimal performance on instances of previously encountered categories and reduce the detrimental effects of catastrophic forgetting. Our method achieves an average accuracy of 16.68% during the pre-selection phase and 21.19% during the final evaluation phase, outperforming the baseline accuracy of 9.39%. We provide the implementation code at this https URL .
zh

[CV-116] Multi-focal Conditioned Latent Diffusion for Person Image Synthesis CVPR2025

【速读】：该论文旨在解决Latent Diffusion Model (LDM) 在高分辨率图像生成及Pose-Guided Person Image Synthesis (PGPIS) 应用中因压缩过程导致细节丢失，特别是在面部特征和衣物纹理等敏感区域的问题。为应对这些局限性，论文提出了一种多焦点条件化潜扩散（Multi-focal Conditioned Latent Diffusion, MCLD）方法，其关键是通过在解耦且姿态不变的特征上对模型进行条件约束，利用多焦点条件聚合模块有效整合面部身份与特定纹理信息，从而提升生成图像的真实性与身份一致性。实验表明，该方法在DeepFashion数据集上实现了稳定的外观与身份生成，并具备灵活的人物图像编辑能力。

链接: https://arxiv.org/abs/2503.15686
作者: Jiaqi Liu,Jichao Zahng,Paolo Rota,Nicu Sebe
机构: University of Trento (特伦托大学); Ocean University of China (中国海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 Accepted

点击查看摘要

Abstract:The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. Our approach utilizes a multi-focal condition aggregation module, which effectively integrates facial identity and texture-specific information, enhancing the model’s ability to produce appearance realistic and identity-consistent images. Our method demonstrates consistent identity and appearance generation on the DeepFashion dataset and enables flexible person image editing due to its generation consistency. The code is available at this https URL.
zh

[CV-117] he Change You Want To Detect: Semantic Change Detection In Earth Observation With Hybrid Data Generation

【速读】：该论文旨在解决基于Very High Resolution (VHR)图像的大规模双时相变化检测问题，这一领域目前仍缺乏有效方法：现有方法要么需要大量标注数据（语义场景），要么局限于特定数据集（二元设定）。大多数方法在架构设计的简洁性、预训练数据的现实性和全面性以及时空适应性方面表现不足。论文的关键解决方案是提出HySCDG，一种生成式流水线，用于创建包含真实VHR图像与插值图像及其相应土地覆盖语义图和变化图的大型混合语义变化检测数据集FSC-180k。通过语义和空间引导，HySCDG生成逼真的图像，从而构建出一个全面且具有迁移鲁棒性的混合数据集，显著提升了变化检测模型在多种任务配置下的性能，尤其是在小样本学习场景下表现出色。

链接: https://arxiv.org/abs/2503.15683
作者: Benidir Yanis,Gonthier Nicolas,Mallet Clement
机构: Univ Gustave Eiffel(龚普特·埃菲尔大学), ENSG(国立地理学院), IGN(国家地理研究院), LASTIG(地形与空间信息技术实验室), France(法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Bi-temporal change detection at scale based on Very High Resolution (VHR) images is crucial for Earth monitoring. This remains poorly addressed so far: methods either require large volumes of annotated data (semantic case), or are limited to restricted datasets (binary set-ups). Most approaches do not exhibit the versatility required for temporal and spatial adaptation: simplicity in architecture design and pretraining on realistic and comprehensive datasets. Synthetic datasets are the key solution but still fail to handle complex and diverse scenes. In this paper, we present HySCDG a generative pipeline for creating a large hybrid semantic change detection dataset that contains both real VHR images and inpainted ones, along with land cover semantic map at both dates and the change map. Being semantically and spatially guided, HySCDG generates realistic images, leading to a comprehensive and hybrid transfer-proof dataset FSC-180k. We evaluate FSC-180k on five change detection cases (binary and semantic), from zero-shot to mixed and sequential training, and also under low data regime training. Experiments demonstrate that pretraining on our hybrid dataset leads to a significant performance boost, outperforming SyntheWorld, a fully synthetic dataset, in every configuration. All codes, models, and data are available here: \hrefthis https URLthis https URL .
zh

[CV-118] High Temporal Consistency through Semantic Similarity Propagation in Semi-Supervised Video Semantic Segmentation for Autonomous Flight

【速读】：该论文旨在解决自主飞行器从RGB摄像机进行语义分割时预测稳定性不足的问题，以提高其可靠性和可信度。论文的关键解决方案是提出了一种轻量级视频语义分割方法——通过跨帧的语义相似性传播（Semantic Similarity Propagation, SSP）实现空中数据的高时间一致性。SSP 方法利用高效图像分割模型的预测结果，并结合全局配准对齐来补偿相机运动的影响，同时通过线性插值结合当前帧估计与先验预测，权重由两帧特征相似性计算得到。此外，针对标注数据稀缺的挑战，论文还提出了基于一致性感知的知识蒸馏（consistency-aware Knowledge Distillation, KD）训练策略，利用大模型教师指导高效SSP模型，在标注稀疏的数据集上通过挖掘同一训练视频中标记与未标记帧之间的强相关性，实现高质量监督。最终，KD-SSP 在UAVid和RuralScapes数据集上的时间一致性分别提升了12.5%和6.7%，在保持较高精度的同时实现了与基础图像分割模型相当的推理速度。

链接: https://arxiv.org/abs/2503.15676
作者: Cédric Vincent,Taehyoung Kim,Henri Meeß
机构: Télécom Paris (Télécom Paris), Institut Polytechnique de Paris (巴黎高等电信学院); Fraunhofer IVI (弗劳恩霍夫协会智能车辆研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation from RGB cameras is essential to the perception of autonomous flying vehicles. The stability of predictions through the captured videos is paramount to their reliability and, by extension, to the trustworthiness of the agents. In this paper, we propose a lightweight video semantic segmentation approach-suited to onboard real-time inference-achieving high temporal consistency on aerial data through Semantic Similarity Propagation across frames. SSP temporally propagates the predictions of an efficient image segmentation model with global registration alignment to compensate for camera movements. It combines the current estimation and the prior prediction with linear interpolation using weights computed from the features similarities of the two frames. Because data availability is a challenge in this domain, we propose a consistency-aware Knowledge Distillation training procedure for sparsely labeled datasets with few annotations. Using a large image segmentation model as a teacher to train the efficient SSP, we leverage the strong correlations between labeled and unlabeled frames in the same training videos to obtain high-quality supervision on all frames. KD-SSP obtains a significant temporal consistency increase over the base image segmentation model of 12.5% and 6.7% TC on UAVid and RuralScapes respectively, with higher accuracy and comparable inference speed. On these aerial datasets, KD-SSP provides a superior segmentation quality and inference speed trade-off than other video methods proposed for general applications and shows considerably higher consistency. The code will be made publicly available upon acceptance.
zh

[CV-119] GASP: Unifying Geometric and Semantic Self-Supervised Pre-training for Autonomous Driving

【速读】：该论文试图解决如何通过自监督预训练方法，从大规模时空数据中学习环境的几何与语义结构及其随时间的演变。论文的关键解决方案在于提出了一种名为GASP的几何与语义自监督预训练方法，通过预测任意未来时空查询点上的（1）通用占据（General Occupancy），捕捉三维场景的动态结构；（2）自我占据（Ego Occupancy），建模自动驾驶车辆在环境中的路径；以及（3）从视觉基础模型蒸馏得到的高层特征，实现对环境及其时间演化的一种统一表示学习。这种方法通过建模几何与语义的四维占据场，而非原始传感器测量值，学习到一种具有结构化且可泛化的环境表示，从而显著提升了语义占据预测、在线地图构建及自我轨迹预测等任务的表现。

链接: https://arxiv.org/abs/2503.15672
作者: William Ljungbergh,Adam Lilja,Adam Tonderski. Arvid Laveno Ling,Carl Lindström,Willem Verbeke,Junsheng Fu,Christoffer Petersson,Lars Hammarstrand,Michael Felsberg
机构: Zenseact; Linköping University (林雪平大学); Chalmers University of Technology (查尔姆斯理工大学); Lund University (隆德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Self-supervised pre-training based on next-token prediction has enabled large language models to capture the underlying structure of text, and has led to unprecedented performance on a large array of tasks when applied at scale. Similarly, autonomous driving generates vast amounts of spatiotemporal data, alluding to the possibility of harnessing scale to learn the underlying geometric and semantic structure of the environment and its evolution over time. In this direction, we propose a geometric and semantic self-supervised pre-training method, GASP, that learns a unified representation by predicting, at any queried future point in spacetime, (1) general occupancy, capturing the evolving structure of the 3D scene; (2) ego occupancy, modeling the ego vehicle path through the environment; and (3) distilled high-level features from a vision foundation model. By modeling geometric and semantic 4D occupancy fields instead of raw sensor measurements, the model learns a structured, generalizable representation of the environment and its evolution through time. We validate GASP on multiple autonomous driving benchmarks, demonstrating significant improvements in semantic occupancy forecasting, online mapping, and ego trajectory prediction. Our results demonstrate that continuous 4D geometric and semantic occupancy prediction provides a scalable and effective pre-training paradigm for autonomous driving. For code and additional visualizations, see \hrefthis https URL.
zh

[CV-120] CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image

【速读】：该论文旨在解决单目图像下重建被遮挡 Cloth Human 的问题，现有方法通常假定无遮挡环境，导致在真实场景中遇到遮挡图像时，产生多视角不一致且碎片化的重建结果。此外，大多数单目 3D 人体重建算法依赖于几何先验（如 SMPL 注解），这些注解在实际应用中难以获取。论文的关键解决方案是提出 CHROME 管道：一种从单张遮挡图像中重建具有遮挡鲁棒性和多视角一致性的 3D Cloth Human 的新方法，无需真实几何先验标注或 3D 监督。具体而言，CHROME 利用多视角扩散模型合成无遮挡图像，并通过姿势控制显式强制跨视角一致性；随后训练一个 3D 重建模型，基于遮挡输入和合成视图预测一组 3D 高斯分布，以对齐跨视角细节并生成一致的 3D 表征。

链接: https://arxiv.org/abs/2503.15671
作者: Arindam Dutta,Meng Zheng,Zhongpai Gao,Benjamin Planche,Anwesha Choudhuri,Terrence Chen,Amit K. Roy-Chowdhury,Ziyan Wu
机构: University of California, Riverside (加州大学河滨分校); United Imaging Intelligence (联影智能, 波士顿)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing clothed humans from a single image is a fundamental task in computer vision with wide-ranging applications. Although existing monocular clothed human reconstruction solutions have shown promising results, they often rely on the assumption that the human subject is in an occlusion-free environment. Thus, when encountering in-the-wild occluded images, these algorithms produce multiview inconsistent and fragmented reconstructions. Additionally, most algorithms for monocular 3D human reconstruction leverage geometric priors such as SMPL annotations for training and inference, which are extremely challenging to acquire in real-world applications. To address these limitations, we propose CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-ConsistEncy from a Single Image, a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input, compatible with off-the-shelf pose control to explicitly enforce cross-view consistency during synthesis. A 3D reconstruction model is then trained to predict a set of 3D Gaussians conditioned on both the occluded input and synthesized views, aligning cross-view details to produce a cohesive and accurate 3D representation. CHROME achieves significant improvements in terms of both novel view synthesis (upto 3 db PSNR) and geometric reconstruction under challenging conditions.
zh

[CV-121] DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis

【速读】：本文旨在解决从单视角图像生成高质量全景（360度）人类头部视图的问题，以支持沉浸式远程呈现应用和可扩展的个性化内容创作。现有方法在生成全头模型时侧重于现实主义，而基于扩散的风格化头部合成方法虽能生成正面视图，但存在视角一致性问题，无法转化为可用于任意角度渲染的真实三维模型。论文的关键解决方案在于提出一种新颖的方法，通过引入自定义的ControlNet来增强后脑细节生成，并利用双外观模块确保前后视角的一致性。此外，通过在连续视角序列上训练并结合背视参考图像，实现局部连续的视角合成。最终，该方法能够生成高质量的神经辐射场（NeRF），用于实时自由视点渲染，在物体合成和具有挑战性的输入肖像的全景头部生成任务中超越现有最先进方法。

链接: https://arxiv.org/abs/2503.15667
作者: Yuming Gu,Phong Tran,Yujian Zheng,Hongyi Xu,Heyuan Li,Adilbek Karmanov,Hao Li
机构: University of Southern California (南加州大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学); ByteDance Inc. (字节跳动); The Chinese University of Hong Kong, Shenzhen (香港中文大学，深圳); Pinscreen Inc. (Pinscreen 公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Page: this https URL Code: this https URL

点击查看摘要

Abstract:Generating high-quality 360-degree views of human heads from single-view images is essential for enabling accessible immersive telepresence applications and scalable personalized content creation. While cutting-edge methods for full head generation are limited to modeling realistic human heads, the latest diffusion-based approaches for style-omniscient head synthesis can produce only frontal views and struggle with view consistency, preventing their conversion into true 3D models for rendering from arbitrary angles. We introduce a novel approach that generates fully consistent 360-degree head views, accommodating human, stylized, and anthropomorphic forms, including accessories like glasses and hats. Our method builds on the DiffPortrait3D framework, incorporating a custom ControlNet for back-of-head detail generation and a dual appearance module to ensure global front-back consistency. By training on continuous view sequences and integrating a back reference image, our approach achieves robust, locally continuous view synthesis. Our model can be used to produce high-quality neural radiance fields (NeRFs) for real-time, free-viewpoint rendering, outperforming state-of-the-art methods in object synthesis and 360-degree head generation for very challenging input portraits.
zh

[CV-122] oward Scalable Flexible Scene Flow for Point Clouds

【速读】：该论文旨在构建具备两大重要特性的场景流（scene flow）估计器：可扩展性（scalable），即随着更多数据和计算资源的获取性能得以提升；以及灵活性（flexible），即无需显著超参数调整即可在多种领域和运动模式下开箱即用地工作。论文的关键解决方案在于提出了一种无需昂贵人工标注的方法，通过大规模伪标签蒸馏（pseudolabel distillation）从强大的无监督测试时优化方法中构建和扩展前馈场景流估计器，并引入了一个新的全序列问题公式化方法，同时建立了一个基准以更好地衡量不同物体类型下的估计质量，从而推动场景流估计器在相关领域的进步。

链接: https://arxiv.org/abs/2503.15666
作者: Kyle Vedder
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: PhD Thesis

点击查看摘要

Abstract:Scene flow estimation is the task of describing 3D motion between temporally successive observations. This thesis aims to build the foundation for building scene flow estimators with two important properties: they are scalable, i.e. they improve with access to more data and computation, and they are flexible, i.e. they work out-of-the-box in a variety of domains and on a variety of motion patterns without requiring significant hyperparameter tuning. In this dissertation we present several concrete contributions towards this. In Chapter 1 we contextualize scene flow and its prior methods. In Chapter 2 we present a blueprint to build and scale feedforward scene flow estimators without requiring expensive human annotations via large scale distillation from pseudolabels provided by strong unsupervised test-time optimization methods. In Chapter 3 we introduce a benchmark to better measure estimate quality across diverse object types, better bringing into focus what we care about and expect from scene flow estimators, and use this benchmark to host a public challenge that produced significant progress. In Chapter 4 we present a state-of-the-art unsupervised scene flow estimator that introduces a new, full sequence problem formulation and exhibits great promise in adjacent domains like 3D point tracking. Finally, in Chapter 5 I philosophize about what’s next for scene flow and its potential future broader impacts. Comments: PhD Thesis Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.15666 [cs.CV] (or arXiv:2503.15666v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.15666 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-123] ransport-Related Surface Detection with Machine Learning: Analyzing Temporal Trends in Madrid and Vienna

【速读】：本文研究旨在解决城市空中影像分析中基础设施表面（如汽车和行人）识别及历史趋势分析的问题。解决方案的关键在于从传统的卷积架构向基于变压器的预训练模型的过渡，这些模型在全局地理空间分析中展现出巨大潜力。作者提出了一种自动化生成地理空间数据集的工作流程，能够利用多种开源数据源（如WMS/WMTS链接、矢量制图、OpenStreetMap (OSM) Overpass-Turbo请求）创建语义分割数据集，而无需人工标注。通过该方法，使用马德里和维也纳地理办公室提供的航拍图像和矢量数据生成了两个用于汽车和行人表面检测的数据集，并针对每个城市训练和评估了基于变压器的模型，取得了良好的准确性。此外，通过将训练好的模型应用于十年前甚至二十年前的历史影像，成功识别了不同城区内行人和汽车基础设施的时间趋势。此技术为市政当局以低成本获取有价值的数据提供了可能。

链接: https://arxiv.org/abs/2503.15653
作者: Miguel Ureña Pliego,Rubén Martínez Marín,Nianfang Shi,Takeru Shibayama,Ulrich Leth,Miguel Marchamalo Sacristán
机构: Department of Land Morphology and Engineering, Civil Engineering School, Universidad Politécnica de Madrid (马德里理工大学); Institut für Verkehrswissenschaften, TU Wien (维也纳工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:This study explores the integration of machine learning into urban aerial image analysis, with a focus on identifying infrastructure surfaces for cars and pedestrians and analyzing historical trends. It emphasizes the transition from convolutional architectures to transformer-based pre-trained models, underscoring their potential in global geospatial analysis. A workflow is presented for automatically generating geospatial datasets, enabling the creation of semantic segmentation datasets from various sources, including WMS/WMTS links, vectorial cartography, and OpenStreetMap (OSM) overpass-turbo requests. The developed code allows a fast dataset generation process for training machine learning models using openly available data without manual labelling. Using aerial imagery and vectorial data from the respective geographical offices of Madrid and Vienna, two datasets were generated for car and pedestrian surface detection. A transformer-based model was trained and evaluated for each city, demonstrating good accuracy values. The historical trend analysis involved applying the trained model to earlier images predating the availability of vectorial data 10 to 20 years, successfully identifying temporal trends in infrastructure for pedestrians and cars across different city areas. This technique is applicable for municipal governments to gather valuable data at a minimal cost.
zh

[CV-124] Cancelable Biometric Template Generation Using Random Feature Vector Transformations

【速读】：该论文旨在解决生物特征识别系统中原始生物特征数据在使用过程中可能被篡改或窃取的问题，并克服现有可撤销生物特征方案的模态依赖性和易受重建攻击的局限性。论文的关键创新在于提出了一种模态无关的可撤销生物特征方案，通过将生物特征特征向量进行多组用户特定随机向量分组的随机变换，生成一个仅包含特征向量不同随机变换之间距离值的可撤销模板（伪标识符）。此方法不仅避免了存储原始生物特征细节，还消除了模板重建的可能性，从而显著提升了系统的安全性与鲁棒性。

链接: https://arxiv.org/abs/2503.15648
作者: Ragendhu Sp,Tony Thomas,Sabu Emmanuel
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cancelable biometric schemes are designed to extract an identity-preserving, non-invertible as well as revocable pseudo-identifier from biometric data. Recognition systems need to store only this pseudo-identifier, to avoid tampering and/or stealing of original biometric data during the recognition process. State-of-the-art cancelable schemes generate pseudo-identifiers by transforming the original template using either user-specific salting or many-to-one transformations. In addition to the performance concerns, most of such schemes are modality-specific and prone to reconstruction attacks as there are chances for unauthorized access to security-critical transformation keys. A novel, modality-independent cancelable biometric scheme is proposed to overcome these limitations. In this scheme, a cancelable template (pseudo identifier) is generated as a distance vector between multiple random transformations of the biometric feature vector. These transformations were done by grouping feature vector components based on a set of user-specific random vectors. The proposed scheme nullifies the possibility of template reconstruction as the generated cancelable template contains only the distance values between the different random transformations of the feature vector and it does not store any details of the biometric template. The recognition performance of the proposed scheme is evaluated for face and fingerprint modalities. Equal Error Rate (EER) of 1.5 is obtained for face and 1.7 is obtained for the fingerprint in the worst case.
zh

[CV-125] Multi-Modal Gesture Recognition from Video and Surgical Tool Pose Information via Motion Invariants

【速读】：该论文旨在解决手术手势实时识别中的关键挑战，当前多模态神经网络方法虽能处理视觉与运动学数据，但未能充分利用手术器械姿态间的几何关系。论文的关键创新在于结合运动不变量（曲率与挠率）与视觉及运动学数据，通过关系图网络捕获不同数据流之间的潜在关联。这种方法显著提升了手势识别的准确性，在JIGSAWS缝合数据集上的帧级准确率达到90.3%，并证明了结合运动不变信号与位置信息比传统的位置与四元数表示更能有效表征手势运动。

链接: https://arxiv.org/abs/2503.15647
作者: Jumanh Atoum,Garrison L.H. Johnston,Nabil Simaan,Jie Ying Wu
机构: Vanderbilt University (范德比尔特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recognizing surgical gestures in real-time is a stepping stone towards automated activity recognition, skill assessment, intra-operative assistance, and eventually surgical automation. The current robotic surgical systems provide us with rich multi-modal data such as video and kinematics. While some recent works in multi-modal neural networks learn the relationships between vision and kinematics data, current approaches treat kinematics information as independent signals, with no underlying relation between tool-tip poses. However, instrument poses are geometrically related, and the underlying geometry can aid neural networks in learning gesture representation. Therefore, we propose combining motion invariant measures (curvature and torsion) with vision and kinematics data using a relational graph network to capture the underlying relations between different data streams. We show that gesture recognition improves when combining invariant signals with tool position, achieving 90.3% frame-wise accuracy on the JIGSAWS suturing dataset. Our results show that motion invariant signals coupled with position are better representations of gesture motion compared to traditional position and quaternion representations. Our results highlight the need for geometric-aware modeling of kinematics for gesture recognition.
zh

[CV-126] A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition

【速读】：该论文旨在解决现代场景文本识别系统中大型端到端架构依赖带来的高训练成本以及在实时场景中部署困难的问题。这些困难源于内存、计算资源和延迟的限制。为了解决这些问题，论文提出了一种新颖的无需训练的即插即用框架，该框架利用预训练文本识别器的优势同时最小化冗余计算。其关键在于引入基于上下文理解的注意力机制分割阶段，通过像素级优化候选文本区域来提升下游识别性能，并结合预训练描述器利用上下文信息直接从场景图像中生成词预测，从而避免传统基于区块比较的方法。此外，通过语义和词汇评估为预测结果赋予最终得分，满足置信阈值的预测可跳过重型端到端文本识别过程，确保更快的推理速度并减少不必要的计算。实验表明，该方法在公共基准数据集上的性能与最先进的系统相当，但所需资源显著减少。

链接: https://arxiv.org/abs/2503.15639
作者: Ritabrata Chakraborty,Shivakumara Palaiahnakote,Umapada Pal,Cheng-Lin Liu
机构: CVPR Unit, Indian Statistical Institute (印度统计学院), Kolkata, India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene this http URL texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.
zh

[CV-127] Vision-Speech Models: Teaching Speech Models to Converse about Images

【速读】：该论文旨在解决如何通过视觉输入增强预训练语音模型以实现跨模态图像理解的能力，从而构建能够自由讨论图像的多模态语音模型。这一目标的关键挑战包括：(i) 配对图像-语音数据集稀缺；(ii) 推理阶段需满足实时低延迟需求，带来计算与内存限制；(iii) 模型需保留如语调等无法仅从文本推断出的韵律特征。为应对这些挑战，论文提出通过轻量级适配模块将近期发布的对话语音大语言模型Moshi扩展为支持视觉输入的MoshiVis，并引入动态门控机制以灵活切换视觉输入与无关话题。此外，为降低训练成本，设计了一种单阶段参数高效的微调管道，利用包含“无语音”图像-文本和图像-语音混合样本进行训练。最终，模型在带有音频和文本提示的下游视觉理解任务中得到评估，并展示了与MoshiVis交互的定性样本。

链接: https://arxiv.org/abs/2503.15633
作者: Amélie Royer,Moritz Böhle,Gabriel de Marmiesse,Laurent Mazaré,Neil Zeghidour,Alexandre Défossez,Patrick Pérez
机构: Kyutai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent successes of Vision-Language models raise the question of how to equivalently imbue a pretrained speech model with vision understanding, an important milestone towards building a multimodal speech model able to freely converse about images. Building such a conversational Vision-Speech model brings its unique challenges: (i) paired image-speech datasets are much scarcer than their image-text counterparts, (ii) ensuring real-time latency at inference is crucial thus bringing compute and memory constraints, and (iii) the model should preserve prosodic features (e.g., speaker tone) which cannot be inferred from text alone. In this work, we introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. To reduce training costs, we design a simple one-stage, parameter-efficient fine-tuning pipeline in which we leverage a mixture of image-text (i.e., “speechless”) and image-speech samples. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis. Our inference code will be made available, as well as the image-speech data used for audio evaluation.
zh

[CV-128] EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis

【速读】：该论文旨在解决传统地表地质制图方法因劳动密集型导致的空间覆盖有限及潜在偏见的问题。为应对这些局限性，论文引入了EarthScape，这是一个专为地表地质制图和地球表面分析设计的新颖、AI-ready多模态数据集。解决方案的关键在于EarthScape整合了高分辨率航拍RGB和近红外（NIR）影像、数字高程模型（DEM）、基于DEM衍生的多尺度地形特征以及水文和基础设施矢量数据，并提供了七种不同地表地质类别的详细标注，同时通过开放原始数据构建全面的数据处理流程，建立不同空间模态的基准以展示EarthScape的实用性。作为具有扩展愿景的动态数据集，EarthScape弥合了计算机视觉与地球科学之间的鸿沟，为多模态学习、地理空间分析和地质制图的研究进展提供了宝贵资源。

链接: https://arxiv.org/abs/2503.15625
作者: Matthew Massey,Abdullah-Al-Zubaer Imran
机构: University of Kentucky (肯塔基大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surficial geologic mapping is essential for understanding Earth surface processes, addressing modern challenges such as climate change and national security, and supporting common applications in engineering and resource management. However, traditional mapping methods are labor-intensive, limiting spatial coverage and introducing potential biases. To address these limitations, we introduce EarthScape, a novel, AI-ready multimodal dataset specifically designed for surficial geologic mapping and Earth surface analysis. EarthScape integrates high-resolution aerial RGB and near-infrared (NIR) imagery, digital elevation models (DEM), multi-scale DEM-derived terrain features, and hydrologic and infrastructure vector data. The dataset provides detailed annotations for seven distinct surficial geologic classes encompassing various geological processes. We present a comprehensive data processing pipeline using open-sourced raw data and establish baseline benchmarks using different spatial modalities to demonstrate the utility of EarthScape. As a living dataset with a vision for expansion, EarthScape bridges the gap between computer vision and Earth sciences, offering a valuable resource for advancing research in multimodal learning, geospatial analysis, and geological mapping. Our code is available at this https URL.
zh

[CV-129] CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation

【速读】：该论文旨在解决传统基于Transformer的语义分割方法依赖离散化嵌入（quantized embeddings）导致的精度下降问题。研究发现，使用离散嵌入（如VQ-VAE）的自编码器在分割掩码任务上的准确率比连续嵌入（如KL-VAE）低8%。为此，论文提出了一种基于连续值嵌入的语义分割框架，通过将语义掩码生成重构成一个连续的图像到嵌入的扩散过程，消除了对离散潜在表示的需求，同时保留了细粒度的空间和语义细节。其关键贡献在于引入了一个扩散引导的自回归Transformer，用于建模图像特征中的长距离依赖关系，从而学习连续的语义嵌入空间。此外，该框架集成了VAE编码器用于连续特征提取、扩散引导的Transformer用于条件嵌入生成以及VAE解码器用于语义掩码重建，形成了统一架构。这一设置不仅提升了模型在分布偏移（如恶劣天气和视角变化）下的鲁棒性，还赋予了零样本领域适应能力。实验结果表明，该方法在多个数据集（如Cityscapes及其领域迁移变体）上表现出最先进的分布偏移鲁棒性，并对高斯噪声、适度运动模糊及亮度/对比度变化具有很强的抗噪能力。

链接: https://arxiv.org/abs/2503.15617
作者: Masud Ahmed,Zahid Hasan,Syed Arefinul Haque,Abu Zaher Md Faridee,Sanjay Purushotham,Suya You,Nirmalya Roy
机构: University of Maryland Baltimore County (美国马里兰大学巴尔的摩郡分校); Northeastern University (东北大学); Amazon Inc. (亚马逊); DEVCOM Army Research Laboratory (美国陆军研究实验室); mahmed10@umbc.edu (邮箱地址属于 University of Maryland Baltimore County)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance ( \approx 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact ( \approx 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: this https URL
zh

[CV-130] How to Train Your Drag on: Automatic Diffusion-Based Rigging for Characters with Diverse Topologies

【速读】：该论文旨在解决现有基于扩散（diffusion）的方法在动画化具有多样化骨骼拓扑结构的人物图像时面临的局限性。这些方法通常依赖于特定于人体的姿态表示，并需要大量带标注的真实视频进行训练。论文的关键创新在于提出了一种新的模型和数据生成方案：通过少量示例帧（3-5帧）以及对应的骨骼信息，快速推断目标角色的骨架绑定（rig），进而生成与新姿态相对应的图像。解决方案的核心在于引入一种高效的程序化数据生成管道，能够实时采样包含多样化骨骼拓扑的训练数据，并结合一种新颖的骨骼表示方法，在广泛的纹理和拓扑空间内训练模型。此外，通过快速适应未见过的目标角色并在渲染新姿态时实现良好的泛化能力，进一步展示了该方法在真实感及卡通风格图像上的优越性能。为了更好地评估这一新颖且具挑战性的任务，研究者创建了一个包含人形与非人形主体且每帧带有关键点标注的首个2D视频数据集。

链接: https://arxiv.org/abs/2503.15586
作者: Zeqi Gu,Difan Liu,Timothy Langlois,Matthew Fisher,Abe Davis
机构: Cornell Tech (康奈尔科技学院); Adobe Research (Adobe 研究院); Cornell University (康奈尔大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Eurographics 2025

点击查看摘要

Abstract:Recent diffusion-based methods have achieved impressive results on animating images of human subjects. However, most of that success has built on human-specific body pose representations and extensive training with labeled real videos. In this work, we extend the ability of such models to animate images of characters with more diverse skeletal topologies. Given a small number (3-5) of example frames showing the character in different poses with corresponding skeletal information, our model quickly infers a rig for that character that can generate images corresponding to new skeleton poses. We propose a procedural data generation pipeline that efficiently samples training data with diverse topologies on the fly. We use it, along with a novel skeleton representation, to train our model on articulated shapes spanning a large space of textures and topologies. Then during fine-tuning, our model rapidly adapts to unseen target characters and generalizes well to rendering new poses, both for realistic and more stylized cartoon appearances. To better evaluate performance on this novel and challenging task, we create the first 2D video dataset that contains both humanoid and non-humanoid subjects with per-frame keypoint annotations. With extensive experiments, we demonstrate the superior quality of our results. Project page: this https URL
zh

[CV-131] A Bird Song Detector for improving bird identification through Deep Learning: a case study from Doñana

【速读】：该论文旨在解决被动声学监测（Passive Acoustic Monitoring）中因自动记录设备生成大量无监督音频数据而带来的信息提取难题。解决方案的关键在于开发一个多阶段管道，用于Doñana国家公园（西班牙西南部，面临严重保护威胁的地区）的鸟类鸣叫自动识别。具体而言，该方法结合了鸟鸣检测器（Bird Song Detector）与基于BirdNET嵌入量训练的自定义分类器，并通过在分类前使用鸟鸣检测器隔离鸣叫片段，显著提升了物种识别的准确性。论文强调，在局部声音景观中，将鸟鸣检测器与微调后的分类模型相结合是提高鸟类识别效果的有效途径，同时也凸显了针对特定生态挑战调整通用工具的重要性。

链接: https://arxiv.org/abs/2503.15576
作者: Alba Márquez-Rodríguez,Miguel Ángel Mohedano-Munoz,Manuel J. Marín-Jiménez,Eduardo Santamaría-García,Giulia Bastianelli,Pedro Jordano,Irene Mendoza
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 20 pages, 13 images, for associated dataset see this https URL , for associated code see this https URL and this https URL

点击查看摘要

Abstract:Passive Acoustic Monitoring with automatic recorders is essential for ecosystem conservation but generates vast unsupervised audio data, posing challenges for extracting meaningful information. Deep Learning techniques offer a promising solution. BirdNET, a widely used model for bird identification, has shown success in many study systems but is limited in some regions due to biases in its training data. A key challenge in bird species detection is that many recordings either lack target species or contain overlapping vocalizations. To overcome these problems, we developed a multi-stage pipeline for automatic bird vocalization identification in Doñana National Park (SW Spain), a region facing significant conservation threats. Our approach included a Bird Song Detector to isolate vocalizations and custom classifiers trained with BirdNET embeddings. We manually annotated 461 minutes of audio from three habitats across nine locations, yielding 3,749 annotations for 34 classes. Spectrograms facilitated the use of image processing techniques. Applying the Bird Song Detector before classification improved species identification, as all classification models performed better when analyzing only the segments where birds were detected. Specifically, the combination of the Bird Song Detector and fine-tuned BirdNET compared to the baseline without the Bird Song Detector. Our approach demonstrated the effectiveness of integrating a Bird Song Detector with fine-tuned classification models for bird identification at local soundscapes. These findings highlight the need to adapt general-purpose tools for specific ecological challenges, as demonstrated in Doñana. Automatically detecting bird species serves for tracking the health status of this threatened ecosystem, given the sensitivity of birds to environmental changes, and helps in the design of conservation measures for reducing biodiversity loss
zh

[CV-132] Shap-MeD

【速读】：该论文旨在开发一种专门用于生物医学领域的文本到三维物体生成模型（Text-to-3D Object Generative Model），以辅助医学对象的三维建模，从而减少开发时间。论文的关键在于通过微调开源的Shap-e模型（由OpenAI开发）并使用生物医学对象的数据集，显著提升了生成模型在潜在空间表示上的均方误差（Mean Squared Error, MSE），从Shap-e的0.147降低至0.089，同时在定性评估中展示了更高的结构准确性，实现了更高质量的生物医学对象生成。

链接: https://arxiv.org/abs/2503.15562
作者: Nicolás Laverde,Melissa Robles,Johan Rodríguez
机构: Universidad de los Andes (安第斯大学)
类目: Graphics (cs.GR); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Shap-MeD, a text-to-3D object generative model specialized in the biomedical domain. The objective of this study is to develop an assistant that facilitates the 3D modeling of medical objects, thereby reducing development time. 3D modeling in medicine has various applications, including surgical procedure simulation and planning, the design of personalized prosthetic implants, medical education, the creation of anatomical models, and the development of research prototypes. To achieve this, we leverage Shap-e, an open-source text-to-3D generative model developed by OpenAI, and fine-tune it using a dataset of biomedical objects. Our model achieved a mean squared error (MSE) of 0.089 in latent generation on the evaluation set, compared to Shap-e’s MSE of 0.147. Additionally, we conducted a qualitative evaluation, comparing our model with others in the generation of biomedical objects. Our results indicate that Shap-MeD demonstrates higher structural accuracy in biomedical object generation.
zh

[CV-133] Cosmos-Reason 1: From Physical Common Sense To Embodied Reasoning

【速读】：该论文旨在解决物理人工智能（Physical AI）系统在感知、理解物理世界以及生成适体决策方面的挑战。论文的关键在于提出Cosmos-Reason1模型，通过长链推理过程，在自然语言中实现对物理世界的理解并生成如下一步动作等适体决策。解决方案的核心包括定义物理AI推理的关键能力，特别是物理常识推理与具身推理，并构建相应的分层本体论来表示物理常识，以及二维本体论以概括不同物理形态下的推理能力。基于这些能力，论文开发了两种多模态大语言模型，并通过四阶段训练方法优化模型性能。最终，通过构建综合基准测试验证模型效果，结果显示物理AI的有监督微调（SFT）与强化学习（RL）带来了显著改进。

链接: https://arxiv.org/abs/2503.15558
作者: NVIDIA:Alisson Azzolini,Hannah Brandon,Prithvijit Chattopadhyay,Huayu Chen,Jinju Chu,Yin Cui,Jenna Diamond,Yifan Ding,Francesco Ferroni,Rama Govindaraju,Jinwei Gu,Siddharth Gururani,Imad El Hanafi,Zekun Hao,Jacob Huffman,Jingyi Jin,Brendan Johnson,Rizwan Khan,George Kurian,Elena Lantz,Nayeon Lee,Zhaoshuo Li,Xuan Li,Tsung-Yi Lin,Yen-Chen Lin,Ming-Yu Liu,Andrew Mathau,Yun Ni,Lindsey Pavao,Wei Ping,David W. Romero,Misha Smelyanskiy,Shuran Song,Lyne Tchapmi,Andrew Z. Wang,Boxin Wang,Haoxiang Wang,Fangyin Wei,Jiashu Xu,Yao Xu,Xiaodong Yang,Zhuolin Yang,Xiaohui Zeng,Zhe Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-8B and Cosmos-Reason1-56B. We curate data and train our models in four stages: vision pre-training, general supervised fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL) as the post-training. To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and reinforcement learning bring significant improvements. To facilitate the development of Physical AI, we will make our code and pre-trained models available under the NVIDIA Open Model License at this https URL.
zh

[CV-134] Motion Synthesis with Sparse and Flexible Keyjoint Control

【速读】：该论文致力于解决生成具有表现力的角色动画过程中劳动密集且受限于预定义密集时空规格的问题。传统方法常依赖于固定的密集时空规范（如精确到每帧的骨盆轨迹），这限制了其实用性。论文的关键解决方案在于提出了一种实用的可控运动合成框架，该框架能够处理来自稀疏且灵活的关键关节信号的高级意图和直观控制。具体而言，该方法采用了一种基于分解的扩散模型运动合成框架，首先从稀疏输入控制信号中合成关键关节运动，然后基于完整的关键关节轨迹生成全身运动。这种方法的优点在于低维的关键关节运动可以轻松适应多种类型的控制信号，并且引入了一种与时间无关的控制公式，消除了对特定帧时间标注的需求，从而增强了控制灵活性。此外，共享的第二阶段能够合成符合任务需求的自然全身运动。通过在多样化数据集和场景中的全面实验，证明了稀疏和灵活的关键关节控制的有效性。

链接: https://arxiv.org/abs/2503.15557
作者: Inwoo Hwang,Jinseok Bae,Donggeun Lim,Young Min Kim
机构: Seoul National University (首尔国立大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages, Project Page: this http URL

点击查看摘要

Abstract:Creating expressive character animations is labor-intensive, requiring intricate manual adjustment of animators across space and time. Previous works on controllable motion generation often rely on a predefined set of dense spatio-temporal specifications (e.g., dense pelvis trajectories with exact per-frame timing), limiting practicality for animators. To process high-level intent and intuitive control in diverse scenarios, we propose a practical controllable motions synthesis framework that respects sparse and flexible keyjoint signals. Our approach employs a decomposed diffusion-based motion synthesis framework that first synthesizes keyjoint movements from sparse input control signals and then synthesizes full-body motion based on the completed keyjoint trajectories. The low-dimensional keyjoint movements can easily adapt to various control signal types, such as end-effector position for diverse goal-driven motion synthesis, or incorporate functional constraints on a subset of keyjoints. Additionally, we introduce a time-agnostic control formulation, eliminating the need for frame-specific timing annotations and enhancing control flexibility. Then, the shared second stage can synthesize a natural whole-body motion that precisely satisfies the task requirement from dense keyjoint movements. We demonstrate the effectiveness of sparse and flexible keyjoint control through comprehensive experiments on diverse datasets and scenarios.
zh

[CV-135] KHAIT: K-9 Handler Artificial Intelligence Teaming for Collaborative Sensemaking

【速读】：该论文旨在解决城市搜救（Urban Search and Rescue, USAR）任务中因复杂环境和搜救犬特定行为导致的训导员与搜救犬之间“态势感知差距”（sensemaking gap）的问题。训导员通常无法实时了解搜救犬的位置和状况，这限制了救援行动的效率和准确性。为解决这一问题，论文提出了一种名为KHAIT的新方法，其关键是结合基于目标检测的人工智能（Artificial Intelligence, AI）技术和增强现实（Augmented Reality, AR）技术。通过在搜救犬装备具备边缘计算能力的AI摄像头和AR头显，KHAIT能够从搜救犬的视角实现精确且快速的目标检测，从而显著提升受困人员定位的速度和准确性。实证结果表明，该方法可将平均生存者分配时间减少22%，有效提升了救援行动的速度和精度。

链接: https://arxiv.org/abs/2503.15524
作者: Matthew Wilchek,Linhan Wang,Sally Dickinson,Erica Feuerbacher,Kurt Luther,Feras A. Batarseh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注: 13 pages, 7 figures, ACM 30th International Conference on Intelligent User Interfaces (IUI 25)

点击查看摘要

Abstract:In urban search and rescue (USAR) operations, communication between handlers and specially trained canines is crucial but often complicated by challenging environments and the specific behaviors canines are trained to exhibit when detecting a person. Since a USAR canine often works out of sight of the handler, the handler lacks awareness of the canine’s location and situation, known as the ‘sensemaking gap.’ In this paper, we propose KHAIT, a novel approach to close the sensemaking gap and enhance USAR effectiveness by integrating object detection-based Artificial Intelligence (AI) and Augmented Reality (AR). Equipped with AI-powered cameras, edge computing, and AR headsets, KHAIT enables precise and rapid object detection from a canine’s perspective, improving survivor localization. We evaluate this approach in a real-world USAR environment, demonstrating an average survival allocation time decrease of 22%, enhancing the speed and accuracy of operations.
zh

[CV-136] Benchmarking Zero-Shot Facial Emotion Annotation with Large Language Models : A Multi-Class and Multi-Frame Approach in DailyLife

【速读】：该论文旨在探索大型语言模型（Large Language Models, LLMs）在自动标注日常场景中人类情绪的可行性和性能。研究针对公开可用的FERV39k数据集的DailyLife子集展开实验，使用GPT-4o-mini模型对从视频片段中提取的关键帧进行零样本标注。论文的关键在于评估LLMs在七类情感分类（“愤怒”、“厌恶”、“恐惧”、“快乐”、“中性”、“悲伤”、“惊讶”）和三元情感分类（负面/中性/正面）下的表现，并提出通过整合1-2秒视频片段内的多帧以提升标注准确性同时降低标注成本的方法。结果显示，这种方法能够略微提高标注精度，表明零样本LLMs在人脸表情自动标注任务中的潜在应用价值。

链接: https://arxiv.org/abs/2502.12454
作者: He Zhang,Xinyi Fu
机构: College of Information Sciences and Technology, Penn State University (宾夕法尼亚州立大学); The Future Laboratory, Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:This study investigates the feasibility and performance of using large language models (LLMs) to automatically annotate human emotions in everyday scenarios. We conducted experiments on the DailyLife subset of the publicly available FERV39k dataset, employing the GPT-4o-mini model for rapid, zero-shot labeling of key frames extracted from video segments. Under a seven-class emotion taxonomy (“Angry,” “Disgust,” “Fear,” “Happy,” “Neutral,” “Sad,” “Surprise”), the LLM achieved an average precision of approximately 50%. In contrast, when limited to ternary emotion classification (negative/neutral/positive), the average precision increased to approximately 64%. Additionally, we explored a strategy that integrates multiple frames within 1-2 second video clips to enhance labeling performance and reduce costs. The results indicate that this approach can slightly improve annotation accuracy. Overall, our preliminary findings highlight the potential application of zero-shot LLMs in human facial emotion annotation tasks, offering new avenues for reducing labeling costs and broadening the applicability of LLMs in complex multimodal environments.
zh

[CV-137] Attentional Triple-Encoder Network in Spatiospectral Domains for Medical Image Segmentation

【速读】：该论文旨在解决视网膜光学相干断层扫描（OCT）图像分割中传统方法仅关注空间或光谱单一域而忽略两者联合依赖性的问题。为应对这一挑战，论文提出了一种三编码器网络，其关键在于通过卷积神经网络（CNNs）提取空间特征，利用快速傅里叶卷积（FFC）捕捉光谱特征，并结合注意力机制以跨域方式捕获全局关系。此外，引入注意力融合模块，通过卷积与交叉注意力进一步增强特征表示。实验结果显示，该方法将平均Dice评分从0.855提升至0.864，超越了现有方法。

链接: https://arxiv.org/abs/2503.16389
作者: Kristin Qi,Xinhan Di
机构: University of Massachusetts Boston (波士顿大学); Giant Network AI Lab (巨人网络人工智能实验室)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Conference on Artificial Intelligence (IEEE CAI)

点击查看摘要

Abstract:Retinal Optical Coherence Tomography (OCT) segmentation is essential for diagnosing pathology. Traditional methods focus on either spatial or spectral domains, overlooking their combined dependencies. We propose a triple-encoder network that integrates CNNs for spatial features, Fast Fourier Convolution (FFC) for spectral features, and attention mechanisms to capture global relationships across both domains. Attention fusion modules integrate convolution and cross-attention to further enhance features. Our method achieves an average Dice score improvement from 0.855 to 0.864, outperforming prior work.
zh

[CV-138] Rapid patient-specific neural networks for intraoperative X-ray to volume registration

【速读】：该论文致力于解决在图像引导干预中广泛依赖X射线指导的复杂手术过程中，现有2D/3D配准方法无法通用且临床负担过重的问题。具体而言，传统优化技术需要针对每位患者进行定制参数调整，而基于小数据集训练的神经网络要么难以泛化到新患者，要么需要大量手工标注，限制了其实际应用范围。

解决方案的关键在于提出了一种名为xvr的全自动框架，用于训练针对特定患者的神经网络进行2D/3D配准。xvr通过基于物理仿真的方式利用患者自身的术前容积成像数据生成高质量的训练样本，从而克服了监督模型在泛化能力上的固有限制。此外，xvr仅需每位患者5分钟的训练时间，使其适用于紧急及计划性手术。实验结果表明，xvr能够在包含多种解剖结构、成像模式和医院的真实X射线数据集上实现稳健的泛化，并在术中速度下达到亚毫米级精度的配准性能，相比现有方法提升了至少一个数量级。

链接: https://arxiv.org/abs/2503.16309
作者: Vivek Gopalakrishnan,Neel Dey,David-Dimitris Chlorogiannis,Andrew Abumoussa,Anna M. Larson,Darren B. Orbach,Sarah Frisken,Polina Golland
机构: Harvard-MIT Health Sciences and Technology, Massachusetts Institute of Technology (MIT)(麻省理工学院); Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT)(麻省理工学院); Department of Radiology, Brigham and Women’s Hospital and Harvard Medical School (哈佛医学院); Department of Neurosurgery, University of North Carolina School of Medicine (北卡罗来纳大学医学院); Department of Interventional Neuroradiology, Boston Children’s Hospital (波士顿儿童医院); Department of Radiology, Brigham and Women’s Hospital and Harvard Medical School (哈佛医学院); Polina Golland
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:The integration of artificial intelligence in image-guided interventions holds transformative potential, promising to extract 3D geometric and quantitative information from conventional 2D imaging modalities during complex procedures. Achieving this requires the rapid and precise alignment of 2D intraoperative images (e.g., X-ray) with 3D preoperative volumes (e.g., CT, MRI). However, current 2D/3D registration methods fail across the broad spectrum of procedures dependent on X-ray guidance: traditional optimization techniques require custom parameter tuning for each subject, whereas neural networks trained on small datasets do not generalize to new patients or require labor-intensive manual annotations, increasing clinical burden and precluding application to new anatomical targets. To address these challenges, we present xvr, a fully automated framework for training patient-specific neural networks for 2D/3D registration. xvr uses physics-based simulation to generate abundant high-quality training data from a patient’s own preoperative volumetric imaging, thereby overcoming the inherently limited ability of supervised models to generalize to new patients and procedures. Furthermore, xvr requires only 5 minutes of training per patient, making it suitable for emergency interventions as well as planned procedures. We perform the largest evaluation of a 2D/3D registration algorithm on real X-ray data to date and find that xvr robustly generalizes across a diverse dataset comprising multiple anatomical structures, imaging modalities, and hospitals. Across surgical tasks, xvr achieves submillimeter-accurate registration at intraoperative speeds, improving upon existing methods by an order of magnitude. xvr is released as open-source software freely available at this https URL.
zh

[CV-139] Do image and video quality metrics model low-level human vision?

【速读】：该论文旨在解决现有图像和视频质量评估指标（如SSIM、LPIPS、VMAF等）在声称“感知性”时未能直接建模人类视觉感知的问题，而是依赖手工设计的公式或训练数据集来实现与感知数据的对齐。论文的关键解决方案是提出了一组针对全参考质量指标的新测试方法，这些测试专门考察指标对低级人类视觉特性的建模能力，包括对比敏感度、对比掩蔽和对比匹配。通过这些测试，作者分析了33种现有图像和视频质量指标的优势与不足，例如发现LPIPS和MS-SSIM在预测对比掩蔽方面的有效性，以及VMAF在此任务中的表现不佳；同时指出SSIM在高空间频率差异上的过度强调问题，而其多尺度版本MS-SSIM对此进行了改进。这种深入分析无法仅通过现有的评估协议实现。

链接: https://arxiv.org/abs/2503.16264
作者: Dounia Hammou,Yancheng Cai,Pavan Madhusudanarao,Christos G. Bampis,Rafał K. Mantiuk
机构: University of Cambridge (剑桥大学); Netflix, Inc. (网飞公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Image and video quality metrics, such as SSIM, LPIPS, and VMAF, are aimed to predict the perceived quality of the evaluated content and are often claimed to be “perceptual”. Yet, few metrics directly model human visual perception, and most rely on hand-crafted formulas or training datasets to achieve alignment with perceptual data. In this paper, we propose a set of tests for full-reference quality metrics that examine their ability to model several aspects of low-level human vision: contrast sensitivity, contrast masking, and contrast matching. The tests are meant to provide additional scrutiny for newly proposed metrics. We use our tests to analyze 33 existing image and video quality metrics and find their strengths and weaknesses, such as the ability of LPIPS and MS-SSIM to predict contrast masking and poor performance of VMAF in this task. We further find that the popular SSIM metric overemphasizes differences in high spatial frequencies, but its multi-scale counterpart, MS-SSIM, addresses this shortcoming. Such findings cannot be easily made using existing evaluation protocols.
zh

[CV-140] Efficient Bayesian Computation Using Plug-and-Play Priors for Poisson Inverse Problems

【速读】：该论文致力于解决低光子泊松成像问题中的贝叶斯推断挑战，这类问题在天文、医学和生物学等领域具有重要应用。现有基于PnP（插拔式）Langevin采样的算法在处理此类问题时面临高解不确定性及不良正则性属性（如梯度爆炸和非负约束）的限制。为应对这些挑战，论文提出了两种扩展Langevin PnP采样到泊松成像模型的关键策略：(i) 引入边界反射和泊松似然近似的加速PnP Langevin方法；(ii) 借助黎曼几何的镜像采样算法，无需近似即可处理约束条件和不良似然正则性。这些方案通过广泛的数值实验验证其有效性，并与最先进的方法进行了对比。

链接: https://arxiv.org/abs/2503.16222
作者: Teresa Klatzer,Savvas Melidonis,Marcelo Pereyra,Konstantinos C. Zygalakis
机构: 未知
类目: Computation (stat.CO); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注: 31 pages, 17 figures

点击查看摘要

Abstract:This paper introduces a novel plug-and-play (PnP) Langevin sampling methodology for Bayesian inference in low-photon Poisson imaging problems, a challenging class of problems with significant applications in astronomy, medicine, and biology. PnP Langevin sampling algorithms offer a powerful framework for Bayesian image restoration, enabling accurate point estimation as well as advanced inference tasks, including uncertainty quantification and visualization analyses, and empirical Bayesian inference for automatic model parameter tuning. However, existing PnP Langevin algorithms are not well-suited for low-photon Poisson imaging due to high solution uncertainty and poor regularity properties, such as exploding gradients and non-negativity constraints. To address these challenges, we propose two strategies for extending Langevin PnP sampling to Poisson imaging models: (i) an accelerated PnP Langevin method that incorporates boundary reflections and a Poisson likelihood approximation and (ii) a mirror sampling algorithm that leverages a Riemannian geometry to handle the constraints and the poor regularity of the likelihood without approximations. The effectiveness of these approaches is demonstrated through extensive numerical experiments and comparisons with state-of-the-art methods.
zh

[CV-141] Selective Complementary Feature Fusion and Modal Feature Compression Interaction for Brain Tumor Segmentation

【速读】：该论文旨在解决脑胶质瘤分割中多模态特征融合的准确性问题。由于不同MRI模态的特性差异，跨模态特征融合面临显著的特征差异挑战，导致模型忽略丰富的特征信息；同时，在并行网络中因特征维度膨胀引发了多模态特征冗余交互的问题，进一步增加了底层特征融合的难度。为解决上述问题，论文提出了一种新颖的互补特征压缩交互网络（CFCI-Net），其关键是通过高效的模态融合策略实现多模态特征信息的互补融合与压缩交互。具体而言，首先引入选择性互补特征融合（SCFF）模块，利用互补软选择权重自适应融合丰富的跨模态特征信息；其次设计模态特征压缩交互（MFCI）变换器，通过模态特征压缩（MFC）和模态特征交互（MFI）处理特征维度激增引起的冗余问题，并在MFI中提出基于多头注意力的分层交互注意机制以实现特征的交互学习。

链接: https://arxiv.org/abs/2503.16149
作者: Dong Chen,Boyue Zhao,Yi Zhang,Meng Zhao
机构: Engineering Research Center of Learning-Based Intelligent System (基于学习的智能系统工程研究中心); Tianjin University of Technology (天津工业大学); Ministry of Education (中华人民共和国教育部)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Efficient modal feature fusion strategy is the key to achieve accurate segmentation of brain glioma. However, due to the specificity of different MRI modes, it is difficult to carry out cross-modal fusion with large differences in modal features, resulting in the model ignoring rich feature information. On the other hand, the problem of multi-modal feature redundancy interaction occurs in parallel networks due to the proliferation of feature dimensions, further increase the difficulty of multi-modal feature fusion at the bottom end. In order to solve the above problems, we propose a noval complementary feature compression interaction network (CFCI-Net), which realizes the complementary fusion and compression interaction of multi-modal feature information with an efficient mode fusion strategy. Firstly, we propose a selective complementary feature fusion (SCFF) module, which adaptively fuses rich cross-modal feature information by complementary soft selection weights. Secondly, a modal feature compression interaction (MFCI) transformer is proposed to deal with the multi-mode fusion redundancy problem when the feature dimension surges. The MFCI transformer is composed of modal feature compression (MFC) and modal feature interaction (MFI) to realize redundancy feature compression and multi-mode feature interactive learning. %In MFI, we propose a hierarchical interactive attention mechanism based on multi-head attention. Evaluations on the BraTS2019 and BraTS2020 datasets demonstrate that CFCI-Net achieves superior results compared to state-of-the-art models. Code: this https URL
zh

[CV-142] 3-D Image-to-Image Fusion in Lightsheet Microscopy by Two-Step Adversarial Network: Contribution to the FuseMyCells Challenge

【速读】：本文旨在解决Lightsheet显微镜在深层成像时因穿透深度有限和图像质量下降所面临的挑战，并通过FuseMyCells挑战赛探索基于深度学习的单视图三维体积融合方法，以简化流程并节约光子预算。论文提出了一种两步法解决方案：第一步处理降采样图像以捕获感兴趣区域的整体信息；第二步采用基于补丁的方法进行高分辨率推理，并引入对抗损失以提升视觉效果。该方法的关键在于应对高数据分辨率、全局上下文需求以及高频细节保留的挑战，同时有效提升了三维图像融合质量和Lightsheet显微镜的应用能力，实验结果显示核与膜的平均结构相似度（SSIM）分别大于0.85和0.91。

链接: https://arxiv.org/abs/2503.16075
作者: Marek Wodzinski,Henning Müller
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lightsheet microscopy is a powerful 3-D imaging technique that addresses limitations of traditional optical and confocal microscopy but suffers from a low penetration depth and reduced image quality at greater depths. Multiview lightsheet microscopy improves 3-D resolution by combining multiple views but simultaneously increasing the complexity and the photon budget, leading to potential photobleaching and phototoxicity. The FuseMyCells challenge, organized in conjunction with the IEEE ISBI 2025 conference, aims to benchmark deep learning-based solutions for fusing high-quality 3-D volumes from single 3-D views, potentially simplifying procedures and conserving the photon budget. In this work, we propose a contribution to the FuseMyCells challenge based on a two-step procedure. The first step processes a downsampled version of the image to capture the entire region of interest, while the second step uses a patch-based approach for high-resolution inference, incorporating adversarial loss to enhance visual outcomes. This method addresses challenges related to high data resolution, the necessity of global context, and the preservation of high-frequency details. Experimental results demonstrate the effectiveness of our approach, highlighting its potential to improve 3-D image fusion quality and extend the capabilities of lightsheet microscopy. The average SSIM for the nucleus and membranes is greater than 0.85 and 0.91, respectively.
zh

[CV-143] SALT: Singular Value Adaptation with Low-Rank Transformation

【速读】：该论文旨在解决医学图像分割任务中大模型微调成本高且性能受限的问题。传统参数高效微调（PEFT）方法如低秩适应（LoRA）虽能以较低秩矩阵更新权重，但可能因秩不足导致欠拟合；而基于全秩奇异值分解（SVD）的方法虽能全面更新模型，却缺乏灵活性且在不同数据集上的表现不稳定。论文提出了一种名为SALT（通过低秩变换进行奇异值适应）的方法，其关键是结合了LoRA和SVD的优势：通过可训练的尺度和偏移参数选择性地调整最重要的奇异值，并辅以低秩更新处理剩余子空间。这种混合策略实现了有效的领域特定特征捕获，同时无需增加模型规模或深度，在仅使用3.9%可训练参数的情况下，于5个具有挑战性的医学数据集上超越现有PEFT方法2%到5%，特别是在低资源场景下表现出色。

链接: https://arxiv.org/abs/2503.16055
作者: Abdelrahman Elsayed,Sarim Hashmi,Mohammed Elseiagy,Hu Wang,Mohammad Yaqub,Ibrahim Almakky
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The complex nature of medical image segmentation calls for models that are specifically designed to capture detailed, domain-specific features. Large foundation models offer considerable flexibility, yet the cost of fine-tuning these models remains a significant barrier. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), efficiently update model weights with low-rank matrices but may suffer from underfitting when the chosen rank is insufficient to capture domain-specific nuances. Conversely, full-rank Singular Value Decomposition (SVD) based methods provide comprehensive updates by modifying all singular values, yet they often lack flexibility and exhibit variable performance across datasets. We propose SALT (Singular Value Adaptation with Low-Rank Transformation), a method that selectively adapts the most influential singular values using trainable scale and shift parameters while complementing this with a low-rank update for the remaining subspace. This hybrid approach harnesses the advantages of both LoRA and SVD, enabling effective adaptation without relying on increasing model size or depth. Evaluated on 5 challenging medical datasets, ranging from as few as 20 samples to 1000, SALT outperforms state-of-the-art PEFT (LoRA and SVD) by 2% to 5% in Dice with only 3.9% trainable parameters, demonstrating robust adaptation even in low-resource settings. The code for SALT is available at: this https URL
zh

[CV-144] Sequential Spatial-Temporal Network for Interpretable Automatic Ultrasonic Assessment of Fetal Head during labor

【速读】：本文旨在解决产时超声评估胎儿头盆比例及预测分娩结局过程中，角度进展（Angle of Progression, AoP）和耻骨联合头间距（Head Symphysis Distance, HSD）测量准确性与标准化的问题。论文的关键解决方案是提出了首个针对产时超声视频分析的可解释性模型——序列时空网络（Sequential Spatial-Temporal Network, SSTN）。SSTN 通过首先识别标准化超声切面，接着分割耻骨联合和胎儿头部等解剖结构，并最终检测关键标志点以实现 HSD 和 AoP 的精确测量，从而显著提升了测量的准确性和可靠性。实验结果表明，SSTN 相较现有方法将 AoP 和 HSD 的平均绝对误差分别降低了 18% 和 22%。

链接: https://arxiv.org/abs/2503.15861
作者: Jie Gan,Zhuonan Liang,Jianan Fan,Lisa Mcguire,Caterina Watson,Jacqueline Spurway,Jillian Clarke,Weidong Cai
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted to 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI)

点击查看摘要

Abstract:The intrapartum ultrasound guideline established by ISUOG highlights the Angle of Progression (AoP) and Head Symphysis Distance (HSD) as pivotal metrics for assessing fetal head descent and predicting delivery outcomes. Accurate measurement of the AoP and HSD requires a structured process. This begins with identifying standardized ultrasound planes, followed by the detection of specific anatomical landmarks within the regions of the pubic symphysis and fetal head that correlate with the delivery parameters AoP and HSD. Finally, these measurements are derived based on the identified anatomical landmarks. Addressing the clinical demands and standard operation process outlined in the ISUOG guideline, we introduce the Sequential Spatial-Temporal Network (SSTN), the first interpretable model specifically designed for the video of intrapartum ultrasound analysis. The SSTN operates by first identifying ultrasound planes, then segmenting anatomical structures such as the pubic symphysis and fetal head, and finally detecting key landmarks for precise measurement of HSD and AoP. Furthermore, the cohesive framework leverages task-related information to improve accuracy and reliability. Experimental evaluations on clinical datasets demonstrate that SSTN significantly surpasses existing models, reducing the mean absolute error by 18% for AoP and 22% for HSD.
zh

[CV-145] Nano-3D: Metasurface-Based Neural Depth Imaging

【速读】：本文旨在解决传统深度相机在小型化与测量精度之间权衡的问题，这些系统通常因体积庞大或近似不精确而限制了其在空间受限场景中的适用性。论文提出的关键解决方案是Nano-3D，这是一种基于超表面（metasurface）的神经深度成像方法，具有极紧凑的设计。Nano-3D通过集成自定义制造的700纳米厚TiO2超表面与多模块深度神经网络，从单目超表面偏振图像中提取精确的度量深度信息。该方案的核心在于结合纳米光学技术与计算成像方法，以突破传统深度感知系统的局限。

链接: https://arxiv.org/abs/2503.15770
作者: Bingxuan Li,Jiahao Wu,Yuan Xu,Yunxiang Zhang,Zezheng Zhu,Nanfang Yu,Qi Sun
机构: New York University (纽约大学); Columbia University (哥伦比亚大学)
类目: Optics (physics.optics); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Depth imaging is a foundational building block for broad applications, such as autonomous driving and virtual/augmented reality. Traditionally, depth cameras have relied on time-of-flight sensors or multi-lens systems to achieve physical depth measurements. However, these systems often face a trade-off between a bulky form factor and imprecise approximations, limiting their suitability for spatially constrained scenarios. Inspired by the emerging advancements of nano-optics, we present Nano-3D, a metasurface-based neural depth imaging solution with an ultra-compact footprint. Nano-3D integrates our custom-fabricated 700 nm thick TiO2 metasurface with a multi-module deep neural network to extract precise metric depth information from monocular metasurface-polarized imagery. We demonstrate the effectiveness of Nano-3D with both simulated and physical experiments. We hope the exhibited success paves the way for the community to bridge future graphics systems with emerging nanomaterial technologies through novel computational approaches.
zh

人工智能

[AI-0] Graph of Effort: Quantifying Risk of AI Usage for Vulnerability Assessment

链接: https://arxiv.org/abs/2503.16392
作者: Anket Mehra,Andreas Aßmuth,Malte Prieß
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages; accepted for the 16th International Conference on Cloud Computing, GRIDs, and Virtualization (Cloud Computing 2025), Valencia, Spain, 2025

点击查看摘要

Abstract:With AI-based software becoming widely available, the risk of exploiting its capabilities, such as high automation and complex pattern recognition, could significantly increase. An AI used offensively to attack non-AI assets is referred to as offensive AI. Current research explores how offensive AI can be utilized and how its usage can be classified. Additionally, methods for threat modeling are being developed for AI-based assets within organizations. However, there are gaps that need to be addressed. Firstly, there is a need to quantify the factors contributing to the AI threat. Secondly, there is a requirement to create threat models that analyze the risk of being attacked by AI for vulnerability assessment across all assets of an organization. This is particularly crucial and challenging in cloud environments, where sophisticated infrastructure and access control landscapes are prevalent. The ability to quantify and further analyze the threat posed by offensive AI enables analysts to rank vulnerabilities and prioritize the implementation of proactive countermeasures. To address these gaps, this paper introduces the Graph of Effort, an intuitive, flexible, and effective threat modeling method for analyzing the effort required to use offensive AI for vulnerability exploitation by an adversary. While the threat model is functional and provides valuable support, its design choices need further empirical validation in future work. Comments: 8 pages; accepted for the 16th International Conference on Cloud Computing, GRIDs, and Virtualization (Cloud Computing 2025), Valencia, Spain, 2025 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2503.16392 [cs.CR] (or arXiv:2503.16392v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.16392 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-1] Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation

链接: https://arxiv.org/abs/2503.16385
作者: Yijia Luo,Yulin Song,Xingyao Zhang,Jiaheng Liu,Weixun Wang,GengRu Chen,Wenbo Su,Bo Zheng
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities through long chain-of-thought (CoT) reasoning. The R1 distillation scheme has emerged as a promising approach for training cost-effective models with enhanced reasoning abilities. However, the underlying mechanisms driving its effectiveness remain unclear. This study examines the universality of distillation data and identifies key components that enable the efficient transfer of long-chain reasoning capabilities in LLM distillation. Our findings reveal that the effectiveness of long CoT reasoning distillation from teacher models like Qwen-QwQ degrades significantly on nonhomologous models, challenging the assumed universality of current distillation methods. To gain deeper insights into the structure and patterns of long CoT reasoning, we propose DLCoT (Deconstructing Long Chain-of-Thought), a distillation data enhancement framework. DLCoT consists of three key steps: (1) data segmentation to decompose complex long CoT structures, (2) simplification by eliminating unsolvable and redundant solutions, and (3) optimization of intermediate error states. Our approach significantly improves model performance and token efficiency, facilitating the development of high-performance LLMs.

[AI-2] Reinforcement Learning-based Heuristics to Guide Domain-Independent Dynamic Programming

链接: https://arxiv.org/abs/2503.16371
作者: Minori Narita,Ryo Kuroiwa,J. Christopher Beck
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 24 pages, 4 figures, to be published in CPAIOR 2025 ( this https URL )

点击查看摘要

Abstract:Domain-Independent Dynamic Programming (DIDP) is a state-space search paradigm based on dynamic programming for combinatorial optimization. In its current implementation, DIDP guides the search using user-defined dual bounds. Reinforcement learning (RL) is increasingly being applied to combinatorial optimization problems and shares several key structures with DP, being represented by the Bellman equation and state-based transition systems. We propose using reinforcement learning to obtain a heuristic function to guide the search in DIDP. We develop two RL-based guidance approaches: value-based guidance using Deep Q-Networks and policy-based guidance using Proximal Policy Optimization. Our experiments indicate that RL-based guidance significantly outperforms standard DIDP and problem-specific greedy heuristics with the same number of node expansions. Further, despite longer node evaluation times, RL guidance achieves better run-time performance than standard DIDP on three of four benchmark domains.

[AI-3] Neural Networks: According to the Principles of Grassmann Algebra

链接: https://arxiv.org/abs/2503.16364
作者: Z. Zarezadeh,N. Zarezadeh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we explore the algebra of quantum idempotents and the quantization of fermions which gives rise to a Hilbert space equal to the Grassmann algebra associated with the Lie algebra. Since idempotents carry representations of the algebra under consideration, they form algebraic varieties and smooth manifolds in the natural topology. In addition to the motivation of linking up mathematical physics with machine learning, it is also shown that by using idempotents and invariant subspace of the corresponding algebras, these representations encode and perhaps provide a probabilistic interpretation of reasoning and relational paths in geometrical terms.

[AI-4] Palatable Conceptions of Disembodied Being: Terra Incognita in the Space of Possible Minds

链接: https://arxiv.org/abs/2503.16348
作者: Murray Shanahan
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Is it possible to articulate a conception of consciousness that is compatible with the exotic characteristics of contemporary, disembodied AI systems, and that can stand up to philosophical scrutiny? How would subjective time and selfhood show up for an entity that conformed to such a conception? Trying to answer these questions, even metaphorically, stretches the language of consciousness to breaking point. Ultimately, the attempt yields something like emptiness, in the Buddhist sense, and helps to undermine our dualistic inclinations towards subjectivity and selfhood.

[AI-5] HiQ-Lip: The First Quantum-Classical Hierarchical Method for Global Lipschitz Constant Estimation of ReLU Networks

链接: https://arxiv.org/abs/2503.16342
作者: Haoqi He,Yan Xiao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Estimating the global Lipschitz constant of neural networks is crucial for understanding and improving their robustness and generalization capabilities. However, precise calculations are NP-hard, and current semidefinite programming (SDP) methods face challenges such as high memory usage and slow processing speeds. In this paper, we propose \textbfHiQ-Lip, a hybrid quantum-classical hierarchical method that leverages Coherent Ising Machines (CIMs) to estimate the global Lipschitz constant. We tackle the estimation by converting it into a Quadratic Unconstrained Binary Optimization (QUBO) problem and implement a multilevel graph coarsening and refinement strategy to adapt to the constraints of contemporary quantum hardware. Our experimental evaluations on fully connected neural networks demonstrate that HiQ-Lip not only provides estimates comparable to state-of-the-art methods but also significantly accelerates the computation process. In specific tests involving two-layer neural networks with 256 hidden neurons, HiQ-Lip doubles the solving speed and offers more accurate upper bounds than the existing best method, LiPopt. These findings highlight the promising utility of small-scale quantum devices in advancing the estimation of neural network robustness.

[AI-6] Enhancing Software Quality Assurance with an Adaptive Differential Evolution based Quantum Variational Autoencoder-Transformer Model

链接: https://arxiv.org/abs/2503.16335
作者: Seshu Babu Barma,Mohanakrishnan Hariharan,Satish Arvapalli
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:An AI-powered quality engineering platform uses artificial intelligence to boost software quality assessments through automated defect prediction and optimized performance alongside improved feature extraction. Existing models result in difficulties addressing noisy data types together with imbalances, pattern recognition complexities, ineffective feature extraction, and generalization weaknesses. To overcome those existing challenges in this research, we develop a new model Adaptive Differential Evolution based Quantum Variational Autoencoder-Transformer Model (ADE-QVAET), that combines a Quantum Variational Autoencoder-Transformer (QVAET) to obtain high-dimensional latent features and maintain sequential dependencies together with contextual relationships, resulting in superior defect prediction accuracy. Adaptive Differential Evolution (ADE) Optimization utilizes an adaptive parameter tuning method that enhances model convergence and predictive performance. ADE-QVAET integrates advanced AI techniques to create a robust solution for scalable and accurate software defect prediction that represents a top-level AI-driven technology for quality engineering applications. The proposed ADE-QVAET model attains high accuracy, precision, recall, and f1-score during the training percentage (TP) 90 of 98.08%, 92.45%, 94.67%, and 98.12%.

[AI-7] Knowledge-guided machine learning model with soil moisture for corn yield prediction under drought conditions

链接: https://arxiv.org/abs/2503.16328
作者: Xiaoyu Wang,Yijia Xu,Jingyi Huang,Zhengwei Yang,Zhou Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Remote sensing (RS) techniques, by enabling non-contact acquisition of extensive ground observations, have become a valuable tool for corn yield prediction. Traditional process-based (PB) models are limited by fixed input features and struggle to incorporate large volumes of RS data. In contrast, machine learning (ML) models are often criticized for being ``black boxes’’ with limited interpretability. To address these limitations, we used Knowledge-Guided Machine Learning (KGML), which combined the strengths of both approaches and fully used RS data. However, previous KGML methods overlooked the crucial role of soil moisture in plant growth. To bridge this gap, we proposed the Knowledge-Guided Machine Learning with Soil Moisture (KGML-SM) framework, using soil moisture as an intermediate variable to emphasize its key role in plant development. Additionally, based on the prior knowledge that the model may overestimate under drought conditions, we designed a drought-aware loss function that penalizes predicted yield in drought-affected areas. Our experiments showed that the KGML-SM model outperformed other ML models. Finally, we explored the relationships between drought, soil moisture, and corn yield prediction, assessing the importance of various features and analyzing how soil moisture impacts corn yield predictions across different regions and time periods.

[AI-8] OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence

链接: https://arxiv.org/abs/2503.16326
作者: Long Yuan,Fengran Mo,Kaiyu Huang,Wenjie Wang,Wangyuxuan Zhai,Xiaoyu Zhu,You Li,Jinan Xu,Jian-Yun Nie
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages, Under review

点击查看摘要

Abstract:The rapid advancement of multimodal large language models (LLMs) has opened new frontiers in artificial intelligence, enabling the integration of diverse large-scale data types such as text, images, and spatial information. In this paper, we explore the potential of multimodal LLMs (MLLM) for geospatial artificial intelligence (GeoAI), a field that leverages spatial data to address challenges in domains including Geospatial Semantics, Health Geography, Urban Geography, Urban Perception, and Remote Sensing. We propose a MLLM (OmniGeo) tailored to geospatial applications, capable of processing and analyzing heterogeneous data sources, including satellite imagery, geospatial metadata, and textual descriptions. By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems. Results demonstrate that our model outperforms task-specific models and existing LLMs on diverse geospatial tasks, effectively addressing the multimodality nature while achieving competitive results on the zero-shot geospatial tasks. Our code will be released after publication.

[AI-9] Structured-Noise Masked Modeling for Video Audio and Beyond

链接: https://arxiv.org/abs/2503.16311
作者: Aritra Bhowmik,Fida Mohammad Thoker,Carlos Hinojosa,Bernard Ghanem,Cees G. M. Snoek
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Masked modeling has emerged as a powerful self-supervised learning framework, but existing methods largely rely on random masking, disregarding the structural properties of different modalities. In this work, we introduce structured noise-based masking, a simple yet effective approach that naturally aligns with the spatial, temporal, and spectral characteristics of video and audio data. By filtering white noise into distinct color noise distributions, we generate structured masks that preserve modality-specific patterns without requiring handcrafted heuristics or access to the data. Our approach improves the performance of masked video and audio modeling frameworks without any computational overhead. Extensive experiments demonstrate that structured noise masking achieves consistent improvement over random masking for standard and advanced masked modeling methods, highlighting the importance of modality-aware masking strategies for representation learning.

[AI-10] Speeding up design and making to reduce time-to-project and time-to-market: an AI-Enhanced approach in engineering education

链接: https://arxiv.org/abs/2503.16307
作者: Giovanni Adorni,Daniele Grosso
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages, 4 figures, AIxEDU 2024 conference

点击查看摘要

Abstract:This paper explores the integration of AI tools, such as ChatGPT and GitHub Copilot, in the Software Architecture for Embedded Systems course. AI-supported workflows enabled students to rapidly prototype complex projects, emphasizing real-world applications like SLAM robotics. Results demon-started enhanced problem-solving, faster development, and more sophisticated outcomes, with AI augmenting but not replacing human decision-making.

[AI-11] Bridging Technology and Humanities: Evaluating the Impact of Large Language Models on Social Sciences Research with DeepSeek -R1

链接: https://arxiv.org/abs/2503.16304
作者: Peiran Gu,Fuhao Duan,Wenhao Li,Bochen Xu,Ying Cai,Teng Yao,Chenxun Zhuo,Tianming Liu,Bao Ge
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2409.18486

点击查看摘要

Abstract:In recent years, the development of Large Language Models (LLMs) has made significant breakthroughs in the field of natural language processing and has gradually been applied to the field of humanities and social sciences research. LLMs have a wide range of application value in the field of humanities and social sciences because of its strong text understanding, generation and reasoning capabilities. In humanities and social sciences research, LLMs can analyze large-scale text data and make inferences. This article analyzes the large language model DeepSeek-R1 from seven aspects: low-resource language translation, educational question-answering, student writing improvement in higher education, logical reasoning, educational measurement and psychometrics, public health policy analysis, and art this http URL we compare the answers given by DeepSeek-R1 in the seven aspects with the answers given by o1-preview. DeepSeek-R1 performs well in the humanities and social sciences, answering most questions correctly and logically, and can give reasonable analysis processes and explanations. Compared with o1-preview, it can automatically generate reasoning processes and provide more detailed explanations, which is suitable for beginners or people who need to have a detailed understanding of this knowledge, while o1-preview is more suitable for quick reading. Through analysis, it is found that LLM has broad application potential in the field of humanities and social sciences, and shows great advantages in improving text analysis efficiency, language communication and other fields. LLM’s powerful language understanding and generation capabilities enable it to deeply explore complex problems in the field of humanities and social sciences, and provide innovative tools for academic research and practical applications. Comments: arXiv admin note: text overlap with arXiv:2409.18486 Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.16304 [cs.CY] (or arXiv:2503.16304v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2503.16304 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-12] Diffusion-augmented Graph Contrastive Learning for Collaborative Filter

链接: https://arxiv.org/abs/2503.16290
作者: Fan Huang,Wei Wang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph-based collaborative filtering has been established as a prominent approach in recommendation systems, leveraging the inherent graph topology of user-item interactions to model high-order connectivity patterns and enhance recommendation performance. Recent advances in Graph Contrastive Learning (GCL) have demonstrated promising potential to alleviate data sparsity issues by improving representation learning through contrastive view generation and mutual information maximization. However, existing approaches lack effective data augmentation strategies. Structural augmentation risks distorting fundamental graph topology, while feature-level perturbation techniques predominantly employ uniform noise scales that fail to account for node-specific characteristics. To solve these challenges, we propose Diffusion-augmented Contrastive Learning (DGCL), an innovative framework that integrates diffusion models with contrastive learning for enhanced collaborative filtering. Our approach employs a diffusion process that learns node-specific Gaussian distributions of representations, thereby generating semantically consistent yet diversified contrastive views through reverse diffusion sampling. DGCL facilitates adaptive data augmentation based on reconstructed representations, considering both semantic coherence and node-specific features. In addition, it explores unrepresented regions of the latent sparse feature space, thereby enriching the diversity of contrastive views. Extensive experimental results demonstrate the effectiveness of DGCL on three public datasets.

[AI-13] AI Agents in Cryptoland: Practical Attacks and No Silver Bullet

链接: https://arxiv.org/abs/2503.16248
作者: Atharv Singh Patlan,Peiyao Sheng,S. Ashwin Hebbar,Prateek Mittal,Pramod Viswanath
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:The integration of AI agents with Web3 ecosystems harnesses their complementary potential for autonomy and openness, yet also introduces underexplored security risks, as these agents dynamically interact with financial protocols and immutable smart contracts. This paper investigates the vulnerabilities of AI agents within blockchain-based financial ecosystems when exposed to adversarial threats in real-world scenarios. We introduce the concept of context manipulation – a comprehensive attack vector that exploits unprotected context surfaces, including input channels, memory modules, and external data feeds. Through empirical analysis of ElizaOS, a decentralized AI agent framework for automated Web3 operations, we demonstrate how adversaries can manipulate context by injecting malicious instructions into prompts or historical interaction records, leading to unintended asset transfers and protocol violations which could be financially devastating. Our findings indicate that prompt-based defenses are insufficient, as malicious inputs can corrupt an agent’s stored context, creating cascading vulnerabilities across interactions and platforms. This research highlights the urgent need to develop AI agents that are both secure and fiduciarily responsible.

[AI-14] Flight Testing an Optionally Piloted Aircraft: a Case Study on Trust Dynamics in Human-Autonomy Teaming

链接: https://arxiv.org/abs/2503.16227
作者: Jeremy C.-H. Wang,Ming Hou,David Dunwoody,Marko Ilievski,Justin Tomasi,Edward Chao,Carl Pigeon
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: IEEE International Conference on Human-Machine Systems 2025, keywords: trust, human factors, aviation, safety-critical, human-autonomy teaming

点击查看摘要

Abstract:This paper examines how trust is formed, maintained, or diminished over time in the context of human-autonomy teaming with an optionally piloted aircraft. Whereas traditional factor-based trust models offer a static representation of human confidence in technology, here we discuss how variations in the underlying factors lead to variations in trust, trust thresholds, and human behaviours. Over 200 hours of flight test data collected over a multi-year test campaign from 2021 to 2023 were reviewed. The dispositional-situational-learned, process-performance-purpose, and IMPACTS homeostasis trust models are applied to illuminate trust trends during nominal autonomous flight operations. The results offer promising directions for future studies on trust dynamics and design-for-trust in human-autonomy teaming.

[AI-15] Logic Explanation of AI Classifiers by Categorical Explaining Functors

链接: https://arxiv.org/abs/2503.16203
作者: Stefano Fioravanti,Francesco Giannini,Paolo Frazzetto,Fabio Zanasi,Pietro Barbiero
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The most common methods in explainable artificial intelligence are post-hoc techniques which identify the most relevant features used by pretrained opaque models. Some of the most advanced post hoc methods can generate explanations that account for the mutual interactions of input features in the form of logic rules. However, these methods frequently fail to guarantee the consistency of the extracted explanations with the model’s underlying reasoning. To bridge this gap, we propose a theoretically grounded approach to ensure coherence and fidelity of the extracted explanations, moving beyond the limitations of current heuristic-based approaches. To this end, drawing from category theory, we introduce an explaining functor which structurally preserves logical entailment between the explanation and the opaque model’s reasoning. As a proof of concept, we validate the proposed theoretical constructions on a synthetic benchmark verifying how the proposed approach significantly mitigates the generation of contradictory or unfaithful explanations.

[AI-16] Large Language Models for Water Distribution Systems Modeling and Decision-Making

链接: https://arxiv.org/abs/2503.16191
作者: Yinon Goldshtein,Gal Perelman,Assaf Schuster,Avi Ostfeld
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted to EWRI Congress 2025

点击查看摘要

Abstract:The design, operations, and management of water distribution systems (WDS) involve complex mathematical models. These models are continually improving due to computational advancements, leading to better decision-making and more efficient WDS management. However, the significant time and effort required for modeling, programming, and analyzing results remain substantial challenges. Another issue is the professional burden, which confines the interaction with models, databases, and other sophisticated tools to a small group of experts, thereby causing non-technical stakeholders to depend on these experts or make decisions without modeling support. Furthermore, explaining model results is challenging even for experts, as it is often unclear which conditions cause the model to reach a certain state or recommend a specific policy. The recent advancements in Large Language Models (LLMs) open doors for a new stage in human-model interaction. This study proposes a framework of plain language interactions with hydraulic and water quality models based on LLM-EPANET architecture. This framework is tested with increasing levels of complexity of queries to study the ability of LLMs to interact with WDS models, run complex simulations, and report simulation results. The performance of the proposed framework is evaluated across several categories of queries and hyper-parameter configurations, demonstrating its potential to enhance decision-making processes in WDS management.

[AI-17] Neural Combinatorial Optimization for Real-World Routing

链接: https://arxiv.org/abs/2503.16159
作者: Jiwoo Son,Zhikai Zhao,Federico Berto,Chuanbo Hua,Changhyun Kwon,Jinkyoo Park
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vehicle Routing Problems (VRPs) are a class of NP-hard problems ubiquitous in several real-world logistics scenarios that pose significant challenges for optimization. Neural Combinatorial Optimization (NCO) has emerged as a promising alternative to classical approaches, as it can learn fast heuristics to solve VRPs. However, most research works in NCO for VRPs focus on simplified settings, which do not account for asymmetric distances and travel durations that cannot be derived by simple Euclidean distances and unrealistic data distributions, hindering real-world deployment. This work introduces RRNCO (Real Routing NCO) to bridge the gap of NCO between synthetic and real-world VRPs in the critical aspects of both data and modeling. First, we introduce a new, openly available dataset with real-world data containing a diverse dataset of locations, distances, and duration matrices from 100 cities, considering realistic settings with actual routing distances and durations obtained from Open Source Routing Machine (OSRM). Second, we propose a novel approach that efficiently processes both node and edge features through contextual gating, enabling the construction of more informed node embedding, and we finally incorporate an Adaptation Attention Free Module (AAFM) with neural adaptive bias mechanisms that effectively integrates not only distance matrices but also angular relationships between nodes, allowing our model to capture rich structural information. RRNCO achieves state-of-the-art results in real-world VRPs among NCO methods. We make our dataset and code publicly available at this https URL.

[AI-18] Unify and Triumph: Polyglot Diverse and Self-Consistent Generation of Unit Tests with LLM s

链接: https://arxiv.org/abs/2503.16144
作者: Djamel Eddine Khelladi,Charly Reux,Mathieu Acher
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language model (LLM)-based test generation has gained attention in software engineering, yet most studies evaluate LLMs’ ability to generate unit tests in a single attempt for a given language, missing the opportunity to leverage LLM diversity for more robust testing. This paper introduces PolyTest, a novel approach that enhances test generation by exploiting polyglot and temperature-controlled diversity. PolyTest systematically leverages these properties in two complementary ways: (1) Cross-lingual test generation, where tests are generated in multiple languages at zero temperature and then unified; (2) Diverse test sampling, where multiple test sets are generated within the same language at a higher temperature before unification. A key insight is that LLMs can generate diverse yet contradicting tests – same input, different expected outputs – across languages and generations. PolyTest mitigates inconsistencies by unifying test sets, fostering self-consistency and improving overall test quality. Unlike single-language or single-attempt approaches, PolyTest enhances testing without requiring on-the-fly execution, making it particularly beneficial for weaker-performing languages. We evaluate PolyTest on Llama3-70B, GPT-4o, and GPT-3.5 using EvalPlus, generating tests in five languages (Java, C, Python, JavaScript, and a CSV-based format) at temperature 0 and sampling multiple sets at temperature 1. We observe that LLMs frequently generate contradicting tests across settings, and that PolyTest significantly improves test quality across all considered metrics – number of tests, passing rate, statement/branch coverage (up to +9.01%), and mutation score (up to +11.23%). Finally, PolyTest outperforms Pynguin in test generation, passing rate, and mutation score.

[AI-19] PromptMobile: Efficient Promptus for Low Bandwidth Mobile Video Streaming

链接: https://arxiv.org/abs/2503.16112
作者: Liming Liu,Jiangkai Wu,Haoyang Wang,Peiheng Wang,Xinggong Zhang,Zongming Guo
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: 7 pages, 10 figures

点击查看摘要

Abstract:Traditional video compression algorithms exhibit significant quality degradation at extremely low bitrates. Promptus emerges as a new paradigm for video streaming, substantially cutting down the bandwidth essential for video streaming. However, Promptus is computationally intensive and can not run in real-time on mobile devices. This paper presents PromptMobile, an efficient acceleration framework tailored for on-device Promptus. Specifically, we propose (1) a two-stage efficient generation framework to reduce computational cost by 8.1x, (2) a fine-grained inter-frame caching to reduce redundant computations by 16.6%, (3) system-level optimizations to further enhance efficiency. The evaluations demonstrate that compared with the original Promptus, PromptMobile achieves a 13.6x increase in image generation speed. Compared with other streaming methods, PromptMobile achives an average LPIPS improvement of 0.016 (compared with H.265), reducing 60% of severely distorted frames (compared to VQGAN).

[AI-20] AIMI: Leverag ing Future Knowledge and Personalization in Sparse Event Forecasting for Treatment Adherence

链接: https://arxiv.org/abs/2503.16091
作者: Abdullah Mamun,Diane J. Cook,Hassan Ghasemzadeh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Adherence to prescribed treatments is crucial for individuals with chronic conditions to avoid costly or adverse health outcomes. For certain patient groups, intensive lifestyle interventions are vital for enhancing medication adherence. Accurate forecasting of treatment adherence can open pathways to developing an on-demand intervention tool, enabling timely and personalized support. With the increasing popularity of smartphones and wearables, it is now easier than ever to develop and deploy smart activity monitoring systems. However, effective forecasting systems for treatment adherence based on wearable sensors are still not widely available. We close this gap by proposing Adherence Forecasting and Intervention with Machine Intelligence (AIMI). AIMI is a knowledge-guided adherence forecasting system that leverages smartphone sensors and previous medication history to estimate the likelihood of forgetting to take a prescribed medication. A user study was conducted with 27 participants who took daily medications to manage their cardiovascular diseases. We designed and developed CNN and LSTM-based forecasting models with various combinations of input features and found that LSTM models can forecast medication adherence with an accuracy of 0.932 and an F-1 score of 0.936. Moreover, through a series of ablation studies involving convolutional and recurrent neural network architectures, we demonstrate that leveraging known knowledge about future and personalized training enhances the accuracy of medication adherence forecasting. Code available: this https URL.

[AI-21] mporal-Spatial Attention Network (TSAN) for DoS Attack Detection in Network Traffic

链接: https://arxiv.org/abs/2503.16047
作者: Bisola Faith Kayode,Akinyemi Sadeeq Akintola,Oluwole Fagbohun,Egonna Anaesiuba-Bristol,Onyekachukwu Ojumah,Oluwagbade Odimayo,Toyese Oloyede,Aniema Inyang,Teslim Kazeem,Habeeb Alli,Udodirim Ibem Offia,Prisca Chinazor Amajuoyi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 19 Pages, 5 figures

点击查看摘要

Abstract:Denial-of-Service (DoS) attacks remain a critical threat to network security, disrupting services and causing significant economic losses. Traditional detection methods, including statistical and rule-based models, struggle to adapt to evolving attack patterns. To address this challenge, we propose a novel Temporal-Spatial Attention Network (TSAN) architecture for detecting Denial of Service (DoS) attacks in network traffic. By leveraging both temporal and spatial features of network traffic, our approach captures complex traffic patterns and anomalies that traditional methods might miss. The TSAN model incorporates transformer-based temporal encoding, convolutional spatial encoding, and a cross-attention mechanism to fuse these complementary feature spaces. Additionally, we employ multi-task learning with auxiliary tasks to enhance the model’s robustness. Experimental results on the NSL-KDD dataset demonstrate that TSAN outperforms state-of-the-art models, achieving superior accuracy, precision, recall, and F1-score while maintaining computational efficiency for real-time deployment. The proposed architecture offers an optimal balance between detection accuracy and computational overhead, making it highly suitable for real-world network security applications.

[AI-22] GreenIQ: A Deep Search Platform for Comprehensive Carbon Market Analysis and Automated Report Generation

链接: https://arxiv.org/abs/2503.16041
作者: Bisola Faith Kayode,Akinyemi Sadeeq Akintola,Oluwole Fagbohun,Egonna Anaesiuba-Bristol,Onyekachukwu Ojumah,Oluwagbade Odimayo,Toyese Oloyede,Aniema Inyang,Teslim Kazeem,Habeeb Alli,Udodirim Ibem Offia,Prisca Chinazor Amajuoyi
类目: Artificial Intelligence (cs.AI)
*备注: 12 Pages, 1 figure

点击查看摘要

Abstract:This study introduces GreenIQ, an AI-powered deep search platform designed to revolutionise carbon market intelligence through autonomous analysis and automated report generation. Carbon markets operate across diverse regulatory landscapes, generating vast amounts of heterogeneous data from policy documents, industry reports, academic literature, and real-time trading platforms. Traditional research approaches remain labour-intensive, slow, and difficult to scale. GreenIQ addresses these limitations through a multi-agent architecture powered by Large Language Models (LLMs), integrating five specialised AI agents: a Main Researcher Agent for intelligent information retrieval, a Report Writing Agent for structured synthesis, a Final Reviewer Agent for accuracy verification, a Data Visualisation Agent for enhanced interpretability, and a Translator Agent for multilingual adaptation. The system achieves seamless integration of structured and unstructured information with AI-driven citation verification, ensuring high transparency and reliability. GreenIQ delivers a 99.2% reduction in processing time and a 99.7% cost reduction compared to traditional research methodologies. A novel AI persona-based evaluation framework involving 16 domain-specific AI personas highlights its superior cross-jurisdictional analytical capabilities and regulatory insight generation. GreenIQ sets new standards in AI-driven research synthesis, policy analysis, and sustainability finance by streamlining carbon market research. It offers an efficient and scalable framework for environmental and financial intelligence, enabling more accurate, timely, and cost-effective decision-making in complex regulatory landscapes

[AI-23] Exploring the Reliability of Self-explanation and its Relationship with Classification in Language Model-driven Financial Analysis

链接: https://arxiv.org/abs/2503.15985
作者: Han Yuan,Li Zhang,Zheng Ma
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Language models (LMs) have exhibited exceptional versatility in reasoning and in-depth financial analysis through their proprietary information processing capabilities. Previous research focused on evaluating classification performance while often overlooking explainability or pre-conceived that refined explanation corresponds to higher classification accuracy. Using a public dataset in finance domain, we quantitatively evaluated self-explanations by LMs, focusing on their factuality and causality. We identified the statistically significant relationship between the accuracy of classifications and the factuality or causality of self-explanations. Our study built an empirical foundation for approximating classification confidence through self-explanations and for optimizing classification via proprietary reasoning.

[AI-24] GAN-enhanced Simulation-driven DNN Testing in Absence of Ground Truth

链接: https://arxiv.org/abs/2503.15953
作者: Mohammed Attaoui,Fabrizio Pastore
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 15 pages, 8 figures, 13 tables

点击查看摘要

Abstract:The generation of synthetic inputs via simulators driven by search algorithms is essential for cost-effective testing of Deep Neural Network (DNN) components for safety-critical systems. However, in many applications, simulators are unable to produce the ground-truth data needed for automated test oracles and to guide the search process. To tackle this issue, we propose an approach for the generation of inputs for computer vision DNNs that integrates a generative network to ensure simulator fidelity and employs heuristic-based search fitnesses that leverage transformation consistency, noise resistance, surprise adequacy, and uncertainty estimation. We compare the performance of our fitnesses with that of a traditional fitness function leveraging ground truth; further, we assess how the integration of a GAN not leveraging the ground truth impacts on test and retraining effectiveness. Our results suggest that leveraging transformation consistency is the best option to generate inputs for both DNN testing and retraining; it maximizes input diversity, spots the inputs leading to worse DNN performance, and leads to best DNN performance after retraining. Besides enabling simulator-based testing in the absence of ground truth, our findings pave the way for testing solutions that replace costly simulators with diffusion and large language models, which might be more affordable than simulators, but cannot generate ground-truth data. Comments: 15 pages, 8 figures, 13 tables Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.15953 [cs.SE] (or arXiv:2503.15953v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2503.15953 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-25] Unreal-MAP: Unreal-Engine-Based General Platform for Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2503.15947
作者: Tianyi Hu,Qingxu Fu,Zhiqiang Pu,Yuan Wang,Tenghai Qiu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose Unreal Multi-Agent Playground (Unreal-MAP), an MARL general platform based on the Unreal-Engine (UE). Unreal-MAP allows users to freely create multi-agent tasks using the vast visual and physical resources available in the UE community, and deploy state-of-the-art (SOTA) MARL algorithms within them. Unreal-MAP is user-friendly in terms of deployment, modification, and visualization, and all its components are open-source. We also develop an experimental framework compatible with algorithms ranging from rule-based to learning-based provided by third-party frameworks. Lastly, we deploy several SOTA algorithms in example tasks developed via Unreal-MAP, and conduct corresponding experimental analyses. We believe Unreal-MAP can play an important role in the MARL field by closely integrating existing algorithms with user-customized tasks, thus advancing the field of MARL.

[AI-26] Advancing Mobile GUI Agents : A Verifier-Driven Approach to Practical Deployment

链接: https://arxiv.org/abs/2503.15937
作者: Gaole Dai,Shiqi Jiang,Ting Cao,Yuanchun Li,Yuqing Yang,Rui Tan,Mo Li,Lili Qiu
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 4 itertions

点击查看摘要

Abstract:We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier’s decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid sets a new state-of-the-art task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 9.5%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves an impressively low latency of 0.7 seconds per step, making it the first mobile agent capable of delivering near-real-time, effective decision-making capabilities.

[AI-27] Denoising-based Contractive Imitation Learning

链接: https://arxiv.org/abs/2503.15918
作者: Macheng Shen,Jishen Peng,Zefang Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A fundamental challenge in imitation learning is the \emphcovariate shift problem. Existing methods to mitigate covariate shift often require additional expert interactions, access to environment dynamics, or complex adversarial training, which may not be practical in real-world applications. In this paper, we propose a simple yet effective method (DeCIL) to mitigate covariate shift by incorporating a denoising mechanism that enhances the contraction properties of the state transition mapping. Our approach involves training two neural networks: a dynamics model ( f ) that predicts the next state from the current state, and a joint state-action denoising policy network ( d ) that refines this state prediction via denoising and outputs the corresponding action. We provide theoretical analysis showing that the denoising network acts as a local contraction mapping, reducing the error propagation of the state transition and improving stability. Our method is straightforward to implement and can be easily integrated with existing imitation learning frameworks without requiring additional expert data or complex modifications to the training procedure. Empirical results demonstrate that our approach effectively improves success rate of various imitation learning tasks under noise perturbation.

[AI-28] me After Time: Deep-Q Effect Estimation for Interventions on When and What to do

链接: https://arxiv.org/abs/2503.15890
作者: Yoav Wald,Mark Goldstein,Yonathan Efroni,Wouter A.C. van Amsterdam,Rajesh Ranganath
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Problems in fields such as healthcare, robotics, and finance requires reasoning about the value both of what decision or action to take and when to take it. The prevailing hope is that artificial intelligence will support such decisions by estimating the causal effect of policies such as how to treat patients or how to allocate resources over time. However, existing methods for estimating the effect of a policy struggle with \emphirregular time. They either discretize time, or disregard the effect of timing policies. We present a new deep-Q algorithm that estimates the effect of both when and what to do called Earliest Disagreement Q-Evaluation (EDQ). EDQ makes use of recursion for the Q-function that is compatible with flexible sequence models, such as transformers. EDQ provides accurate estimates under standard assumptions. We validate the approach through experiments on survival time and tumor growth tasks.

[AI-29] LeanTTA: A Backpropagation-Free and Stateless Approach to Quantized Test-Time Adaptation on Edge Devices

链接: https://arxiv.org/abs/2503.15889
作者: Cynthia Dong,Hong Jia,Young D. Kwon,Georgios Rizos,Cecilia Mascolo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:While there are many advantages to deploying machine learning models on edge devices, the resource constraints of mobile platforms, the dynamic nature of the environment, and differences between the distribution of training versus in-the-wild data make such deployments challenging. Current test-time adaptation methods are often memory-intensive and not designed to be quantization-compatible or deployed on low-resource devices. To address these challenges, we present LeanTTA, a novel backpropagation-free and stateless framework for quantized test-time adaptation tailored to edge devices. Our approach minimizes computational costs by dynamically updating normalization statistics without backpropagation, which frees LeanTTA from the common pitfall of relying on large batches and historical data, making our method robust to realistic deployment scenarios. Our approach is the first to enable further computational gains by combining partial adaptation with quantized module fusion. We validate our framework across sensor modalities, demonstrating significant improvements over state-of-the-art TTA methods, including a 15.7% error reduction, peak memory usage of only 11.2MB for ResNet18, and fast adaptation within an order-of-magnitude of normal inference speeds on-device. LeanTTA provides a robust solution for achieving the right trade offs between accuracy and system efficiency in edge deployments, addressing the unique challenges posed by limited data and varied operational conditions.

[AI-30] DeepPsy-Agent : A Stage-Aware and Deep-Thinking Emotional Support Agent System

链接: https://arxiv.org/abs/2503.15876
作者: Kai Chen,Zebing Sun
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces DeepPsy-Agent, an innovative psychological support system that combines the three-stage helping theory in psychology with deep learning techniques. The system consists of two core components: (1) a multi-stage response-capable dialogue model (\textitdeeppsy-chat), which enhances reasoning capabilities through stage-awareness and deep-thinking analysis to generate high-quality responses; and (2) a real-time stage transition detection model that identifies contextual shifts to guide the dialogue towards more effective intervention stages. Based on 30,000 real psychological hotline conversations, we employ AI-simulated dialogues and expert re-annotation strategies to construct a high-quality multi-turn dialogue dataset. Experimental results demonstrate that DeepPsy-Agent outperforms general-purpose large language models (LLMs) in key metrics such as problem exposure completeness, cognitive restructuring success rate, and action adoption rate. Ablation studies further validate the effectiveness of stage-awareness and deep-thinking modules, showing that stage information contributes 42.3% to performance, while the deep-thinking module increases root-cause identification by 58.3% and reduces ineffective suggestions by 72.1%. This system addresses critical challenges in AI-based psychological support through dynamic dialogue management and deep reasoning, advancing intelligent mental health services.

[AI-31] Active management of battery degradation in wireless sensor network using deep reinforcement learning for group battery replacement

链接: https://arxiv.org/abs/2503.15865
作者: Jong-Hyun Jeonga,Hongki Jo,Qiang Zhou,Tahsin Afroz Hoque Nishat,Lang Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Wireless sensor networks (WSNs) have become a promising solution for structural health monitoring (SHM), especially in hard-to-reach or remote locations. Battery-powered WSNs offer various advantages over wired systems, however limited battery life has always been one of the biggest obstacles in practical use of the WSNs, regardless of energy harvesting methods. While various methods have been studied for battery health management, existing methods exclusively aim to extend lifetime of individual batteries, lacking a system level view. A consequence of applying such methods is that batteries in a WSN tend to fail at different times, posing significant difficulty on planning and scheduling of battery replacement trip. This study investigate a deep reinforcement learning (DRL) method for active battery degradation management by optimizing duty cycle of WSNs at the system level. This active management strategy effectively reduces earlier failure of battery individuals which enable group replacement without sacrificing WSN performances. A simulated environment based on a real-world WSN setup was developed to train a DRL agent and learn optimal duty cycle strategies. The performance of the strategy was validated in a long-term setup with various network sizes, demonstrating its efficiency and scalability.

[AI-32] Beyond Local Selection: Global Cut Selection for Enhanced Mixed-Integer Programming

链接: https://arxiv.org/abs/2503.15847
作者: Shuli Zeng,Sijia Zhang,Shaoang Li,Feng Wu,Xiang-Yang Li
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In mixed-integer programming (MIP) solvers, cutting planes are essential for Branch-and-Cut (BC) algorithms as they reduce the search space and accelerate the solving process. Traditional methods rely on hard-coded heuristics for cut plane selection but fail to leverage problem-specific structural features. Recent machine learning approaches use neural networks for cut selection but focus narrowly on the efficiency of single-node within the BC algorithm, without considering the broader contextual information. To address this, we propose Global Cut Selection (GCS), which uses a bipartite graph to represent the search tree and combines graph neural networks with reinforcement learning to develop cut selection strategies. Unlike prior methods, GCS applies cutting planes across all nodes, incorporating richer contextual information. Experiments show GCS significantly improves solving efficiency for synthetic and large-scale real-world MIPs compared to traditional and learning-based methods.

[AI-33] Ranking Counterfactual Explanations

链接: https://arxiv.org/abs/2503.15817
作者: Suryani Lim,Henri Prade,Gilles Richard
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages

点击查看摘要

Abstract:AI-driven outcomes can be challenging for end-users to understand. Explanations can address two key questions: “Why this outcome?” (factual) and “Why not another?” (counterfactual). While substantial efforts have been made to formalize factual explanations, a precise and comprehensive study of counterfactual explanations is still lacking. This paper proposes a formal definition of counterfactual explanations, proving some properties they satisfy, and examining the relationship with factual explanations. Given that multiple counterfactual explanations generally exist for a specific case, we also introduce a rigorous method to rank these counterfactual explanations, going beyond a simple minimality condition, and to identify the optimal ones. Our experiments with 12 real-world datasets highlight that, in most cases, a single optimal counterfactual explanation emerges. We also demonstrate, via three metrics, that the selected optimal explanation exhibits higher representativeness and can explain a broader range of elements than a random minimal counterfactual. This result highlights the effectiveness of our approach in identifying more robust and comprehensive counterfactual explanations.

[AI-34] Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Simulated Annealing

链接: https://arxiv.org/abs/2503.15815
作者: Vishnu Asutosh Dasu,Md Rafi ur Rashid,Vipul Gupta,Saeid Tizpaz-Niari,Gang Tan
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores pruning attention heads as a post-processing bias mitigation method for large language models (LLMs). Modern AI systems such as LLMs are expanding into sensitive social contexts where fairness concerns become especially crucial. Since LLMs develop decision-making patterns by training on massive datasets of human-generated content, they naturally encode and perpetuate societal biases. While modifying training datasets and algorithms is expensive and requires significant resources; post-processing techniques-such as selectively deactivating neurons and attention heads in pre-trained LLMs-can provide feasible and effective approaches to improve fairness. However, identifying the optimal subset of parameters to prune presents a combinatorial challenge within LLMs’ immense parameter space, requiring solutions that efficiently balance competing objectives across the frontiers of model fairness and utility. To address the computational challenges, we explore a search-based program repair approach via randomized simulated annealing. Given the prohibitive evaluation costs in billion-parameter LLMs, we develop surrogate deep neural networks that efficiently model the relationship between attention head states (active/inactive) and their corresponding fairness/utility metrics. This allows us to perform optimization over the surrogate models and efficiently identify optimal subsets of attention heads for selective pruning rather than directly searching through the LLM parameter space. This paper introduces Attention Pruning, a fairness-aware surrogate simulated annealing approach to prune attention heads in LLMs that disproportionately contribute to bias while minimally impacting overall model utility. Our experiments show that Attention Pruning achieves up to 40% reduction in gender bias and outperforms the state-of-the-art bias mitigation strategies. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2503.15815 [cs.AI] (or arXiv:2503.15815v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.15815 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-35] Video-VoT-R1: An efficient video inference model integrating image packing and AoE architecture

链接: https://arxiv.org/abs/2503.15807
作者: Cheng Li,Jiexiong Liu,Yixuan Chen,Yanqin Jia
类目: Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:In the field of video-language pretraining, existing models face numerous challenges in terms of inference efficiency and multimodal data processing. This paper proposes a KunLunBaize-VoT-R1 video inference model based on a long-sequence image encoder, along with its training and application methods. By integrating image packing technology, the Autonomy-of-Experts (AoE) architecture, and combining the video of Thought (VoT), a large language model (LLM) trained with large-scale reinforcement learning, and multiple training techniques, the efficiency and accuracy of the model in video inference tasks are effectively improved. Experiments show that this model performs outstandingly in multiple tests, providing a new solution for video-language understanding.

[AI-36] Blend the Separated: Mixture of Synergistic Experts for Data-Scarcity Drug-Target Interaction Prediction

链接: https://arxiv.org/abs/2503.15796
作者: Xinlong Zhai,Chunchen Wang,Ruijia Wang,Jiazheng Kang,Shujie Li,Boyu Chen,Tengfei Ma,Zikai Zhou,Cheng Yang,Chuan Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Drug-target interaction prediction (DTI) is essential in various applications including drug discovery and clinical application. There are two perspectives of input data widely used in DTI prediction: Intrinsic data represents how drugs or targets are constructed, and extrinsic data represents how drugs or targets are related to other biological entities. However, any of the two perspectives of input data can be scarce for some drugs or targets, especially for those unpopular or newly discovered. Furthermore, ground-truth labels for specific interaction types can also be scarce. Therefore, we propose the first method to tackle DTI prediction under input data and/or label scarcity. To make our model functional when only one perspective of input data is available, we design two separate experts to process intrinsic and extrinsic data respectively and fuse them adaptively according to different samples. Furthermore, to make the two perspectives complement each other and remedy label scarcity, two experts synergize with each other in a mutually supervised way to exploit the enormous unlabeled data. Extensive experiments on 3 real-world datasets under different extents of input data scarcity and/or label scarcity demonstrate our model outperforms states of the art significantly and steadily, with a maximum improvement of 53.53%. We also test our model without any data scarcity and it still outperforms current methods.

[AI-37] MobiFuse: Learning Universal Human Mobility Patterns through Cross-domain Data Fusion

链接: https://arxiv.org/abs/2503.15779
作者: Haoxuan Ma,Xishun Liao,Yifan Liu,Qinhua Jiang,Chris Stanford,Shangqing Cao,Jiaqi Ma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human mobility modeling is critical for urban planning and transportation management, yet existing datasets often lack the resolution and semantic richness required for comprehensive analysis. To address this, we proposed a cross-domain data fusion framework that integrates multi-modal data of distinct nature and spatio-temporal resolution, including geographical, mobility, socio-demographic, and traffic information, to construct a privacy-preserving and semantically enriched human travel trajectory dataset. This framework is demonstrated through two case studies in Los Angeles (LA) and Egypt, where a domain adaptation algorithm ensures its transferability across diverse urban contexts. Quantitative evaluation shows that the generated synthetic dataset accurately reproduces mobility patterns observed in empirical data. Moreover, large-scale traffic simulations for LA County based on the generated synthetic demand align well with observed traffic. On California’s I-405 corridor, the simulation yields a Mean Absolute Percentage Error of 5.85% for traffic volume and 4.36% for speed compared to Caltrans PeMS observations.

[AI-38] Detecting LLM -Written Peer Reviews

链接: https://arxiv.org/abs/2503.15772
作者: Vishisht Rao,Aounon Kumar,Himabindu Lakkaraju,Nihar B. Shah
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 26 pages, 1 figure

点击查看摘要

Abstract:Editors of academic journals and program chairs of conferences require peer reviewers to write their own reviews. However, there is growing concern about the rise of lazy reviewing practices, where reviewers use large language models (LLMs) to generate reviews instead of writing them independently. Existing tools for detecting LLM-generated content are not designed to differentiate between fully LLM-generated reviews and those merely polished by an LLM. In this work, we employ a straightforward approach to identify LLM-generated reviews - doing an indirect prompt injection via the paper PDF to ask the LLM to embed a watermark. Our focus is on presenting watermarking schemes and statistical tests that maintain a bounded family-wise error rate, when a venue evaluates multiple reviews, with a higher power as compared to standard methods like Bonferroni correction. These guarantees hold without relying on any assumptions about human-written reviews. We also consider various methods for prompt injection including font embedding and jailbreaking. We evaluate the effectiveness and various tradeoffs of these methods, including different reviewer defenses. We find a high success rate in the embedding of our watermarks in LLM-generated reviews across models. We also find that our approach is resilient to common reviewer defenses, and that the bounds on error rates in our statistical tests hold in practice while having the power to flag LLM-generated reviews, while Bonferroni correction is infeasible.

[AI-39] owards Agent ic AI Networking in 6G: A Generative Foundation Model-as-Agent Approach

链接: https://arxiv.org/abs/2503.15764
作者: Yong Xiao,Guangming Shi,Ping Zhang
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: Currently under revision at IEEE Communications Magazine

点击查看摘要

Abstract:The promising potential of AI and network convergence in improving networking performance and enabling new service capabilities has recently attracted significant interest. Existing network AI solutions, while powerful, are mainly built based on the close-loop and passive learning framework, resulting in major limitations in autonomous solution finding and dynamic environmental adaptation. Agentic AI has recently been introduced as a promising solution to address the above limitations and pave the way for true generally intelligent and beneficial AI systems. The key idea is to create a networking ecosystem to support a diverse range of autonomous and embodied AI agents in fulfilling their goals. In this paper, we focus on the novel challenges and requirements of agentic AI networking. We propose AgentNet, a novel framework for supporting interaction, collaborative learning, and knowledge transfer among AI agents. We introduce a general architectural framework of AgentNet and then propose a generative foundation model (GFM)-based implementation in which multiple GFM-as-agents have been created as an interactive knowledge-base to bootstrap the development of embodied AI agents according to different task requirements and environmental features. We consider two application scenarios, digital-twin-based industrial automation and metaverse-based infotainment system, to describe how to apply AgentNet for supporting efficient task-driven collaboration and interaction among AI agents.

[AI-40] Dialogic Learning in Child-Robot Interaction: A Hybrid Approach to Personalized Educational Content Generation

链接: https://arxiv.org/abs/2503.15762
作者: Elena Malnatsky,Shenghui Wang,Koen V. Hindriks,Mike E.U. Ligthart
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Dialogic learning fosters motivation and deeper understanding in education through purposeful and structured dialogues. Foundational models offer a transformative potential for child-robot interactions, enabling the design of personalized, engaging, and scalable interactions. However, their integration into educational contexts presents challenges in terms of ensuring age-appropriate and safe content and alignment with pedagogical goals. We introduce a hybrid approach to designing personalized educational dialogues in child-robot interactions. By combining rule-based systems with LLMs for selective offline content generation and human validation, the framework ensures educational quality and developmental appropriateness. We illustrate this approach through a project aimed at enhancing reading motivation, in which a robot facilitated book-related dialogues.

[AI-41] ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism

链接: https://arxiv.org/abs/2503.15758
作者: Venmugil Elango
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Transformer-based models have emerged as a leading architecture for natural language processing, natural language generation, and image generation tasks. A fundamental element of the transformer architecture is self-attention, which allows the model to capture intricate dependencies within the data. However, the self-attention mechanism also incurs significant computational and memory costs, particularly for long sequences. In this paper, we introduce ATTENTION2D, a novel approach that exploits parallelism along two dimensions - query and key/value - of the self-attention operation. This method enables efficient distribution and parallelization of computations across multiple devices. Our approach facilitates asymptotically faster training and inference phases compared to previous methods, without relying on approximations or incurring additional computational or memory overheads. Furthermore, unlike existing techniques that struggle to scale with an increasing number of processing units, our approach effectively scales with additional processing units. Our experimental results confirm the effectiveness of our method in improving communication efficiency and scalability. Compared to Ring Attention, our approach demonstrated up to a 5x performance boost on a GPT-3-like model using 64 NVIDIA A100 GPUs across 16 nodes, and up to a 9.4x performance boost on 64 NVIDIA H100 GPUs across 64 nodes. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2503.15758 [cs.LG] (or arXiv:2503.15758v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.15758 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-42] AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

链接: https://arxiv.org/abs/2503.15754
作者: Andy Zhou,Kevin Wu,Francesco Pinto,Zhaorun Chen,Yi Zeng,Yu Yang,Shuang Yang,Sanmi Koyejo,James Zou,Bo Li
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) become increasingly capable, security and safety evaluation are crucial. While current red teaming approaches have made strides in assessing LLM vulnerabilities, they often rely heavily on human input and lack comprehensive coverage of emerging attack vectors. This paper introduces AutoRedTeamer, a novel framework for fully automated, end-to-end red teaming against LLMs. AutoRedTeamer combines a multi-agent architecture with a memory-guided attack selection mechanism to enable continuous discovery and integration of new attack vectors. The dual-agent framework consists of a red teaming agent that can operate from high-level risk categories alone to generate and execute test cases and a strategy proposer agent that autonomously discovers and implements new attacks by analyzing recent research. This modular design allows AutoRedTeamer to adapt to emerging threats while maintaining strong performance on existing attack vectors. We demonstrate AutoRedTeamer’s effectiveness across diverse evaluation settings, achieving 20% higher attack success rates on HarmBench against Llama-3.1-70B while reducing computational costs by 46% compared to existing approaches. AutoRedTeamer also matches the diversity of human-curated benchmarks in generating test cases, providing a comprehensive, scalable, and continuously evolving framework for evaluating the security of AI systems.

[AI-43] Using Language Models to Decipher the Motivation Behind Human Behaviors

链接: https://arxiv.org/abs/2503.15752
作者: Yutong Xie,Qiaozhu Mei,Walter Yuan,Matthew O. Jackson
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI presents a novel tool for deciphering the motivations behind human behaviors. We show that by varying prompts to a large language model, we can elicit a full range of human behaviors in a variety of different scenarios in terms of classic economic games. Then by analyzing which prompts are needed to elicit which behaviors, we can infer (decipher) the motivations behind the human behaviors. We also show how one can analyze the prompts to reveal relationships between the classic economic games, providing new insight into what different economic scenarios induce people to think about. We also show how this deciphering process can be used to understand differences in the behavioral tendencies of different populations.

[AI-44] ECLAIR: Enhanced Clarification for Interactive Responses

链接: https://arxiv.org/abs/2503.15739
作者: John Murzaku,Zifan Liu,Md Mehrab Tanjim,Vaishnavi Muppala,Xiang Chen,Yunyao Li
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:We present ECLAIR (Enhanced CLArification for Interactive Responses), a novel unified and end-to-end framework for interactive disambiguation in enterprise AI assistants. ECLAIR generates clarification questions for ambiguous user queries and resolves ambiguity based on the user’s this http URL introduce a generalized architecture capable of integrating ambiguity information from multiple downstream agents, enhancing context-awareness in resolving ambiguities and allowing enterprise specific definition of agents. We further define agents within our system that provide domain-specific grounding information. We conduct experiments comparing ECLAIR to few-shot prompting techniques and demonstrate ECLAIR’s superior performance in clarification question generation and ambiguity resolution.

[AI-45] Reinforcement Learning Environment with LLM -Controlled Adversary in DD 5th Edition Combat ICONIP2024

链接: https://arxiv.org/abs/2503.15726
作者: Joseph Emmanuel DL Dayo,Michel Onasis S. Ogbinar,Prospero C. Naval Jr
类目: Artificial Intelligence (cs.AI)
*备注: Preprint. Submitted to the 31st International Conference on Neural Information Processing (ICONIP 2024)

点击查看摘要

Abstract:The objective of this study is to design and implement a reinforcement learning (RL) environment using D\D 5E combat scenarios to challenge smaller RL agents through interaction with a robust adversarial agent controlled by advanced Large Language Models (LLMs) like GPT-4o and LLaMA 3 8B. This research employs Deep Q-Networks (DQN) for the smaller agents, creating a testbed for strategic AI development that also serves as an educational tool by simulating dynamic and unpredictable combat scenarios. We successfully integrated sophisticated language models into the RL framework, enhancing strategic decision-making processes. Our results indicate that while RL agents generally outperform LLM-controlled adversaries in standard metrics, the strategic depth provided by LLMs significantly enhances the overall AI capabilities in this complex, rule-based setting. The novelty of our approach and its implications for mastering intricate environments and developing adaptive strategies are discussed, alongside potential innovations in AI-driven interactive simulations. This paper aims to demonstrate how integrating LLMs can create more robust and adaptable AI systems, providing valuable insights for further research and educational applications.

[AI-46] Reward Training Wheels: Adaptive Auxiliary Rewards for Robotics Reinforcement Learning

链接: https://arxiv.org/abs/2503.15724
作者: Linji Wang,Tong Xu,Yuanjie Lu,Xuesu Xiao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:Robotics Reinforcement Learning (RL) often relies on carefully engineered auxiliary rewards to supplement sparse primary learning objectives to compensate for the lack of large-scale, real-world, trial-and-error data. While these auxiliary rewards accelerate learning, they require significant engineering effort, may introduce human biases, and cannot adapt to the robot’s evolving capabilities during training. In this paper, we introduce Reward Training Wheels (RTW), a teacher-student framework that automates auxiliary reward adaptation for robotics RL. To be specific, the RTW teacher dynamically adjusts auxiliary reward weights based on the student’s evolving capabilities to determine which auxiliary reward aspects require more or less emphasis to improve the primary objective. We demonstrate RTW on two challenging robot tasks: navigation in highly constrained spaces and off-road vehicle mobility on vertically challenging terrain. In simulation, RTW outperforms expert-designed rewards by 2.35% in navigation success rate and improves off-road mobility performance by 122.62%, while achieving 35% and 3X faster training efficiency, respectively. Physical robot experiments further validate RTW’s effectiveness, achieving a perfect success rate (5/5 trials vs. 2/5 for expert-designed rewards) and improving vehicle stability with up to 47.4% reduction in orientation angles.

[AI-47] Safety Aware Task Planning via Large Language Models in Robotics

链接: https://arxiv.org/abs/2503.15707
作者: Azal Ahmad Khan,Michael Andrev,Muhammad Ali Murtaza,Sergio Aguilera,Rui Zhang,Jie Ding,Seth Hutchinson,Ali Anwar
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of large language models (LLMs) into robotic task planning has unlocked better reasoning capabilities for complex, long-horizon workflows. However, ensuring safety in LLM-driven plans remains a critical challenge, as these models often prioritize task completion over risk mitigation. This paper introduces SAFER (Safety-Aware Framework for Execution in Robotics), a multi-LLM framework designed to embed safety awareness into robotic task planning. SAFER employs a Safety Agent that operates alongside the primary task planner, providing safety feedback. Additionally, we introduce LLM-as-a-Judge, a novel metric leveraging LLMs as evaluators to quantify safety violations within generated task plans. Our framework integrates safety feedback at multiple stages of execution, enabling real-time risk assessment, proactive error correction, and transparent safety evaluation. We also integrate a control framework using Control Barrier Functions (CBFs) to ensure safety guarantees within SAFER’s task planning. We evaluated SAFER against state-of-the-art LLM planners on complex long-horizon tasks involving heterogeneous robotic agents, demonstrating its effectiveness in reducing safety violations while maintaining task efficiency. We also verify the task planner and safety planner through actual hardware experiments involving multiple robots and a human.

[AI-48] Predicting Multi-Agent Specialization via Task Parallelizability

链接: https://arxiv.org/abs/2503.15703
作者: Elizabeth Mieczkowski,Ruaridh Mon-Williams,Neil Bramley,Christopher G. Lucas,Natalia Velez,Thomas L. Griffiths
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-agent systems often rely on specialized agents with distinct roles rather than general-purpose agents that perform the entire task independently. However, the conditions that govern the optimal degree of specialization remain poorly understood. In this work, we propose that specialist teams outperform generalist ones when environmental constraints limit task parallelizability – the potential to execute task components concurrently. Drawing inspiration from distributed systems, we introduce a heuristic to predict the relative efficiency of generalist versus specialist teams by estimating the speed-up achieved when two agents perform a task in parallel rather than focus on complementary subtasks. We validate this heuristic through three multi-agent reinforcement learning (MARL) experiments in Overcooked-AI, demonstrating that key factors limiting task parallelizability influence specialization. We also observe that as the state space expands, agents tend to converge on specialist strategies, even when generalist ones are theoretically more efficient, highlighting potential biases in MARL training algorithms. Our findings provide a principled framework for interpreting specialization given the task and environment, and introduce a novel benchmark for evaluating whether MARL finds optimal strategies.

[AI-49] R2: A LLM Based Novel-to-Screenplay Generation Framework with Causal Plot Graphs

链接: https://arxiv.org/abs/2503.15655
作者: Zefeng Lin,Yi Xiao,Zhiqiang Mo,Qifan Zhang,Jie Wang,Jiayang Chen,Jiajing Zhang,Hui Zhang,Zhengyi Liu,Xianyong Fang,Xiaohua Xu
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:Automatically adapting novels into screenplays is important for the TV, film, or opera industries to promote products with low costs. The strong performances of large language models (LLMs) in long-text generation call us to propose a LLM based framework Reader-Rewriter (R ^2 ) for this task. However, there are two fundamental challenges here. First, the LLM hallucinations may cause inconsistent plot extraction and screenplay generation. Second, the causality-embedded plot lines should be effectively extracted for coherent rewriting. Therefore, two corresponding tactics are proposed: 1) A hallucination-aware refinement method (HAR) to iteratively discover and eliminate the affections of hallucinations; and 2) a causal plot-graph construction method (CPC) based on a greedy cycle-breaking algorithm to efficiently construct plot lines with event causalities. Recruiting those efficient techniques, R ^2 utilizes two modules to mimic the human screenplay rewriting process: The Reader module adopts a sliding window and CPC to build the causal plot graphs, while the Rewriter module generates first the scene outlines based on the graphs and then the screenplays. HAR is integrated into both modules for accurate inferences of LLMs. Experimental results demonstrate the superiority of R ^2 , which substantially outperforms three existing approaches (51.3%, 22.6%, and 57.1% absolute increases) in pairwise comparison at the overall win rate for GPT-4o.

[AI-50] Survey on Generalization Theory for Graph Neural Networks

链接: https://arxiv.org/abs/2503.15650
作者: Antonis Vasileiou,Stefanie Jegelka,Ron Levie,Christopher Morris
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Message-passing graph neural networks (MPNNs) have emerged as the leading approach for machine learning on graphs, attracting significant attention in recent years. While a large set of works explored the expressivity of MPNNs, i.e., their ability to separate graphs and approximate functions over them, comparatively less attention has been directed toward investigating their generalization abilities, i.e., making meaningful predictions beyond the training data. Here, we systematically review the existing literature on the generalization abilities of MPNNs. We analyze the strengths and limitations of various studies in these domains, providing insights into their methodologies and findings. Furthermore, we identify potential avenues for future research, aiming to deepen our understanding of the generalization abilities of MPNNs.

[AI-51] Neural Lyapunov Function Approximation with Self-Supervised Reinforcement Learning ICRA

链接: https://arxiv.org/abs/2503.15629
作者: Luc McCutcheon,Bahman Gharesifard,Saber Fallah
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: Accepted at IEEE International Conference on Robotics and Automation (ICRA)

点击查看摘要

Abstract:Control Lyapunov functions are traditionally used to design a controller which ensures convergence to a desired state, yet deriving these functions for nonlinear systems remains a complex challenge. This paper presents a novel, sample-efficient method for neural approximation of nonlinear Lyapunov functions, leveraging self-supervised Reinforcement Learning (RL) to enhance training data generation, particularly for inaccurately represented regions of the state space. The proposed approach employs a data-driven World Model to train Lyapunov functions from off-policy trajectories. The method is validated on both standard and goal-conditioned robotic tasks, demonstrating faster convergence and higher approximation accuracy compared to the state-of-the-art neural Lyapunov approximation baseline. The code is available at: this https URL

[AI-52] PEnGUiN: Partially Equivariant Graph NeUral Networks for Sample Efficient MARL

链接: https://arxiv.org/abs/2503.15615
作者: Joshua McClellan,Greyson Brothers,Furong Huang,Pratap Tokekar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Equivariant Graph Neural Networks (EGNNs) have emerged as a promising approach in Multi-Agent Reinforcement Learning (MARL), leveraging symmetry guarantees to greatly improve sample efficiency and generalization. However, real-world environments often exhibit inherent asymmetries arising from factors such as external forces, measurement inaccuracies, or intrinsic system biases. This paper introduces \textitPartially Equivariant Graph NeUral Networks (PEnGUiN), a novel architecture specifically designed to address these challenges. We formally identify and categorize various types of partial equivariance relevant to MARL, including subgroup equivariance, feature-wise equivariance, regional equivariance, and approximate equivariance. We theoretically demonstrate that PEnGUiN is capable of learning both fully equivariant (EGNN) and non-equivariant (GNN) representations within a unified framework. Through extensive experiments on a range of MARL problems incorporating various asymmetries, we empirically validate the efficacy of PEnGUiN. Our results consistently demonstrate that PEnGUiN outperforms both EGNNs and standard GNNs in asymmetric environments, highlighting their potential to improve the robustness and applicability of graph-based MARL algorithms in real-world scenarios.

[AI-53] How Well Can AI Build SD Models?

链接: https://arxiv.org/abs/2503.15580
作者: William Schoenberg,Davidson Girard,Saras Chung,Ellen O’Neill,Janet Velasquez,Sara Metcalf
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Introduction: As system dynamics (SD) embraces automation, AI offers efficiency but risks bias from missing data and flawed models. Models that omit multiple perspectives and data threaten model quality, whether created by humans or with the assistance of AI. To reduce uncertainty about how well AI can build SD models, we introduce two metrics for evaluation of AI-generated causal maps: technical correctness (causal translation) and adherence to instructions (conformance). Approach: We developed an open source project called sd-ai to provide a basis for collaboration in the SD community, aiming to fully harness the potential of AI based tools like ChatGPT for dynamic modeling. Additionally, we created an evaluation theory along with a comprehensive suite of tests designed to evaluate any such tools developed within the sd-ai ecosystem. Results: We tested 11 different LLMs on their ability to do causal translation as well as conform to user instruction. gpt-4.5-preview was the top performer, scoring 92.9% overall, excelling in both tasks. o1 scored 100% in causal translation. gpt-4o identified all causal links but struggled with positive polarity in decreasing terms. While gpt-4.5-preview and o1 are most accurate, gpt-4o is the cheapest. Discussion: Causal translation and conformance tests applied to the sd-ai engine reveal significant variations across lLLMs, underscoring the need for continued evaluation to ensure responsible development of AI tools for dynamic modeling. To address this, an open collaboration among tool developers, modelers, and stakeholders is launched to standardize measures for evaluating the capacity of AI tools to improve the modeling process. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2503.15580 [cs.AI] (or arXiv:2503.15580v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.15580 Focus to learn more arXiv-issued DOI via DataCite Submission history From: William Schoenberg [view email] [v1] Wed, 19 Mar 2025 14:48:47 UTC (1,080 KB)

[AI-54] Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack

链接: https://arxiv.org/abs/2503.15551
作者: Murong Yue,Ziyu Yao
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Batch prompting, which combines a batch of multiple queries sharing the same context in one inference, has emerged as a promising solution to reduce inference costs. However, our study reveals a significant security vulnerability in batch prompting: malicious users can inject attack instructions into a batch, leading to unwanted interference across all queries, which can result in the inclusion of harmful content, such as phishing links, or the disruption of logical reasoning. In this paper, we construct BATCHSAFEBENCH, a comprehensive benchmark comprising 150 attack instructions of two types and 8k batch instances, to study the batch prompting vulnerability systematically. Our evaluation of both closed-source and open-weight LLMs demonstrates that all LLMs are susceptible to batch-prompting attacks. We then explore multiple defending approaches. While the prompting-based defense shows limited effectiveness for smaller LLMs, the probing-based approach achieves about 95% accuracy in detecting attacks. Additionally, we perform a mechanistic analysis to understand the attack and identify attention heads that are responsible for it.

[AI-55] Zero-Knowledge Federated Learning: A New Trustworthy and Privacy-Preserving Distributed Learning Paradigm

链接: https://arxiv.org/abs/2503.15550
作者: Yuxin Jin,Taotao Wang,Qing Yang,Long Shi,Shengli Zhang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures, 1 table

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising paradigm in distributed machine learning, enabling collaborative model training while preserving data privacy. However, despite its many advantages, FL still contends with significant challenges – most notably regarding security and trust. Zero-Knowledge Proofs (ZKPs) offer a potential solution by establishing trust and enhancing system integrity throughout the FL process. Although several studies have explored ZKP-based FL (ZK-FL), a systematic framework and comprehensive analysis are still lacking. This article makes two key contributions. First, we propose a structured ZK-FL framework that categorizes and analyzes the technical roles of ZKPs across various FL stages and tasks. Second, we introduce a novel algorithm, Verifiable Client Selection FL (Veri-CS-FL), which employs ZKPs to refine the client selection process. In Veri-CS-FL, participating clients generate verifiable proofs for the performance metrics of their local models and submit these concise proofs to the server for efficient verification. The server then selects clients with high-quality local models for uploading, subsequently aggregating the contributions from these selected clients. By integrating ZKPs, Veri-CS-FL not only ensures the accuracy of performance metrics but also fortifies trust among participants while enhancing the overall efficiency and security of FL systems.

[AI-56] Rendering Transparency to Ranking in Educational Assessment via Bayesian Comparative Judgement

链接: https://arxiv.org/abs/2503.15549
作者: Andy Gray,Alma Rahat,Stephen Lindsay,Jen Pearson,Tom Crick
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Ensuring transparency in educational assessment is increasingly critical, particularly post-pandemic, as demand grows for fairer and more reliable evaluation methods. Comparative Judgement (CJ) offers a promising alternative to traditional assessments, yet concerns remain about its perceived opacity. This paper examines how Bayesian Comparative Judgement (BCJ) enhances transparency by integrating prior information into the judgement process, providing a structured, data-driven approach that improves interpretability and accountability. BCJ assigns probabilities to judgement outcomes, offering quantifiable measures of uncertainty and deeper insights into decision confidence. By systematically tracking how prior data and successive judgements inform final rankings, BCJ clarifies the assessment process and helps identify assessor disagreements. Multi-criteria BCJ extends this by evaluating multiple learning outcomes (LOs) independently, preserving the richness of CJ while producing transparent, granular rankings aligned with specific assessment goals. It also enables a holistic ranking derived from individual LOs, ensuring comprehensive evaluations without compromising detailed feedback. Using a real higher education dataset with professional markers in the UK, we demonstrate BCJ’s quantitative rigour and ability to clarify ranking rationales. Through qualitative analysis and discussions with experienced CJ practitioners, we explore its effectiveness in contexts where transparency is crucial, such as high-stakes national assessments. We highlight the benefits and limitations of BCJ, offering insights into its real-world application across various educational settings. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR) Cite as: arXiv:2503.15549 [cs.CY] (or arXiv:2503.15549v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2503.15549 Focus to learn more arXiv-issued DOI via DataCite

[AI-57] Privacy-Aware RAG : Secure and Isolated Knowledge Retrieval

链接: https://arxiv.org/abs/2503.15548
作者: Pengcheng Zhou,Yinglun Feng,Zhongliang Yang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The widespread adoption of Retrieval-Augmented Generation (RAG) systems in real-world applications has heightened concerns about the confidentiality and integrity of their proprietary knowledge bases. These knowledge bases, which play a critical role in enhancing the generative capabilities of Large Language Models (LLMs), are increasingly vulnerable to breaches that could compromise sensitive information. To address these challenges, this paper proposes an advanced encryption methodology designed to protect RAG systems from unauthorized access and data leakage. Our approach encrypts both textual content and its corresponding embeddings prior to storage, ensuring that all data remains securely encrypted. This mechanism restricts access to authorized entities with the appropriate decryption keys, thereby significantly reducing the risk of unintended data exposure. Furthermore, we demonstrate that our encryption strategy preserves the performance and functionality of RAG pipelines, ensuring compatibility across diverse domains and applications. To validate the robustness of our method, we provide comprehensive security proofs that highlight its resilience against potential threats and vulnerabilities. These proofs also reveal limitations in existing approaches, which often lack robustness, adaptability, or reliance on open-source models. Our findings suggest that integrating advanced encryption techniques into the design and deployment of RAG systems can effectively enhance privacy safeguards. This research contributes to the ongoing discourse on improving security measures for AI-driven services and advocates for stricter data protection standards within RAG architectures.

[AI-58] Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents

链接: https://arxiv.org/abs/2503.15547
作者: Juhee Kim,Woohyuk Choi,Byoungyoung Lee
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are combined with plugins to create powerful LLM agents that provide a wide range of services. Unlike traditional software, LLM agent’s behavior is determined at runtime by natural language prompts from either user or plugin’s data. This flexibility enables a new computing paradigm with unlimited capabilities and programmability, but also introduces new security risks, vulnerable to privilege escalation attacks. Moreover, user prompt is prone to be interpreted in an insecure way by LLM agents, creating non-deterministic behaviors that can be exploited by attackers. To address these security risks, we propose Prompt Flow Integrity (PFI), a system security-oriented solution to prevent privilege escalation in LLM agents. Analyzing the architectural characteristics of LLM agents, PFI features three mitigation techniques – i.e., untrusted data identification, enforcing least privilege on LLM agents, and validating unsafe data flows. Our evaluation result shows that PFI effectively mitigates privilege escalation attacks while successfully preserving the utility of LLM agents.

[AI-59] Enforcing Cybersecurity Constraints for LLM -driven Robot Agents for Online Transactions

链接: https://arxiv.org/abs/2503.15546
作者: Shraddha Pradipbhai Shah,Aditya Vilas Deshpande
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into autonomous robotic agents for conducting online transactions poses significant cybersecurity challenges. This study aims to enforce robust cybersecurity constraints to mitigate the risks associated with data breaches, transaction fraud, and system manipulation. The background focuses on the rise of LLM-driven robotic systems in e-commerce, finance, and service industries, alongside the vulnerabilities they introduce. A novel security architecture combining blockchain technology with multi-factor authentication (MFA) and real-time anomaly detection was implemented to safeguard transactions. Key performance metrics such as transaction integrity, response time, and breach detection accuracy were evaluated, showing improved security and system performance. The results highlight that the proposed architecture reduced fraudulent transactions by 90%, improved breach detection accuracy to 98%, and ensured secure transaction validation within a latency of 0.05 seconds. These findings emphasize the importance of cybersecurity in the deployment of LLM-driven robotic systems and suggest a framework adaptable to various online platforms.

[AI-60] A Logic of Uncertain Interpretation

链接: https://arxiv.org/abs/2503.15544
作者: Adam Bjorndahl
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:We introduce a logical framework for reasoning about “uncertain interpretations” and investigate two key applications: a new semantics for implication capturing a kind of “meaning entailment”, and a conservative notion of “evidentially supported” belief that takes the form of a Dempster-Shafer belief function.

[AI-61] Identifying Likely-Reputable Blockchain Projects on Ethereum

链接: https://arxiv.org/abs/2503.15542
作者: Cyrus Malik,Josef Bajada,Joshua Ellul
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Identifying reputable Ethereum projects remains a critical challenge within the expanding blockchain ecosystem. The ability to distinguish between legitimate initiatives and potentially fraudulent schemes is non-trivial. This work presents a systematic approach that integrates multiple data sources with advanced analytics to evaluate credibility, transparency, and overall trustworthiness. The methodology applies machine learning techniques to analyse transaction histories on the Ethereum blockchain. The study classifies accounts based on a dataset comprising 2,179 entities linked to illicit activities and 3,977 associated with reputable projects. Using the LightGBM algorithm, the approach achieves an average accuracy of 0.984 and an average AUC of 0.999, validated through 10-fold cross-validation. Key influential factors include time differences between transactions and received_tnx. The proposed methodology provides a robust mechanism for identifying reputable Ethereum projects, fostering a more secure and transparent investment environment. By equipping stakeholders with data-driven insights, this research enables more informed decision-making, risk mitigation, and the promotion of legitimate blockchain initiatives. Furthermore, it lays the foundation for future advancements in trust assessment methodologies, contributing to the continued development and maturity of the Ethereum ecosystem. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET) Cite as: arXiv:2503.15542 [cs.CR] (or arXiv:2503.15542v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.15542 Focus to learn more arXiv-issued DOI via DataCite

[AI-62] A Beautiful Mind: Principles and Strategies for AI-Augmented Human Reasoning

链接: https://arxiv.org/abs/2503.15530
作者: Sean Koon
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 13 pages, 6493 words, 1 table

点击查看摘要

Abstract:Amidst the race to create more intelligent machines, this paper asserts a critical need to invest in human reasoning so that people can manage the many new challenges and opportunities of the future. As people face accelerating changes and complexities in our society, there is a risk that we will rely on AI in ways that reduce our own agency as humans. This paper outlines a human-centered augmented reasoning paradigm by 1. Articulating fundamental principles for augmented reasoning tools, emphasizing their ergonomic, pre-conclusive, directable, exploratory, enhancing, and integrated nature; 2. Proposing a ‘many tasks, many tools’ approach to ensuring human control, and 3. Offering examples of interaction modes that can serve as bridges between human reasoning and AI algorithms.

[AI-63] Complying with the EU AI Act: Innovations in Explainable and User-Centric Hand Gesture Recognition

链接: https://arxiv.org/abs/2503.15528
作者: Sarah Seifi,Tobias Sukianto,Cecilia Carbonelli,Lorenzo Servadei,Robert Wille
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The EU AI Act underscores the importance of transparency, user-centricity, and robustness in AI systems, particularly for high-risk systems. In response, we present advancements in XentricAI, an explainable hand gesture recognition (HGR) system designed to meet these regulatory requirements. XentricAI adresses fundamental challenges in HGR, such as the opacity of black-box models using explainable AI methods and the handling of distributional shifts in real-world data through transfer learning techniques. We extend an existing radar-based HGR dataset by adding 28,000 new gestures, with contributions from multiple users across varied locations, including 24,000 out-of-distribution gestures. Leveraging this real-world dataset, we enhance XentricAI’s capabilities by integrating a variational autoencoder module for improved gesture anomaly detection, incorporating user-specific thresholding. This integration enables the identification of 11.50% more anomalous gestures. Our extensive evaluations demonstrate a 97.5% sucess rate in characterizing these anomalies, significantly improving system explainability. Furthermore, the implementation of transfer learning techniques has shown a substantial increase in user adaptability, with an average improvement of at least 15.17%. This work contributes to the development of trustworthy AI systems by providing both technical advancements and regulatory compliance, offering a commercially viable solution that aligns with the EU AI Act requirements.

[AI-64] Exploring the Panorama of Anxiety Levels: A Multi-Scenario Study Based on Human-Centric Anxiety Level Detection and Personalized Guidance

链接: https://arxiv.org/abs/2503.15527
作者: Longdi Xian,Junhao Xu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:More and more people are experiencing pressure from work, life, and education. These pressures often lead to an anxious state of mind, or even the early symptoms of suicidal ideation. With the advancement of artificial intelligence (AI) technology, large language models have become one of the most prominent technologies. They are often used for detecting psychological disorders. However, current studies primarily provide categorization results without offering interpretable explanations for these results. To address this gap, this study adopts a person-centered perspective and focuses on GPT-generated multi-scenario simulated conversations. These simulated conversations were selected as data samples for the study. Various transformer-based encoder models were utilized to develop a classification model capable of identifying different levels of anxiety. Additionally, a knowledge base focusing on anxiety was constructed using LangChain and GPT-4. When analyzing classification results, this knowledge base was able to provide explanations and reasons most relevant to the interlocutor’s anxiety situation. The study demonstrates that the proposed model achieves over 94% accuracy in categorical prediction, and the advice provided is highly personalized and relevant.

[AI-65] he Use of Artificial Intelligence Tools in Assessing Content Validity: A Comparative Study with Human Experts

链接: https://arxiv.org/abs/2503.15525
作者: Hatice Gurdil,Hatice Ozlem Anadol,Yesim Beril Soguksu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:In this study, it was investigated whether AI evaluators assess the content validity of B1-level English reading comprehension test items in a manner similar to human evaluators. A 25-item multiple-choice test was developed, and these test items were evaluated by four human and four AI evaluators. No statistically significant difference was found between the scores given by human and AI evaluators, with similar evaluation trends observed. The Content Validity Ratio (CVR) and the Item Content Validity Index (I-CVI) were calculated and analyzed using the Wilcoxon Signed-Rank Test, with no statistically significant difference. The findings revealed that in some cases, AI evaluators could replace human evaluators. However, differences in specific items were thought to arise from varying interpretations of the evaluation criteria. Ensuring linguistic clarity and clearly defining criteria could contribute to more consistent evaluations. In this regard, the development of hybrid evaluation systems, in which AI technologies are used alongside human experts, is recommended.

[AI-66] Analysis of AI Effectiveness in Reducing Human Errors in Processing Transportation Requests

链接: https://arxiv.org/abs/2503.15517
作者: Oleksandr Korostin
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 4 pages, 1 figure

点击查看摘要

Abstract:This article examines the characteristics of human errors in processing transportation requests. The role of artificial intelligence (AI) in maritime transportation is explored. The main methods and technologies used for automating and optimizing the handling of transportation requests are analyzed, along with their impact on reducing the number of errors. Examples of successful AI implementation in large companies are provided, confirming the positive influence of these technologies on overall operational efficiency and customer service levels.

[AI-67] In Pursuit of Predictive Models of Human Preferences Toward AI Teammates

链接: https://arxiv.org/abs/2503.15516
作者: Ho Chit Siu,Jaime D. Peña,Yutai Zhou,Ross E. Allen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We seek measurable properties of AI agents that make them better or worse teammates from the subjective perspective of human collaborators. Our experiments use the cooperative card game Hanabi – a common benchmark for AI-teaming research. We first evaluate AI agents on a set of objective metrics based on task performance, information theory, and game theory, which are measurable without human interaction. Next, we evaluate subjective human preferences toward AI teammates in a large-scale (N=241) human-AI teaming experiment. Finally, we correlate the AI-only objective metrics with the human subjective preferences. Our results refute common assumptions from prior literature on reinforcement learning, revealing new correlations between AI behaviors and human preferences. We find that the final game score a human-AI team achieves is less predictive of human preferences than esoteric measures of AI action diversity, strategic dominance, and ability to team with other AI. In the future, these correlations may help shape reward functions for training human-collaborative AI.

[AI-68] owards Computer-Using Personal Agents

链接: https://arxiv.org/abs/2503.15515
作者: Piero A. Bonatti,John Domingue,Anna Lisa Gentile,Andreas Harth,Olaf Hartig,Aidan Hogan,Katja Hose,Ernesto Jimenez-Ruiz,Deborah L. McGuinness,Chang Sun,Ruben Verborgh,Jesse Wright
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: This report is a result of Dagstuhl Seminar 25051 “Trust and Accountability in Knowledge Graph-Based AI for Self Determination”, which took place in January 2025

点击查看摘要

Abstract:Computer-Using Agents (CUA) enable users to automate increasingly-complex tasks using graphical interfaces such as browsers. As many potential tasks require personal data, we propose Computer-Using Personal Agents (CUPAs) that have access to an external repository of the user’s personal data. Compared with CUAs, CUPAs offer users better control of their personal data, the potential to automate more tasks involving personal data, better interoperability with external sources of data, and better capabilities to coordinate with other CUPAs in order to solve collaborative tasks involving the personal data of multiple users.

[AI-69] Beyond Accuracy SHAP and Anchors – On the difficulty of designing effective end-user explanations

链接: https://arxiv.org/abs/2503.15512
作者: Zahra Abba Omar,Nadia Nahar,Jacob Tjaden,Inès M. Gilles,Fikir Mekonnen,Jane Hsieh,Christian Kästner,Alka Menon
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern machine learning produces models that are impossible for users or developers to fully understand – raising concerns about trust, oversight and human dignity. Transparency and explainability methods aim to provide some help in understanding models, but it remains challenging for developers to design explanations that are understandable to target users and effective for their purpose. Emerging guidelines and regulations set goals but may not provide effective actionable guidance to developers. In a controlled experiment with 124 participants, we investigate whether and how specific forms of policy guidance help developers design explanations for an ML-powered screening tool for diabetic retinopathy. Contrary to our expectations, we found that participants across the board struggled to produce quality explanations, comply with the provided policy requirements for explainability, and provide evidence of compliance. We posit that participant noncompliance is in part due to a failure to imagine and anticipate the needs of their audience, particularly non-technical stakeholders. Drawing on cognitive process theory and the sociological imagination to contextualize participants’ failure, we recommend educational interventions.

[AI-70] MapColorAI: Designing Contextually Relevant Choropleth Map Color Schemes Using a Large Language Model

链接: https://arxiv.org/abs/2503.15502
作者: Nai Yang,Yijie Wang,Fan Wu,Zhiwei Wei
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Choropleth maps, which utilize color schemes to visualize spatial patterns and trends, are simple yet effective tools for geographic data analysis. As such, color scheme design is a critical aspect of choropleth map creation. The traditional coloring methods offered by GIS tools such as ArcGIS and QGIS are not user-friendly for non-professionals. On the one hand, these tools provide numerous color schemes, making it hard to decide which one best matches the theme. On the other hand, it is difficult to fulfill some ambiguous and personalized coloring needs of users, such as requests for ‘summer-like’ map colors. To address these shortcomings, we develop a novel system that leverages a large language model and map color design principles to generate contextually relevant and user-aligned choropleth map color schemes. The system follows a three-stage process: Data processing, which provides an overview of the data and classifies the data into meaningful classes; Color Concept Design, where the color theme and color mode are conceptualized based on data characteristics and user intentions; and Color Scheme Design, where specific colors are assigned to classes based on generated color theme, color mode, and user requirements. Our system incorporates an interactive interface, providing necessary visualization for choropleth map color design and allowing users to customize and refine color choices flexibly. Through user studies and evaluations, the system demonstrates acceptable usability, accuracy, and flexibility, with users highlighting the tool’s efficiency and ease of use.

[AI-71] Development of an Inclusive Educational Platform Using Open Technologies and Machine Learning: A Case Study on Accessibility Enhancement

链接: https://arxiv.org/abs/2503.15501
作者: Jimi Togni
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 14 pages, 1 figure

点击查看摘要

Abstract:This study addresses the pressing challenge of educational inclusion for students with special needs by proposing and developing an inclusive educational platform. Integrating machine learning, natural language processing, and cross-platform interfaces, the platform features key functionalities such as speech recognition functionality to support voice commands and text generation via voice input; real-time object recognition using the YOLOv5 model, adapted for educational environments; Grapheme-to-Phoneme (G2P) conversion for Text-to-Speech systems using seq2seq models with attention, ensuring natural and fluent voice synthesis; and the development of a cross-platform mobile application in Flutter with on-device inference execution using TensorFlow Lite. The results demonstrated high accuracy, usability, and positive impact in educational scenarios, validating the proposal as an effective tool for educational inclusion. This project underscores the importance of open and accessible technologies in promoting inclusive and quality education.

[AI-72] Approach to Visual Attractiveness of Event Space Through Data-Driven Environment and Spatial Perception

链接: https://arxiv.org/abs/2503.15499
作者: Aliffi Majiid,Riaz-Ul-Haque Mian,Kouki Kurohara,Yen-Khang Nguyen-Tran
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Revitalizing Japan’s remote areas has become a crucial task, and Matsue City exemplifies this effort in its temporary event spaces, created through collective efforts to foster urban vibrancy and bring together residents and visitors. This research examines the relationship between data-driven in-sights using generative AI and visual attractiveness by evaluating tempo-rary events in Matsue City, particularly considering the cognitive-cultural differences in processing visual information of the participants. The first phase employs semantic keyword extraction from interviews, categorizing responses into physical elements, activities, and atmosphere. The second phase analyzes spatial perception through three categories: layout hierar-chy, product visibility, and visual attention. The correlation indicates that successful event design requires a balance between spatial efficiency and diverse needs, with a spatial organization that optimizes visitor flow and visibility strategies considering cultural and demographic diversity. These findings contribute to understanding the urban quality of temporary event spaces and offer a replicable framework for enhancing the visual appeal of events in remote areas throughout Japan.

[AI-73] Revival: Collaborative Artistic Creation through Human-AI Interactions in Musical Creativity NIPS

链接: https://arxiv.org/abs/2503.15498
作者: Keon Ju M. Lee,Philippe Pasquier,Jun Yuri
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Keon Ju M. Lee, Philippe Pasquier and Jun Yuri. 2024. In Proceedings of the Creativity and Generative AI NIPS (Neural Information Processing Systems) Workshop

点击查看摘要

Abstract:Revival is an innovative live audiovisual performance and music improvisation by our artist collective K-Phi-A, blending human and AI musicianship to create electronic music with audio-reactive visuals. The performance features real-time co-creative improvisation between a percussionist, an electronic music artist, and AI musical agents. Trained in works by deceased composers and the collective’s compositions, these agents dynamically respond to human input and emulate complex musical styles. An AI-driven visual synthesizer, guided by a human VJ, produces visuals that evolve with the musical landscape. Revival showcases the potential of AI and human collaboration in improvisational artistic creation.

[AI-74] he Impact of Big Five Personality Traits on AI Agent Decision-Making in Public Spaces: A Social Simulation Study

链接: https://arxiv.org/abs/2503.15497
作者: Mingjun Ren,Wentao Xu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This study investigates how the Big Five personality traits influence decision-making processes in AI agents within public spaces. Using AgentVerse framework and GPT-3.5-turbo, we simulated interactions among 10 AI agents, each embodying different dimensions of the Big Five personality traits, in a classroom environment responding to misinformation. The experiment assessed both public expressions ([Speak]) and private thoughts ([Think]) of agents, revealing significant correlations between personality traits and decision-making patterns. Results demonstrate that Openness to Experience had the strongest impact on information acceptance, with curious agents showing high acceptance rates and cautious agents displaying strong skepticism. Extraversion and Conscientiousness also showed notable influence on decision-making, while Neuroticism and Agreeableness exhibited more balanced responses. Additionally, we observed significant discrepancies between public expressions and private thoughts, particularly in agents with friendly and extroverted personalities, suggesting that social context influences decision-making behavior. Our findings contribute to understanding how personality traits shape AI agent behavior in social settings and have implications for developing more nuanced and context-aware AI systems.

[AI-75] Entwicklung einer Webanwendung zur Generierung von skolemisierten RDF Daten für die Verwaltung von Lieferketten

链接: https://arxiv.org/abs/2503.15495
作者: Roman Laas
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Master’s thesis

点击查看摘要

Abstract:Für eine frühzeitige Erkennung von Lieferengpässen müssen Lieferketten in einer geeigneten digitalen Form vorliegen, damit sie verarbeitet werden können. Der für die Datenmodellierung benötigte Arbeitsaufwand ist jedoch, gerade IT-fremden Personen, nicht zuzumuten. Es wurde deshalb im Rahmen dieser Arbeit eine Webanwendung entwickelt, welche die zugrunde liegende Komplexität für den Benutzer verschleiern soll. Konkret handelt es sich dabei um eine grafische Benutzeroberfläche, auf welcher Templates instanziiert und miteinander verknüpft werden können. Für die Definition dieser Templates wurden in dieser Arbeit geeignete Konzepte erarbeitet und erweitert. Zur Erhebung der Benutzerfreundlichkeit der Webanwendung wurde abschließend eine Nutzerstudie mit mehreren Testpersonen durchgeführt. Diese legte eine Vielzahl von nützlichen Verbesserungsvorschlägen offen. – For early detection of supply bottlenecks, supply chains must be available in a suitable digital form so that they can be processed. However, the amount of work required for data modeling cannot be expected of people who are not familiar with IT topics. Therefore, a web application was developed in the context of this thesis, which is supposed to disguise the underlying complexity for the user. Specifically, this is a graphical user interface on which templates can be instantiated and linked to each other. Suitable concepts for the definition of these templates were developed and extended in this thesis. Finally, a user study with several test persons was conducted to determine the usability of the web application. This revealed a large number of useful suggestions for improvement. Comments: Master’s thesis Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.15495 [cs.HC] (or arXiv:2503.15495v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2503.15495 Focus to learn more arXiv-issued DOI via DataCite

[AI-76] AI-Powered Assistive Technologies for Visual Impairment

链接: https://arxiv.org/abs/2503.15494
作者: Prudhvi Naayini,Praveen Kumar Myakala,Chiranjeevi Bura,Anil Kumar Jonnalagadda,Srikanth Kamatala
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) is revolutionizing assistive technologies. It offers innovative solutions to enhance the quality of life for individuals with visual impairments. This review examines the development, applications, and impact of AI-powered tools in key domains, such as computer vision, natural language processing (NLP), and wearable devices. Specific advancements include object recognition for identifying everyday items, scene description for understanding surroundings, and NLP-driven text-to-speech systems for accessing digital information. Assistive technologies like smart glasses, smartphone applications, and AI-enabled navigation aids are discussed, demonstrating their ability to support independent travel, facilitate social interaction, and increase access to education and employment opportunities. The integration of deep learning models, multimodal interfaces, and real-time data processing has transformed the functionality and usability of these tools, fostering inclusivity and empowerment. This article also addresses critical challenges, including ethical considerations, affordability, and adaptability in diverse environments. Future directions highlight the need for interdisciplinary collaboration to refine these technologies, ensuring equitable access and sustainable innovation. By providing a comprehensive overview, this review underscores AI’s transformative potential in promoting independence, enhancing accessibility, and fostering social inclusion for visually impaired individuals. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2503.15494 [cs.HC] (or arXiv:2503.15494v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2503.15494 Focus to learn more arXiv-issued DOI via DataCite

[AI-77] World of ScoreCraft: Novel Multi Scorer Experiment on the Impact of a Decision Support System in Sleep Staging

链接: https://arxiv.org/abs/2503.15492
作者: Benedikt Holm,Arnar Óskarsson,Björn Elvar Þorleifsson,Hörður Þór Hafsteinsson,Sigríður Sigurðardóttir,Heiður Grétarsdóttir,Kenan Hoelke,Gabriel Marc Marie Jouan,Thomas Penzel,Erna Sif Arnardottir,María Óskarsdóttir
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 12 pages, 13 figures, 2 tables

点击查看摘要

Abstract:Manual scoring of polysomnography (PSG) is a time intensive task, prone to inter scorer variability that can impact diagnostic reliability. This study investigates the integration of decision support systems (DSS) into PSG scoring workflows, focusing on their effects on accuracy, scoring time, and potential biases toward recommendations from artificial intelligence (AI) compared to human generated recommendations. Using a novel online scoring platform, we conducted a repeated measures study with sleep technologists, who scored traditional and self applied PSGs. Participants were occasionally presented with recommendations labeled as either human or AI generated. We found that traditional PSGs tended to be scored slightly more accurately than self applied PSGs, but this difference was not statistically significant. Correct recommendations significantly improved scoring accuracy for both PSG types, while incorrect recommendations reduced accuracy. No significant bias was observed toward or against AI generated recommendations compared to human generated recommendations. These findings highlight the potential of AI to enhance PSG scoring reliability. However, ensuring the accuracy of AI outputs is critical to maximizing its benefits. Future research should explore the long term impacts of DSS on scoring workflows and strategies for integrating AI in clinical practice. Comments: 12 pages, 13 figures, 2 tables Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.15492 [cs.HC] (or arXiv:2503.15492v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2503.15492 Focus to learn more arXiv-issued DOI via DataCite

[AI-78] PersonaAI: Leverag ing Retrieval-Augmented Generation and Personalized Context for AI-Driven Digital Avatars

链接: https://arxiv.org/abs/2503.15489
作者: Elvis Kimara,Kunle S. Oguntoye,Jian Sun
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces PersonaAI, a cutting-edge application that leverages Retrieval-Augmented Generation (RAG) and the LLAMA model to create highly personalized digital avatars capable of accurately mimicking individual personalities. Designed as a cloud-based mobile application, PersonaAI captures user data seamlessly, storing it in a secure database for retrieval and analysis. The result is a system that provides context-aware, accurate responses to user queries, enhancing the potential of AI-driven personalization. Why should you care? PersonaAI combines the scalability of RAG with the efficiency of prompt-engineered LLAMA3, offering a lightweight, sustainable alternative to traditional large language model (LLM) training methods. The system’s novel approach to data collection, utilizing real-time user interactions via a mobile app, ensures enhanced context relevance while maintaining user privacy. By open-sourcing our implementation, we aim to foster adaptability and community-driven development. PersonaAI demonstrates how AI can transform interactions by merging efficiency, scalability, and personalization, making it a significant step forward in the future of digital avatars and personalized AI. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.15489 [cs.HC] (or arXiv:2503.15489v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2503.15489 Focus to learn more arXiv-issued DOI via DataCite

[AI-79] Allostatic Control of Persistent States in Spiking Neural Networks for perception and computation

链接: https://arxiv.org/abs/2503.16085
作者: Aung Htet,Alejandro Rodriguez Jimenez,Sarah Hamburg,Alessandro Di Nuovo
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce a novel model for updating perceptual beliefs about the environment by extending the concept of Allostasis to the control of internal representations. Allostasis is a fundamental regulatory mechanism observed in animal physiology that orchestrates responses to maintain a dynamic equilibrium in bodily needs and internal states. In this paper, we focus on an application in numerical cognition, where a bump of activity in an attractor network is used as a spatial numerical representation. While existing neural networks can maintain persistent states, to date, there is no unified framework for dynamically controlling spatial changes in neuronal activity in response to environmental changes. To address this, we couple a well known allostatic microcircuit, the Hammel model, with a ring attractor, resulting in a Spiking Neural Network architecture that can modulate the location of the bump as a function of some reference input. This localized activity in turn is used as a perceptual belief in a simulated subitization task a quick enumeration process without counting. We provide a general procedure to fine-tune the model and demonstrate the successful control of the bump location. We also study the response time in the model with respect to changes in parameters and compare it with biological data. Finally, we analyze the dynamics of the network to understand the selectivity and specificity of different neurons to distinct categories present in the input. The results of this paper, particularly the mechanism for moving persistent states, are not limited to numerical cognition but can be applied to a wide range of tasks involving similar representations.

[AI-80] Open Science and Artificial Intelligence for supporting the sustainability of the SRC Network: The espSRC case

链接: https://arxiv.org/abs/2503.16045
作者: J. Garrido,S. Sánchez-Expósito,A. Ruiz-Falcó,J. Ruedas,M. Á. Mendoza,V. Vázquez,M. Parra,J. Sánchez,I. Labadie,L. Darriba,J. Moldón,M. Rodriguez-Álvarez,J. Díaz,L. Verdes-Montenegro
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注: Conference: Astronomical Data Analysis Software Systems - ADASS XXXIV - 2024

点击查看摘要

Abstract:The SKA Observatory (SKAO), a landmark project in radio astronomy, seeks to address fundamental questions in astronomy. To process its immense data output, approximately 700 PB/year, a global network of SKA Regional Centres (SR-CNet) will provide the infrastructure, tools, computational power needed for scientific analysis and scientific support. The Spanish SRC (espSRC) focuses on ensuring the sustainability of this network by reducing its environmental impact, integrating green practices into data platforms, and developing Open Science technologies to enable reproducible research. This paper discusses and summarizes part of the research and development activities that the team is conducting to reduce the SRC energy consumption at the espSRC and SRCNet. The paper also discusses fundamental research on trusted repositories to support Open Science practices.

[AI-81] A multi-model approach using XAI and anomaly detection to predict asteroid hazards

链接: https://arxiv.org/abs/2503.15901
作者: Amit Kumar Mondal,Nafisha Aslam,Prasenjit Maji,Hemanta Kumar Mondal
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 12 figures

点击查看摘要

Abstract:The potential for catastrophic collision makes near-Earth asteroids (NEAs) a serious concern. Planetary defense depends on accurately classifying potentially hazardous asteroids (PHAs), however the complexity of the data hampers conventional techniques. This work offers a sophisticated method for accurately predicting hazards by combining machine learning, deep learning, explainable AI (XAI), and anomaly detection. Our approach extracts essential parameters like size, velocity, and trajectory from historical and real-time asteroid data. A hybrid algorithm improves prediction accuracy by combining several cutting-edge models. A forecasting module predicts future asteroid behavior, and Monte Carlo simulations evaluate the likelihood of collisions. Timely mitigation is made possible by a real-time alarm system that notifies worldwide monitoring stations. This technique enhances planetary defense efforts by combining real-time alarms with sophisticated predictive modeling.

[AI-82] Whole-Body Image-to-Image Translation for a Virtual Scanner in a Healthcare Digital Twin

链接: https://arxiv.org/abs/2503.15555
作者: Valerio Guarrasi,Francesco Di Feola,Rebecca Restivo,Lorenzo Tronchin,Paolo Soda
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating positron emission tomography (PET) images from computed tomography (CT) scans via deep learning offers a promising pathway to reduce radiation exposure and costs associated with PET imaging, improving patient care and accessibility to functional imaging. Whole-body image translation presents challenges due to anatomical heterogeneity, often limiting generalized models. We propose a framework that segments whole-body CT images into four regions-head, trunk, arms, and legs-and uses district-specific Generative Adversarial Networks (GANs) for tailored CT-to-PET translation. Synthetic PET images from each region are stitched together to reconstruct the whole-body scan. Comparisons with a baseline non-segmented GAN and experiments with Pix2Pix and CycleGAN architectures tested paired and unpaired scenarios. Quantitative evaluations at district, whole-body, and lesion levels demonstrated significant improvements with our district-specific GANs. Pix2Pix yielded superior metrics, ensuring precise, high-quality image synthesis. By addressing anatomical heterogeneity, this approach achieves state-of-the-art results in whole-body CT-to-PET translation. This methodology supports healthcare Digital Twins by enabling accurate virtual PET scans from CT data, creating virtual imaging representations to monitor, predict, and optimize health outcomes.

[AI-83] here must be encapsulated nonconceptual content in vision

链接: https://arxiv.org/abs/2503.15538
作者: Vincent C. Müller
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper I want to propose an argument to support Jerry Fodor’s thesis (Fodor 1983) that input systems are modular and thus informationally encapsulated. The argument starts with the suggestion that there is a “grounding problem” in perception, i. e. that there is a problem in explaining how perception that can yield a visual experience is possible, how sensation can become meaningful perception of something for the subject. Given that visual experience is actually possible, this invites a transcendental argument that explains the conditions of its possibility. I propose that one of these conditions is the existence of a visual module in Fodor’s sense that allows the step from sensation to object-identifying perception, thus enabling visual experience. It seems to follow that there is informationally encapsulated nonconceptual content in visual perception.

机器学习

[LG-0] Exploring the Hidden Reasoning Process of Large Language Models by Misleading Them

链接: https://arxiv.org/abs/2503.16401
作者: Guanyu Chen,Peiyang Wang,Tianren Zhang,Feng Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) and Vision language models (VLMs) have been able to perform various forms of reasoning tasks in a wide range of scenarios, but are they truly engaging in task abstraction and rule-based reasoning beyond mere memorization and pattern matching? To answer this question, we propose a novel experimental approach, Misleading Fine-Tuning (MisFT), to examine whether LLMs/VLMs perform abstract reasoning by altering their original understanding of fundamental rules. In particular, by constructing a dataset with math expressions that contradict correct operation principles, we fine-tune the model to learn those contradictory rules and assess its generalization ability on different test domains. Through a series of experiments, we find that current LLMs/VLMs are capable of effectively applying contradictory rules to solve practical math word problems and math expressions represented by images, implying the presence of an internal mechanism that abstracts before reasoning.

[LG-1] ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos

链接: https://arxiv.org/abs/2503.16400
作者: Haolin Yang,Feilong Tang,Ming Hu,Yulong Li,Junjie Guo,Yexin Liu,Zelin Peng,Junjun He,Zongyuan Ge,Imran Razzak,
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Video diffusion models (VDMs) facilitate the generation of high-quality videos, with current research predominantly concentrated on scaling efforts during training through improvements in data quality, computational resources, and model complexity. However, inference-time scaling has received less attention, with most approaches restricting models to a single generation attempt. Recent studies have uncovered the existence of “golden noises” that can enhance video quality during generation. Building on this, we find that guiding the scaling inference-time search of VDMs to identify better noise candidates not only evaluates the quality of the frames generated in the current step but also preserves the high-level object features by referencing the anchor frame from previous multi-chunks, thereby delivering long-term value. Our analysis reveals that diffusion models inherently possess flexible adjustments of computation by varying denoising steps, and even a one-step denoising approach, when guided by a reward signal, yields significant long-term benefits. Based on the observation, we proposeScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process to improve global content consistency and visual diversity. Specifically, we perform one-step denoising to convert initial noises into a clip and subsequently evaluate its long-term value, leveraging a reward model anchored by previously generated content. Moreover, to preserve diversity, we sample candidates from a tilted noise distribution that up-weights promising noises. In this way, ScalingNoise significantly reduces noise-induced errors, ensuring more coherent and spatiotemporally consistent video generation. Extensive experiments on benchmark datasets demonstrate that the proposed ScalingNoise effectively improves long video generation.

[LG-2] ruthful Elicitation of Imprecise Forecasts

链接: https://arxiv.org/abs/2503.16395
作者: Anurag Singh,Siu Lun Chau,Krikamol Muandet
类目: Machine Learning (cs.LG)
*备注: 32 pages, 3 figures

点击查看摘要

Abstract:The quality of probabilistic forecasts is crucial for decision-making under uncertainty. While proper scoring rules incentivize truthful reporting of precise forecasts, they fall short when forecasters face epistemic uncertainty about their beliefs, limiting their use in safety-critical domains where decision-makers (DMs) prioritize proper uncertainty management. To address this, we propose a framework for scoring imprecise forecasts – forecasts given as a set of beliefs. Despite existing impossibility results for deterministic scoring rules, we enable truthful elicitation by drawing connection to social choice theory and introducing a two-way communication framework where DMs first share their aggregation rules (e.g., averaging or min-max) used in downstream decisions for resolving forecast ambiguity. This, in turn, helps forecasters resolve indecision during elicitation. We further show that truthful elicitation of imprecise forecasts is achievable using proper scoring rules randomized over the aggregation procedure. Our approach allows DM to elicit and integrate the forecaster’s epistemic uncertainty into their decision-making process, thus improving credibility.

[LG-3] Probabilistic Quantum SVM Training on Ising Machine

链接: https://arxiv.org/abs/2503.16363
作者: Haoqi He,Yan Xiao
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum computing holds significant potential to accelerate machine learning algorithms, especially in solving optimization problems like those encountered in Support Vector Machine (SVM) training. However, current QUBO-based Quantum SVM (QSVM) methods rely solely on binary optimal solutions, limiting their ability to identify fuzzy boundaries in data. Additionally, the limited qubit count in contemporary quantum devices constrains training on larger datasets. In this paper, we propose a probabilistic quantum SVM training framework suitable for Coherent Ising Machines (CIMs). By formulating the SVM training problem as a QUBO model, we leverage CIMs’ energy minimization capabilities and introduce a Boltzmann distribution-based probabilistic approach to better approximate optimal SVM solutions, enhancing robustness. To address qubit limitations, we employ batch processing and multi-batch ensemble strategies, enabling small-scale quantum devices to train SVMs on larger datasets and support multi-class classification tasks via a one-vs-one approach. Our method is validated through simulations and real-machine experiments on binary and multi-class datasets. On the banknote binary classification dataset, our CIM-based QSVM, utilizing an energy-based probabilistic approach, achieved up to 20% higher accuracy compared to the original QSVM, while training up to 10^4 times faster than simulated annealing methods. Compared with classical SVM, our approach either matched or reduced training time. On the IRIS three-class dataset, our improved QSVM outperformed existing QSVM models in all key metrics. As quantum technology advances, increased qubit counts are expected to further enhance QSVM performance relative to classical SVM.

[LG-4] Lyra: An Efficient and Expressive Subquadratic Architecture for Modeling Biological Sequences

链接: https://arxiv.org/abs/2503.16351
作者: Krithik Ramesh(1 and 2),Sameed M. Siddiqui(1 and 3),Albert Gu(4),Michael D. Mitzenmacher(1 and 5),Pardis C. Sabeti(1 and 6 and 7 and 8) ((1) Broad Institute of MIT and Harvard, (2) Massachusetts Institute of Technology, (3) Computational and Systems Biology Program, Massachusetts Institute of Technology, (4) Machine Learning Department, Carnegie Mellon University, (5) School of Engineering and Applied Sciences, Harvard University, (6) Department of Organismic and Evolutionary Biology, Harvard University, (7) Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Harvard University, (8) Howard Hughes Medical Institute)
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 53 pages, 5 figures

点击查看摘要

Abstract:Deep learning architectures such as convolutional neural networks and Transformers have revolutionized biological sequence modeling, with recent advances driven by scaling up foundation and task-specific models. The computational resources and large datasets required, however, limit their applicability in biological contexts. We introduce Lyra, a subquadratic architecture for sequence modeling, grounded in the biological framework of epistasis for understanding sequence-to-function relationships. Mathematically, we demonstrate that state space models efficiently capture global epistatic interactions and combine them with projected gated convolutions for modeling local relationships. We demonstrate that Lyra is performant across over 100 wide-ranging biological tasks, achieving state-of-the-art (SOTA) performance in many key areas, including protein fitness landscape prediction, biophysical property prediction (e.g. disordered protein region functions) peptide engineering applications (e.g. antibody binding, cell-penetrating peptide prediction), RNA structure analysis, RNA function prediction, and CRISPR guide design. It achieves this with orders-of-magnitude improvements in inference speed and reduction in parameters (up to 120,000-fold in our tests) compared to recent biology foundation models. Using Lyra, we were able to train and run every task in this study on two or fewer GPUs in under two hours, democratizing access to biological sequence modeling at SOTA performance, with potential applications to many fields.

[LG-5] Nonlinear action prediction models reveal multi-timescale locomotor control

链接: https://arxiv.org/abs/2503.16340
作者: Wei-Chen Wang,Antoine De Comite,Monica Daley,Alexandra Voloshina,Nidhi Seethapathi
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Modeling movement in real-world tasks is a fundamental scientific goal. However, it is unclear whether existing models and their assumptions, overwhelmingly tested in laboratory-constrained settings, generalize to the real world. For example, data-driven models of foot placement control – a crucial action for stable locomotion – assume linear and single timescale mappings. We develop nonlinear foot placement prediction models, finding that neural network architectures with flexible input history-dependence like GRU and Transformer perform best across multiple contexts (walking and running, treadmill and overground, varying terrains) and input modalities (multiple body states, gaze), outperforming traditional models. These models reveal context- and modality-dependent timescales: there is more reliance on fast-timescale predictions in complex terrain, gaze predictions precede body state predictions, and full-body state predictions precede center-of-mass-relevant predictions. Thus, nonlinear action prediction models provide quantifiable insights into real-world motor control and can be extended to other actions, contexts, and populations.

[LG-6] On the Cone Effect in the Learning Dynamics ICLR2025

链接: https://arxiv.org/abs/2503.16316
作者: Zhanpeng Zhou,Yongyi Yang,Jie Ren,Mahito Sugiyama,Junchi Yan
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2025 workshop DeLTa

点击查看摘要

Abstract:Understanding the learning dynamics of neural networks is a central topic in the deep learning community. In this paper, we take an empirical perspective to study the learning dynamics of neural networks in real-world settings. Specifically, we investigate the evolution process of the empirical Neural Tangent Kernel (eNTK) during training. Our key findings reveal a two-phase learning process: i) in Phase I, the eNTK evolves significantly, signaling the rich regime, and ii) in Phase II, the eNTK keeps evolving but is constrained in a narrow space, a phenomenon we term the cone effect. This two-phase framework builds on the hypothesis proposed by Fort et al. (2020), but we uniquely identify the cone effect in Phase II, demonstrating its significant performance advantages over fully linearized training.

[LG-7] Explainable Graph-theoretical Machine Learning: with Application to Alzheimers Disease Prediction

链接: https://arxiv.org/abs/2503.16286
作者: Narmina Baghirova,Duy-Thanh Vũ,Duy-Cat Can,Christelle Schneuwly Diaz,Julien Bodlet,Guillaume Blanc,Georgi Hrusanov,Bernard Ries,Oliver Y. Chén
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) affects 50 million people worldwide and is projected to overwhelm 152 million by 2050. AD is characterized by cognitive decline due partly to disruptions in metabolic brain connectivity. Thus, early and accurate detection of metabolic brain network impairments is crucial for AD management. Chief to identifying such impairments is FDG-PET data. Despite advancements, most graph-based studies using FDG-PET data rely on group-level analysis or thresholding. Yet, group-level analysis can veil individual differences and thresholding may overlook weaker but biologically critical brain connections. Additionally, machine learning-based AD prediction largely focuses on univariate outcomes, such as disease status. Here, we introduce explainable graph-theoretical machine learning (XGML), a framework employing kernel density estimation and dynamic time warping to construct individual metabolic brain graphs that capture the distance between pair-wise brain regions and identify subgraphs most predictive of multivariate AD-related outcomes. Using FDG-PET data from the Alzheimer’s Disease Neuroimaging Initiative, XGML builds metabolic brain graphs and uncovers subgraphs predictive of eight AD-related cognitive scores in new subjects. XGML shows robust performance, particularly for predicting scores measuring learning, memory, language, praxis, and orientation, such as CDRSB ( r = 0.74 ), ADAS11 ( r = 0.73 ), and ADAS13 ( r = 0.71 ). Moreover, XGML unveils key edges jointly but differentially predictive of several AD-related outcomes; they may serve as potential network biomarkers for assessing overall cognitive decline. Together, we show the promise of graph-theoretical machine learning in biomarker discovery and disease prediction and its potential to improve our understanding of network neural mechanisms underlying AD.

[LG-8] Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens

链接: https://arxiv.org/abs/2503.16278
作者: Shuqi Lu,Haowei Lin,Lin Yao,Zhifeng Gao,Xiaohong Ji,Weinan E,Linfeng Zhang,Guolin Ke
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models and their multi-modal extensions have demonstrated the effectiveness of unifying generation and understanding through autoregressive next-token prediction. However, despite the critical role of 3D structural generation and understanding (3D GU) in AI for science, these tasks have largely evolved independently, with autoregressive methods remaining underexplored. To bridge this gap, we introduce Uni-3DAR, a unified framework that seamlessly integrates 3D GU tasks via autoregressive prediction. At its core, Uni-3DAR employs a novel hierarchical tokenization that compresses 3D space using an octree, leveraging the inherent sparsity of 3D structures. It then applies an additional tokenization for fine-grained structural details, capturing key attributes such as atom types and precise spatial coordinates in microscopic 3D structures. We further propose two optimizations to enhance efficiency and effectiveness. The first is a two-level subtree compression strategy, which reduces the octree token sequence by up to 8x. The second is a masked next-token prediction mechanism tailored for dynamically varying token positions, significantly boosting model performance. By combining these strategies, Uni-3DAR successfully unifies diverse 3D GU tasks within a single autoregressive framework. Extensive experiments across multiple microscopic 3D GU tasks, including molecules, proteins, polymers, and crystals, validate its effectiveness and versatility. Notably, Uni-3DAR surpasses previous state-of-the-art diffusion models by a substantial margin, achieving up to 256% relative improvement while delivering inference speeds up to 21.8x faster. The code is publicly available at this https URL.

[LG-9] Rethinking Robustness in Machine Learning: A Posterior Agreement Approach

链接: https://arxiv.org/abs/2503.16271
作者: João Borges S. Carvalho,Alessandro Torcinovich,Victor Jimenez Rodriguez,Antonio E. Cinà,Carlos Cotrini,Lea Schönherr,Joachim M. Buhmann
类目: Machine Learning (cs.LG)
*备注: Preprint submitted to TMLR. 29 pages, 13 figures

点击查看摘要

Abstract:The robustness of algorithms against covariate shifts is a fundamental problem with critical implications for the deployment of machine learning algorithms in the real world. Current evaluation methods predominantly match the robustness definition to that of standard generalization, relying on standard metrics like accuracy-based scores, which, while designed for performance assessment, lack a theoretical foundation encompassing their application in estimating robustness to distribution shifts. In this work, we set the desiderata for a robustness metric, and we propose a novel principled framework for the robustness assessment problem that directly follows the Posterior Agreement (PA) theory of model validation. Specifically, we extend the PA framework to the covariate shift setting by proposing a PA metric for robustness evaluation in supervised classification tasks. We assess the soundness of our metric in controlled environments and through an empirical robustness analysis in two different covariate shift scenarios: adversarial learning and domain generalization. We illustrate the suitability of PA by evaluating several models under different nature and magnitudes of shift, and proportion of affected observations. The results show that the PA metric provides a sensible and consistent analysis of the vulnerabilities in learning algorithms, even in the presence of few perturbed observations.

[LG-10] Machine learning identifies nullclines in oscillatory dynamical systems

链接: https://arxiv.org/abs/2503.16240
作者: Bartosz Prokop,Jimmy Billen,Nikita Frolov,Lendert Gelens
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Adaptation and Self-Organizing Systems (nlin.AO); Computational Physics (physics.comp-ph)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:We introduce CLINE (Computational Learning and Identification of Nullclines), a neural network-based method that uncovers the hidden structure of nullclines from oscillatory time series data. Unlike traditional approaches aiming at direct prediction of system dynamics, CLINE identifies static geometric features of the phase space that encode the (non)linear relationships between state variables. It overcomes challenges such as multiple time scales and strong nonlinearities while producing interpretable results convertible into symbolic differential equations. We validate CLINE on various oscillatory systems, showcasing its effectiveness.

[LG-11] Empirical Analysis of Privacy-Fairness-Accuracy Trade-offs in Federated Learning: A Step Towards Responsible AI IJCAI2025

链接: https://arxiv.org/abs/2503.16233
作者: Dawood Wasif,Dian Chen,Sindhuja Madabushi,Nithin Alluru,Terrence J. Moore,Jin-Hee Cho
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
*备注: Submitted to IJCAI 2025 (under review)

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative machine learning while preserving data privacy but struggles to balance privacy preservation (PP) and fairness. Techniques like Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Multi-Party Computation (SMC) protect sensitive data but introduce trade-offs. DP enhances privacy but can disproportionately impact underrepresented groups, while HE and SMC mitigate fairness concerns at the cost of computational overhead. This work explores the privacy-fairness trade-offs in FL under IID (Independent and Identically Distributed) and non-IID data distributions, benchmarking q-FedAvg, q-MAML, and Ditto on diverse datasets. Our findings highlight context-dependent trade-offs and offer guidelines for designing FL systems that uphold responsible AI principles, ensuring fairness, privacy, and equitable real-world applications.

[LG-12] Neural Variable-Order Fractional Differential Equation Networks AAAI2025

链接: https://arxiv.org/abs/2503.16207
作者: Wenjun Cui,Qiyu Kang,Xuhao Li,Kai Zhao,Wee Peng Tay,Weihua Deng,Yidong Li
类目: Machine Learning (cs.LG)
*备注: AAAI 2025

点击查看摘要

Abstract:Neural differential equation models have garnered significant attention in recent years for their effectiveness in machine learning this http URL these, fractional differential equations (FDEs) have emerged as a promising tool due to their ability to capture memory-dependent dynamics, which are often challenging to model with traditional integer-order this http URL existing models have primarily focused on constant-order fractional derivatives, variable-order fractional operators offer a more flexible and expressive framework for modeling complex memory patterns. In this work, we introduce the Neural Variable-Order Fractional Differential Equation network (NvoFDE), a novel neural network framework that integrates variable-order fractional derivatives with learnable neural this http URL framework allows for the modeling of adaptive derivative orders dependent on hidden features, capturing more complex feature-updating dynamics and providing enhanced flexibility. We conduct extensive experiments across multiple graph datasets to validate the effectiveness of our this http URL results demonstrate that NvoFDE outperforms traditional constant-order fractional and integer models across a range of tasks, showcasing its superior adaptability and performance.

[LG-13] Deferring Concept Bottleneck Models: Learning to Defer Interventions to Inaccurate Experts

链接: https://arxiv.org/abs/2503.16199
作者: Andrea Pugnana,Riccardo Massidda,Francesco Giannini,Pietro Barbiero,Mateo Espinosa Zarlenga,Roberto Pellungrini,Gabriele Dominici,Fosca Giannotti,Davide Bacciu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) are machine learning models that improve interpretability by grounding their predictions on human-understandable concepts, allowing for targeted interventions in their decision-making process. However, when intervened on, CBMs assume the availability of humans that can identify the need to intervene and always provide correct interventions. Both assumptions are unrealistic and impractical, considering labor costs and human error-proneness. In contrast, Learning to Defer (L2D) extends supervised learning by allowing machine learning models to identify cases where a human is more likely to be correct than the model, thus leading to deferring systems with improved performance. In this work, we gain inspiration from L2D and propose Deferring CBMs (DCBMs), a novel framework that allows CBMs to learn when an intervention is needed. To this end, we model DCBMs as a composition of deferring systems and derive a consistent L2D loss to train them. Moreover, by relying on a CBM architecture, DCBMs can explain why defer occurs on the final task. Our results show that DCBMs achieve high predictive performance and interpretability at the cost of deferring more to humans.

[LG-14] Nonparametric Bellm an Mappings for Value Iteration in Distributed Reinforcement Learning

链接: https://arxiv.org/abs/2503.16192
作者: Yuki Akiyama,Konstantinos Slavakis
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper introduces novel Bellman mappings (B-Maps) for value iteration (VI) in distributed reinforcement learning (DRL), where multiple agents operate over a network without a centralized fusion node. Each agent constructs its own nonparametric B-Map for VI while communicating only with direct neighbors to achieve consensus. These B-Maps operate on Q-functions represented in a reproducing kernel Hilbert space, enabling a nonparametric formulation that allows for flexible, agent-specific basis function design. Unlike existing DRL methods that restrict information exchange to Q-function estimates, the proposed framework also enables agents to share basis information in the form of covariance matrices, capturing additional structural details. A theoretical analysis establishes linear convergence rates for both Q-function and covariance-matrix estimates toward their consensus values. The optimal learning rates for consensus-based updates are dictated by the ratio of the smallest positive eigenvalue to the largest one of the network’s Laplacian matrix. Furthermore, each nodal Q-function estimate is shown to lie very close to the fixed point of a centralized nonparametric B-Map, effectively allowing the proposed DRL design to approximate the performance of a centralized fusion center. Numerical experiments on two well-known control problems demonstrate the superior performance of the proposed nonparametric B-Maps compared to prior methods. Notably, the results reveal a counter-intuitive finding: although the proposed approach involves greater information exchange – specifically through the sharing of covariance matrices – it achieves the desired performance with lower cumulative communication cost than existing DRL schemes, highlighting the crucial role of basis information in accelerating the learning process.

[LG-15] Manifold learning in metric spaces

链接: https://arxiv.org/abs/2503.16187
作者: Liane Xu,Amit Singer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Laplacian-based methods are popular for dimensionality reduction of data lying in \mathbbR^N . Several theoretical results for these algorithms depend on the fact that the Euclidean distance approximates the geodesic distance on the underlying submanifold which the data are assumed to lie on. However, for some applications, other metrics, such as the Wasserstein distance, may provide a more appropriate notion of distance than the Euclidean distance. We provide a framework that generalizes the problem of manifold learning to metric spaces and study when a metric satisfies sufficient conditions for the pointwise convergence of the graph Laplacian.

[LG-16] Variance-Aware Noisy Training: Hardening DNNs against Unstable Analog Computations

链接: https://arxiv.org/abs/2503.16183
作者: Xiao Wang,Hendrik Borras,Bernhard Klein,Holger Fröning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The disparity between the computational demands of deep learning and the capabilities of compute hardware is expanding drastically. Although deep learning achieves remarkable performance in countless tasks, its escalating requirements for computational power and energy consumption surpass the sustainable limits of even specialized neural processing units, including the Apple Neural Engine and NVIDIA TensorCores. This challenge is intensified by the slowdown in CMOS scaling. Analog computing presents a promising alternative, offering substantial improvements in energy efficiency by directly manipulating physical quantities such as current, voltage, charge, or photons. However, it is inherently vulnerable to manufacturing variations, nonlinearities, and noise, leading to degraded prediction accuracy. One of the most effective techniques for enhancing robustness, Noisy Training, introduces noise during the training phase to reinforce the model against disturbances encountered during inference. Although highly effective, its performance degrades in real-world environments where noise characteristics fluctuate due to external factors such as temperature variations and temporal drift. This study underscores the necessity of Noisy Training while revealing its fundamental limitations in the presence of dynamic noise. To address these challenges, we propose Variance-Aware Noisy Training, a novel approach that mitigates performance degradation by incorporating noise schedules which emulate the evolving noise conditions encountered during inference. Our method substantially improves model robustness, without training overhead. We demonstrate a significant increase in robustness, from 72.3% with conventional Noisy Training to 97.3% with Variance-Aware Noisy Training on CIFAR-10 and from 38.5% to 89.9% on Tiny ImageNet. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.16183 [cs.LG] (or arXiv:2503.16183v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.16183 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Improving Discriminator Guidance in Diffusion Models

链接: https://arxiv.org/abs/2503.16117
作者: Alexandre Verine,Mehdi Inane,Florian Le Bronnec,Benjamin Negrevergne,Yann Chevaleyre
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discriminator Guidance has become a popular method for efficiently refining pre-trained Score-Matching Diffusion models. However, in this paper, we demonstrate that the standard implementation of this technique does not necessarily lead to a distribution closer to the real data distribution. Specifically, we show that training the discriminator using Cross-Entropy loss, as commonly done, can in fact increase the Kullback-Leibler divergence between the model and target distributions, particularly when the discriminator overfits. To address this, we propose a theoretically sound training objective for discriminator guidance that properly minimizes the KL divergence. We analyze its properties and demonstrate empirically across multiple datasets that our proposed method consistently improves over the conventional method by producing samples of higher quality.

[LG-18] Learn to Bid as a Price-Maker Wind Power Producer

链接: https://arxiv.org/abs/2503.16107
作者: Shobhit Singhal,Marta Fochesato,Liviu Aolaritei,Florian Dörfler
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Wind power producers (WPPs) participating in short-term power markets face significant imbalance costs due to their non-dispatchable and variable production. While some WPPs have a large enough market share to influence prices with their bidding decisions, existing optimal bidding methods rarely account for this aspect. Price-maker approaches typically model bidding as a bilevel optimization problem, but these methods require complex market models, estimating other participants’ actions, and are computationally demanding. To address these challenges, we propose an online learning algorithm that leverages contextual information to optimize WPP bids in the price-maker setting. We formulate the strategic bidding problem as a contextual multi-armed bandit, ensuring provable regret minimization. The algorithm’s performance is evaluated against various benchmark strategies using a numerical simulation of the German day-ahead and real-time markets.

[LG-19] OThink-MR1: Stimulating multimodal generalized reasoning capabilities through dynamic reinforcement learning

链接: https://arxiv.org/abs/2503.16081
作者: Zhiyuan Liu,Yuting Zhang,Feng Liu,Changwang Zhang,Ying Sun,Jun Wang
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal Language Models have gained significant traction for their ability to process diverse input data types and generate coherent, contextually relevant outputs across various applications. While supervised fine-tuning (SFT) has been the predominant approach to enhance MLLM capabilities in task-specific optimization, it often falls short in fostering crucial generalized reasoning abilities. Despite the potential of reinforcement learning (RL) to address these limitations, it faces two issues: (1) its generalized capabilities in multimodal tasks remain underexplored. (2) its training constraints such as constant Kullback-Leibler or clamp strategy easily lead to suboptimal bottleneck. To adress these issues, we introduce OThink-MR1, a framework that extends RL to MLLMs, enabling them to achieve deeper understanding and reasoning across multimodal tasks. We design a dynamic Kullback-Leibler strategy that significantly enhances RL performance, surpassing SFT in same-task evaluations. Also, we are the first to reveal that RL exhibits remarkable cross-task generalization capabilities, which shows that models post-trained with RL on one multimodal task can be effectively transfered to another tasks. Finally, extensive experiments demonstrate the great reasoning ability of our proposed OThink-MR1.

[LG-20] VineSynth: A Truncated C-Vine Copula Generator of Synthetic Tabular Data to Balance Privacy and Utility AISTATS2025

链接: https://arxiv.org/abs/2503.15972
作者: Elisabeth Griesbauer,Claudia Czado,Arnoldo Frigessi,Ingrid Hobæk Haff
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 28th International Conference on Artificial Intelligence and Statistics (AISTATS 2025)

点击查看摘要

Abstract:We propose TVineSynth, a vine copula based synthetic tabular data generator, which is designed to balance privacy and utility, using the vine tree structure and its truncation to do the trade-off. Contrary to synthetic data generators that achieve DP by globally adding noise, TVineSynth performs a controlled approximation of the estimated data generating distribution, so that it does not suffer from poor utility of the resulting synthetic data for downstream prediction tasks. TVineSynth introduces a targeted bias into the vine copula model that, combined with the specific tree structure of the vine, causes the model to zero out privacy-leaking dependencies while relying on those that are beneficial for utility. Privacy is here measured with membership (MIA) and attribute inference attacks (AIA). Further, we theoretically justify how the construction of TVineSynth ensures AIA privacy under a natural privacy measure for continuous sensitive attributes. When compared to competitor models, with and without DP, on simulated and on real-world data, TVineSynth achieves a superior privacy-utility balance.

[LG-21] Information maximization for a broad variety of multi-armed bandit games

链接: https://arxiv.org/abs/2503.15962
作者: Alex Barbier-Chebbah(EPIMETHEE),Christian L. Vestergaard(EPIMETHEE),Jean-Baptiste Masson(EPIMETHEE)
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Information and free-energy maximization are physics principles that provide general rules for an agent to optimize actions in line with specific goals and policies. These principles are the building blocks for designing decision-making policies capable of efficient performance with only partial information. Notably, the information maximization principle has shown remarkable success in the classical bandit problem and has recently been shown to yield optimal algorithms for Gaussian and sub-Gaussian reward distributions. This article explores a broad extension of physics-based approaches to more complex and structured bandit problems. To this end, we cover three distinct types of bandit problems, where information maximization is adapted and leads to strong performance. Since the main challenge of information maximization lies in avoiding over-exploration, we highlight how information is tailored at various levels to mitigate this issue, paving the way for more efficient and robust decision-making strategies.

[LG-22] Multivariate Time Series Anomaly Detection in Industry 5.0

链接: https://arxiv.org/abs/2503.15946
作者: Lorenzo Colombi,Michela Vespa,Nicolas Belletti,Matteo Brina,Simon Dahdal,Filippo Tabanelli,Elena Bellodi,Mauro Tortonesi,Cesare Stefanelli,Massimiliano Vignoli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Industry5.0 environments present a critical need for effective anomaly detection methods that can indicate equipment malfunctions, process inefficiencies, or potential safety hazards. The ever-increasing sensorization of manufacturing lines makes processes more observable, but also poses the challenge of continuously analyzing vast amounts of multivariate time series data. These challenges include data quality since data may contain noise, be unlabeled or even mislabeled. A promising approach consists of combining an embedding model with other Machine Learning algorithms to enhance the overall performance in detecting anomalies. Moreover, representing time series as vectors brings many advantages like higher flexibility and improved ability to capture complex temporal dependencies. We tested our solution in a real industrial use case, using data collected from a Bonfiglioli plant. The results demonstrate that, unlike traditional reconstruction-based autoencoders, which often struggle in the presence of sporadic noise, our embedding-based framework maintains high performance across various noise conditions.

[LG-23] Sample-Efficient Bayesian Transfer Learning for Online Machine Parameter Optimization

链接: https://arxiv.org/abs/2503.15928
作者: Philipp Wagner,Tobias Nagel,Philipp Leube,Marco F. Huber
类目: Machine Learning (cs.LG)
*备注: Accepted in IEEE Conference on Artificial Intelligence, 2025

点击查看摘要

Abstract:Correctly setting the parameters of a production machine is essential to improve product quality, increase efficiency, and reduce production costs while also supporting sustainability goals. Identifying optimal parameters involves an iterative process of producing an object and evaluating its quality. Minimizing the number of iterations is, therefore, desirable to reduce the costs associated with unsuccessful attempts. This work introduces a method to optimize the machine parameters in the system itself using a \acBO algorithm. By leveraging existing machine data, we use a transfer learning approach in order to identify an optimum with minimal iterations, resulting in a cost-effective transfer learning algorithm. We validate our approach on a laser machine for cutting sheet metal in the real world.

[LG-24] On the Limits of Applying Graph Transformers for Brain Connectome Classification

链接: https://arxiv.org/abs/2503.15902
作者: Jose Lara-Rangel,Clare Heinbaugh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Brain connectomes offer detailed maps of neural connections within the brain. Recent studies have proposed novel connectome graph datasets and attempted to improve connectome classification by using graph deep learning. With recent advances demonstrating transformers’ ability to model intricate relationships and outperform in various domains, this work explores their performance on the novel NeuroGraph benchmark datasets and synthetic variants derived from probabilistically removing edges to simulate noisy data. Our findings suggest that graph transformers offer no major advantage over traditional GNNs on this dataset. Furthermore, both traditional and transformer GNN models maintain accuracy even with all edges removed, suggesting that the dataset’s graph structures may not significantly impact predictions. We propose further assessing NeuroGraph as a brain connectome benchmark, emphasizing the need for well-curated datasets and improved preprocessing strategies to obtain meaningful edge connections.

[LG-25] FedSAF: A Federated Learning Framework for Enhanced Gastric Cancer Detection and Privacy Preservation

链接: https://arxiv.org/abs/2503.15870
作者: Yuxin Miao,Xinyuan Yang,Hongda Fan,Yichun Li,Yishu Hong,Xiechen Guo,Ali Braytee,Weidong Huang,Ali Anaissi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gastric cancer is one of the most commonly diagnosed cancers and has a high mortality rate. Due to limited medical resources, developing machine learning models for gastric cancer recognition provides an efficient solution for medical institutions. However, such models typically require large sample sizes for training and testing, which can challenge patient privacy. Federated learning offers an effective alternative by enabling model training across multiple institutions without sharing sensitive patient data. This paper addresses the limited sample size of publicly available gastric cancer data with a modified data processing method. This paper introduces FedSAF, a novel federated learning algorithm designed to improve the performance of existing methods, particularly in non-independent and identically distributed (non-IID) data scenarios. FedSAF incorporates attention-based message passing and the Fisher Information Matrix to enhance model accuracy, while a model splitting function reduces computation and transmission costs. Hyperparameter tuning and ablation studies demonstrate the effectiveness of this new algorithm, showing improvements in test accuracy on gastric cancer datasets, with FedSAF outperforming existing federated learning methods like FedAMP, FedAvg, and FedProx. The framework’s robustness and generalization ability were further validated across additional datasets (SEED, BOT, FashionMNIST, and CIFAR-10), achieving high performance in diverse environments.

[LG-26] Network Embedding Exploration Tool (NEExT)

链接: https://arxiv.org/abs/2503.15853
作者: Ashkan Dehghan,Paweł Prałat,François Théberge
类目: Machine Learning (cs.LG)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:Many real-world and artificial systems and processes can be represented as graphs. Some examples of such systems include social networks, financial transactions, supply chains, and molecular structures. In many of these cases, one needs to consider a collection of graphs, rather than a single network. This could be a collection of distinct but related graphs, such as different protein structures or graphs resulting from dynamic processes on the same network. Examples of the latter include the evolution of social networks, community-induced graphs, or ego-nets around various nodes. A significant challenge commonly encountered is the absence of ground-truth labels for graphs or nodes, necessitating the use of unsupervised techniques to analyze such systems. Moreover, even when ground-truth labels are available, many existing graph machine learning methods depend on complex deep learning models, complicating model explainability and interpretability. To address some of these challenges, we have introduced NEExT (Network Embedding Exploration Tool) for embedding collections of graphs via user-defined node features. The advantages of the framework are twofold: (i) the ability to easily define your own interpretable node-based features in view of the task at hand, and (ii) fast embedding of graphs provided by the Vectorizers library. In this paper, we demonstrate the usefulness of NEExT on collections of synthetic and real-world graphs. For supervised tasks, we demonstrate that performance in graph classification tasks could be achieved similarly to other state-of-the-art techniques while maintaining model interpretability. Furthermore, our framework can also be used to generate high-quality embeddings in an unsupervised way, where target variables are not available.

[LG-27] Network-wide Freeway Traffic Estimation Using Sparse Sensor Data: A Dirichlet Graph Auto-Encoder Approach

链接: https://arxiv.org/abs/2503.15845
作者: Qishen Zhou,Yifan Zhang,Michail A. Makridis,Anastasios Kouvelas,Yibing Wang,Simon Hu
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Network-wide Traffic State Estimation (TSE), which aims to infer a complete image of network traffic states with sparsely deployed sensors, plays a vital role in intelligent transportation systems. With the development of data-driven methods, traffic dynamics modeling has advanced significantly. However, TSE poses fundamental challenges for data-driven approaches, since historical patterns cannot be learned locally at sensor-free segments. Although inductive graph learning shows promise in estimating states at locations without sensor, existing methods typically handle unobserved locations by filling them with zeros, introducing bias to the sensitive graph message propagation. The recently proposed Dirichlet Energy-based Feature Propagation (DEFP) method achieves State-Of-The-Art (SOTA) performance in unobserved node classification by eliminating the need for zero-filling. However, applying it to TSE faces three key challenges: inability to handle directed traffic networks, strong assumptions in traffic spatial correlation modeling, and overlooks distinct propagation rules of different patterns (e.g., congestion and free flow). We propose DGAE, a novel inductive graph representation model that addresses these challenges through theoretically derived DEFP for Directed graph (DEFP4D), enhanced spatial representation learning via DEFP4D-guided latent space encoding, and physics-guided propagation mechanisms that separately handles congested and free-flow patterns. Experiments on three traffic datasets demonstrate that DGAE outperforms existing SOTA methods and exhibits strong cross-city transferability. Furthermore, DEFP4D can serve as a standalone lightweight solution, showing superior performance under extremely sparse sensor conditions.

[LG-28] FedAWA: Adaptive Optimization of Aggregation Weights in Federated Learning Using Client Vectors CVPR2025

链接: https://arxiv.org/abs/2503.15842
作者: Changlong Shi,He Zhao,Bingjie Zhang,Mingyuan Zhou,Dandan Guo,Yi Chang
类目: Machine Learning (cs.LG)
*备注: Accepted in CVPR 2025

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising framework for distributed machine learning, enabling collaborative model training without sharing local data, thereby preserving privacy and enhancing security. However, data heterogeneity resulting from differences across user behaviors, preferences, and device characteristics poses a significant challenge for federated learning. Most previous works overlook the adjustment of aggregation weights, relying solely on dataset size for weight assignment, which often leads to unstable convergence and reduced model performance. Recently, several studies have sought to refine aggregation strategies by incorporating dataset characteristics and model alignment. However, adaptively adjusting aggregation weights while ensuring data security-without requiring additional proxy data-remains a significant challenge. In this work, we propose Federated learning with Adaptive Weight Aggregation (FedAWA), a novel method that adaptively adjusts aggregation weights based on client vectors during the learning process. The client vector captures the direction of model updates, reflecting local data variations, and is used to optimize the aggregation weight without requiring additional datasets or violating privacy. By assigning higher aggregation weights to local models whose updates align closely with the global optimization direction, FedAWA enhances the stability and generalization of the global model. Extensive experiments under diverse scenarios demonstrate the superiority of our method, providing a promising solution to the challenges of data heterogeneity in federated learning.

[LG-29] Energy-Efficient Federated Learning and Migration in Digital Twin Edge Networks

链接: https://arxiv.org/abs/2503.15822
作者: Yuzhi Zhou,Yaru Fu,Zheng Shi,Howard H. Yang,Kevin Hung,Yan Zhang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The digital twin edge network (DITEN) is a significant paradigm in the sixth-generation wireless system (6G) that aims to organize well-developed infrastructures to meet the requirements of evolving application scenarios. However, the impact of the interaction between the long-term DITEN maintenance and detailed digital twin tasks, which often entail privacy considerations, is commonly overlooked in current research. This paper addresses this issue by introducing a problem of digital twin association and historical data allocation for a federated learning (FL) task within DITEN. To achieve this goal, we start by introducing a closed-form function to predict the training accuracy of the FL task, referring to it as the data utility. Subsequently, we carry out comprehensive convergence analyses on the proposed FL methodology. Our objective is to jointly optimize the data utility of the digital twin-empowered FL task and the energy costs incurred by the long-term DITEN maintenance, encompassing FL model training, data synchronization, and twin migration. To tackle the aforementioned challenge, we present an optimization-driven learning algorithm that effectively identifies optimized solutions for the formulated problem. Numerical results demonstrate that our proposed algorithm outperforms various baseline approaches.

[LG-30] Control Pneumatic Soft Bending Actuator with Online Learning Pneumatic Physical Reservoir Computing

链接: https://arxiv.org/abs/2503.15819
作者: Junyi Shen,Tetsuro Miyazaki,Kenji Kawashima
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, 13 figures, IEEE-RAS International Conference on Soft Robotics (RoboSoft 2025)

点击查看摘要

Abstract:The intrinsic nonlinearities of soft robots present significant control but simultaneously provide them with rich computational potential. Reservoir computing (RC) has shown effectiveness in online learning systems for controlling nonlinear systems such as soft actuators. Conventional RC can be extended into physical reservoir computing (PRC) by leveraging the nonlinear dynamics of soft actuators for computation. This paper introduces a PRC-based online learning framework to control the motion of a pneumatic soft bending actuator, utilizing another pneumatic soft actuator as the PRC model. Unlike conventional designs requiring two RC models, the proposed control system employs a more compact architecture with a single RC model. Additionally, the framework enables zero-shot online learning, addressing limitations of previous PRC-based control systems reliant on offline training. Simulations and experiments validated the performance of the proposed system. Experimental results indicate that the PRC model achieved superior control performance compared to a linear model, reducing the root-mean-square error (RMSE) by an average of over 37% in bending motion control tasks. The proposed PRC-based online learning control framework provides a novel approach for harnessing physical systems’ inherent nonlinearities to enhance the control of soft actuators.

[LG-31] Communication Efficient Federated Learning with Linear Convergence on Heterogeneous Data

链接: https://arxiv.org/abs/2503.15804
作者: Jie Liu,Yongqiang Wang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:By letting local clients perform multiple local updates before communicating with a parameter server, modern federated learning algorithms such as FedAvg tackle the communication bottleneck problem in distributed learning and have found many successful applications. However, this asynchrony between local updates and communication also leads to a ‘‘client-drift’’ problem when the data is heterogeneous (not independent and identically distributed), resulting in errors in the final learning result. In this paper, we propose a federated learning algorithm, which is called FedCET, to ensure accurate convergence even under heterogeneous distributions of data across clients. Inspired by the distributed optimization algorithm NIDS, we use learning rates to weight information received from local clients to eliminate the ‘‘client-drift’’. We prove that under appropriate learning rates, FedCET can ensure linear convergence to the exact solution. Different from existing algorithms which have to share both gradients and a drift-correction term to ensure accurate convergence under heterogeneous data distributions, FedCET only shares one variable, which significantly reduces communication overhead. Numerical comparison with existing counterpart algorithms confirms the effectiveness of FedCET.

[LG-32] Disentangling Uncertainties by Learning Compressed Data Representation

链接: https://arxiv.org/abs/2503.15801
作者: Zhiyu An,Zhibo Hou,Wan Du
类目: Machine Learning (cs.LG)
*备注: Accepted by the 7th Annual Learning for Dynamics Control Conference (L4DC) 2025

点击查看摘要

Abstract:We study aleatoric and epistemic uncertainty estimation in a learned regressive system dynamics model. Disentangling aleatoric uncertainty (the inherent randomness of the system) from epistemic uncertainty (the lack of data) is crucial for downstream tasks such as risk-aware control and reinforcement learning, efficient exploration, and robust policy transfer. While existing approaches like Gaussian Processes, Bayesian networks, and model ensembles are widely adopted, they suffer from either high computational complexity or inaccurate uncertainty estimation. To address these limitations, we propose the Compressed Data Representation Model (CDRM), a framework that learns a neural network encoding of the data distribution and enables direct sampling from the output distribution. Our approach incorporates a novel inference procedure based on Langevin dynamics sampling, allowing CDRM to predict arbitrary output distributions rather than being constrained to a Gaussian prior. Theoretical analysis provides the conditions where CDRM achieves better memory and computational complexity compared to bin-based compression methods. Empirical evaluations show that CDRM demonstrates a superior capability to identify aleatoric and epistemic uncertainties separately, achieving AUROCs of 0.8876 and 0.9981 on a single test set containing a mixture of both uncertainties. Qualitative results further show that CDRM’s capability extends to datasets with multimodal output distributions, a challenging scenario where existing methods consistently fail. Code and supplementary materials are available at this https URL.

[LG-33] DNA Bench: When Silence is Smarter – Benchmarking Over-Reasoning in Reasoning LLM s

链接: https://arxiv.org/abs/2503.15793
作者: Masoud Hashemi,Oluwanifemi Bamgbose,Sathwik Tejaswi Madhusudhan,Jishnu Sethumadhavan Nair,Aman Tiwari,Vikas Yadav
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Test-time scaling has significantly improved large language model performance, enabling deeper reasoning to solve complex problems. However, this increased reasoning capability also leads to excessive token generation and unnecessary problem-solving attempts. We introduce Dont Answer Bench (DNA Bench), a new benchmark designed to evaluate LLMs ability to robustly understand the tricky reasoning triggers and avoiding unnecessary generation. DNA Bench consists of 150 adversarially designed prompts that are easy for humans to understand and respond to, but surprisingly not for many of the recent prominent LLMs. DNA Bench tests models abilities across different capabilities, such as instruction adherence, hallucination avoidance, redundancy filtering, and unanswerable question recognition. We evaluate reasoning LLMs (RLMs), including DeepSeek-R1, OpenAI O3-mini, Claude-3.7-sonnet and compare them against a powerful non-reasoning model, e.g., GPT-4o. Our experiments reveal that RLMs generate up to 70x more tokens than necessary, often failing at tasks that simpler non-reasoning models handle efficiently with higher accuracy. Our findings underscore the need for more effective training and inference strategies in RLMs.

[LG-34] Line Space Clustering (LSC): Feature-Based Clustering using K-medians and Dynamic Time Warping for Versatility

链接: https://arxiv.org/abs/2503.15777
作者: Joanikij Chulev,Angela Mladenovska
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Clustering high-dimensional data is a critical challenge in machine learning due to the curse of dimensionality and the presence of noise. Traditional clustering algorithms often fail to capture the intrinsic structures in such data. This paper explores a combination of clustering methods, which we called Line Space Clustering (LSC), a representation that transforms data points into lines in a newly defined feature space, enabling clustering based on the similarity of feature value patterns, essentially treating features as sequences. LSC employs a combined distance metric that uses Euclidean and Dynamic Time Warping (DTW) distances, weighted by a parameter \alpha, allowing flexibility in emphasizing shape or magnitude similarities. We delve deeply into the mechanics of DTW and the Savitzky Golay filter, explaining their roles in the algorithm. Extensive experiments demonstrate the efficacy of LSC on synthetic and real-world datasets, showing that randomly experimenting with time-series optimized methods sometimes might surprisingly work on a complex dataset, particularly in noisy environments. Source code and experiments are available at: this https URL. Comments: 8 pages, 5 figures, 3 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.15777 [cs.LG] (or arXiv:2503.15777v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.15777 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Prediction of Permissioned Blockchain Performance for Resource Scaling Configurations

链接: https://arxiv.org/abs/2503.15769
作者: Seungwoo Jung,Yeonho Yoo,Gyeongsik Yang,Chuck Yoo
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Blockchain is increasingly offered as blockchain-as-a-service (BaaS) by cloud service providers. However, configuring BaaS appropriately for optimal performance and reliability resorts to try-and-error. A key challenge is that BaaS is often perceived as a ``black-box,‘’ leading to uncertainties in performance and resource provisioning. Previous studies attempted to address this challenge; however, the impacts of both vertical and horizontal scaling remain elusive. To this end, we present machine learning-based models to predict network reliability and throughput based on scaling configurations. In our evaluation, the models exhibit prediction errors of ~1.9%, which is highly accurate and can be applied in the real-world.

[LG-36] Accelerating Transient CFD through Machine Learning-Based Flow Initialization

链接: https://arxiv.org/abs/2503.15766
作者: Peter Sharpe,Rishikesh Ranade,Sanjay Choudhry
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:Transient computational fluid dynamics (CFD) simulations are essential for many industrial applications, but a significant portion of their computational cost stems from the time needed to reach statistical steadiness from initial conditions. We present a novel machine learning-based initialization method that reduces the cost of this subsequent transient solve substantially, achieving a 50% reduction in time-to-convergence compared to traditional uniform and potential flow-based initializations. Through a case study in automotive aerodynamics using a 16.7M-cell unsteady RANS simulation, we evaluate three ML-based initialization strategies. Two of these strategies are recommended for general use: (1) a physics-informed hybrid method combining ML predictions with potential flow solutions, and (2) a more versatile approach integrating ML predictions with uniform flow. Both strategies enable CFD solvers to achieve convergence times comparable to computationally expensive steady RANS initializations, while requiring only seconds of computation. We develop a robust statistical convergence metric based on windowed time-averaging for performance comparison between initialization strategies. Notably, these improvements are achieved using an ML model trained on a different dataset of automotive geometries, demonstrating strong generalization capabilities. The proposed methods integrate seamlessly with existing CFD workflows without requiring modifications to the underlying flow solver, providing a practical approach to accelerating industrial CFD simulations through improved ML-based initialization strategies.

[LG-37] PARQ: Piecewise-Affine Regularized Quantization

链接: https://arxiv.org/abs/2503.15748
作者: Lisa Jin,Jianhao Ma,Zechun Liu,Andrey Gromov,Aaron Defazio,Lin Xiao
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We develop a principled method for quantization-aware training (QAT) of large-scale machine learning models. Specifically, we show that convex, piecewise-affine regularization (PAR) can effectively induce the model parameters to cluster towards discrete values. We minimize PAR-regularized loss functions using an aggregate proximal stochastic gradient method (AProx) and prove that it has last-iterate convergence. Our approach provides an interpretation of the straight-through estimator (STE), a widely used heuristic for QAT, as the asymptotic form of PARQ. We conduct experiments to demonstrate that PARQ obtains competitive performance on convolution- and transformer-based vision tasks.

[LG-38] Approximation properties of neural ODEs

链接: https://arxiv.org/abs/2503.15696
作者: Arturo De Marinis,Davide Murari,Elena Celledoni,Nicola Guglielmi,Brynjulf Owren,Francesco Tudisco
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 30 pages, 11 figures, 2 tables

点击查看摘要

Abstract:We study the approximation properties of shallow neural networks whose activation function is defined as the flow of a neural ordinary differential equation (neural ODE) at the final time of the integration interval. We prove the universal approximation property (UAP) of such shallow neural networks in the space of continuous functions. Furthermore, we investigate the approximation properties of shallow neural networks whose parameters are required to satisfy some constraints. In particular, we constrain the Lipschitz constant of the flow of the neural ODE to increase the stability of the shallow neural network, and we restrict the norm of the weight matrices of the linear layers to one to make sure that the restricted expansivity of the flow is not compensated by the increased expansivity of the linear layers. For this setting, we prove approximation bounds that tell us the accuracy to which we can approximate a continuous function with a shallow neural network with such constraints. We prove that the UAP holds if we consider only the constraint on the Lipschitz constant of the flow or the unit norm constraint on the weight matrices of the linear layers.

[LG-39] Good Actions Succeed Bad Actions Generalize: A Case Study on Why RL Generalizes Better

链接: https://arxiv.org/abs/2503.15693
作者: Meng Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised learning (SL) and reinforcement learning (RL) are both widely used to train general-purpose agents for complex tasks, yet their generalization capabilities and underlying mechanisms are not yet fully understood. In this paper, we provide a direct comparison between SL and RL in terms of zero-shot generalization. Using the Habitat visual navigation task as a testbed, we evaluate Proximal Policy Optimization (PPO) and Behavior Cloning (BC) agents across two levels of generalization: state-goal pair generalization within seen environments and generalization to unseen environments. Our experiments show that PPO consistently outperforms BC across both zero-shot settings and performance metrics-success rate and SPL. Interestingly, even though additional optimal training data enables BC to match PPO’s zero-shot performance in SPL, it still falls significantly behind in success rate. We attribute this to a fundamental difference in how models trained by these algorithms generalize: BC-trained models generalize by imitating successful trajectories, whereas TD-based RL-trained models generalize through combinatorial experience stitching-leveraging fragments of past trajectories (mostly failed ones) to construct solutions for new tasks. This allows RL to efficiently find solutions in vast state space and discover novel strategies beyond the scope of human knowledge. Besides providing empirical evidence and understanding, we also propose practical guidelines for improving the generalization capabilities of RL and SL through algorithm design.

[LG-40] Robotic Paper Wrapping by Learning Force Control

链接: https://arxiv.org/abs/2503.15685
作者: Hiroki Hanai,Takuya Kiyokawa,Weiwei Wan,Kensuke Harada
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robotic packaging using wrapping paper poses significant challenges due to the material’s complex deformation properties. The packaging process itself involves multiple steps, primarily categorized as folding the paper or creating creases. Small deviations in the robot’s arm trajectory or force vector can lead to tearing or wrinkling of the paper, exacerbated by the variability in material properties. This study introduces a novel framework that combines imitation learning and reinforcement learning to enable a robot to perform each step of the packaging process efficiently. The framework allows the robot to follow approximate trajectories of the tool-center point (TCP) based on human demonstrations while optimizing force control parameters to prevent tearing or wrinkling, even with variable wrapping paper materials. The proposed method was validated through ablation studies, which demonstrated successful task completion with a significant reduction in tear and wrinkle rates. Furthermore, the force control strategy proved to be adaptable across different wrapping paper materials and robust against variations in the size of the target object. Subjects: Robotics (cs.RO); Machine Learning (cs.LG) Cite as: arXiv:2503.15685 [cs.RO] (or arXiv:2503.15685v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2503.15685 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] Efficient Post-Hoc Uncertainty Calibration via Variance-Based Smoothing

链接: https://arxiv.org/abs/2503.15583
作者: Fabian Denoodt,José Oramas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Since state-of-the-art uncertainty estimation methods are often computationally demanding, we investigate whether incorporating prior information can improve uncertainty estimates in conventional deep neural networks. Our focus is on machine learning tasks where meaningful predictions can be made from sub-parts of the input. For example, in speaker classification, the speech waveform can be divided into sequential patches, each containing information about the same speaker. We observe that the variance between sub-predictions serves as a reliable proxy for uncertainty in such settings. Our proposed variance-based scaling framework produces competitive uncertainty estimates in classification while being less computationally demanding and allowing for integration as a post-hoc calibration tool. This approach also leads to a simple extension of deep ensembles, improving the expressiveness of their predicted distributions.

[LG-42] Performance-bounded Online Ensemble Learning Method Based on Multi-armed bandits and Its Applications in Real-time Safety Assessment

链接: https://arxiv.org/abs/2503.15581
作者: Songqiao Hu,Zeyi Liu,Xiao He
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Ensemble learning plays a crucial role in practical applications of online learning due to its enhanced classification performance and adaptable adjustment mechanisms. However, most weight allocation strategies in ensemble learning are heuristic, making it challenging to theoretically guarantee that the ensemble classifier outperforms its base classifiers. To address this issue, a performance-bounded online ensemble learning method based on multi-armed bandits, named PB-OEL, is proposed in this paper. Specifically, multi-armed bandit with expert advice is incorporated into online ensemble learning, aiming to update the weights of base classifiers and make predictions. A theoretical framework is established to bound the performance of the ensemble classifier relative to base classifiers. By setting expert advice of bandits, the bound exceeds the performance of any base classifier when the length of data stream is sufficiently large. Additionally, performance bounds for scenarios with limited annotations are also derived. Numerous experiments on benchmark datasets and a dataset of real-time safety assessment tasks are conducted. The experimental results validate the theoretical bound to a certain extent and demonstrate that the proposed method outperforms existing state-of-the-art methods.

[LG-43] Understanding the Generalization of In-Context Learning in Transformers: An Empirical Study

链接: https://arxiv.org/abs/2503.15579
作者: Xingxuan Zhang,Haoran Wang,Jiansheng Li,Yuan Xue,Shikai Guan,Renzhe Xu,Hao Zou,Han Yu,Peng Cui
类目: Machine Learning (cs.LG)
*备注: 32 pages

点击查看摘要

Abstract:Large language models (LLMs) like GPT-4 and LLaMA-3 utilize the powerful in-context learning (ICL) capability of Transformer architecture to learn on the fly from limited examples. While ICL underpins many LLM applications, its full potential remains hindered by a limited understanding of its generalization boundaries and vulnerabilities. We present a systematic investigation of transformers’ generalization capability with ICL relative to training data coverage by defining a task-centric framework along three dimensions: inter-problem, intra-problem, and intra-task generalization. Through extensive simulation and real-world experiments, encompassing tasks such as function fitting, API calling, and translation, we find that transformers lack inter-problem generalization with ICL, but excel in intra-task and intra-problem generalization. When the training data includes a greater variety of mixed tasks, it significantly enhances the generalization ability of ICL on unseen tasks and even on known simple tasks. This guides us in designing training data to maximize the diversity of tasks covered and to combine different tasks whenever possible, rather than solely focusing on the target task for testing.

[LG-44] Sparseformer: a Transferable Transformer with Multi-granularity Token Sparsification for Medical Time Series Classification

链接: https://arxiv.org/abs/2503.15578
作者: Jiexia Ye,Weiqi Zhang,Ziyue Li,Jia Li,Fugee Tsung
类目: Machine Learning (cs.LG)
*备注: 3 figures, 16 pages, 5 tables

点击查看摘要

Abstract:Medical time series (MedTS) classification is crucial for improved diagnosis in healthcare, and yet it is challenging due to the varying granularity of patterns, intricate inter-channel correlation, information redundancy, and label scarcity. While existing transformer-based models have shown promise in time series analysis, they mainly focus on forecasting and fail to fully exploit the distinctive characteristics of MedTS data. In this paper, we introduce Sparseformer, a transformer specifically designed for MedTS classification. We propose a sparse token-based dual-attention mechanism that enables global modeling and token compression, allowing dynamic focus on the most informative tokens while distilling redundant features. This mechanism is then applied to the multi-granularity, cross-channel encoding of medical signals, capturing intra- and inter-granularity correlations and inter-channel connections. The sparsification design allows our model to handle heterogeneous inputs of varying lengths and channels directly. Further, we introduce an adaptive label encoder to address label space misalignment across datasets, equipping our model with cross-dataset transferability to alleviate the medical label scarcity issue. Our model outperforms 12 baselines across seven medical datasets under supervised learning. In the few-shot learning experiments, our model also achieves superior average results. In addition, the in-domain and cross-domain experiments among three diagnostic scenarios demonstrate our model’s zero-shot learning capability. Collectively, these findings underscore the robustness and transferability of our model in various medical applications.

[LG-45] Machine Learning Techniques for Multifactor Analysis of National Carbon Dioxide Emissions

链接: https://arxiv.org/abs/2503.15574
作者: Wenjia Xie,Jinhui Li,Kai Zong,Luis Seco
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive study leveraging Support Vector Machine (SVM) regression and Principal Component Regression (PCR) to analyze carbon dioxide emissions in a global dataset of 62 countries and their dependence on idiosyncratic, country-specific parameters. The objective is to understand the factors contributing to carbon dioxide emissions and identify the most predictive elements. The analysis provides country-specific emission estimates, highlighting diverse national trajectories and pinpointing areas for targeted interventions in climate change mitigation, sustainable development, and the growing carbon credit markets and green finance sector. The study aims to support policymaking with accurate representations of carbon dioxide emissions, offering nuanced information for formulating effective strategies to address climate change while informing initiatives related to carbon trading and environmentally sustainable investments.

[LG-46] Neuronal Activation States as Sample Embeddings for Data Selection in Task-Specific Instruction Tuning

链接: https://arxiv.org/abs/2503.15573
作者: Da Ma,Gonghu Shang,Zhi Chen,Libo Qin,Yijie Luo,Lei Pan,Shuai Fan,Lu Chen,Kai Yu
类目: Machine Learning (cs.LG)
*备注: preprint

点击查看摘要

Abstract:Task-specific instruction tuning enhances the performance of large language models (LLMs) on specialized tasks, yet efficiently selecting relevant data for this purpose remains a challenge. Inspired by neural coactivation in the human brain, we propose a novel data selection method called NAS, which leverages neuronal activation states as embeddings for samples in the feature space. Extensive experiments show that NAS outperforms classical data selection methods in terms of both effectiveness and robustness across different models, datasets, and selection ratios.

[LG-47] LLM -Aided Customizable Profiling of Code Data Based On Programming Language Concepts

链接: https://arxiv.org/abs/2503.15571
作者: Pankaj Thorat,Adnan Qidwai,Adrija Dhar,Aishwariya Chakraborty,Anand Eswaran,Hima Patel,Praveen Jayachandran
类目: oftware Engineering (cs.SE); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: 21 pages

点击查看摘要

Abstract:Data profiling is critical in machine learning for generating descriptive statistics, supporting both deeper understanding and downstream tasks like data valuation and curation. This work addresses profiling specifically in the context of code datasets for Large Language Models (code-LLMs), where data quality directly influences tasks such as code generation and summarization. Characterizing code datasets in terms of programming language concepts enables better insights and targeted data curation. Our proposed methodology decomposes code data profiling into two phases: (1) an offline phase where LLMs are leveraged to derive and learn rules for extracting syntactic and semantic concepts across various programming languages, including previously unseen or low-resource languages, and (2) an online deterministic phase applying these derived rules for efficient real-time analysis. This hybrid approach is customizable, extensible to new syntactic and semantic constructs, and scalable to multiple languages. Experimentally, our LLM-aided method achieves a mean accuracy of 90.33% for syntactic extraction rules and semantic classification accuracies averaging 80% and 77% across languages and semantic concepts, respectively.

[LG-48] RAG -based User Profiling for Precision Planning in Mixed-precision Over-the-Air Federated Learning

链接: https://arxiv.org/abs/2503.15569
作者: Jinsheng Yuan,Yun Tang,Weisi Guo
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 5 pages, 4 figures, 2 tables, submitted to IEEE VTC 2025 fall for possible publication

点击查看摘要

Abstract:Mixed-precision computing, a widely applied technique in AI, offers a larger trade-off space between accuracy and efficiency. The recent purposed Mixed-Precision Over-the-Air Federated Learning (MP-OTA-FL) enables clients to operate at appropriate precision levels based on their heterogeneous hardware, taking advantages of the larger trade-off space while covering the quantization overheads in the mixed-precision modulation scheme for the OTA aggregation process. A key to further exploring the potential of the MP-OTA-FL framework is the optimization of client precision levels. The choice of precision level hinges on multifaceted factors including hardware capability, potential client contribution, and user satisfaction, among which factors can be difficult to define or quantify. In this paper, we propose a RAG-based User Profiling for precision planning framework that integrates retrieval-augmented LLMs and dynamic client profiling to optimize satisfaction and contributions. This includes a hybrid interface for gathering device/user insights and an RAG database storing historical quantization decisions with feedback. Experiments show that our method boosts satisfaction, energy savings, and global model accuracy in MP-OTA-FL systems. Comments: 5 pages, 4 figures, 2 tables, submitted to IEEE VTC 2025 fall for possible publication Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC) Cite as: arXiv:2503.15569 [cs.LG] (or arXiv:2503.15569v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.15569 Focus to learn more arXiv-issued DOI via DataCite

[LG-49] Mixed precision accumulation for neural network inference guided by componentwise forward error analysis

链接: https://arxiv.org/abs/2503.15568
作者: El-Mehdi El Arar,Silviu-Ioan Filip(TARAN),Theo Mary(PEQUAN),Elisa Riccietti(ENS de Lyon)
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This work proposes a mathematically founded mixed precision accumulation strategy for the inference of neural networks. Our strategy is based on a new componentwise forward error analysis that explains the propagation of errors in the forward pass of neural networks. Specifically, our analysis shows that the error in each component of the output of a layer is proportional to the condition number of the inner product between the weights and the input, multiplied by the condition number of the activation function. These condition numbers can vary widely from one component to the other, thus creating a significant opportunity to introduce mixed precision: each component should be accumulated in a precision inversely proportional to the product of these condition numbers. We propose a practical algorithm that exploits this observation: it first computes all components in low precision, uses this output to estimate the condition numbers, and recomputes in higher precision only the components associated with large condition numbers. We test our algorithm on various networks and datasets and confirm experimentally that it can significantly improve the cost–accuracy tradeoff compared with uniform precision accumulation baselines.

[LG-50] owards Unified Latent Space for 3D Molecular Latent Diffusion Modeling

链接: https://arxiv.org/abs/2503.15567
作者: Yanchen Luo,Zhiyuan Liu,Yi Zhao,Sihang Li,Kenji Kawaguchi,Tat-Seng Chua,Xiang Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:3D molecule generation is crucial for drug discovery and material science, requiring models to process complex multi-modalities, including atom types, chemical bonds, and 3D coordinates. A key challenge is integrating these modalities of different shapes while maintaining SE(3) equivariance for 3D coordinates. To achieve this, existing approaches typically maintain separate latent spaces for invariant and equivariant modalities, reducing efficiency in both training and sampling. In this work, we propose \textbfUnified Variational \textbfAuto-\textbfEncoder for \textbf3D Molecular Latent Diffusion Modeling (\textbfUAE-3D), a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space, while maintaining near-zero reconstruction error. This unified latent space eliminates the complexities of handling multi-modality and equivariance when performing latent diffusion modeling. We demonstrate this by employing the Diffusion Transformer–a general-purpose diffusion model without any molecular inductive bias–for latent generation. Extensive experiments on GEOM-Drugs and QM9 datasets demonstrate that our method significantly establishes new benchmarks in both \textitde novo and conditional 3D molecule generation, achieving leading efficiency and quality.

[LG-51] Enforcing Consistency and Fairness in Multi-level Hierarchical Classification with a Mask-based Output Layer

链接: https://arxiv.org/abs/2503.15566
作者: Shijing Chen,Shoaib Jameel,Mohamed Reda Bouadjenek,Feilong Tang,Usman Naseem,Basem Suleiman,Hakim Hacid,Flora D. Salim,Imran Razzak
类目: Machine Learning (cs.LG)
*备注: 14 pages, 14 figures. arXiv admin note: text overlap with arXiv:2501.06827

点击查看摘要

Abstract:Traditional Multi-level Hierarchical Classification (MLHC) classifiers often rely on backbone models with n independent output layers. This structure tends to overlook the hierarchical relationships between classes, leading to inconsistent predictions that violate the underlying taxonomy. Additionally, once a backbone architecture for an MLHC classifier is selected, adapting the model to accommodate new tasks can be challenging. For example, incorporating fairness to protect sensitive attributes within a hierarchical classifier necessitates complex adjustments to maintain the class hierarchy while enforcing fairness constraints. In this paper, we extend this concept to hierarchical classification by introducing a fair, model-agnostic layer designed to enforce taxonomy and optimize specific objectives, including consistency, fairness, and exact match. Our evaluations demonstrate that the proposed layer not only improves the fairness of predictions but also enforces the taxonomy, resulting in consistent predictions and superior performance. Compared to Large Language Models (LLMs) employing in-processing de-biasing techniques and models without any bias correction, our approach achieves better outcomes in both fairness and accuracy, making it particularly valuable in sectors like e-commerce, healthcare, and education, where predictive reliability is crucial.

[LG-52] GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction ICDE2025

链接: https://arxiv.org/abs/2503.15564
作者: Tung Sum Thomas Kwok,Chi-Hua Wang,Guang Cheng
类目: Machine Learning (cs.LG)
*备注: Accepted by Data Engineering Meets Large Language Models: Challenges and Opportunities Workshop@ICDE2025 Workshop at ICDE 2025

点击查看摘要

Abstract:Tabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, eliminating the need to partition data types. Despite this, the framework’s performance is constrained by two issues: (1) tabular data entries lack sufficient semantic meaning, limiting LLM’s ability to leverage pre-trained knowledge for in-context learning, and (2) complex multi-table datasets struggle to establish effective relationships for collaboration. To address these, we propose GReaTER (Generate Realistic Tabular Data after data Enhancement and Reduction), which includes: (1) a data semantic enhancement system that improves LLM’s understanding of tabular data through mapping, enabling better in-context learning, and (2) a cross-table connecting method to establish efficient relationships across complex tables. Experimental results show that GReaTER outperforms the GReaT framework.

[LG-53] Dynamic Power Flow Analysis and Fault Characteristics: A Graph Attention Neural Network

链接: https://arxiv.org/abs/2503.15563
作者: Tan Le,Van Le
类目: Machine Learning (cs.LG)
*备注: The 2025 International Conference on the AI Revolution: Research, Ethics, and Society (AIR-RES 2025)

点击查看摘要

Abstract:We propose the joint graph attention neural network (GAT), clustering with adaptive neighbors (CAN) and probabilistic graphical model for dynamic power flow analysis and fault characteristics. In fact, computational efficiency is the main focus to enhance, whilst we ensure the performance accuracy at the accepted level. Note that Machine Learning (ML) based schemes have a requirement of sufficient labeled data during training, which is not easily satisfied in practical applications. Also, there are unknown data due to new arrived measurements or incompatible smart devices in complex smart grid systems. These problems would be resolved by our proposed GAT based framework, which models the label dependency between the network data and learns object representations such that it could achieve the semi-supervised fault diagnosis. To create the joint label dependency, we develop the graph construction from the raw acquired signals by using CAN. Next, we develop the probabilistic graphical model of Markov random field for graph representation, which supports for the GAT based framework. We then evaluate the proposed framework in the use-case application in smart grid and make a fair comparison to the existing methods.

[LG-54] Localized Physics-informed Gaussian Processes with Curriculum Training for Topology Optimization

链接: https://arxiv.org/abs/2503.15561
作者: Amin Yousefpour,Shirin Hosseinmardi,Xiangyu Sun,Ramin Bostanabad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a simultaneous and meshfree topology optimization (TO) framework based on physics-informed Gaussian processes (GPs). Our framework endows all design and state variables via GP priors which have a shared, multi-output mean function that is parametrized via a customized deep neural network (DNN). The parameters of this mean function are estimated by minimizing a multi-component loss function that depends on the performance metric, design constraints, and the residuals on the state equations. Our TO approach yields well-defined material interfaces and has a built-in continuation nature that promotes global optimality. Other unique features of our approach include (1) its customized DNN which, unlike fully connected feed-forward DNNs, has a localized learning capacity that enables capturing intricate topologies and reducing residuals in high gradient fields, (2) its loss function that leverages localized weights to promote solution accuracy around interfaces, and (3) its use of curriculum training to avoid local this http URL demonstrate the power of our framework, we validate it against commercial TO package COMSOL on three problems involving dissipated power minimization in Stokes flow.

[LG-55] mporal Context Awareness: A Defense Framework Against Multi-turn Manipulation Attacks on Large Language Models

链接: https://arxiv.org/abs/2503.15560
作者: Prashant Kulkarni,Assaf Namer
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, IEEE CAI

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly vulnerable to sophisticated multi-turn manipulation attacks, where adversaries strategically build context through seemingly benign conversational turns to circumvent safety measures and elicit harmful or unauthorized responses. These attacks exploit the temporal nature of dialogue to evade single-turn detection methods, representing a critical security vulnerability with significant implications for real-world deployments. This paper introduces the Temporal Context Awareness (TCA) framework, a novel defense mechanism designed to address this challenge by continuously analyzing semantic drift, cross-turn intention consistency and evolving conversational patterns. The TCA framework integrates dynamic context embedding analysis, cross-turn consistency verification, and progressive risk scoring to detect and mitigate manipulation attempts effectively. Preliminary evaluations on simulated adversarial scenarios demonstrate the framework’s potential to identify subtle manipulation patterns often missed by traditional detection techniques, offering a much-needed layer of security for conversational AI systems. In addition to outlining the design of TCA , we analyze diverse attack vectors and their progression across multi-turn conversation, providing valuable insights into adversarial tactics and their impact on LLM vulnerabilities. Our findings underscore the pressing need for robust, context-aware defenses in conversational AI systems and highlight TCA framework as a promising direction for securing LLMs while preserving their utility in legitimate applications. We make our implementation available to support further research in this emerging area of AI security. Comments: 6 pages, 2 figures, IEEE CAI Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2503.15560 [cs.CR] (or arXiv:2503.15560v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.15560 Focus to learn more arXiv-issued DOI via DataCite

[LG-56] Advanced Relay-Based Collaborative Framework for Optimizing Synchronization in Split Federated Learning over Wireless Networks

链接: https://arxiv.org/abs/2503.15559
作者: Haoran Gao,Samuel D. Okegbile,Jun Cai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Split Federated Learning (SFL) offers a promising approach for distributed model training in edge computing, combining the strengths of split learning in reducing computational demands on edge devices and enhancing data privacy, with the role of federated aggregation to ensure model convergence and synchronization across users. However, synchronization issues caused by user heterogeneity have hindered the development of the framework. To optimize synchronization efficiency among users and improve overall system performance, we propose a collaborative SFL framework (CSFL). Based on the model’s partitioning capabilities, we design a mechanism called the collaborative relay optimization mechanism (CROM), where the assistance provided by high-efficiency users is seen as a relay process, with the portion of the model they compute acting as the relay point. Wireless communication between users facilitates real-time collaboration, allowing high-efficiency users to assist bottleneck users in handling part of the model’s computation, thereby alleviating the computational load on bottleneck users. Simulation results show that our proposed CSFL framework reduces synchronization delays and improves overall system throughput while maintaining similar performance and convergence rate to the SFL framework. This demonstrates that the collaboration not only reduces synchronization waiting time but also accelerates model convergence.

[LG-57] A Comprehensive Study of LLM Secure Code Generation

链接: https://arxiv.org/abs/2503.15554
作者: Shih-Chieh Dai,Jun Xu,Guanhong Tao
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:LLMs are widely used in software development. However, the code generated by LLMs often contains vulnerabilities. Several secure code generation methods have been proposed to address this issue, but their current evaluation schemes leave several concerns unaddressed. Specifically, most existing studies evaluate security and functional correctness separately, using different datasets. That is, they assess vulnerabilities using security-related code datasets while validating functionality with general code datasets. In addition, prior research primarily relies on a single static analyzer, CodeQL, to detect vulnerabilities in generated code, which limits the scope of security evaluation. In this work, we conduct a comprehensive study to systematically assess the improvements introduced by four state-of-the-art secure code generation techniques. Specifically, we apply both security inspection and functionality validation to the same generated code and evaluate these two aspects together. We also employ three popular static analyzers and two LLMs to identify potential vulnerabilities in the generated code. Our study reveals that existing techniques often compromise the functionality of generated code to enhance security. Their overall performance remains limited when evaluating security and functionality together. In fact, many techniques even degrade the performance of the base LLM. Our further inspection reveals that these techniques often either remove vulnerable lines of code entirely or generate ``garbage code’’ that is unrelated to the intended task. Moreover, the commonly used static analyzer CodeQL fails to detect several vulnerabilities, further obscuring the actual security improvements achieved by existing techniques. Our study serves as a guideline for a more rigorous and comprehensive evaluation of secure code generation performance in future work. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2503.15554 [cs.CR] (or arXiv:2503.15554v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.15554 Focus to learn more arXiv-issued DOI via DataCite

[LG-58] Data-Driven Approximation of Binary-State Network Reliability Function: Algorithm Selection and Reliability Thresholds for Large-Scale Systems

链接: https://arxiv.org/abs/2503.15545
作者: Wei-Chang Yeh
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Network reliability assessment is pivotal for ensuring the robustness of modern infrastructure systems, from power grids to communication networks. While exact reliability computation for binary-state networks is NP-hard, existing approximation methods face critical tradeoffs between accuracy, scalability, and data efficiency. This study evaluates 20 machine learning methods across three reliability regimes full range (0.0-1.0), high reliability (0.9-1.0), and ultra high reliability (0.99-1.0) to address these gaps. We demonstrate that large-scale networks with arc reliability larger than or equal to 0.9 exhibit near-unity system reliability, enabling computational simplifications. Further, we establish a dataset-scale-driven paradigm for algorithm selection: Artificial Neural Networks (ANN) excel with limited data, while Polynomial Regression (PR) achieves superior accuracy in data-rich environments. Our findings reveal ANN’s Test-MSE of 7.24E-05 at 30,000 samples and PR’s optimal performance (5.61E-05) at 40,000 samples, outperforming traditional Monte Carlo simulations. These insights provide actionable guidelines for balancing accuracy, interpretability, and computational efficiency in reliability engineering, with implications for infrastructure resilience and system optimization.

[LG-59] he Trust Calibration Maturity Model for Characterizing and Communicating Trustworthiness of AI Systems

链接: https://arxiv.org/abs/2503.15511
作者: Scott T Steinmetz,Asmeret Naugle,Paul Schutte,Matt Sweitzer,Alex Washburne,Lisa Linville,Daniel Krofcheck,Michal Kucer,Samuel Myren
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The proliferation of powerful AI capabilities and systems necessitates a commensurate focus on user trust. We introduce the Trust Calibration Maturity Model (TCMM) to capture and communicate the maturity of AI system trustworthiness. The TCMM scores maturity along 5 dimensions that drive user trust: Performance Characterization, Bias Robustness Quantification, Transparency, Safety Security, and Usability. Information captured in the TCMM can be presented along with system performance information to help a user to appropriately calibrate trust, to compare requirements with current states of development, and to clarify trustworthiness needs. We present the TCMM and demonstrate its use on two AI system-target task pairs.

[LG-60] he global convergence time of stochastic gradient descent in non-convex landscapes: Sharp estimates via large deviations

链接: https://arxiv.org/abs/2503.16398
作者: Waïss Azizian,Franck Iutzeler,Jérôme Malick,Panayotis Mertikopoulos
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 62 pages, 5 figures

点击查看摘要

Abstract:In this paper, we examine the time it takes for stochastic gradient descent (SGD) to reach the global minimum of a general, non-convex loss function. We approach this question through the lens of randomly perturbed dynamical systems and large deviations theory, and we provide a tight characterization of the global convergence time of SGD via matching upper and lower bounds. These bounds are dominated by the most “costly” set of obstacles that the algorithm may need to overcome to reach a global minimizer from a given initialization, coupling in this way the global geometry of the underlying loss landscape with the statistics of the noise entering the process. Finally, motivated by applications to the training of deep neural networks, we also provide a series of refinements and extensions of our analysis for loss functions with shallow local minima.

[LG-61] Sparse Nonparametric Contextual Bandits

链接: https://arxiv.org/abs/2503.16382
作者: Hamish Flynn,Julia Olkhovskaya,Paul Rognon-Vael
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 45 pages

点击查看摘要

Abstract:This paper studies the problem of simultaneously learning relevant features and minimising regret in contextual bandit problems. We introduce and analyse a new class of contextual bandit problems, called sparse nonparametric contextual bandits, in which the expected reward function lies in the linear span of a small unknown set of features that belongs to a known infinite set of candidate features. We consider two notions of sparsity, for which the set of candidate features is either countable or uncountable. Our contribution is two-fold. First, we provide lower bounds on the minimax regret, which show that polynomial dependence on the number of actions is generally unavoidable in this setting. Second, we show that a variant of the Feel-Good Thompson Sampling algorithm enjoys regret bounds that match our lower bounds up to logarithmic factors of the horizon, and have logarithmic dependence on the effective number of candidate features. When we apply our results to kernelised and neural contextual bandits, we find that sparsity always enables better regret bounds, as long as the horizon is large enough relative to the sparsity and the number of actions.

[LG-62] Enhancing variational quantum algorithms by balancing training on classical and quantum hardware

链接: https://arxiv.org/abs/2503.16361
作者: Rahul Bhowmick,Harsh Wadhwa,Avinash Singh,Tania Sidana,Quoc Hoan Tran,Krishna Kumar Sabapathy
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 28 pages, 13 figures, 5 tables, 4 algorithms

点击查看摘要

Abstract:Quantum computers offer a promising route to tackling problems that are classically intractable such as in prime-factorization, solving large-scale linear algebra and simulating complex quantum systems, but require fault-tolerant quantum hardware. On the other hand, variational quantum algorithms (VQAs) have the potential to provide a near-term route to quantum utility or advantage, and is usually constructed by using parametrized quantum circuits (PQCs) in combination with a classical optimizer for training. Although VQAs have been proposed for a multitude of tasks such as ground-state estimation, combinatorial optimization and unitary compilation, there remain major challenges in its trainability and resource costs on quantum hardware. Here we address these challenges by adopting Hardware Efficient and dynamical LIe algebra Supported Ansatz (HELIA), and propose two training schemes that combine an existing g-sim method (that uses the underlying group structure of the operators) and the Parameter-Shift Rule (PSR). Our improvement comes from distributing the resources required for gradient estimation and training to both classical and quantum hardware. We numerically test our proposal for ground-state estimation using Variational Quantum Eigensolver (VQE) and classification of quantum phases using quantum neural networks. Our methods show better accuracy and success of trials, and also need fewer calls to the quantum hardware on an average than using only PSR (upto 60% reduction), that runs exclusively on quantum hardware. We also numerically demonstrate the capability of HELIA in mitigating barren plateaus, paving the way for training large-scale quantum models.

[LG-63] Optimal Complexity in Byzantine-Robust Distributed Stochastic Optimization with Data Heterogeneity

链接: https://arxiv.org/abs/2503.16337
作者: Qiankun Shi,Jie Peng,Kun Yuan,Xiao Wang,Qing Ling
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we establish tight lower bounds for Byzantine-robust distributed first-order stochastic optimization methods in both strongly convex and non-convex stochastic optimization. We reveal that when the distributed nodes have heterogeneous data, the convergence error comprises two components: a non-vanishing Byzantine error and a vanishing optimization error. We establish the lower bounds on the Byzantine error and on the minimum number of queries to a stochastic gradient oracle required to achieve an arbitrarily small optimization error. Nevertheless, we identify significant discrepancies between our established lower bounds and the existing upper bounds. To fill this gap, we leverage the techniques of Nesterov’s acceleration and variance reduction to develop novel Byzantine-robust distributed stochastic optimization methods that provably match these lower bounds, up to logarithmic factors, implying that our established lower bounds are tight.

[LG-64] NeuralFoil: An Airfoil Aerodynamics Analysis Tool Using Physics-Informed Machine Learning

链接: https://arxiv.org/abs/2503.16323
作者: Peter Sharpe,R. John Hansman
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 42 pages, 14 figures

点击查看摘要

Abstract:NeuralFoil is an open-source Python-based tool for rapid aerodynamics analysis of airfoils, similar in purpose to XFoil. Speedups ranging from 8x to 1,000x over XFoil are demonstrated, after controlling for equivalent accuracy. NeuralFoil computes both global and local quantities (lift, drag, velocity distribution, etc.) over a broad input space, including: an 18-dimensional space of airfoil shapes, possibly including control deflections; a 360 degree range of angles of attack; Reynolds numbers from 10^2 to 10^10 ; subsonic flows up to the transonic drag rise; and with varying turbulence parameters. Results match those of XFoil closely: the mean relative error of drag is 0.37% on simple cases, and remains as low as 2.0% on a test dataset with numerous post-stall and transitional cases. NeuralFoil facilitates gradient-based design optimization, due to its C^\infty -continuous solutions, automatic-differentiation-compatibility, and bounded computational cost without non-convergence issues. NeuralFoil is a hybrid of physics-informed machine learning techniques and analytical models. Here, physics information includes symmetries that are structurally embedded into the model architecture, feature engineering using domain knowledge, and guaranteed extrapolation to known limit cases. This work also introduces a new approach for surrogate model uncertainty quantification that enables robust design optimization. This work discusses the methodology and performance of NeuralFoil with several case studies, including a practical airfoil design optimization study including both aerodynamic and non-aerodynamic constraints. Here, NeuralFoil optimization is able to produce airfoils nearly identical in performance and shape to expert-designed airfoils within seconds; these computationally-optimized airfoils provide a useful starting point for further expert refinement. Comments: 42 pages, 14 figures Subjects: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG) Cite as: arXiv:2503.16323 [physics.flu-dyn] (or arXiv:2503.16323v1 [physics.flu-dyn] for this version) https://doi.org/10.48550/arXiv.2503.16323 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-65] Active Learning For Repairable Hardware Systems With Partial Coverag e

链接: https://arxiv.org/abs/2503.16315
作者: Michael Potter,Beyza Kalkanlı,Deniz Erdoğmuş,Michael Everett
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: Submitted to IEEE Reliability and Maintainability Symposium - Europe 2025

点击查看摘要

Abstract:Identifying the optimal diagnostic test and hardware system instance to infer reliability characteristics using field data is challenging, especially when constrained by fixed budgets and minimal maintenance cycles. Active Learning (AL) has shown promise for parameter inference with limited data and budget constraints in machine learning/deep learning tasks. However, AL for reliability model parameter inference remains underexplored for repairable hardware systems. It requires specialized AL Acquisition Functions (AFs) that consider hardware aging and the fact that a hardware system consists of multiple sub-systems, which may undergo only partial testing during a given diagnostic test. To address these challenges, we propose a relaxed Mixed Integer Semidefinite Program (MISDP) AL AF that incorporates Diagnostic Coverage (DC), Fisher Information Matrices (FIMs), and diagnostic testing budgets. Furthermore, we design empirical-based simulation experiments focusing on two diagnostic testing scenarios: (1) partial tests of a hardware system with overlapping subsystem coverage, and (2) partial tests where one diagnostic test fully subsumes the subsystem coverage of another. We evaluate our proposed approach against the most widely used AL AF in the literature (entropy), as well as several intuitive AL AFs tailored for reliability model parameter inference. Our proposed AF ranked best on average among the alternative AFs across 6,000 experimental configurations, with respect to Area Under the Curve (AUC) of the Absolute Total Expected Event Error (ATEER) and Mean Squared Error (MSE) curves, with statistical significance calculated at a 0.05 alpha level using a Friedman hypothesis test.

[LG-66] Interpretable Neural Causal Models with TRAM-DAGs

链接: https://arxiv.org/abs/2503.16206
作者: Beate Sick,Oliver Dürr
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at the CLeaR 2025 Conference

点击查看摘要

Abstract:The ultimate goal of most scientific studies is to understand the underlying causal mechanism between the involved variables. Structural causal models (SCMs) are widely used to represent such causal mechanisms. Given an SCM, causal queries on all three levels of Pearl’s causal hierarchy can be answered: L_1 observational, L_2 interventional, and L_3 counterfactual. An essential aspect of modeling the SCM is to model the dependency of each variable on its causal parents. Traditionally this is done by parametric statistical models, such as linear or logistic regression models. This allows to handle all kinds of data types and fit interpretable models but bears the risk of introducing a bias. More recently neural causal models came up using neural networks (NNs) to model the causal relationships, allowing the estimation of nearly any underlying functional form without bias. However, current neural causal models are generally restricted to continuous variables and do not yield an interpretable form of the causal relationships. Transformation models range from simple statistical regressions to complex networks and can handle continuous, ordinal, and binary data. Here, we propose to use TRAMs to model the functional relationships in SCMs allowing us to bridge the gap between interpretability and flexibility in causal modeling. We call this method TRAM-DAG and assume currently that the underlying directed acyclic graph is known. For the fully observed case, we benchmark TRAM-DAGs against state-of-the-art statistical and NN-based causal models. We show that TRAM-DAGs are interpretable but also achieve equal or superior performance in queries ranging from L_1 to L_3 in the causal hierarchy. For the continuous case, TRAM-DAGs allow for counterfactual queries for three common causal structures, including unobserved confounding.

[LG-67] Distributed Learning over Arbitrary Topology: Linear Speed-Up with Polynomial Transient Time

链接: https://arxiv.org/abs/2503.16123
作者: Runze You,Shi Pu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study a distributed learning problem in which n agents, each with potentially heterogeneous local data, collaboratively minimize the sum of their local cost functions via peer-to-peer communication. We propose a novel algorithm, Spanning Tree Push-Pull (STPP), which employs two spanning trees extracted from a general communication graph to distribute both model parameters and stochastic gradients. Unlike prior approaches that rely heavily on spectral gap properties, STPP leverages a more flexible topological characterization, enabling robust information flow and efficient updates. Theoretically, we prove that STPP achieves linear speedup and polynomial transient iteration complexity, up to O(n^7) for smooth nonconvex objectives and \tildeO(n^3) for smooth strongly convex objectives, under arbitrary network topologies. Moreover, compared with the existing methods, STPP achieves faster convergence rates on sparse and non-regular topologies (e.g., directed ring) and reduces communication overhead on dense networks (e.g., static exponential graph). These results significantly advance the state of the art, especially when n is large. Numerical experiments further demonstrate the strong performance of STPP and confirm the practical relevance of its theoretical convergence rates across various common graph architectures. Our code is available at this https URL.

[LG-68] Patch-based learning of adaptive Total Variation parameter maps for blind image denoising

链接: https://arxiv.org/abs/2503.16010
作者: Claudio Fantasia,Luca Calatroni,Xavier Descombes,Rim Rekik
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We consider a patch-based learning approach defined in terms of neural networks to estimate spatially adaptive regularisation parameter maps for image denoising with weighted Total Variation and test it to situations when the noise distribution is unknown. As an example, we consider situations where noise could be either Gaussian or Poisson and perform preliminary model selection by a standard binary classification network. Then, we define a patch-based approach where at each image pixel an optimal weighting between TV regularisation and the corresponding data fidelity is learned in a supervised way using reference natural image patches upon optimisation of SSIM and in a sliding window fashion. Extensive numerical results are reported for both noise models, showing significant improvement w.r.t. results obtained by means of optimal scalar regularisation.

[LG-69] Big data comparison of quantum invariants

链接: https://arxiv.org/abs/2503.15810
作者: Daniel Tubbenhauer,Victor Zhang
类目: Geometric Topology (math.GT); Machine Learning (cs.LG); Quantum Algebra (math.QA)
*备注: 26 pages, many figures, comments welcome

点击查看摘要

Abstract:We apply big data techniques, including exploratory and topological data analysis, to investigate quantum invariants. More precisely, our study explores the Jones polynomial’s structural properties and contrasts its behavior under four principal methods of enhancement: coloring, rank increase, categorification, and leaving the realm of Lie algebras.

[LG-70] Using machine learning to map simulated noisy and laser-limited multidimensional spectra to molecular electronic couplings

链接: https://arxiv.org/abs/2503.15706
作者: Jonathan D. Schultz,Kelsey A. Parker,Bashir Sbaiti,David N. Beratan
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 24 pages, 15 figures

点击查看摘要

Abstract:Two-dimensional electronic spectroscopy (2DES) has enabled significant discoveries in both biological and synthetic energy-transducing systems. Although deriving chemical information from 2DES is a complex task, machine learning (ML) offers exciting opportunities to translate complicated spectroscopic data into physical insight. Recent studies have found that neural networks (NNs) can map simulated multidimensional spectra to molecular-scale properties with high accuracy. However, simulations often do not capture experimental factors that influence real spectra, including noise and suboptimal pulse resonance conditions, bringing into question the experimental utility of NNs trained on simulated data. Here, we show how factors associated with experimental 2D spectral data influence the ability of NNs to map simulated 2DES spectra onto underlying intermolecular electronic couplings. By systematically introducing multisourced noise into a library of 356000 simulated 2D spectra, we show that noise does not hamper NN performance for spectra exceeding threshold signal-to-noise ratios (SNR) ( 6.6 if background noise dominates vs. 2.5 for intensity-dependent noise). In stark contrast to human-based analyses of 2DES data, we find that the NN accuracy improves significantly (ca. 84% \rightarrow 96%) when the data are constrained by the bandwidth and center frequency of the pump pulses. This result is consistent with the NN learning the optical trends described by Kasha’s theory of molecular excitons. Our findings convey positive prospects for adapting simulation-trained NNs to extract molecular properties from inherently imperfect experimental 2DES data. More broadly, we propose that machine-learned perspectives of nonlinear spectroscopic data may produce unique and, perhaps, counterintuitive guidelines for experimental design.

[LG-71] uning Sequential Monte Carlo Samplers via Greedy Incremental Divergence Minimization

链接: https://arxiv.org/abs/2503.15704
作者: Kyurae Kim,Zuheng Xu,Jacob R. Gardner,Trevor Campbell
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:The performance of sequential Monte Carlo (SMC) samplers heavily depends on the tuning of the Markov kernels used in the path proposal. For SMC samplers with unadjusted Markov kernels, standard tuning objectives, such as the Metropolis-Hastings acceptance rate or the expected-squared jump distance, are no longer applicable. While stochastic gradient-based end-to-end optimization has been explored for tuning SMC samplers, they often incur excessive training costs, even for tuning just the kernel step sizes. In this work, we propose a general adaptation framework for tuning the Markov kernels in SMC samplers by minimizing the incremental Kullback-Leibler (KL) divergence between the proposal and target paths. For step size tuning, we provide a gradient- and tuning-free algorithm that is generally applicable for kernels such as Langevin Monte Carlo (LMC). We further demonstrate the utility of our approach by providing a tailored scheme for tuning \textitkinetic LMC used in SMC samplers. Our implementations are able to obtain a full \textitschedule of tuned parameters at the cost of a few vanilla SMC runs, which is a fraction of gradient-based approaches.

[LG-72] Sequential learning based PINNs to overcome temporal domain complexities in unsteady flow past flapping wings

链接: https://arxiv.org/abs/2503.15679
作者: Rahul Sundar,Didier Lucor,Sunetra Sarkar
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For a data-driven and physics combined modelling of unsteady flow systems with moving immersed boundaries, Sundar \it et al. introduced an immersed boundary-aware (IBA) framework, combining Physics-Informed Neural Networks (PINNs) and the immersed boundary method (IBM). This approach was beneficial because it avoided case-specific transformations to a body-attached reference frame. Building on this, we now address the challenges of long time integration in velocity reconstruction and pressure recovery by extending this IBA framework with sequential learning strategies. Key difficulties for PINNs in long time integration include temporal sparsity, long temporal domains and rich spectral content. To tackle these, a moving boundary-enabled PINN is developed, proposing two sequential learning strategies: - a time marching with gradual increase in time domain size, however, this approach struggles with error accumulation over long time domains; and - a time decomposition which divides the temporal domain into smaller segments, combined with transfer learning it effectively reduces error propagation and computational complexity. The key findings for modelling of incompressible unsteady flows past a flapping airfoil include: - for quasi-periodic flows, the time decomposition approach with preferential spatio-temporal sampling improves accuracy and efficiency for pressure recovery and aerodynamic load reconstruction, and, - for long time domains, decomposing it into smaller temporal segments and employing multiple sub-networks, simplifies the problem ensuring stability and reduced network sizes. This study highlights the limitations of traditional PINNs for long time integration of flow-structure interaction problems and demonstrates the benefits of decomposition-based strategies for addressing error accumulation, computational cost, and complex dynamics.

[LG-73] Model Risk Management for Generative AI In Financial Institutions

链接: https://arxiv.org/abs/2503.15668
作者: Anwesha Bhattacharyya,Ye Yu,Hanyu Yang,Rahul Singh,Tarun Joshi,Jie Chen,Kiran Yalavarthy
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The success of OpenAI’s ChatGPT in 2023 has spurred financial enterprises into exploring Generative AI applications to reduce costs or drive revenue within different lines of businesses in the Financial Industry. While these applications offer strong potential for efficiencies, they introduce new model risks, primarily hallucinations and toxicity. As highly regulated entities, financial enterprises (primarily large US banks) are obligated to enhance their model risk framework with additional testing and controls to ensure safe deployment of such applications. This paper outlines the key aspects for model risk management of generative AI model with a special emphasis on additional practices required in model validation.

[LG-74] Using machine learning to measure evidence of students sensemaking in physics courses

链接: https://arxiv.org/abs/2503.15638
作者: Kaitlin Gili,Kyle Heuton,Astha Shah,Michael C. Hughes
类目: Physics Education (physics.ed-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the education system, problem-solving correctness is often inappropriately conflated with student learning. Advances in both Physics Education Research (PER) and Machine Learning (ML) provide the initial tools to develop a more meaningful and efficient measurement scheme for whether physics students are engaging in sensemaking: a learning process of figuring out the how and why for a particular phenomena. In this work, we contribute such a measurement scheme, which quantifies the evidence of students’ physical sensemaking given their written explanations for their solutions to physics problems. We outline how the proposed human annotation scheme can be automated into a deployable ML model using language encoders and shared probabilistic classifiers. The procedure is scalable for a large number of problems and students. We implement three unique language encoders with logistic regression, and provide a deployability analysis on 385 real student explanations from the 2023 Introduction to Physics course at Tufts University. Furthermore, we compute sensemaking scores for all students, and analyze these measurements alongside their corresponding problem-solving accuracies. We find no linear relationship between these two variables, supporting the hypothesis that one is not a reliable proxy for the other. We discuss how sensemaking scores can be used alongside problem-solving accuracies to provide a more nuanced snapshot of student performance in physics class.

[LG-75] Hierarchical clustering with maximum density paths and mixture models

链接: https://arxiv.org/abs/2503.15582
作者: Martin Ritzert,Polina Turishcheva,Laura Hansel,Paul Wollenhaupt,Marissa Weis,Alexander Ecker
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hierarchical clustering is an effective and interpretable technique for analyzing structure in data, offering a nuanced understanding by revealing insights at multiple scales and resolutions. It is particularly helpful in settings where the exact number of clusters is unknown, and provides a robust framework for exploring complex datasets. Additionally, hierarchical clustering can uncover inner structures within clusters, capturing subtle relationships and nested patterns that may be obscured by traditional flat clustering methods. However, existing hierarchical clustering methods struggle with high-dimensional data, especially when there are no clear density gaps between modes. Our method addresses this limitation by leveraging a two-stage approach, first employing a Gaussian or Student’s t mixture model to overcluster the data, and then hierarchically merging clusters based on the induced density landscape. This approach yields state-of-the-art clustering performance while also providing a meaningful hierarchy, making it a valuable tool for exploratory data analysis. Code is available at this https URL clustering.

信息检索

[IR-0] Narrative Trails: A Method for Coherent Storyline Extraction via Maximum Capacity Path Optimization ECIR2025

链接: https://arxiv.org/abs/2503.15681
作者: Fausto German,Brian Keith,Chris North
类目: Information Retrieval (cs.IR)
*备注: Eighth Text2Story Workshop at the 47th European Conference on Information Retrieval (ECIR 2025). The code for our algorithm, evaluations, and examples are available at this https URL

点击查看摘要

Abstract:Traditional information retrieval is primarily concerned with finding relevant information from large datasets without imposing a structure within the retrieved pieces of data. However, structuring information in the form of narratives–ordered sets of documents that form coherent storylines–allows us to identify, interpret, and share insights about the connections and relationships between the ideas presented in the data. Despite their significance, current approaches for algorithmically extracting storylines from data are scarce, with existing methods primarily relying on intricate word-based heuristics and auxiliary document structures. Moreover, many of these methods are difficult to scale to large datasets and general contexts, as they are designed to extract storylines for narrow tasks. In this paper, we propose Narrative Trails, an efficient, general-purpose method for extracting coherent storylines in large text corpora. Specifically, our method uses the semantic-level information embedded in the latent space of deep learning models to build a sparse coherence graph and extract narratives that maximize the minimum coherence of the storylines. By quantitatively evaluating our proposed methods on two distinct narrative extraction tasks, we show the generalizability and scalability of Narrative Trails in multiple contexts while also simplifying the extraction pipeline.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-03-21

目录

概览 (2025-03-21)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载