Arxiv今日论文 | 2025-04-03

本篇博文主要内容为 2025-04-03 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：本文旨在评估大型语言模型（LLMs）在统计相关性基础上隐藏的性别偏见程度，特别是通过在线购物历史预测个体性别的能力及其受性别刻板印象影响的情况。研究的关键在于引入一种新颖的视角，即从在线购物行为出发分析LLMs的性别分类能力，并探究其推理过程及产品与性别之间的共现模式。为此，作者使用美国用户的在线购买历史数据集，测试了六种LLMs的性别分类性能，并深入分析了模型的决策依据。结果表明，尽管这些模型能够以中等精度推断性别，但它们的判断往往基于产品类别与性别的刻板关联。即使给予明确指令要求减少偏见，模型预测的确定性虽有所降低，但仍未能完全消除刻板印象模式。因此，本研究强调了在LLMs中持续存在的性别偏见问题，并呼吁开发更有效的偏见缓解策略作为解决方案的核心。

链接: https://arxiv.org/abs/2504.01951
作者: Massimiliano Luca,Ciro Beneduce,Bruno Lepri,Jacopo Staiano
机构: Mobile and Social Computing Lab (MOBS); Fondazione Bruno Kessler (布鲁诺·凯勒基金会); Dept. of Economics & Management (经济学与管理系), University of Trento (特伦托大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:With the wide and cross-domain adoption of Large Language Models, it becomes crucial to assess to which extent the statistical correlations in training data, which underlie their impressive performance, hide subtle and potentially troubling biases. Gender bias in LLMs has been widely investigated from the perspectives of works, hobbies, and emotions typically associated with a specific gender. In this study, we introduce a novel perspective. We investigate whether LLMs can predict an individual’s gender based solely on online shopping histories and whether these predictions are influenced by gender biases and stereotypes. Using a dataset of historical online purchases from users in the United States, we evaluate the ability of six LLMs to classify gender and we then analyze their reasoning and products-gender co-occurrences. Results indicate that while models can infer gender with moderate accuracy, their decisions are often rooted in stereotypical associations between product categories and gender. Furthermore, explicit instructions to avoid bias reduce the certainty of model predictions, but do not eliminate stereotypical patterns. Our findings highlight the persistent nature of gender biases in LLMs and emphasize the need for robust bias-mitigation strategies.
zh

[NLP-1] OpenCodeReasoning : Advancing Data Distillation for Competitive Coding

【速读】：该论文旨在解决通过知识蒸馏将推理能力融入学生模型时，许多现有方法因依赖专有数据集或缺乏数据整理与训练细节而导致的进展受限问题。论文的关键解决方案是构建了一个高质量的监督微调（SFT）数据集，并利用此数据集在不同规模的模型上实现了最先进的编码能力成果。实验表明，仅使用SFT的蒸馏模型在LiveCodeBench上达到61.8%，在CodeContests上达到24.6%，优于采用强化学习训练的替代方案。此外，研究分析了数据来源、代码执行过滤的影响以及指令/解答多样性的重要性，发现执行过滤对基准准确性产生负面影响，因此强调了指令多样性而非解答正确性的重要性。最后，论文还探讨了这些模型的令牌效率和推理模式。论文将开源这些数据集和蒸馏模型以促进社区发展。

链接: https://arxiv.org/abs/2504.01943
作者: Wasi Uddin Ahmad,Sean Narenthiran,Somshubra Majumdar,Aleksander Ficek,Siddhartha Jain,Jocelyn Huang,Vahid Noroozi,Boris Ginsburg
机构: NVIDIA (NVIDIA)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Since the advent of reasoning-based large language models, many have found great success from distilling reasoning capabilities into student models. Such techniques have significantly bridged the gap between reasoning and standard LLMs on coding tasks. Despite this, much of the progress on distilling reasoning models remains locked behind proprietary datasets or lacks details on data curation, filtering and subsequent training. To address this, we construct a superior supervised fine-tuning (SFT) dataset that we use to achieve state-of-the-art coding capability results in models of various sizes. Our distilled models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning. We then perform analysis on the data sources used to construct our dataset, the impact of code execution filtering, and the importance of instruction/solution diversity. We observe that execution filtering negatively affected benchmark accuracy, leading us to prioritize instruction diversity over solution correctness. Finally, we also analyze the token efficiency and reasoning patterns utilized by these models. We will open-source these datasets and distilled models to the community.
zh

[NLP-2] Review Refine Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection

【速读】：该论文旨在解决复杂多模态应用、结构化生成和战略规划中AI代理性能不足的问题，尤其是在标准微调方法不适用且推理时间方法如Best-of-N (BON)采样缺乏迭代反馈整合机制的情况下。论文的关键创新在于提出Iterative Agent Decoding (IAD)，通过结合迭代优化与动态候选评估及选择，利用验证器引导实现高效反馈设计与整合。IAD特别优化以最大化从奖励分数中提取信号，并在Sketch2Code、Text2SQL和Webshop任务中显著优于基线模型，分别取得3%-6%（带或不带LLM裁判）和8%-10%的绝对性能提升。进一步分析表明，IAD的改进主要归因于验证器引导的精炼而非单纯采样多样性，验证器质量对推理时间优化至关重要，同时揭示了噪声和稀疏奖励对扩展行为的影响。

链接: https://arxiv.org/abs/2504.01931
作者: Souradip Chakraborty,Mohammadreza Pourreza,Ruoxi Sun,Yiwen Song,Nino Scherrer,Jindong Gu,Furong Huang,Amrit Singh Bedi,Ahmad Beirami,Hamid Palangi,Tomas Pfister
机构: Google; University of Maryland; University of Central Florida
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While AI agents have shown remarkable performance at various tasks, they still struggle with complex multi-modal applications, structured generation and strategic planning. Improvements via standard fine-tuning is often impractical, as solving agentic tasks usually relies on black box API access without control over model parameters. Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. However, BON lacks iterative feedback integration mechanism. Hence, we propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier. IAD differs in how feedback is designed and integrated, specifically optimized to extract maximal signal from reward scores. We conduct a detailed comparison of baselines across key metrics on Sketch2Code, Text2SQL, and Webshop where IAD consistently outperforms baselines, achieving 3–6% absolute gains on Sketch2Code and Text2SQL (with and without LLM judges) and 8–10% gains on Webshop across multiple metrics. To better understand the source of IAD’s gains, we perform controlled experiments to disentangle the effect of adaptive feedback from stochastic sampling, and find that IAD’s improvements are primarily driven by verifier-guided refinement, not merely sampling diversity. We also show that both IAD and BON exhibit inference-time scaling with increased compute when guided by an optimal verifier. Our analysis highlights the critical role of verifier quality in effective inference-time optimization and examines the impact of noisy and sparse rewards on scaling behavior. Together, these findings offer key insights into the trade-offs and principles of effective inference-time optimization.
zh

[NLP-3] A thorough benchmark of automatic text classification: From traditional approaches to large language models

【速读】：该论文试图解决的问题是：尽管近年来基于Transformer架构的小型语言模型（SLMs）和大型语言模型（LLMs）在自动文本分类（ATC）任务上取得了显著的效果提升，但尚未有全面的成本效益分析来评估这些新方法相对于传统方法（如支持向量机SVMs和Logistic回归）是否能够在效果提升的同时合理补偿其更高的成本。
解决方案的关键在于：首先，提供了一种科学严谨的传统与最新ATC方法（包括五种开源LLMs）的成本效益对比分析；其次，构建了一个包含22个数据集的大规模基准测试平台，覆盖情感分析和主题分类等任务，并采用基于折叠交叉验证的数据划分方式，同时公开代码、数据及文档，以促进社区复现实验并推动领域发展。实验结果表明，LLMs在效果上优于传统方法（平均高26%-7.1%）和SLMs（平均高4.9%-1.9%），但其计算成本显著更高，比传统方法和SLMs分别慢590倍和8.5倍。基于此，论文提出了针对不同应用场景的推荐策略。

链接: https://arxiv.org/abs/2504.01930
作者: Washington Cunha,Leonardo Rocha,Marcos André Gonçalves
机构: Federal University of Minas Gerais (巴西联邦大学米纳斯吉拉斯); Federal University of São João del Rei (巴西联邦大学圣若昂德尔雷伊); Federal University of Minas Gerais (巴西联邦大学米纳斯吉拉斯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Automatic text classification (ATC) has experienced remarkable advancements in the past decade, best exemplified by recent small and large language models (SLMs and LLMs), leveraged by Transformer architectures. Despite recent effectiveness improvements, a comprehensive cost-benefit analysis investigating whether the effectiveness gains of these recent approaches compensate their much higher costs when compared to more traditional text classification approaches such as SVMs and Logistic Regression is still missing in the literature. In this context, this work’s main contributions are twofold: (i) we provide a scientifically sound comparative analysis of the cost-benefit of twelve traditional and recent ATC solutions including five open LLMs, and (ii) a large benchmark comprising 22 datasets, including sentiment analysis and topic classification, with their (train-validation-test) partitions based on folded cross-validation procedures, along with documentation, and code. The release of code, data, and documentation enables the community to replicate experiments and advance the field in a more scientifically sound manner. Our comparative experimental results indicate that LLMs outperform traditional approaches (up to 26%-7.1% on average) and SLMs (up to 4.9%-1.9% on average) in terms of effectiveness. However, LLMs incur significantly higher computational costs due to fine-tuning, being, on average 590x and 8.5x slower than traditional methods and SLMs, respectively. Results suggests the following recommendations: (1) LLMs for applications that require the best possible effectiveness and can afford the costs; (2) traditional methods such as Logistic Regression and SVM for resource-limited applications or those that cannot afford the cost of tuning large LLMs; and (3) SLMs like Roberta for near-optimal effectiveness-efficiency trade-off.
zh

[NLP-4] Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure

【速读】：该论文旨在解决大型语言模型（LLMs）中存在的反转咒诅（Reversal Curse）问题，即模型难以学习可逆事实关联的基本泛化失败现象。作者推测这一问题源于认知科学、神经科学及人工智能领域长期存在的绑定问题，并指出其核心原因在于Transformer架构在概念绑定方面的局限性，具体表现为概念表征的不一致性与纠缠。为解决此问题，论文提出了一种基于联合嵌入预测架构（JEPA）的新模型设计，首次突破了反转咒诅，且未依赖特殊的数据增强或非因果掩码等旁路方法。此外，通过引入支持解缠概念表征的特殊记忆层，进一步提升了模型的泛化能力。关键解决方案在于创新性的JEPA架构以及对概念表征纠缠问题的有效缓解，从而实现参数化前向链推理以解决大规模算术推理任务，超越了基于非参数化记忆和冗长显式推理的传统大型语言模型。

链接: https://arxiv.org/abs/2504.01928
作者: Boshi Wang,Huan Sun
机构: The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code and data: this https URL

点击查看摘要

Abstract:Despite their impressive capabilities, LLMs exhibit a basic generalization failure known as the Reversal Curse, where they struggle to learn reversible factual associations. Understanding why this occurs could help identify weaknesses in current models and advance their generalization and robustness. In this paper, we conjecture that the Reversal Curse in LLMs is a manifestation of the long-standing binding problem in cognitive science, neuroscience and AI. Specifically, we identify two primary causes of the Reversal Curse stemming from transformers’ limitations in conceptual binding: the inconsistency and entanglements of concept representations. We perform a series of experiments that support these conjectures. Our exploration leads to a model design based on JEPA (Joint-Embedding Predictive Architecture) that for the first time breaks the Reversal Curse without side-stepping it with specialized data augmentation or non-causal masking, and moreover, generalization could be further improved by incorporating special memory layers that support disentangled concept representations. We demonstrate that the skill of reversal unlocks a new kind of memory integration that enables models to solve large-scale arithmetic reasoning problems via parametric forward-chaining, outperforming frontier LLMs based on non-parametric memory and prolonged explicit reasoning.
zh

[NLP-5] Bridging the Linguistic Divide: A Survey on Leverag ing Large Language Models for Machine Translation

【速读】：该论文旨在解决低资源语言和领域在机器翻译（Machine Translation, MT）中的挑战，特别是在缺乏足够的平行语料库、语言工具以及计算基础设施的情况下。论文的关键解决方案在于探索利用大型语言模型（Large Language Models, LLMs）的有效技术，包括少量提示（few-shot prompting）、跨语言迁移（cross-lingual transfer）以及参数高效微调（parameter-efficient fine-tuning），以实现对低资源环境的适应。此外，论文还研究了基于LLMs的合成数据生成策略，如反向翻译（back-translation）和词典增强（lexical augmentation）。通过对比LLM驱动的翻译与传统编码器-解码器模型的表现，论文进一步评估了不同方法的优势与局限性，并探讨了幻觉（hallucinations）、评价不一致性及继承偏差等持续存在的挑战，同时评估了新兴的LLM驱动翻译质量指标。总体而言，论文提供了实用见解，并为构建大规模生成模型时代的鲁棒、包容且可扩展的MT系统指明了方向。

链接: https://arxiv.org/abs/2504.01919
作者: Baban Gain,Dibyanayan Bandyopadhyay,Asif Ekbal
机构: Indian Institute of Technology Patna (印度理工学院帕特纳); Indian Institute of Technology Jodhpur (印度理工学院乔杜尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has significantly reshaped the landscape of machine translation (MT), particularly for low-resource languages and domains that lack sufficient parallel corpora, linguistic tools, and computational infrastructure. This survey presents a comprehensive overview of recent progress in leveraging LLMs for MT. We analyze techniques such as few-shot prompting, cross-lingual transfer, and parameter-efficient fine-tuning that enable effective adaptation to under-resourced settings. The paper also explores synthetic data generation strategies using LLMs, including back-translation and lexical augmentation. Additionally, we compare LLM-based translation with traditional encoder-decoder models across diverse language pairs, highlighting the strengths and limitations of each. We discuss persistent challenges such as hallucinations, evaluation inconsistencies, and inherited biases while also evaluating emerging LLM-driven metrics for translation quality. This survey offers practical insights and outlines future directions for building robust, inclusive, and scalable MT systems in the era of large-scale generative models.
zh

[NLP-6] FineLIP: Extending CLIPs Reach via Fine-Grained Alignment with Longer Text Inputs

【速读】：该论文旨在解决现有CLIP模型在处理长描述性文本时能力受限以及难以有效捕捉视觉和文本细节的问题。具体而言，传统CLIP模型的文本编码器仅能处理最多77个文本标记，限制了其在长描述任务中的表现，并且在需要细粒度分析的任务中往往性能不佳。为了解决这些问题，论文提出了一种名为FineLIP的新方法，它通过在CLIP框架内引入细粒度对齐机制与更长文本输入的能力来增强跨模态文本-图像映射。FineLIP的关键在于首先扩展位置嵌入以支持更长的文本输入，接着动态聚合局部图像和文本标记，并利用这些聚合结果强制执行细粒度的标记到标记的跨模态对齐。实验结果表明，FineLIP在零样本跨模态检索和文本到图像生成两项任务上的表现优于现有的最先进方法。

链接: https://arxiv.org/abs/2504.01916
作者: Mothilal Asokan,Kebin Wu,Fatima Albreiki
机构: Technology Innovation Institute (TII)(技术创新研究所), United Arab Emirates (阿拉伯联合酋长国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, \textbfFineLIP, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating \textbfFine-grained alignment with \textbfLonger text input within the CL\textbfIP-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggregated results are then used to enforce fine-grained token-to-token cross-modal alignment. We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation. Quantitative and qualitative experimental results demonstrate the effectiveness of FineLIP, outperforming existing state-of-the-art approaches. Furthermore, comprehensive ablation studies validate the benefits of key design elements within FineLIP.
zh

[NLP-7] Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在物理学研究中输出结果可靠性与可解释性不足的问题。论文的关键解决方案在于提出一个包含三个模块的框架：推理模块、解释模块以及AI与科学家交互模块。其中，解释模块尤为关键，它通过引入多个专业化代理（如总结者、模型构建者、用户界面构建者和测试者），将LLMs的输出结构化为基于物理原理的更易理解的科学模型，从而提升对AI生成结果的理解深度。这一创新方法显著增强了科学发现中AI增强推理的透明度、验证能力和理论整合能力。

链接: https://arxiv.org/abs/2504.01911
作者: Yinggan Xu,Hana Kimlee,Yijia Xiao,Di Luo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are playing an expanding role in physics research by enhancing reasoning, symbolic manipulation, and numerical computation. However, ensuring the reliability and interpretability of their outputs remains a significant challenge. In our framework, we conceptualize the collaboration between AI and human scientists as a dynamic interplay among three modules: the reasoning module, the interpretation module, and the AI-scientist interaction module. Recognizing that effective physics reasoning demands rigorous logical consistency, quantitative precision, and deep integration with established theoretical models, we introduce the interpretation module to improve the understanding of AI-generated outputs, which is not previously explored in the literature. This module comprises multiple specialized agents, including summarizers, model builders, UI builders, and testers, which collaboratively structure LLM outputs within a physically grounded framework, by constructing a more interpretable science model. A case study demonstrates that our approach enhances transparency, facilitates validation, and strengthens AI-augmented reasoning in scientific discovery.
zh

[NLP-8] STAR-1: Safer Alignment of Reasoning LLM s with 1K Data

【速读】：该论文旨在解决大型推理模型（Large Reasoning Models, LRMs）在安全对齐（safety alignment）方面的关键需求。具体而言，作者构建了一个名为STAR-1的高质量、仅包含千级别样本的安全数据集，专为LRMs设计，如DeepSeek-R1。论文解决方案的关键在于遵循三个核心原则：多样性（diversity）、深思熟虑的推理（deliberative reasoning）和严格的筛选（rigorous filtering）。首先，通过整合来自不同开源数据集的安全样本提升数据多样性；其次，制定安全策略以生成基于策略的深思熟虑的推理样本；最后，利用基于GPT-4o的安全评分系统筛选符合最佳实践的安全训练样本。实验结果表明，使用STAR-1微调LRMs可使四个基准测试中的安全性能平均提升40%，同时仅导致五个推理任务中推理能力的轻微下降（平均1.1%）。此外，消融研究进一步验证了这三个设计原则的重要性，并分析了STAR-1在LRMs和传统大型语言模型（LLMs）中的有效性。

链接: https://arxiv.org/abs/2504.01903
作者: Zijun Wang,Haoqin Tu,Yuhan Wang,Juncheng Wu,Jieru Mei,Brian R. Bartoldson,Bhavya Kailkhura,Cihang Xie
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles – diversity, deliberative reasoning, and rigorous filtering – STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is this https URL.
zh

[NLP-9] Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights

【速读】：该论文旨在解决社交媒体对话中滥用语言检测的问题，传统方法因忽视会话上下文（即前续评论的内容与拓扑结构）而导致性能不稳定。解决方案的关键在于提出了一种基于图神经网络（Graph Neural Networks, GNNs）的新方法，将社交媒体对话建模为图结构，其中节点代表评论，边表示回复关系。通过系统性研究不同的图表示和上下文窗口配置，优化了滥用语言检测（Abusive Language Detection, ALD）的效果，最终显著提升了F1分数，证明了结构化会话上下文的重要性，并确立了GNN作为上下文感知滥用语言检测的强大框架。

链接: https://arxiv.org/abs/2504.01902
作者: Célia Nouri,Jean-Philippe Cointet,Chloé Clavel
机构: INRIA(法国国家信息与自动化研究所); Sciences Po(科学政治学院); Télécom Paris(巴黎高等电信学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting abusive language in social media conversations poses significant challenges, as identifying abusiveness often depends on the conversational context, characterized by the content and topology of preceding comments. Traditional Abusive Language Detection (ALD) models often overlook this context, which can lead to unreliable performance metrics. Recent Natural Language Processing (NLP) methods that integrate conversational context often depend on limited and simplified representations, and report inconsistent results. In this paper, we propose a novel approach that utilize graph neural networks (GNNs) to model social media conversations as graphs, where nodes represent comments, and edges capture reply structures. We systematically investigate various graph representations and context windows to identify the optimal configuration for ALD. Our GNN model outperform both context-agnostic baselines and linear context-aware methods, achieving significant improvements in F1 scores. These findings demonstrate the critical role of structured conversational context and establish GNNs as a robust framework for advancing context-aware abusive language detection.
zh

[NLP-10] Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在3D场景理解中的应用障碍，特别是由于缺乏大规模视觉-语言3D数据集所导致的问题。为应对这一挑战，传统方法侧重于通过设计3D输入级场景表示将3D感知能力注入到2D LMMs中。本文提出了一种新的视角，即引入带有3D感知的重建视觉指令微调（Ross3D），其关键在于将带有3D感知的视觉监督融入训练过程。具体而言，Ross3D结合了跨视图和全局视图重建：前者通过聚合其他视图的重叠信息来重构被遮掩的视图；后者则利用所有可用视图的信息恢复鸟瞰图（Bird’s-Eye-View）图像，从而提供整个场景的全面概览。实验结果表明，Ross3D在多种3D场景理解基准测试中达到了最先进的性能，并且半监督实验展示了利用大量未标注的3D视觉数据的巨大潜力。

链接: https://arxiv.org/abs/2504.01901
作者: Haochen Wang,Yucheng Zhao,Tiancai Wang,Haoqiang Fan,Xiangyu Zhang,Zhaoxiang Zhang
机构: NLPR(模式识别国家重点实验室), MAIS(多媒体计算与人工智能研究组), CASIA(中国科学院自动化研究所); UCAS(中国科学院大学); Dexmal; MEGVII Technology(旷视科技); StepFun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird’s-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.
zh

[NLP-11] CoRAG : Collaborative Retrieval-Augmented Generation NAACL2024

【速读】：该论文旨在解决在资源受限（low-resource）的协作环境中，如何有效提升 Retrieval-Augmented Generation (RAG) 模型性能的问题。论文提出了一种名为 CoRAG 的框架，其关键在于通过引入一个共享的协作语料库，使多个客户端能够联合训练一个共享的 RAG 模型，从而在保持知识多样性的同时优化模型性能。实验表明，CoRAG 在低资源场景下优于参数化协作学习方法和本地训练的 RAG 模型。研究进一步揭示了共享存储中相关和无关片段的重要性及潜在风险，并强调了在协作 RAG 中平衡利用集体增强的知识库与避免有害片段传播的关键权衡问题。

链接: https://arxiv.org/abs/2504.01883
作者: Aashiq Muhamed,Mona Diab,Virginia Smith
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: NAACL 2024

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) models excel in knowledge-intensive tasks, especially under few-shot learning constraints. We introduce CoRAG, a framework extending RAG to collaborative settings, where clients jointly train a shared model using a collaborative passage store. To evaluate CoRAG, we introduce CRAB, a benchmark for collaborative homogeneous open-domain question answering. Our experiments demonstrate that CoRAG consistently outperforms both parametric collaborative learning methods and locally trained RAG models in low-resource scenarios. Further analysis reveals the critical importance of relevant passages within the shared store, the surprising benefits of incorporating irrelevant passages, and the potential for hard negatives to negatively impact performance. This introduces a novel consideration in collaborative RAG: the trade-off between leveraging a collectively enriched knowledge base and the potential risk of incorporating detrimental passages from other clients. Our findings underscore the viability of CoRAG, while also highlighting key design challenges and promising avenues for future research.
zh

[NLP-12] ransientTables: Evaluating LLM s Reasoning on Temporally Evolving Semi-structured Tables

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在时间推理（temporal reasoning）能力上的局限性问题。传统LLMs通常基于静态数据集进行训练，难以有效捕捉和推理事件的时间序列关系。为应对这一挑战，论文的关键解决方案是构建了一个名为TRANSIENTTABLES的数据集，并引入了一种基于模板的问题生成流水线来优化模板和问题。此外，通过采用最先进的LLMs建立基准结果，同时提出以任务分解为中心的新建模策略，进一步提升了LLMs在时间推理任务中的性能。

链接: https://arxiv.org/abs/2504.01879
作者: Abhilash Shankarampeta,Harsh Mahajan,Tushar Kataria,Dan Roth,Vivek Gupta
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 19 Pages. 21 Tables, 1 figure

点击查看摘要

Abstract:Humans continuously make new discoveries, and understanding temporal sequence of events leading to these breakthroughs is essential for advancing science and society. This ability to reason over time allows us to identify future steps and understand the effects of financial and political decisions on our lives. However, large language models (LLMs) are typically trained on static datasets, limiting their ability to perform effective temporal reasoning. To assess the temporal reasoning capabilities of LLMs, we present the TRANSIENTTABLES dataset, which comprises 3,971 questions derived from over 14,000 tables, spanning 1,238 entities across multiple time periods. We introduce a template-based question-generation pipeline that harnesses LLMs to refine both templates and questions. Additionally, we establish baseline results using state-of-the-art LLMs to create a benchmark. We also introduce novel modeling strategies centered around task decomposition, enhancing LLM performance.
zh

[NLP-13] Cross-Lingual Consistency: A Novel Inference Framework for Advancing Reasoning in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理复杂推理任务时因多语言训练语料库中的固有语言偏见而导致的语义漂移和逻辑不一致问题，尤其是在参数规模小于10B的LLMs中表现更为显著。为克服这些限制，论文提出了一种名为跨语言一致性（Cross-Lingual Consistency, CLC）的新颖推理框架。CLC的关键创新在于通过多数投票机制整合多语言推理路径，从而提升LLMs的推理能力。实验结果表明，CLC在CMATH数据集上的表现优于传统的自一致性方法，并在DeepSeek-Math-7B-Instruct、Qwen2.5-Math-7B-Instruct和Gemma2-9B-Instruct等模型上分别实现了9.5%、6.5%和6.0%的绝对准确率提升。此外，将CLC扩展到11种不同语言进一步展示了其双重优势：一是通过多语言集成投票消除多语言训练语料库中的语言偏见；二是通过探索更广泛的多语言解空间避免单语言推理陷阱，从而实现比单语言自一致性基线更优的全局推理路径，在Gemma2-9B-Instruct模型在MGSM数据集上的测试中验证了4.1%-18.5%的准确率提升。

链接: https://arxiv.org/abs/2504.01857
作者: Zhiwei Yu,Tuo Li,Changhong Wang,Hui Chen,Lang Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) has emerged as a critical mechanism for enhancing reasoning capabilities in large language models (LLMs), with self-consistency demonstrating notable promise in boosting performance. However, inherent linguistic biases in multilingual training corpora frequently cause semantic drift and logical inconsistencies, especially in sub-10B parameter LLMs handling complex inference tasks. To overcome these constraints, we propose the Cross-Lingual Consistency (CLC) framework, an innovative inference paradigm that integrates multilingual reasoning paths through majority voting to elevate LLMs’ reasoning capabilities. Empirical evaluations on the CMATH dataset reveal CLC’s superiority over the conventional self-consistency method, delivering 9.5%, 6.5%, and 6.0% absolute accuracy gains for DeepSeek-Math-7B-Instruct, Qwen2.5-Math-7B-Instruct, and Gemma2-9B-Instruct respectively. Expanding CLC’s linguistic scope to 11 diverse languages implies two synergistic benefits: 1) neutralizing linguistic biases in multilingual training corpora through multilingual ensemble voting, 2) escaping monolingual reasoning traps by exploring the broader multilingual solution space. This dual benefits empirically enables more globally optimal reasoning paths compared to monolingual self-consistency baselines, as evidenced by the 4.1%-18.5% accuracy gains using Gemma2-9B-Instruct on the MGSM dataset.
zh

[NLP-14] PaperBench: Evaluating AIs Ability to Replicate AI Research

【速读】：该论文试图解决的问题是如何客观评估人工智能（AI）代理在复现顶级人工智能研究方面的能力。为实现这一目标，论文提出了一套名为PaperBench的基准测试系统，要求AI代理从零开始复现20篇ICML 2024 Spotlight和Oral论文，涵盖理解论文贡献、开发代码库以及成功执行实验等任务。为了确保评估的客观性，论文设计了一套层次化的打分标准（rubrics），将每个复现任务细分为多个明确评分标准的小任务，总计包含8,316个可单独评分的任务。此外，这些打分标准由相应ICML论文的作者共同开发，以保证其准确性和现实性。为支持大规模评估，论文还开发了一种基于大型语言模型（LLM）的自动评分器，并通过构建独立的裁判基准来评估其性能。最终，论文发现当前表现最佳的AI代理（Claude 3.5 Sonnet (New)结合开源框架）平均复现得分为21.0%，且尚未超越人类专家的表现。论文的关键解决方案在于构建了一个全面且多层次的评估框架，结合人工与自动化手段，以量化AI代理在复现前沿研究中的工程能力。

链接: https://arxiv.org/abs/2504.01848
作者: Giulio Starace,Oliver Jaffe,Dane Sherburn,James Aung,Jun Shern Chan,Leon Maksin,Rachel Dias,Evan Mays,Benjamin Kinsella,Wyatt Thompson,Johannes Heidecke,Amelia Glaese,Tejal Patwardhan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages, 14 figures

点击查看摘要

Abstract:We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge’s performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We \hrefthis https URLopen-source our code to facilitate future research in understanding the AI engineering capabilities of AI agents.
zh

[NLP-15] LARGE: Legal Retrieval Augmented Generation Evaluation Tool

【速读】：该论文试图解决构建检索增强生成（Retrieval-Augmented Generation, RAG）系统在法律领域的性能优化问题。解决方案的关键在于提出LRAGE（Legal Retrieval-Augmented Generation Evaluation），一个专注于法律领域的开源工具，用于全面评估RAG系统的整体性能。LRAGE通过图形用户界面（GUI）和命令行接口（CLI）支持实验，系统性地研究检索语料库、检索算法、重排序器、大型语言模型主干以及评价指标这五个组件的变化对整体准确性的影响，并通过多语言法律基准（如KBL、LegalBench、LawBench）验证了其有效性。

链接: https://arxiv.org/abs/2504.01840
作者: Minhu Park,Hongseok Oh,Eunkyung Choi,Wonseok Hwang
机构: University of Seoul (首尔大学); LBOX (LBOX)
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice. Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents. However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy. We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above. The source code is available at this https URL.
zh

[NLP-16] YourBench: Easy Custom Evaluation Sets for Everyone

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）评估中存在的两大关键问题：传统静态基准测试容易饱和且存在污染，而人工评估又成本高昂且耗时，这阻碍了及时或领域特定的评估，而这类评估对于实际应用至关重要。论文提出的关键解决方案是YourBench，这是一个新颖的开源框架，通过动态、自动化生成可靠、最新且领域定制化的基准测试，从而以低成本且无需人工标注的方式从用户提供的文档中直接构建评估工具。YourBench的核心创新在于其能够利用少量源文本高效复制多样化的MMLU子集，并完美保留原始基准中观察到的模型性能排名（Spearman相关系数=1），同时引入Tempora-0325数据集确保生成的数据基于输入内容而非依赖于模型后验参数知识。这一方法不仅验证了生成评估的质量，还通过算法检查和人工评估保证了结果的可靠性。

链接: https://arxiv.org/abs/2504.01833
作者: Sumuk Shashidhar,Clémentine Fourrier,Alina Lozovskia,Thomas Wolf,Gokhan Tur,Dilek Hakkani-Tür
机构: Huggingface (Hugging Face); UIUC (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.
zh

[NLP-17] Efficient Constant-Space Multi-Vector Retrieval ECIR2025

【速读】：该论文试图解决多向量检索方法（如ColBERT架构）因存储每个输入标记的向量而导致的高存储成本问题。解决方案的关键在于将文档编码为固定数量的向量，这些向量不再与输入标记直接绑定，从而在减少存储开销的同时，保持固定大小的磁盘表示，便于操作系统页管理，并且在MSMARCO篇章语料库和BEIR基准测试中验证了其有效性保留能力。

链接: https://arxiv.org/abs/2504.01818
作者: Sean MacAvaney,Antonio Mallia,Nicola Tonellotto
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: ECIR 2025

点击查看摘要

Abstract:Multi-vector retrieval methods, exemplified by the ColBERT architecture, have shown substantial promise for retrieval by providing strong trade-offs in terms of retrieval latency and effectiveness. However, they come at a high cost in terms of storage since a (potentially compressed) vector needs to be stored for every token in the input collection. To overcome this issue, we propose encoding documents to a fixed number of vectors, which are no longer necessarily tied to the input tokens. Beyond reducing the storage costs, our approach has the advantage that document representations become of a fixed size on disk, allowing for better OS paging management. Through experiments using the MSMARCO passage corpus and BEIR with the ColBERT-v2 architecture, a representative multi-vector ranking model architecture, we find that passages can be effectively encoded into a fixed number of vectors while retaining most of the original effectiveness.
zh

[NLP-18] Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在预训练数据存在极端语言不平衡的情况下仍展现出显著多语言能力的原因，重点关注预训练语料库的作用。论文的关键发现是代码切换（code-switching）——在同一上下文中交替使用不同语言的现象——对于多语言能力至关重要。为此，作者分析了预训练语料库中的代码切换现象，将其分为四种类型，并评估其对多语言性能的影响。此外，为了进一步探索代码切换在预训练期间对语言对齐的潜力，研究引入了合成代码切换（synthetic code-switching）策略，并通过逐步增加合成数据规模，观察到基准测试和表示空间均获得显著提升。实验结果表明，结合合成代码切换数据能够实现更好的语言对齐，并且对资源丰富、中等和稀缺的语言均具有良好的泛化能力。因此，该研究的核心解决方案在于利用合成代码切换数据来优化语言对齐效果。

链接: https://arxiv.org/abs/2504.01801
作者: Zhijun Wang,Jiahuan Li,Hao Zhou,Rongxiang Weng,Jingang Wang,Xin Huang,Xue Han,Junlan Feng,Chao Deng,Shujian Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.
zh

[NLP-19] OpenThaiGPT 1.6 and R1: Thai-Centric Open Source and Reasoning Large Language Models

【速读】：该论文旨在解决提升泰语大型语言模型（Large Language Models, LLMs）在泛化能力和推理能力方面表现的问题。为实现这一目标，论文提出了两种具有不同方法论的泰语中心化大型语言模型：OpenThaiGPT 1.6 (OTG-1.6) 和 OpenThaiGPT R1 (OTG-R1)。其中，OTG-1.6通过任务算术模型合并（Task Arithmetic model merging）来增强广泛的泛化能力；而OTG-R1则结合多阶段训练与“少即是多”推理假设（Less-Is-More Reasoning Hypothesis, LIMO）以支持高级推理能力。关键在于这两种模型分别采用了创新性的方法来优化其特定的能力方向，并通过基准测试验证了其在泰语相关任务上的卓越性能，从而确立了新的性能标准。

链接: https://arxiv.org/abs/2504.01789
作者: Sumeth Yuenyong,Thodsaporn Chay-intr,Kobkrit Viriyayudhakorn
机构: Department of Computer Science, Faculty of Engineering, Mahidol University (玛希敦大学), Thailand; iApp Technology Co., Ltd. (泰国iApp科技有限公司), Thailand; Intelligent Informatics and Service Innovation Research Center (智能信息与服务创新研究中心), Thailand; Artificial Intelligence Entrepreneur Association of Thailand (AIEAT) (泰国人工智能企业家协会), Thailand
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present OpenThaiGPT 1.6 and R1 (OTG-1.6 and OTG-R1), Thai-centric Large Language Models (LLMs) developed through distinct methodologies to enhance generalization and reasoning capabilities. OTG-1.6 employs Task Arithmetic model merging for broad generalization, while OTG-R1 integrates multi-stage training with the Less-Is-More Reasoning Hypothesis (LIMO) for advanced reasoning. Benchmark evaluations demonstrate superior performance across Thai language tasks, achieving competitive results against larger-scale open-source Thai LLMs. This paper details the proposed models, training processes, benchmarks, and results, highlighting improvements over previous models and establishing new performance standards for Thai-centric LLMs.
zh

[NLP-20] Style over Substance: Distilled Language Models Reason Via Stylistic Replication

【速读】：该论文试图解决的问题是：尽管通过知识蒸馏将详细推理轨迹（Reasoning Traces）转移到较小的指令微调模型中可以显著提升性能，但这些模型在推理过程中究竟继承了多少实质性的推理能力尚不明确，特别是是否主要依赖于表面级的风格模式。
解决方案的关键在于引入两个新的数据集：一个是从实际推理轨迹中提取的新兴推理轨迹数据集，另一个是专门设计的合成数据集，用于精确复制和分析这些风格模式对蒸馏模型推理能力的影响。研究发现，训练在合成轨迹上的模型表现与原始模型相当，表明蒸馏后的推理能力很大程度上依赖于表面级的风格模式，并且即使这些轨迹引导得出错误答案，模型性能仍有所提高。这一结果揭示了如何利用风格模式来高效增强不同语言模型家族的推理能力。

链接: https://arxiv.org/abs/2504.01738
作者: Philip Lippmann,Jie Yang
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Specialized reasoning language models (RLMs) have demonstrated that scaling test-time computation through detailed reasoning traces significantly enhances performance. Although these traces effectively facilitate knowledge distillation into smaller, instruction-tuned models, the precise nature of transferred reasoning remains unclear. In this study, we investigate to what extent distilled models internalize replicated stylistic patterns during reasoning. To this end, we systematically analyze reasoning traces, identifying structural and lexical patterns that characterize successful reasoning. We then introduce two new datasets – a dataset of emergent reasoning traces and a synthetic dataset explicitly constructed to replicate these stylistic patterns – to precisely examine their influence on distilled models’ reasoning capabilities. We find that models trained on the synthetic traces achieve comparable performance, indicating that distilled reasoning abilities rely significantly on surface-level patterns. Surprisingly, we observe an increase in performance even when the synthetic traces are altered to lead to the wrong answer. Our findings highlight how stylistic patterns can be leveraged to efficiently enhance LM reasoning across diverse model families.
zh

[NLP-21] InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在基于上下文学习（In-Context Learning, ICL）过程中因有限上下文窗口导致的性能约束，特别是在处理超长上下文时的挑战。为了解决这一问题，论文提出了InfiniteICL框架，其关键是将LLMs中的上下文和参数与人类认知系统中的短时记忆和长时记忆进行类比，专注于将临时的上下文知识转化为永久的参数更新。这种方法通过上下文知识的提取、选择和整合原则，实现了理论上无限上下文集成的能力，同时显著降低了内存使用，并保持了对不同输入长度的鲁棒性能。

链接: https://arxiv.org/abs/2504.01707
作者: Bowen Cao,Deng Cai,Wai Lam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In-context learning (ICL) is critical for large language models (LLMs), but its effectiveness is constrained by finite context windows, particularly in ultra-long contexts. To overcome this, we introduce InfiniteICL, a framework that parallels context and parameters in LLMs with short- and long-term memory in human cognitive systems, focusing on transforming temporary context knowledge into permanent parameter updates. This approach significantly reduces memory usage, maintains robust performance across varying input lengths, and theoretically enables infinite context integration through the principles of context knowledge elicitation, selection, and consolidation. Evaluations demonstrate that our method reduces context length by 90% while achieving 103% average performance of full-context prompting across fact recall, grounded reasoning, and skill acquisition tasks. When conducting sequential multi-turn transformations on complex, real-world contexts (with length up to 2M tokens), our approach surpasses full-context prompting while using only 0.4% of the original contexts. These findings highlight InfiniteICL’s potential to enhance the scalability and efficiency of LLMs by breaking the limitations of conventional context window sizes.
zh

[NLP-22] oM-RL: Reinforcement Learning Unlocks Theory of Mind in Small LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在社会推理（尤其是心智理论 Theory of Mind, ToM）方面能力不足的问题。传统上，规则驱动的强化学习（Reinforcement Learning, RL）在结构化推理任务（如数学和逻辑推理）中已取得显著进展，但在社会推理领域，尤其是在推断他人心理状态的能力方面，其效果尚未得到充分探索。论文的关键解决方案在于通过强化学习方法，即使在参数规模较小的LLMs（0.5B至7B参数）中，也能有效解锁ToM推理能力。研究采用一个包含3200个多样化场景问题的数据集，发现基于RL训练的7B参数模型在Hi-ToM基准测试中达到了84.50%的准确率，超越了参数量远多于自身的GPT-4o和DeepSeek-v3等模型。此外，研究揭示了较大的模型（7B参数）能够通过一致的信念追踪保持稳定的推理性能，而较小的模型（≤3B参数）则容易出现推理崩溃。这一方法不仅展示了RL在提升社会认知推理方面的潜力，还弥合了结构化问题解决与复杂社会推断之间的差距。

链接: https://arxiv.org/abs/2504.01698
作者: Yi-Long Lu,Chunhui Zhang,Jiajun Song,Lifeng Fan,Wei Wang
机构: State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室), BIGAI (北京通用人工智能研究院), Beijing (北京), China (中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in rule-based reinforcement learning (RL), applied during the post-training phase of large language models (LLMs), have significantly enhanced their capabilities in structured reasoning tasks such as mathematics and logical inference. However, the effectiveness of RL in social reasoning, particularly in Theory of Mind (ToM), the ability to infer others’ mental states, remains largely unexplored. In this study, we demonstrate that RL methods effectively unlock ToM reasoning capabilities even in small-scale LLMs (0.5B to 7B parameters). Using a modest dataset comprising 3200 questions across diverse scenarios, our RL-trained 7B model achieves 84.50% accuracy on the Hi-ToM benchmark, surpassing models like GPT-4o and DeepSeek-v3 despite significantly fewer parameters. While smaller models ( \leq 3B parameters) suffer from reasoning collapse, larger models (7B parameters) maintain stable performance through consistent belief tracking. Additionally, our RL-based models demonstrate robust generalization to higher-order, out-of-distribution ToM problems, novel textual presentations, and previously unseen datasets. These findings highlight RL’s potential to enhance social cognitive reasoning, bridging the gap between structured problem-solving and nuanced social inference in LLMs.
zh

[NLP-23] sting Low-Resource Language Support in LLM s Using Language Proficiency Exams: the Case of Luxembourgish

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在低资源语言（如卢森堡语）评估工具和数据集稀缺的问题。论文的关键解决方案是探索语言能力考试作为评估卢森堡语大型语言模型的有效性，并发现大模型（如ChatGPT、Claude和DeepSeek-R1）通常表现优异，而小模型则表现较弱，同时指出这些考试成绩可以预测其他自然语言处理（NLP）任务的表现。

链接: https://arxiv.org/abs/2504.01667
作者: Cedric Lothritz,Jordi Cabot
机构: Luxembourg Institute of Science and Technology (卢森堡科学技术研究院)
类目: Computation and Language (cs.CL)
备注: 18 pages, 2 figures, 11 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have become an increasingly important tool in research and society at large. While LLMs are regularly used all over the world by experts and lay-people alike, they are predominantly developed with English-speaking users in mind, performing well in English and other wide-spread languages while less-resourced languages such as Luxembourgish are seen as a lower priority. This lack of attention is also reflected in the sparsity of available evaluation tools and datasets. In this study, we investigate the viability of language proficiency exams as such evaluation tools for the Luxembourgish language. We find that large models such as ChatGPT, Claude and DeepSeek-R1 typically achieve high scores, while smaller models show weak performances. We also find that the performances in such language exams can be used to predict performances in other NLP tasks.
zh

[NLP-24] Horizon Scans can be accelerated using novel information retrieval and artificial intelligence tools

【速读】：该论文旨在解决医疗领域前景扫描（Horizon Scanning）面临的挑战，特别是信息检索与分析效率低下以及从非结构化数据源（如新闻）提取有价值信号的问题。论文提出的关键解决方案是开发SCANAR和AIDOC两个开源工具。SCANAR通过自动化处理新闻文章的检索与整理，提供去重及无监督相关性排序等功能；而AIDOC则利用人工智能技术，基于语义相似性神经网络重新排序文本数据，优先筛选出可能的相关条目供人工审查，从而大幅减少人工审核的工作量。关键在于结合自动化流程与人工智能技术，显著提升了信息处理效率与准确性。

链接: https://arxiv.org/abs/2504.01627
作者: Lena Schmidt,Oshin Sharma,Chris Marshall,Sonia Garcia Gonzalez Moral
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Introduction: Horizon scanning in healthcare assesses early signals of innovation, crucial for timely adoption. Current horizon scanning faces challenges in efficient information retrieval and analysis, especially from unstructured sources like news, presenting a need for innovative tools. Methodology: The study introduces SCANAR and AIDOC, open-source Python-based tools designed to improve horizon scanning. SCANAR automates the retrieval and processing of news articles, offering functionalities such as de-duplication and unsupervised relevancy ranking. AIDOC aids filtration by leveraging AI to reorder textual data based on relevancy, employing neural networks for semantic similarity, and subsequently prioritizing likely relevant entries for human review. Results: Twelve internal datasets from horizon scans and four external benchmarking datasets were used. SCANAR improved retrieval efficiency by automating processes previously dependent on manual labour. AIDOC displayed work-saving potential, achieving around 62% reduction in manual review efforts at 95% recall. Comparative analysis with benchmarking data showed AIDOC’s performance was similar to existing systematic review automation tools, though performance varied depending on dataset characteristics. A smaller case-study on our news datasets shows the potential of ensembling large language models within the active-learning process for faster detection of relevant articles across news datasets. Conclusion: The validation indicates that SCANAR and AIDOC show potential to enhance horizon scanning efficiency by streamlining data retrieval and prioritisation. These tools may alleviate methodological limitations and allow broader, swifter horizon scans. Further studies are suggested to optimize these models and to design new workflows and validation processes that integrate large language models.
zh

[NLP-25] Representation Bending for Large Language Model Safety

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中因有害内容生成及更广泛的社会危害所引发的安全风险问题，特别是这些风险因对抗性攻击、微调漏洞以及LLMs在高风险环境中的广泛应用而加剧。现有增强安全性的方法（如基于人类反馈的微调或对抗性训练）存在局限性，它们通常针对特定威胁设计，难以泛化到未见过的攻击场景，或者需要人工系统级防御，缺乏可扩展性。

论文提出RepBend这一创新方法，其关键在于从根本上干扰LLMs中潜在有害行为背后的表征。通过将激活引导（activation steering，即通过简单的向量算术在推理阶段调整模型行为）引入基于损失的微调框架，RepBend提供了一种可扩展的解决方案来提升模型的潜在安全性。实验结果表明，RepBend在多种越狱基准测试中实现了最先进的性能，相比先前的方法（如Circuit Breaker、RMU和NPO），能够将攻击成功率降低高达95%，同时对模型的可用性和通用能力几乎没有负面影响。

链接: https://arxiv.org/abs/2504.01550
作者: Ashkan Yousefpour,Taeheon Kim,Ryan S. Kwon,Seungbeen Lee,Wonje Jeung,Seungju Han,Alvin Wan,Harrison Ngan,Youngjae Yu,Jonghyun Choi
机构: Seoul National University (首尔国立大学); Yonsei University (延世大学); AIM Intelligence (AIM 智能); University of Michigan (密歇根大学); Stanford University (斯坦福大学); Amazon AWS (亚马逊AWS)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering - simple vector arithmetic for steering model’s behavior during inference - to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.
zh

[NLP-26] Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）预训练数据质量过滤中缺乏对不同文本类型贡献的深入理解的问题。传统方法通常基于统计指标或LLM辅助标注系统将数据分为“有价值”与“无价值”两类，但未能充分揭示预训练数据的文体（register）对模型性能的具体影响。论文的关键解决方案在于首次利用语料库语言学中广泛采用的文体分类标准，对预训练数据进行精细化筛选，并通过对比实验研究不同文体对LLM性能的影响。研究发现，不同的文体对模型表现有显著差异，例如新闻类文本表现不佳，而意见类文本（如评论和博客）则非常有益；同时，结合表现良好的文体类别（如如何指导、信息描述和意见类）可以大幅提升模型性能。这一研究揭示了文体作为模型变异性的重要解释变量，为未来的数据选择实践提供了更明确的方向。

链接: https://arxiv.org/abs/2504.01542
作者: Amanda Myntti,Erik Henriksson,Veronika Laippala,Sampo Pyysalo
机构: University of Turku (图尔库大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretraining data curation is a cornerstone in Large Language Model (LLM) development, leading to growing research on quality filtering of large web corpora. From statistical quality flags to LLM-based labeling systems, datasets are divided into categories, frequently reducing to a binary: those passing the filters deemed as valuable examples, others discarded as useless or detrimental. However, a more detailed understanding of the contribution of different kinds of texts to model performance is still largely lacking. In this article, we present the first study utilizing registers (also known as genres) - a widely used standard in corpus linguistics to model linguistic variation - to curate pretraining datasets and investigate the effect of register on the performance of LLMs. We perform comparative studies by training models with register classified data and evaluating them using standard benchmarks, and show that the register of pretraining data substantially affects model performance. We uncover surprising relationships between the pretraining material and the resulting models: using the News register results in subpar performance, and on the contrary, including the Opinion class, covering texts such as reviews and opinion blogs, is highly beneficial. While a model trained on the entire unfiltered dataset outperforms those trained on datasets limited to a single register, combining well-performing registers like How-to-Instructions, Informational Description, and Opinion leads to major improvements. Furthermore, analysis of individual benchmark results reveals key differences in the strengths and drawbacks of specific register classes as pretraining data. These findings show that register is an important explainer of model variation and can facilitate more deliberate future data selection practices.
zh

[NLP-27] From Smør-re-brød to Subwords: Training LLM s on Danish One Morpheme at a Time

【速读】：该论文试图解决的问题是如何通过结合语言学原理（尤其是形态学分割）改进针对丹麦语的语言模型性能。现有基于Transformer的顶级语言模型通常采用子词级标记化技术（如Byte-Pair-Encoding, BPE），但这些方法往往忽视了语言特定的形态结构。论文的关键解决方案是利用一个注释好的丹麦语形态数据集训练半监督的形态分割模型，开发出优化丹麦语形态特性的定制化分词器，并评估其在形态分割及下游任务中的表现。实验结果表明，使用形态学分词器的模型在F1分数上显著优于传统BPE分词器，并在多项下游任务中展现出更优的性能。这表明将丹麦语特有的形态分割策略融入分词器设计能够有效提升生成式Transformer模型在丹麦语处理中的表现。

链接: https://arxiv.org/abs/2504.01540
作者: Mikkel Wildner Kildeberg,Emil Allerslev Schledermann,Nicolaj Larsen,Rob van der Goot
机构: IT University of Copenhagen (哥本哈根信息技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The best performing transformer-based language models use subword tokenization techniques, such as Byte-Pair-Encoding (BPE). However, these approaches often overlook linguistic principles, such as morphological segmentation, which we believe is fundamental for understanding language-specific word structure. In this study, we leverage an annotated Danish morphological dataset to train a semisupervised model for morphological segmentation, enabling the development of tokenizers optimized for Danish morphology. We evaluate four distinct tokenizers, including two custom morphological tokenizers, by analyzing their performance in morphologically segmenting Danish words. Additionally, we train two generative transformer models, \textitCerebrasGPT-111M and \textitLLaMA-3.2 1B, using these tokenizers and evaluate their downstream performance. Our findings reveal that our custom-developed tokenizers substantially enhance morphological segmentation, achieving an F1 score of 58.84, compared to 39.28 achieved by a Danish BPE tokenizer. In downstream tasks, models trained with our morphological tokenizers outperform those using BPE tokenizers across different evaluation metrics. These results highlight that incorporating Danish morphological segmentation strategies into tokenizers leads to improved performance in generative transformer models on Danish language
zh

[NLP-28] Context-Aware Toxicity Detection in Multiplayer Games: Integrating Domain-Adaptive Pretraining and Match Metadata

【速读】：该论文旨在解决在线竞技视频游戏中因毒性言论（toxicity）带来的负面影响，特别是在传统毒性检测模型难以准确识别毒性言论的问题。这些模型通常仅关注孤立的消息而忽略了更广泛的上下文信息，而在游戏环境中，玩家互动往往包含专业俚语、缩写和拼写错误，进一步增加了检测难度。此外，毒性言论在文本中的出现频率较低，使得检测更具挑战性。

解决方案的关键在于将RoBERTa语言模型（RoBERTa LLM）适配为适合游戏环境的毒性检测工具。论文通过结合文本与非文本上下文信息，并利用元数据增强预训练嵌入（pretrained embeddings），同时针对游戏领域的特定俚语和语言特点进行领域自适应预训练（domain adaptive pretraining），从而更好地捕捉玩家互动中的细微差异。此外，通过使用两个游戏数据集（Defense of the Ancients 2 和 Call of Duty: Modern Warfare III），研究验证了不同上下文来源（如元数据、先前交互等）的重要性及其最佳利用方式，以提升检测性能。这项工作强调了采用上下文感知和领域特定方法进行主动监管的重要性。

链接: https://arxiv.org/abs/2504.01534
作者: Adrien Schurger-Foy,Rafal Dariusz Kocielnik,Caglar Gulcehre,R. Michael Alvarez
机构: EPFL; Caltech
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The detrimental effects of toxicity in competitive online video games are widely acknowledged, prompting publishers to monitor player chat conversations. This is challenging due to the context-dependent nature of toxicity, often spread across multiple messages or informed by non-textual interactions. Traditional toxicity detectors focus on isolated messages, missing the broader context needed for accurate moderation. This is especially problematic in video games, where interactions involve specialized slang, abbreviations, and typos, making it difficult for standard models to detect toxicity, especially given its rarity. We adapted RoBERTa LLM to support moderation tailored to video games, integrating both textual and non-textual context. By enhancing pretrained embeddings with metadata and addressing the unique slang and language quirks through domain adaptive pretraining, our method better captures the nuances of player interactions. Using two gaming datasets - from Defense of the Ancients 2 (DOTA 2) and Call of Duty ^\circledR : Modern Warfare ^\circledR III (MWIII) we demonstrate which sources of context (metadata, prior interactions…) are most useful, how to best leverage them to boost performance, and the conditions conducive to doing so. This work underscores the importance of context-aware and domain-specific approaches for proactive moderation.
zh

[NLP-29] Redefining technology for indigenous languages

【速读】：该论文试图解决原住民语言（Indigenous Languages）因被低估而导致的语言权利立法需求不足的问题，并探讨如何通过技术手段有效促进这些语言的复兴。论文指出，外来技术往往适得其反，而由社区内部开发的技术则更具效力。解决方案的关键在于以参与式环境为基础，将原住民知识融入大型语言模型（LLMs），从而实现技术领域的丰富与文化知识的有效交流。

链接: https://arxiv.org/abs/2504.01522
作者: Silvia Fernandez-Sabido,Laura Peniche-Sabido
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: in Spanish language

点击查看摘要

Abstract:In this paper, we offer an overview of indigenous languages, identifying the causes of their devaluation and the need for legislation on language rights. We review the technologies used to revitalize these languages, finding that when they come from outside, they often have the opposite effect to what they seek; however, when developed from within communities, they become powerful instruments of expression. We propose that the inclusion of Indigenous knowledge in large language models (LLMs) will enrich the technological landscape, but must be done in a participatory environment that encourages the exchange of knowledge.
zh

[NLP-30] Chain of Correction for Full-text Speech Recognition with Large Language Models

【速读】：该论文旨在解决基于大语言模型（Large Language Models, LLMs）的自动语音识别（Automatic Speech Recognition, ASR）全文本错误修正中的稳定性、可控性、完整性和流畅性等挑战。论文的关键解决方案是提出了一种称为链式修正（Chain of Correction, CoC）的框架，该框架通过预识别的文本作为引导，在常规的多轮对话格式中分段修正错误，并利用完整的预识别文本提供上下文以更好地理解全局语义并保持对整体内容的全面把握。实验结果表明，CoC 框架在修正 ASR 输出的全文本错误方面表现出色，显著优于基线和基准系统。

链接: https://arxiv.org/abs/2504.01519
作者: Zhiyuan Tang,Dong Wang,Zhikai Zhou,Yong Liu,Shen Huang,Shidong Shang
机构: Tencent Ethereal Audio Lab, Tencent (腾讯音视频实验室, 腾讯); Center for Speech and Language Technologies, BNRist, Tsinghua University (清华大学语音和语言技术中心, 清华大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) has gained increased attention due to its potential to correct errors across long contexts and address a broader spectrum of error types, including punctuation restoration and inverse text normalization. Nevertheless, many challenges persist, including issues related to stability, controllability, completeness, and fluency. To mitigate these challenges, this paper proposes the Chain of Correction (CoC) for full-text error correction with LLMs, which corrects errors segment by segment using pre-recognized text as guidance within a regular multi-turn chat format. The CoC also uses pre-recognized full text for context, allowing the model to better grasp global semantics and maintain a comprehensive overview of the entire content. Utilizing the open-sourced full-text error correction dataset ChFT, we fine-tune a pre-trained LLM to evaluate the performance of the CoC framework. Experimental results demonstrate that the CoC effectively corrects errors in full-text ASR outputs, significantly outperforming baseline and benchmark systems. We further analyze how to set the correction threshold to balance under-correction and over-rephrasing, extrapolate the CoC model on extremely long ASR outputs, and investigate whether other types of information can be employed to guide the error correction process.
zh

[NLP-31] PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

【速读】：该论文旨在解决现有事件预测基准中存在的非推理性问题，即部分预测问题可能缺乏有效的支持性理由或因果依据，从而导致无法通过现有系统可靠推断。为解决此问题，论文提出了一个新的基准数据集PROPHET，其关键创新在于引入了一种名为“因果干预似然性（Causal Intervened Likelihood, CIL）”的统计度量方法。CIL通过因果推理评估预测问题的可推导性，确保数据集中问题的合理性与有效性。此外，通过使用CIL过滤原始数据，论文构建了一个更可靠的事件预测基准，为后续研究提供了有价值的实验基础和方向指引。

链接: https://arxiv.org/abs/2504.01509
作者: Zhengwei Tao,Zhi Jin,Bincheng Li,Xiaoying Bai,Haiyan Zhao,Chengfeng Dou,Xiancai Chen,Jia Li,Linyu Li,Chongyang Tao
机构: Key Laboratory of High Confidence Software Technologies (PKU), MOE, China (高置信软件技术教育部重点实验室（北京大学）); School of Computer Science, Peking University (北京大学计算机学院); Guangzhou University (广州大学); Advanced Institute of Big Data (大数据先进技术研究院); SKLSDE Lab, Beihang University (北航软件开发环境国家重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Predicting future events stands as one of the ultimate aspirations of artificial intelligence. Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events, thereby garnering significant interest in the research community. Currently, several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event prediction as a retrieval-augmented generation (RAG) and reasoning task. In these benchmarks, each prediction question is answered with relevant retrieved news articles. However, because there is no consideration on whether the questions can be supported by valid or sufficient supporting rationales, some of the questions in these benchmarks may be inherently noninferable. To address this issue, we introduce a new benchmark, PROPHET, which comprises inferable forecasting questions paired with relevant news for retrieval. To ensure the inferability of the benchmark, we propose Causal Intervened Likelihood (CIL), a statistical measure that assesses inferability through causal inference. In constructing this benchmark, we first collected recent trend forecasting questions and then filtered the data using CIL, resulting in an inferable benchmark for event prediction. Through extensive experiments, we first demonstrate the validity of CIL and in-depth investigations into event prediction with the aid of CIL. Subsequently, we evaluate several representative prediction systems on PROPHET, drawing valuable insights for future directions.
zh

[NLP-32] CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models

【速读】：该论文旨在解决跨模态知识检索（cross-mode knowledge retrieval）的问题，即语言模型在以一种模态（mode）接受训练后，难以在被另一种模态查询时准确检索到所学知识的能力。论文通过定量研究发现，当模型基于多源数据集（如Wikipedia和TinyStories）训练时，其在非原始训练模态下进行知识检索的准确性显著下降。论文指出，尽管尝试通过对数据集进行重写（dataset rewriting）来缓解这一问题，但这种方法需要极高的重写工作量，并呈现类似“S”形的关系。为此，论文提出了一种名为CASCADE的新预训练算法作为解决方案，该算法利用具有不同序列长度的级联数据集（cascading datasets），以捕捉不同尺度的知识。实验结果表明，CASCADE的表现优于数据集重写方法，即使在压缩为单一模型并采用统一损失函数的情况下依然如此。因此，解决方案的关键在于CASCADE算法的设计，它通过引入级联数据集的方式有效提升了语言模型跨模态检索知识的能力。

链接: https://arxiv.org/abs/2504.01450
作者: Runlong Zhou,Yi Zhang
机构: University of Washington (华盛顿大学); Apple (苹果公司); Microsoft Research, Redmond (微软研究，雷德蒙德)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models often struggle with cross-mode knowledge retrieval – the ability to access knowledge learned in one format (mode) when queried in another. We demonstrate that models trained on multiple data sources (e.g., Wikipedia and TinyStories) exhibit significantly reduced accuracy when retrieving knowledge in a format different from its original training mode. This paper quantitatively investigates this phenomenon through a controlled study of random token sequence memorization across different modes. We first explore dataset rewriting as a solution, revealing that effective cross-mode retrieval requires prohibitively extensive rewriting efforts that follow a sigmoid-like relationship. As an alternative, we propose CASCADE, a novel pretraining algorithm that uses cascading datasets with varying sequence lengths to capture knowledge at different scales. Our experiments demonstrate that CASCADE outperforms dataset rewriting approaches, even when compressed into a single model with a unified loss function. This work provides both qualitative evidence of cross-mode retrieval limitations and a practical solution to enhance language models’ ability to access knowledge independently of its presentational format.
zh

[NLP-33] Refining Interactions: Enhancing Anisotropy in Graph Neural Networks with Language Semantics ICME2025

【速读】：本文旨在解决现有将大型语言模型（Large Language Models, LLMs）与图神经网络（Graph Neural Networks, GNNs）结合的方法在处理文本属性图（Text Attribute Graphs, TAGs）时存在的问题。这些方法通常直接将图结构的文本描述或邻近节点的文本输入到LLMs中，导致LLMs仅将结构信息视为通用上下文文本，从而限制了其在图相关任务中的有效性。为了解决这一问题，论文提出了LanSAGNN（语言语义各向异性图神经网络），该框架将各向异性GNN的概念扩展到自然语言层面。其关键是利用LLMs为节点对提取定制化的语义信息，有效捕捉节点关系中的独特交互。此外，还提出了一种高效的双层LLMs微调架构，以更好地使LLMs的输出与图任务对齐。实验结果表明，LanSAGNN在不增加复杂性的情况下显著提升了现有基于LLMs的方法，并表现出较强的抗干扰鲁棒性。

链接: https://arxiv.org/abs/2504.01429
作者: Zhaoxing Li,Xiaoming Zhang,Haifeng Zhang,Chengxiang Liu
机构: Institute of Physical Science and Information Technology, Anhui University (安徽大学物理科学与信息技术学院); School of Mathematical Sciences, Anhui University (安徽大学数学科学学院); Qinghai Institute of Science and Technology Information (青海科学技术信息研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICME 2025

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with Graph Neural Networks (GNNs) has recently been explored to enhance the capabilities of Text Attribute Graphs (TAGs). Most existing methods feed textual descriptions of the graph structure or neighbouring nodes’ text directly into LLMs. However, these approaches often cause LLMs to treat structural information simply as general contextual text, thus limiting their effectiveness in graph-related tasks. In this paper, we introduce LanSAGNN (Language Semantic Anisotropic Graph Neural Network), a framework that extends the concept of anisotropic GNNs to the natural language level. This model leverages LLMs to extract tailor-made semantic information for node pairs, effectively capturing the unique interactions within node relationships. In addition, we propose an efficient dual-layer LLMs finetuning architecture to better align LLMs’ outputs with graph tasks. Experimental results demonstrate that LanSAGNN significantly enhances existing LLM-based methods without increasing complexity while also exhibiting strong robustness against interference.
zh

[NLP-34] FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations

【速读】：该论文旨在解决AI驱动招聘中因种族和性别偏见导致的公平性问题。论文的关键解决方案是引入了一个名为FAIRE（Fairness Assessment In Resume Evaluation）的基准，用于评估大型语言模型（LLMs）在不同行业简历评估中的种族和性别偏见。通过直接评分（direct scoring）和排名（ranking）两种方法，测量模型性能在简历微小改动反映不同种族或性别身份时的变化。这一基准不仅揭示了每种模型都存在一定程度的偏见，且其幅度和方向差异显著，还为减少AI驱动招聘中的偏见提供了明确的考察方式和有价值的洞见。

链接: https://arxiv.org/abs/2504.01420
作者: Athena Wen,Tanush Patil,Ansh Saxena,Yicheng Fu,Sean O’Brien,Kevin Zhu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In an era where AI-driven hiring is transforming recruitment practices, concerns about fairness and bias have become increasingly important. To explore these issues, we introduce a benchmark, FAIRE (Fairness Assessment In Resume Evaluation), to test for racial and gender bias in large language models (LLMs) used to evaluate resumes across different industries. We use two methods-direct scoring and ranking-to measure how model performance changes when resumes are slightly altered to reflect different racial or gender identities. Our findings reveal that while every model exhibits some degree of bias, the magnitude and direction vary considerably. This benchmark provides a clear way to examine these differences and offers valuable insights into the fairness of AI-based hiring tools. It highlights the urgent need for strategies to reduce bias in AI-driven recruitment. Our benchmark code and dataset are open-sourced at our repository: this https URL.
zh

[NLP-35] Generative Retrieval and Alignment Model: A New Paradigm for E-commerce Retrieval WWW2025

【速读】：该论文旨在解决传统稀疏和密集检索方法难以利用通用世界知识且无法有效捕捉查询与产品细微特征的问题。此外，现有基于大型语言模型（Large Language Models, LLMs）的方法在生成产品检索标识符时面临挑战，包括静态语义ID或产品术语集的方式要么未能充分利用LLMs中的嵌入知识，要么因查询与产品之间的词分布差异导致召回效率低下，尤其是在查询包含大量属性时，生成的标识符数量庞大且质量评估困难。

为应对这些挑战，论文提出了一种新的电子商务检索范式：生成式检索与对齐模型（Generative Retrieval and Alignment Model, GRAM）。其关键在于通过联合训练查询和产品的文本信息生成共享的文本标识符代码，从而有效弥合查询与产品之间的差距。GRAM采用协同对齐策略优化检索效率，并引入查询-产品评分机制以比较不同代码下的产品值，进一步提升检索性能。离线和在线A/B测试结果表明，GRAM显著优于传统模型及最新的生成式检索模型，验证了其有效性与实用性。

链接: https://arxiv.org/abs/2504.01403
作者: Ming Pang,Chunyuan Yuan,Xiaoyu He,Zheng Fang,Donghao Xie,Fanyi Qu,Xue Jiang,Changping Peng,Zhangang Lin,Zheng Luo,Jingping Shao
机构: JD.COM(Beijing China)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by WWW2025

点击查看摘要

Abstract:Traditional sparse and dense retrieval methods struggle to leverage general world knowledge and often fail to capture the nuanced features of queries and products. With the advent of large language models (LLMs), industrial search systems have started to employ LLMs to generate identifiers for product retrieval. Commonly used identifiers include (1) static/semantic IDs and (2) product term sets. The first approach requires creating a product ID system from scratch, missing out on the world knowledge embedded within LLMs. While the second approach leverages this general knowledge, the significant difference in word distribution between queries and products means that product-based identifiers often do not align well with user search queries, leading to missed product recalls. Furthermore, when queries contain numerous attributes, these algorithms generate a large number of identifiers, making it difficult to assess their quality, which results in low overall recall efficiency. To address these challenges, this paper introduces a novel e-commerce retrieval paradigm: the Generative Retrieval and Alignment Model (GRAM). GRAM employs joint training on text information from both queries and products to generate shared text identifier codes, effectively bridging the gap between queries and products. This approach not only enhances the connection between queries and products but also improves inference efficiency. The model uses a co-alignment strategy to generate codes optimized for maximizing retrieval efficiency. Additionally, it introduces a query-product scoring mechanism to compare product values across different codes, further boosting retrieval efficiency. Extensive offline and online A/B testing demonstrates that GRAM significantly outperforms traditional models and the latest generative retrieval models, confirming its effectiveness and practicality. Comments: Accepted by WWW2025 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2504.01403 [cs.IR] (or arXiv:2504.01403v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.01403 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-36] oolACE-R: Tool Learning with Adaptive Self-Refinement

【速读】：本文旨在解决现有工具学习方法主要关注于通过数据合成来微调大型语言模型（Large Language Models, LLMs）以有效调用外部工具的问题，而忽视了充分激发模型潜力的方法。为了解决这一局限性，论文提出了一种名为ToolACE-R的新方法，其关键是引入了自适应自我精化（adaptive self-refinement）机制用于工具调用。该方法采用模型感知的迭代训练过程，根据模型能力的发展逐步纳入更多训练样本，并允许LLMs在无需外部反馈的情况下迭代优化其工具调用。此外，通过集成一种自适应机制来调整推理时间，使模型能够自主决定何时停止精化过程，从而进一步提高计算效率。实验结果表明，即使不进行任何精化，ToolACE-R也能达到与基于API的先进模型相当的性能，并且通过自适应自我精化可以更高效地提升性能，展示了该方法的有效性及其对不同规模基础模型的良好兼容性。

链接: https://arxiv.org/abs/2504.01400
作者: Xingshan Zeng,Weiwen Liu,Xu Huang,Zezhong Wang,Lingzhi Wang,Liangyou Li,Yasheng Wang,Lifeng Shang,Xin Jiang,Ruiming Tang,Qun Liu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tool learning, which allows Large Language Models (LLMs) to leverage external tools for solving complex user tasks, has emerged as a promising avenue for extending model capabilities. However, current approaches primarily focus on data synthesis for fine-tuning LLMs to invoke tools effectively, largely ignoring how to fully stimulate the potential of the model. In this paper, we propose ToolACE-R, a novel method that introduces adaptive self-refinement for tool invocations. Our approach features a model-aware iterative training procedure that progressively incorporates more training samples based on the model’s evolving capabilities. Additionally, it allows LLMs to iteratively refine their tool calls, optimizing performance without requiring external feedback. To further enhance computational efficiency, we integrate an adaptive mechanism when scaling the inference time, enabling the model to autonomously determine when to stop the refinement process. We conduct extensive experiments across several benchmark datasets, showing that ToolACE-R achieves competitive performance compared to advanced API-based models, even without any refinement. Furthermore, its performance can be further improved efficiently through adaptive self-refinement. Our results demonstrate the effectiveness of the proposed method, which is compatible with base models of various sizes, offering a promising direction for more efficient tool learning.
zh

[NLP-37] An Illusion of Progress? Assessing the Current State of Web Agents

【速读】：该论文试图解决的问题是如何准确评估基于大型语言模型（Large Language Models, LLMs）的自主网络代理（web agents）的能力，并揭示现有基准测试中存在的不足。论文的关键解决方案包括两个方面：首先，引入了一个名为Online-Mind2Web的在线评估基准，包含300个涵盖136个网站的多样化且真实的任务，以更贴近真实用户使用场景的方式评估网络代理的能力；其次，开发了一种新颖的LLM-as-a-Judge自动评估方法，该方法能够实现与人工判断约85%的一致性，显著优于现有方法。通过这些手段，论文不仅揭示了先前报告结果中的乐观偏差，还为未来研究提供了全面的对比分析和启发。

链接: https://arxiv.org/abs/2504.01382
作者: Tianci Xue,Weijian Qi,Tianneng Shi,Chan Hee Song,Boyu Gou,Dawn Song,Huan Sun,Yu Su
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 16 figures, 4 tables

点击查看摘要

Abstract:As digitalization and cloud technologies evolve, the web is becoming increasingly important in the modern society. Autonomous web agents based on large language models (LLMs) hold a great potential in work automation. It is therefore important to accurately measure and monitor the progression of their capabilities. In this work, we conduct a comprehensive and rigorous assessment of the current state of web agents. Our results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results. This gap can be attributed to shortcomings in existing benchmarks. We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites. It enables us to evaluate web agents under a setting that approximates how real users use these agents. To facilitate more scalable evaluation and development, we also develop a novel LLM-as-a-Judge automatic evaluation method and show that it can achieve around 85% agreement with human judgment, substantially higher than existing methods. Finally, we present the first comprehensive comparative analysis of current web agents, highlighting both their strengths and limitations to inspire future research.
zh

[NLP-38] LITE: LLM -Impelled efficient Taxonomy Evaluation

【速读】：该论文旨在解决大规模本体（Ontology）评估中的效率、公平性和一致性等挑战。为应对这些挑战，论文提出了一种基于大型语言模型（LLM）的评估方法LITE。其关键解决方案包括采用自上而下的分层评估策略，将本体分解为可管理的子结构，并通过交叉验证和标准化输入格式确保结果可靠性；同时引入惩罚机制处理极端情况，结合与任务目标紧密相关的评估指标提供定量分析和定性洞察，从而实现对本体语义错误、逻辑矛盾和结构缺陷的高效识别及改进建议。

链接: https://arxiv.org/abs/2504.01369
作者: Lin Zhang,Zhouhong Gu,Suhang Zheng,Tao Wang,Tianyu Li,Hongwei Feng,Yanghua Xiao
机构: Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (复旦大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents LITE, an LLM-based evaluation method designed for efficient and flexible assessment of taxonomy quality. To address challenges in large-scale taxonomy evaluation, such as efficiency, fairness, and consistency, LITE adopts a top-down hierarchical evaluation strategy, breaking down the taxonomy into manageable substructures and ensuring result reliability through cross-validation and standardized input formats. LITE also introduces a penalty mechanism to handle extreme cases and provides both quantitative performance analysis and qualitative insights by integrating evaluation metrics closely aligned with task objectives. Experimental results show that LITE demonstrates high reliability in complex evaluation tasks, effectively identifying semantic errors, logical contradictions, and structural flaws in taxonomies, while offering directions for improvement. Code is available at this https URL .
zh

[NLP-39] asks and Roles in Legal AI: Data Curation Annotation and Verification

【速读】：该论文旨在解决法律领域应用人工智能（AI）工具所面临的挑战，特别是围绕数据获取、标注及结果验证这三个关键实践领域的问题。论文指出，法律文档的不一致性、模拟性以及分散性使得高质量数据的获取变得困难，同时强调缺乏足够的训练数据会限制AI性能；法律数据的标注通常需要专业知识以识别复杂的法律现象，而现有AI系统在某些场景下的表现仍显不足；此外，AI生成的结果必须具备可验证性和可信度，才能在法律实践中发挥实际价值。解决方案的关键在于跨学科合作与开放资源共享，通过法律与AI从业者的共同努力，开发出高性能且可靠的新型AI工具，以克服上述挑战并推动法律领域的AI应用发展。

链接: https://arxiv.org/abs/2504.01349
作者: Allison Koenecke,Jed Stiglitz,David Mimno,Matthew Wilkens
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The application of AI tools to the legal field feels natural: large legal document collections could be used with specialized AI to improve workflow efficiency for lawyers and ameliorate the “justice gap” for underserved clients. However, legal documents differ from the web-based text that underlies most AI systems. The challenges of legal AI are both specific to the legal domain, and confounded with the expectation of AI’s high performance in high-stakes settings. We identify three areas of special relevance to practitioners: data curation, data annotation, and output verification. First, it is difficult to obtain usable legal texts. Legal collections are inconsistent, analog, and scattered for reasons technical, economic, and jurisdictional. AI tools can assist document curation efforts, but the lack of existing data also limits AI performance. Second, legal data annotation typically requires significant expertise to identify complex phenomena such as modes of judicial reasoning or controlling precedents. We describe case studies of AI systems that have been developed to improve the efficiency of human annotation in legal contexts and identify areas of underperformance. Finally, AI-supported work in the law is valuable only if results are verifiable and trustworthy. We describe both the abilities of AI systems to support evaluation of their outputs, as well as new approaches to systematic evaluation of computational systems in complex domains. We call on both legal and AI practitioners to collaborate across disciplines and to release open access materials to support the development of novel, high-performing, and reliable AI tools for legal applications.
zh

[NLP-40] GTR: Graph-Table-RAG for Cross-Table Question Answering

【速读】：该论文旨在解决跨表问答（Cross-Table Question Answering）领域中可用数据不足以及现有方法在处理分布式答案时推理能力有限的问题。为填补这一空白，论文首先构建了一个包含60,000个表格和25,000个真实用户查询的多表基准数据集MultiTableQA。解决方案的关键在于提出了一种名为GTR（Graph-Table-RAG）的新框架，该框架通过将表格语料组织为异构图，采用层次化粗到细的检索过程提取相关表格，并结合图感知提示技术增强下游大型语言模型（LLMs）的表格推理能力。实验结果表明，GTR在保持高效部署的同时，显著提升了跨表问答性能，展示了其实际应用价值。

链接: https://arxiv.org/abs/2504.01346
作者: Jiaru Zou,Dongqi Fu,Sirui Chen,Xinrui He,Zihao Li,Yada Zhu,Jiawei Han,Jingrui He
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Meta AI (Meta); IBM Research
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Beyond pure text, a substantial amount of knowledge is stored in tables. In real-world scenarios, user questions often require retrieving answers that are distributed across multiple tables. GraphRAG has recently attracted much attention for enhancing LLMs’ reasoning capabilities by organizing external knowledge to address ad-hoc and complex questions, exemplifying a promising direction for cross-table question answering. In this paper, to address the current gap in available data, we first introduce a multi-table benchmark, MutliTableQA, comprising 60k tables and 25k user queries collected from real-world sources. Then, we propose the first Graph-Table-RAG framework, namely GTR, which reorganizes table corpora into a heterogeneous graph, employs a hierarchical coarse-to-fine retrieval process to extract the most relevant tables, and integrates graph-aware prompting for downstream LLMs’ tabular reasoning. Extensive experiments show that GTR exhibits superior cross-table question-answering performance while maintaining high deployment efficiency, demonstrating its real-world practical applicability.
zh

[NLP-41] Breaking BERT: Gradient Attack on Twitter Sentiment Analysis for Targeted Misclassification

【速读】：该论文旨在探究基于Transformer的双向编码器表示（BERT）模型在Twitter情感分析中的固有脆弱性，并提出一种构建目标对抗文本的框架，以在保持隐蔽性的同时欺骗这些模型。与传统的权重重分配方法不同，该框架的关键在于利用梯度来评估文本中单个词的重要性，通过白盒方法实现细粒度敏感性分析，识别对分类结果影响最大的词汇。解决方案的核心步骤包括：首先微调预训练的BERT模型以适应Twitter数据；其次分析模型梯度以对词汇重要性排序，并迭代替换为目标候选词，直至找到可行解；最后评估生成的对抗文本对自定义训练的情感分类模型的有效性，以衡量其成功规避分类检测的能力而不触发警报。

链接: https://arxiv.org/abs/2504.01345
作者: Akil Raj Subedi,Taniya Shah,Aswani Kumar Cherukuri,Thanos Vasilakos
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Social media platforms like Twitter have increasingly relied on Natural Language Processing NLP techniques to analyze and understand the sentiments expressed in the user generated content. One such state of the art NLP model is Bidirectional Encoder Representations from Transformers BERT which has been widely adapted in sentiment analysis. BERT is susceptible to adversarial attacks. This paper aims to scrutinize the inherent vulnerabilities of such models in Twitter sentiment analysis. It aims to formulate a framework for constructing targeted adversarial texts capable of deceiving these models, while maintaining stealth. In contrast to conventional methodologies, such as Importance Reweighting, this framework core idea resides in its reliance on gradients to prioritize the importance of individual words within the text. It uses a whitebox approach to attain fine grained sensitivity, pinpointing words that exert maximal influence on the classification outcome. This paper is organized into three interdependent phases. It starts with fine-tuning a pre-trained BERT model on Twitter data. It then analyzes gradients of the model to rank words on their importance, and iteratively replaces those with feasible candidates until an acceptable solution is found. Finally, it evaluates the effectiveness of the adversarial text against the custom trained sentiment classification model. This assessment would help in gauging the capacity of the adversarial text to successfully subvert classification without raising any alarm.
zh

[NLP-42] Foundations and Evaluations in NLP

【速读】：本文献旨在解决自然语言处理（NLP）领域中两个核心问题：一是创建语言资源，二是评估NLP系统性能。针对语言资源创建，作者开发了一种基于词素的韩语标注方案，该方案能够从形态学到语义全面捕捉语言特性，并在词性标注、依存句法分析及命名实体识别等任务中达到当前最先进的成果。此外，研究还深入分析了分词粒度对NLP系统性能的关键影响。在系统评估方面，作者提出了名为jp-algorithm的新框架，通过基于对齐的方法解决了诸如分词和句子边界检测（SBD）等预处理任务中的挑战。传统评估方法假定标准数据与系统输出具有相同的分词和句子长度，这限制了其在实际应用中的适用性。而jp-algorithm通过引入线性时间对齐技术克服了这些局限性，不仅增强了评估的准确性与灵活性，还保持了传统评估指标的复杂性。因此，该研究为处理形态学丰富的语言（如韩语）提供了重要见解，并为多样化端到端NLP系统的通用评估框架奠定了基础，同时对多语言资源开发和系统评估具有广泛意义。

链接: https://arxiv.org/abs/2504.01342
作者: Jungyeul Park
机构: The University of British Columbia (英属哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This memoir explores two fundamental aspects of Natural Language Processing (NLP): the creation of linguistic resources and the evaluation of NLP system performance. Over the past decade, my work has focused on developing a morpheme-based annotation scheme for the Korean language that captures linguistic properties from morphology to semantics. This approach has achieved state-of-the-art results in various NLP tasks, including part-of-speech tagging, dependency parsing, and named entity recognition. Additionally, this work provides a comprehensive analysis of segmentation granularity and its critical impact on NLP system performance. In parallel with linguistic resource development, I have proposed a novel evaluation framework, the jp-algorithm, which introduces an alignment-based method to address challenges in preprocessing tasks like tokenization and sentence boundary detection (SBD). Traditional evaluation methods assume identical tokenization and sentence lengths between gold standards and system outputs, limiting their applicability to real-world data. The jp-algorithm overcomes these limitations, enabling robust end-to-end evaluations across a variety of NLP tasks. It enhances accuracy and flexibility by incorporating linear-time alignment while preserving the complexity of traditional evaluation metrics. This memoir provides key insights into the processing of morphologically rich languages, such as Korean, while offering a generalizable framework for evaluating diverse end-to-end NLP systems. My contributions lay the foundation for future developments, with broader implications for multilingual resource development and system evaluation.
zh

[NLP-43] Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design NAACL2025

【速读】：该论文旨在解决Mixture-of-Experts (MoE) 模型在实际应用中效率难以充分发挥的问题，主要源于两个挑战：一是专家激活的不平衡导致模型或专家并行化过程中存在大量空闲时间以及资源利用率不足；二是专家并行化系统层面产生的巨大通信开销，特别是在专家路由组合数量庞大时。以往研究通常将这些问题归因于负载均衡问题或静态执行未能适应运行时动态工作负载的变化。

本文从一个新的视角出发，即对MoE路由策略进行更高阶的分析与设计，提出专家协作与专业化之间的权衡：部分专家倾向于广泛与其他专家协作，而另一些专家则更可能仅与特定子集的专家交互。实验表明，大多数专家表现出过度协作的现象，从而增加了重复向不同加速器发送令牌所引发的通信开销。

为此，论文提出了合作约束路由（Collaboration-Constrained Routing, C2R）策略，通过鼓励形成更多专业化专家组来改善专家利用效率，并结合专家的专业化特性优化MoE模型的实现。最终，在LLaMA-MoE和Qwen-MoE两种模型上分别实现了平均0.51%和0.33%的性能提升，并显著降低了GPU间all-to-all通信成本，相较于现有最先进的方法MegaBlocks额外节省了20%-30%的整体运行时间。

链接: https://arxiv.org/abs/2504.01337
作者: Mohan Zhang,Pingzhi Li,Jie Peng,Mufan Qiu,Tianlong Chen
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NAACL 2025

点击查看摘要

Abstract:Mixture-of-Experts (MoE) has successfully scaled up models while maintaining nearly constant computing costs. By employing a gating network to route input tokens, it selectively activates a subset of expert networks to process the corresponding token embeddings. However, in practice, the efficiency of MoE is challenging to achieve due to two key reasons: imbalanced expert activation, which leads to substantial idle time during model or expert parallelism, and insufficient capacity utilization; massive communication overhead, induced by numerous expert routing combinations in expert parallelism at the system level. Previous works typically formulate it as the load imbalance issue characterized by the gating network favoring certain experts over others or attribute it to static execution which fails to adapt to the dynamic expert workload at runtime. In this paper, we exploit it from a brand new perspective, a higher-order view and analysis of MoE routing policies: expert collaboration and specialization where some experts tend to activate broadly with others (collaborative), while others are more likely to activate only with a specific subset of experts (specialized). Our experiments reveal that most experts tend to be overly collaborative, leading to increased communication overhead from repeatedly sending tokens to different accelerators. To this end, we propose a novel collaboration-constrained routing (C2R) strategy to encourage more specialized expert groups, as well as to improve expert utilization, and present an efficient implementation of MoE that further leverages expert specialization. We achieve an average performance improvement of 0.51% and 0.33% on LLaMA-MoE and Qwen-MoE respectively across ten downstream NLP benchmarks, and reduce the all2all communication costs between GPUs, bringing an extra 20%-30% total running time savings on top of the existing SoTA, i.e. MegaBlocks.
zh

[NLP-44] On Data Synthesis and Post-training for Visual Abstract Reasoning

【速读】：该论文试图解决抽象视觉推理（Abstract Visual Reasoning, AVR）问题，针对大型视觉-语言模型（Vision-Language Models, VLMs）在该领域的局限性开展研究。论文的关键突破在于提出了一种创新的数据合成与微调后处理方法，通过逐步减轻任务难度，引导模型有效学习抽象视觉推理能力。这种方法不仅使70亿参数规模的模型（LLaVA-NeXT 7B）在代表性AVR基准测试中显著超越开源（如Qwen-2-VL-72B）和闭源（如GPT-4o）的强大VLMs，还保持了良好的多模态理解能力，为该领域提供了开创性的贡献。

链接: https://arxiv.org/abs/2504.01324
作者: Ke Zhu,Yu Wang,Jiangjiang Liu,Qunyi Xie,Shanshan Liu,Gang Zhang
机构: Nanjing University (南京大学); Baidu VIS (百度视觉感知实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper is a pioneering work attempting to address abstract visual reasoning (AVR) problems for large vision-language models (VLMs). We make a common LLaVA-NeXT 7B model capable of perceiving and reasoning about specific AVR problems, surpassing both open-sourced (e.g., Qwen-2-VL-72B) and closed-sourced powerful VLMs (e.g., GPT-4o) with significant margin. This is a great breakthrough since almost all previous VLMs fail or show nearly random performance on representative AVR benchmarks. Our key success is our innovative data synthesis and post-training process, aiming to fully relieve the task difficulty and elicit the model to learn, step by step. Our 7B model is also shown to be behave well on AVR without sacrificing common multimodal comprehension abilities. We hope our paper could serve as an early effort in this area and would inspire further research in abstract visual reasoning.
zh

[NLP-45] Adaptive Rectification Sampling for Test-Time Compute Scaling

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在利用测试时调整（Test-Time Scaling）方法提升性能过程中存在的问题，特别是自修正（self-correction）可能导致推理步骤冗余（token waste）以及可读性下降的问题。论文的关键在于提出了一种自适应修正采样方法（Adaptive Rectification Sampling, AR-Sampling），通过引入一个过程监督奖励模型（Process-Supervised Reward Model, PRM）作为验证器，并构建触发句（trigger sentences）来引导模型在适当的步骤进行细粒度的重思考（step-level rethinking）。这种方法能够在提高模型逻辑推理准确性的同时，合理控制额外生成的标记数量。

链接: https://arxiv.org/abs/2504.01317
作者: Zhendong Tan,Xingjun Zhang,Chaoyi Hu,Yancheng Pan,Shaoxun Wang
机构: School of Computer Science and Technology, Xi’an Jiaotong University (西安交通大学计算机科学与技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The newly released OpenAI-o1 and DeepSeek-R1 have demonstrated that test-time scaling can significantly improve model performance, especially in complex tasks such as logical reasoning. Common test-time scaling methods involve generating more chain of thoughts (CoTs) or longer CoTs with self-correction. However, while self-correction can improve performance, it may lead to significant token waste and reduce readability of the CoT if the reasoning steps are already correct. To demonstrate that large language models (LLMs) can rectify errors at a more fine-grained level, we propose Adaptive Rectification Sampling (AR-Sampling), which can guide the LLMs to self-correction at the appropriate step. AR-Sampling leverages a process-supervised reward model (PRM) as a verifier and constructed trigger sentences to guide the model in adaptive step-level rethinking. Through the experiments on GSM8K and MATH500, it indicate that our approach enables the models to rethink in more fine-grained level, improving the accuracy of solutions, while generating a reasonable number of additional tokens.
zh

[NLP-46] Biomedical Question Answering via Multi-Level Summarization on a Local Knowledge Graph

【速读】：该论文旨在解决在多文档关系捕捉方面的挑战，特别是针对生物医学任务中这一问题的重要性尤为突出。论文提出了一种新颖的方法，通过利用命题主张从检索到的文档中构建局部知识图谱，并通过分层摘要技术从知识图谱生成上下文化的摘要，以微调小型语言模型进行问答任务 (Question Answering, QA)。该方法的关键在于利用命题主张构建局部知识图谱，从而更好地捕捉多文档之间的复杂关系，并通过分层摘要提升模型的上下文理解能力。实验结果显示，该方法在多个生物医学问答基准数据集上的表现与或优于基于 Retrieval Augmented Generation (RAG) 的基线模型。

链接: https://arxiv.org/abs/2504.01309
作者: Lingxiao Guan,Yuanhao Huang,Jie Liu
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In Question Answering (QA), Retrieval Augmented Generation (RAG) has revolutionized performance in various domains. However, how to effectively capture multi-document relationships, particularly critical for biomedical tasks, remains an open question. In this work, we propose a novel method that utilizes propositional claims to construct a local knowledge graph from retrieved documents. Summaries are then derived via layerwise summarization from the knowledge graph to contextualize a small language model to perform QA. We achieved comparable or superior performance with our method over RAG baselines on several biomedical QA benchmarks. We also evaluated each individual step of our methodology over a targeted set of metrics, demonstrating its effectiveness.
zh

[NLP-47] hinkPrune: Pruning Long Chain-of-Thought of LLM s via Reinforcement Learning

【速读】：该论文旨在解决长思考（long-thinking）大型语言模型 (LLMs) 在推理过程中产生低效且冗余思维过程的问题。现有方法主要关注强制早期退出以缩短推理长度，而非优化和精简模型的推理过程，导致性能与推理长度之间的权衡（tradeoff）不够理想。论文的关键解决方案是提出 ThinkPrune 方法，通过强化学习 (Reinforcement Learning, RL) 对长推理 LLMs 进行持续训练，并引入令牌限制（token limit），超出该限制的未完成推理将被丢弃并给予零奖励。此外，为了进一步保持模型性能，引入了迭代长度剪枝方法，在多轮 RL 训练中逐步增加令牌限制的严格性。实验表明，ThinkPrune 实现了显著的性能与推理长度之间的平衡，并在 AIME24 数据集上验证了其有效性。

链接: https://arxiv.org/abs/2504.01296
作者: Bairu Hou,Yang Zhang,Jiabao Ji,Yujian Liu,Kaizhi Qian,Jacob Andreas,Shiyu Chang
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校); MIT-IBM Watson AI Lab (麻省理工学院-IBM Watson人工智能实验室); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:We present ThinkPrune, a simple yet effective method for pruning the thinking length for long-thinking LLMs, which has been found to often produce inefficient and redundant thinking processes. Existing preliminary explorations of reducing thinking length primarily focus on forcing the thinking process to early exit, rather than adapting the LLM to optimize and consolidate the thinking process, and therefore the length-performance tradeoff observed so far is sub-optimal. To fill this gap, ThinkPrune offers a simple solution that continuously trains the long-thinking LLMs via reinforcement learning (RL) with an added token limit, beyond which any unfinished thoughts and answers will be discarded, resulting in a zero reward. To further preserve model performance, we introduce an iterative length pruning approach, where multiple rounds of RL are conducted, each with an increasingly more stringent token limit. We observed that ThinkPrune results in a remarkable performance-length tradeoff – on the AIME24 dataset, the reasoning length of DeepSeek-R1-Distill-Qwen-1.5B can be reduced by half with only 2% drop in performance. We also observed that after pruning, the LLMs can bypass unnecessary steps while keeping the core reasoning process complete. Code is available at this https URL.
zh

[NLP-48] Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）中存在的 Prompt-Reverse Inconsistency (PRIN)，这是一种新的自我不一致性形式。PRIN 表现为在面对相同生成的答案候选集时，LLM 对“哪些答案是正确的？”和“哪些答案是错误的？”这两个相反的提问给出冲突响应的现象。这种现象削弱了 LLM 作为裁判的可信度，并挑战了其遵循基本逻辑规则的能力。论文的关键在于通过一系列实验探讨 PRIN 的程度、不同 LLM 间的差异、缓解方法、潜在应用及其与随机性不一致性和释义不一致性之间的关系，从而为理解 LLM 的内部工作机制提供洞见，并推动可信 AI 的发展。

链接: https://arxiv.org/abs/2504.01282
作者: Jihyun Janice Ahn,Wenpeng Yin
机构: Department of Computer Science & Engineering (计算机科学与工程系), The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:While the inconsistency of LLMs is not a novel topic, prior research has predominantly addressed two types of generative inconsistencies: i) Randomness Inconsistency: running the same LLM multiple trials, yielding varying responses; ii) Paraphrase Inconsistency: paraphrased prompts result in different responses from the same LLM. Randomness Inconsistency arises from the inherent randomness due to stochastic sampling in generative models, while Paraphrase Inconsistency is a consequence of the language modeling objectives, where paraphrased prompts alter the distribution of vocabulary logits. This research discovers Prompt-Reverse Inconsistency (PRIN), a new form of LLM self-inconsistency: given a question and a couple of LLM-generated answer candidates, the LLM often has conflicting responses when prompted “Which are correct answers?” and “Which are incorrect answers?”. PRIN poses a big concern as it undermines the credibility of LLM-as-a-judge, and suggests a challenge for LLMs to adhere to basic logical rules. We conduct a series of experiments to investigate PRIN, examining the extent of PRIN across different LLMs, methods to mitigate it, potential applications, and its relationship with Randomness Inconsistency and Paraphrase Inconsistency. As the first study to explore PRIN, our findings offer valuable insights into the inner workings of LLMs and contribute to advancing trustworthy AI.
zh

[NLP-49] Scaling Test-Time Inference with Policy-Optimized Dynamic Retrieval-Augmented Generation via KV Caching and Decoding

【速读】：该论文旨在解决现有 Retrieval-Augmented Generation (RAG) 系统在知识密集型任务（如开放领域问答和复杂推理）中的不足，特别是提高其事实准确性与响应质量。论文的关键创新在于提出了一种综合框架，通过动态检索策略和强化微调优化 RAG 性能。具体而言，该框架结合了 Policy-Optimized Retrieval-Augmented Generation (PORAG)，用于优化检索信息的利用，以及 Adaptive Token-Layer Attention Scoring (ATLAS)，用于根据上下文需求动态决定检索时机与内容。此外，还引入了 CRITIC 方法以选择性压缩键值缓存，缓解长上下文应用中的内存瓶颈，并结合测试时扩展技术平衡推理深度与计算资源。这些方法共同显著提升了 RAG 系统在知识密集型任务中的性能，同时保持了高效性和可扩展性。

链接: https://arxiv.org/abs/2504.01281
作者: Sakhinana Sagar Srinivas,Venkataramana Runkana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We present a comprehensive framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning. This approach significantly improves large language models on knowledge-intensive tasks, including opendomain question answering and complex reasoning. Our framework integrates two complementary techniques: Policy-Optimized RetrievalAugmented Generation (PORAG), which optimizes the use of retrieved information, and Adaptive Token-Layer Attention Scoring (ATLAS), which dynamically determines retrieval timing and content based on contextual needs. Together, these techniques enhance both the utilization and relevance of retrieved content, improving factual accuracy and response quality. Designed as a lightweight solution compatible with any Transformer-based LLM without requiring additional training, our framework excels in knowledge-intensive tasks, boosting output accuracy in RAG settings. We further propose CRITIC, a novel method to selectively compress key-value caches by token importance, mitigating memory bottlenecks in long-context applications. The framework also incorporates test-time scaling techniques to dynamically balance reasoning depth and computational resources, alongside optimized decoding strategies for faster inference. Experiments on benchmark datasets show that our framework reduces hallucinations, strengthens domain-specific reasoning, and achieves significant efficiency and scalability gains over traditional RAG systems. This integrated approach advances the development of robust, efficient, and scalable RAG systems across diverse applications.
zh

[NLP-50] Grade Guard: A Smart System for Short Answer Automated Grading

【速读】：该论文旨在解决自动化短答案评分（Automated Short Answer Grading, ASAG）中大型语言模型（Large Language Models, LLMs）因训练数据多样性导致的对细微或部分正确答案评估不准确的问题。为应对这一挑战，论文提出了一种名为Grade Guard的新框架，其关键解决方案包括：1）通过均方根误差（Root Mean Square Error, RMSE）微调温度参数以增强LLM的任务专用性；2）引入犹豫分数（Indecisiveness Score, IS）与评分一同输出，反映预测评分的不确定性；3）设计置信感知损失（Confidence-Aware Loss, CAL）优化IS；4）基于优化后的IS引入自省机制，并结合人工复评以减少错误评分分配，从而提高评分可靠性。实验表明，Grade Guard在多种基准测试中显著优于传统方法。

链接: https://arxiv.org/abs/2504.01253
作者: Niharika Dadu,Harsh Vardhan Singh,Romi Banerjee(Indian Institute of Technology Jodhpur)
机构: Department of Computer Science and Engineering, Indian Institute of Technology (IIT), Jodhpur, India
类目: Computation and Language (cs.CL)
备注: 11 pages, 18 figures

点击查看摘要

Abstract:The advent of large language models (LLMs) in the education sector has provided impetus to automate grading short answer questions. LLMs make evaluating short answers very efficient, thus addressing issues like staff shortage. However, in the task of Automated Short Answer Grading (ASAG), LLM responses are influenced by diverse perspectives in their training dataset, leading to inaccuracies in evaluating nuanced or partially correct answers. To address this challenge, we propose a novel framework, Grade Guard. 1. To enhance the task-based specialization of the LLMs, the temperature parameter has been fine-tuned using Root Mean Square Error (RMSE). 2. Unlike traditional approaches, LLMs in Grade Guard compute an Indecisiveness Score (IS) along with the grade to reflect uncertainty in predicted grades. 3. Introduced Confidence-Aware Loss (CAL) to generate an optimized Indecisiveness Score (IS). 4. To improve reliability, self-reflection based on the optimized IS has been introduced into the framework, enabling human re-evaluation to minimize incorrect grade assignments. Our experimentation shows that the best setting of Grade Guard outperforms traditional methods by 19.16% RMSE in Upstage Solar Pro, 23.64% RMSE in Upstage Solar Mini, 4.00% RMSE in Gemini 1.5 Flash, and 10.20% RMSE in GPT 4-o Mini. Future work includes improving interpretability by generating rationales for grades to enhance accuracy. Expanding benchmark datasets and annotating them with domain-specific nuances will enhance grading accuracy. Finally, analyzing feedback to enhance confidence in predicted grades, reduce biases, optimize grading criteria, and personalize learning while supporting multilingual grading systems will make the solution more accurate, adaptable, fair, and inclusive. Comments: 11 pages, 18 figures Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2504.01253 [cs.CL] (or arXiv:2504.01253v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.01253 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.36227/techrxiv.174114489.93670234/v1 Focus to learn more DOI(s) linking to related resources Submission history From: Niharika Dadu [view email] [v1] Tue, 1 Apr 2025 23:45:44 UTC (4,963 KB) Full-text links: Access Paper: View a PDF of the paper titled Grade Guard: A Smart System for Short Answer Automated Grading, by Niharika Dadu and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[NLP-51] Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models

【速读】：该论文旨在解决车载对话系统在事实准确性方面的验证问题。现代基于大语言模型（Large Language Models, LLMs）的对话系统容易出现幻觉（hallucinations），即提供不准确或虚构的错误信息。为解决这一问题，论文提出了一种基于LLM的自动事实基准测试方法，并通过结合集成技术与多样的角色设定来增强一致性并减少幻觉现象。关键在于利用GPT-4与输入输出提示（Input Output Prompting）的组合，实现了超过90%的事实准确性一致率，并以平均4.5秒的响应时间成为最高效的方法。研究结果表明，基于LLM的测试方法是一种可行的手段，可用于验证车载对话系统在事实准确性方面的表现。

链接: https://arxiv.org/abs/2504.01248
作者: Rafael Giebisch,Ken E. Friedl,Lev Sorokin,Andrea Stocco
机构: Technical University of Munich (慕尼黑工业大学); BMW Group (宝马集团); fortiss GmbH (fortiss有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Accepted in IEEE Intelligent Vehicles Symposium Conference (IV 2025)

点击查看摘要

Abstract:In-car conversational systems bring the promise to improve the in-vehicle user experience. Modern conversational systems are based on Large Language Models (LLMs), which makes them prone to errors such as hallucinations, i.e., inaccurate, fictitious, and therefore factually incorrect information. In this paper, we present an LLM-based methodology for the automatic factual benchmarking of in-car conversational systems. We instantiate our methodology with five LLM-based methods, leveraging ensembling techniques and diverse personae to enhance agreement and minimize hallucinations. We use our methodology to evaluate CarExpert, an in-car retrieval-augmented conversational question answering system, with respect to the factual correctness to a vehicle’s manual. We produced a novel dataset specifically created for the in-car domain, and tested our methodology against an expert evaluation. Our results show that the combination of GPT-4 with the Input Output Prompting achieves over 90 per cent factual correctness agreement rate with expert evaluations, other than being the most efficient approach yielding an average response time of 4.5s. Our findings suggest that LLM-based testing constitutes a viable approach for the validation of conversational systems regarding their factual correctness.
zh

[NLP-52] Catastrophic Forgetting in LLM s: A Comparative Analysis Across Language Tasks

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在连续学习（continual learning）过程中面临的灾难性遗忘（catastrophic forgetting）问题，即模型在适应新任务的同时难以保留先前学到的知识。论文的关键解决方案在于通过提示工程（prompt engineering）和针对具体任务的调整，评估不同开源LLMs（参数规模小于10亿）在GLUE基准数据集关键NLU任务上的持续微调能力，包括SST-2、MRPC、CoLA和MNLI。研究发现，如Phi-3.5-mini等模型在保持强大学习能力的同时表现出极小的遗忘现象，而Orca-2-7b和Qwen2.5-7B等模型在微调后也展现出优异的学习能力和整体性能，从而验证了优化提示工程对于提升LLMs在连续学习场景下表现的重要性。

链接: https://arxiv.org/abs/2504.01241
作者: Naimul Haque
机构: Alfred University (阿尔弗雷德大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced Natural Language Processing (NLP), particularly in Natural Language Understanding (NLU) tasks. As we progress toward an agentic world where LLM-based agents autonomously handle specialized tasks, it becomes crucial for these models to adapt to new tasks without forgetting previously learned information - a challenge known as catastrophic forgetting. This study evaluates the continual fine-tuning of various open-source LLMs with different parameter sizes (specifically models under 10 billion parameters) on key NLU tasks from the GLUE benchmark, including SST-2, MRPC, CoLA, and MNLI. By employing prompt engineering and task-specific adjustments, we assess and compare the models’ abilities to retain prior knowledge while learning new tasks. Our results indicate that models such as Phi-3.5-mini exhibit minimal forgetting while maintaining strong learning capabilities, making them well-suited for continual learning environments. Additionally, models like Orca-2-7b and Qwen2.5-7B demonstrate impressive learning abilities and overall performance after fine-tuning. This work contributes to understanding catastrophic forgetting in LLMs and highlights prompting engineering to optimize model performance for continual learning scenarios.
zh

[NLP-53] A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates

【速读】：本文研究了现有基于学习的图像描述评估指标的局限性，特别是个体单词错位的细粒度评估缺失以及对单一质量点估计的依赖而未考虑不确定性的问题。为了解决这些问题，论文提出了一种简单而有效的策略，用于生成和校准CLIPScore分布。关键在于利用一种与模型无关的符合风险控制框架，针对特定任务的控制变量校准CLIPScore值，从而解决上述两个问题。实验结果表明，通过符合风险控制方法处理生成的分布，相比简单的输入掩码等方法能够达到与更复杂方法竞争的性能。该方法不仅能有效检测错位单词，还能提供与期望风险水平一致的形式化保证，并改善不确定性估计与预测误差之间的相关性，从而提升整体评估指标的可靠性。

链接: https://arxiv.org/abs/2504.01225
作者: Gonçalo Gomes,Chrysoula Zerva,Bruno Martins
机构: Instituto Superior Técnico, University of Lisbon (里斯本大学高等技术学院); INESC-ID (未知); Instituto de Telecomunicações (电信研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study explores current limitations of learned image captioning evaluation metrics, specifically the lack of granular assessment for individual word misalignments within captions, and the reliance on single-point quality estimates without considering uncertainty. To address these limitations, we propose a simple yet effective strategy for generating and calibrating CLIPScore distributions. Leveraging a model-agnostic conformal risk control framework, we calibrate CLIPScore values for task-specific control variables, to tackle the aforementioned two limitations. Experimental results demonstrate that using conformal risk control, over the distributions produced with simple methods such as input masking, can achieve competitive performance compared to more complex approaches. Our method effectively detects misaligned words, while providing formal guarantees aligned with desired risk levels, and improving the correlation between uncertainty estimations and prediction errors, thus enhancing the overall reliability of caption evaluation metrics.
zh

[NLP-54] Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models

【速读】：该论文试图解决临床环境中 PTSD（创伤后应激障碍）诊断不足的问题，通过自动化检测方法识别患者。解决方案的关键在于采用特定领域的自然语言处理技术，包括基于嵌入的方法（如LLaMA）、领域特定的Transformer模型（如Mental-RoBERTa）以及大语言模型的提示策略（如零样本、少样本和逐步推理）。研究发现，领域适应的嵌入方法和大语言模型在性能上表现最优，特别是LLaMA嵌入结合神经网络达到了最高的F1分数（0.700），并且零样本提示基于DSM-5标准也能取得竞争性结果（F1=0.657）。这些结果表明，通过优化领域适配和模型设计，可以实现PTSD筛查的规模化应用，同时需进一步改进对复杂病例的检测能力。

链接: https://arxiv.org/abs/2504.01216
作者: Feng Chen,Dror Ben-Zeev,Gillian Sparks,Arya Kadakia,Trevor Cohen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 4 tables, 1 figure

点击查看摘要

Abstract:Post-Traumatic Stress Disorder (PTSD) remains underdiagnosed in clinical settings, presenting opportunities for automated detection to identify patients. This study evaluates natural language processing approaches for detecting PTSD from clinical interview transcripts. We compared general and mental health-specific transformer models (BERT/RoBERTa), embedding-based methods (SentenceBERT/LLaMA), and large language model prompting strategies (zero-shot/few-shot/chain-of-thought) using the DAIC-WOZ dataset. Domain-specific models significantly outperformed general models (Mental-RoBERTa F1=0.643 vs. RoBERTa-base 0.485). LLaMA embeddings with neural networks achieved the highest performance (F1=0.700). Zero-shot prompting using DSM-5 criteria yielded competitive results without training data (F1=0.657). Performance varied significantly across symptom severity and comorbidity status, with higher accuracy for severe PTSD cases and patients with comorbid depression. Our findings highlight the potential of domain-adapted embeddings and LLMs for scalable screening while underscoring the need for improved detection of nuanced presentations and offering insights for developing clinically viable AI tools for PTSD assessment.
zh

[NLP-55] Epistemic Alignment: A Mediating Framework for User-LLM Knowledge Delivery

【速读】：该论文试图解决用户在使用大型语言模型（Large Language Models, LLMs）时无法有效表达信息呈现偏好（如“引用权威来源”或“包含多视角”）的问题，当前界面缺乏结构化方式来表达这些需求，导致用户依赖于非正式的提示词共享惯例而非基于实际效果。论文的关键解决方案是提出认知对齐框架（Epistemic Alignment Framework），这是一套源自认识论哲学的十个知识传递挑战，用于评估证据质量、校准证词依赖等。该框架作为用户需求与系统能力之间的结构化中介，建立了共同的语言以弥合用户期望与系统输出之间的差距。通过分析在线社区中的自定义提示词和个人化策略，论文发现用户已发展出复杂的变通方法应对这些挑战，但主流模型提供商（如OpenAI和Anthropic）尚未充分建立指定认知偏好的机制，缺乏透明度且未提供验证工具。因此，该框架为开发者提供了支持多样化知识传递的具体指导，同时为用户提供符合其特定需求的信息交付方式。

链接: https://arxiv.org/abs/2504.01205
作者: Nicholas Clark,Hua Shen,Bill Howe,Tanushree Mitra
机构: University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs increasingly serve as tools for knowledge acquisition, yet users cannot effectively specify how they want information presented. When users request that LLMs “cite reputable sources,” “express appropriate uncertainty,” or “include multiple perspectives,” they discover that current interfaces provide no structured way to articulate these preferences. The result is prompt sharing folklore: community-specific copied prompts passed through trust relationships rather than based on measured efficacy. We propose the Epistemic Alignment Framework, a set of ten challenges in knowledge transmission derived from the philosophical literature of epistemology, concerning issues such as evidence quality assessment and calibration of testimonial reliance. The framework serves as a structured intermediary between user needs and system capabilities, creating a common vocabulary to bridge the gap between what users want and what systems deliver. Through a thematic analysis of custom prompts and personalization strategies shared on online communities where these issues are actively discussed, we find users develop elaborate workarounds to address each of the challenges. We then apply our framework to two prominent model providers, OpenAI and Anthropic, through content analysis of their documented policies and product features. Our analysis shows that while these providers have partially addressed the challenges we identified, they fail to establish adequate mechanisms for specifying epistemic preferences, lack transparency about how preferences are implemented, and offer no verification tools to confirm whether preferences were followed. For AI developers, the Epistemic Alignment Framework offers concrete guidance for supporting diverse approaches to knowledge; for users, it works toward information delivery that aligns with their specific needs rather than defaulting to one-size-fits-all approaches.
zh

[NLP-56] Medical large language models are easily distracted

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）在处理包含真实世界干扰信息（如辅助技术生成的非临床上下文相关词汇或无关健康状况的引用）时过滤相关临床数据的能力。论文通过构建MedDistractQA基准测试集，嵌入美国医学执照考试（USMLE）风格的问题并模拟实际临床场景中的干扰因素，发现这些干扰可使LLMs的准确性降低多达17.9%。论文进一步探讨了两种常见方法——检索增强生成（Retrieval-Augmented Generation, RAG）和医学领域微调（medical fine-tuning），但结果显示这些方法未能有效缓解干扰影响，甚至可能引入新的混淆因素，进一步削弱性能。研究的关键在于揭示LLMs缺乏区分相关与无关临床信息所需的逻辑机制，强调了开发鲁棒性策略以增强LLMs对抗冗余信息能力的重要性。

链接: https://arxiv.org/abs/2504.01201
作者: Krithik Vishwanath,Anton Alyakin,Daniel Alexander Alber,Jin Vivian Lee,Douglas Kondziolka,Eric Karl Oermann
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 20 pages, 2 main figures, 6 extended figures

点击查看摘要

Abstract:Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance. The rise of assistive technologies like ambient dictation, which automatically generates draft notes from live patient encounters, has the potential to introduce additional noise making it crucial to assess the ability of LLM’s to filter relevant data. To investigate this, we developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions. Our findings show that distracting statements (polysemous words with clinical meanings used in a non-clinical context or references to unrelated health conditions) can reduce LLM accuracy by up to 17.9%. Commonly proposed solutions to improve model performance such as retrieval-augmented generation (RAG) and medical fine-tuning did not change this effect and in some cases introduced their own confounders and further degraded performance. Our findings suggest that LLMs natively lack the logical mechanisms necessary to distinguish relevant from irrelevant clinical information, posing challenges for real-world applications. MedDistractQA and our results highlights the need for robust mitigation strategies to enhance LLM resilience to extraneous information.
zh

[NLP-57] μKE: Matryoshka Unstructured Knowledge Editing of Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）因静态训练数据导致的知识编辑局限性问题，如幻觉（hallucinations）和安全风险。当前基于窗口的自回归方法在通过定位与编辑范式更新模型内部知识时，往往破坏早期记忆更新与后续输出标记之间的因果依赖关系。为应对这一挑战，论文提出了一种名为Matryoshka无结构知识编辑（μKE）的新颖内存更新机制。其关键在于通过嵌套式的优化目标（Matryoshka-style objective）和自适应损失系数，保留早期记忆更新与后期输出之间的因果依赖性，从而提升知识编辑的有效性和鲁棒性。

链接: https://arxiv.org/abs/2504.01196
作者: Zian Su,Ziyang Huang,Kaiyuan Zhang,Xiangyu Zhang
机构: Department of Computer Science, Purdue University (普渡大学), West Lafayette, IN; Department of Computer Science, Johns Hopkins University (约翰斯·霍普金斯大学), Baltimore, MD
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) have emerged as powerful knowledge bases yet are limited by static training data, leading to issues such as hallucinations and safety risks. Editing a model’s internal knowledge through the locate-and-edit paradigm has proven a cost-effective alternative to retraining, though current unstructured approaches, especially window-based autoregressive methods, often disrupt the causal dependency between early memory updates and later output tokens. In this work, we first theoretically analyze these limitations and then introduce Matryoshka Unstructured Knowledge Editing ( \mu KE), a novel memory update mechanism that preserves such dependencies via a Matryoshka-style objective and adaptive loss coefficients. Empirical evaluations on two models across four benchmarks demonstrate that \mu KE improves edit efficacy by up to 12.33% over state-of-the-art methods, and remain robust when applied to diverse formatted edits, underscoring its potential for effective unstructured knowledge editing in LLMs.
zh

[NLP-58] Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

【速读】：该论文致力于解决文本到图像（Text-to-Image, T2I）模型中存在的语义泄漏（semantic leakage）、特征绑定错误（incorrect feature binding）以及关键概念遗漏（key concept omissions）等问题。为了解决这些问题，论文重点研究了文本标记表示之间信息流的作用，并通过在给定提示的上下文标记表示子集上应用扩散组件生成图像，观察到几个有趣的现象。关键在于发现单个或少数几个标记可以充分表示一个单词或多词表达（如“San Francisco’s Golden Gate Bridge”中的“gate”），而其他标记是冗余的。基于此，论文提出了一种简单且无需训练的方法来缓解语义泄漏：在文本编码后，用非上下文化的表示替换泄漏项的表示。这种方法显著减少了语义泄漏达85%，同时提高了生成性能并减少了错误。总体而言，这项工作提供了对T2I模型中文本标记间信息流的全面分析，带来了新颖的见解和实用价值。

链接: https://arxiv.org/abs/2504.01137
作者: Guy Kaplan,Michael Toker,Yuval Reif,Yonatan Belinkov,Roy Schwartz
机构: Hebrew University of Jerusalem (希伯来大学耶路撒冷); Technion – Israel Institute of Technology (以色列理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) models often suffer from issues such as semantic leakage, incorrect feature binding, and omissions of key concepts in the generated image. This work studies these phenomena by looking into the role of information flow between textual token representations. To this end, we generate images by applying the diffusion component on a subset of contextual token representations in a given prompt and observe several interesting phenomena. First, in many cases, a word or multiword expression is fully represented by one or two tokens, while other tokens are redundant. For example, in “San Francisco’s Golden Gate Bridge”, the token “gate” alone captures the full expression. We demonstrate the redundancy of these tokens by removing them after textual encoding and generating an image from the resulting representation. Surprisingly, we find that this process not only maintains image generation performance but also reduces errors by 21% compared to standard generation. We then show that information can also flow between different expressions in a sentence, which often leads to semantic leakage. Based on this observation, we propose a simple, training-free method to mitigate semantic leakage: replacing the leaked item’s representation after the textual encoding with its uncontextualized representation. Remarkably, this simple approach reduces semantic leakage by 85%. Overall, our work provides a comprehensive analysis of information flow across textual tokens in T2I models, offering both novel insights and practical benefits.
zh

[NLP-59] Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding

【速读】：该论文旨在解决确定声明与源文档之间忠实性判断的问题，尤其是在声明含义模糊的情况下。传统方法通常将此任务视为二元判断（支持或不支持），但这种强制性的二分类降低了评估的可靠性。论文的关键解决方案是引入一种新的评价机制——通过大型语言模型（LLM）生成的摘要编辑来提供更细致的评估：声明需要多少编辑才能变得无歧义？声明是否被改写以及修改的程度可以作为自动评价指标，即模糊重写度量（Ambiguity Rewrite Metric, ARM）。相比传统的二元忠实性判断，ARM提供了更丰富的反馈信号。论文重点研究叙事性摘要领域，因其高度存在模糊性和主观解释。实验表明，ARM使标注者在声明忠实性判断上的协议一致性提高了21%，表明减少了主观性影响。

链接: https://arxiv.org/abs/2504.01132
作者: Melanie Subbiah,Akankshya Mishra,Grace Kim,Liyan Tang,Greg Durrett,Kathleen McKeown
机构: Columbia University (哥伦比亚大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Determining faithfulness of a claim to a source document is an important problem across many domains. This task is generally treated as a binary judgment of whether the claim is supported or unsupported in relation to the source. In many cases, though, whether a claim is supported can be ambiguous. For instance, it may depend on making inferences from given evidence, and different people can reasonably interpret the claim as either supported or unsupported based on their agreement with those inferences. Forcing binary labels upon such claims lowers the reliability of evaluation. In this work, we reframe the task to manage the subjectivity involved with factuality judgments of ambiguous claims. We introduce LLM-generated edits of summaries as a method of providing a nuanced evaluation of claims: how much does a summary need to be edited to be unambiguous? Whether a claim gets rewritten and how much it changes can be used as an automatic evaluation metric, the Ambiguity Rewrite Metric (ARM), with a much richer feedback signal than a binary judgment of faithfulness. We focus on the area of narrative summarization as it is particularly rife with ambiguity and subjective interpretation. We show that ARM produces a 21% absolute improvement in annotator agreement on claim faithfulness, indicating that subjectivity is reduced.
zh

[NLP-60] Can LLM s Grasp Implicit Cultural Values? Benchmarking LLM s Metacognitive Cultural Intelligence with CQ-Bench

【速读】：本文旨在解决大型语言模型（LLMs）在理解隐含文化价值观方面的不足，现有研究主要关注显性文化规范，但忽视了真实对话中更微妙、隐含的价值观。为填补这一空白，论文引入了CQ-Bench，这是一个专门设计的基准测试集，用于评估LLMs从自然对话语境中推断隐含文化价值观的能力。关键在于通过构建包含世界价值调查（World Value Survey）和GlobalOpinions数据集中价值观的多角色对话故事数据集，并采用严格的验证流程（包括一致性与隐含性检查），以及利用GPT-4o进行模型微调等方法，提升LLMs在态度检测、价值选择及提取等任务上的表现，特别是针对较小规模模型通过少量高质量文化样本进行精调可显著改善其跨文化交流推理能力。

链接: https://arxiv.org/abs/2504.01127
作者: Ziyi Liu,Priyanka Dey,Zhenyu Zhao,Jen-tse Huang,Rahul Gupta,Yang Liu,Jieyu Zhao
机构: University of Southern California (南加州大学); Amazon AGI; John Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cultural Intelligence (CQ) refers to the ability to understand unfamiliar cultural contexts-a crucial skill for large language models (LLMs) to effectively engage with globally diverse users. While existing research often focuses on explicitly stated cultural norms, such approaches fail to capture the subtle, implicit values that underlie real-world conversations. To address this gap, we introduce CQ-Bench, a benchmark specifically designed to assess LLMs’ capability to infer implicit cultural values from natural conversational contexts. We generate a multi-character conversation-based stories dataset using values from the World Value Survey and GlobalOpinions datasets, with topics including ethical, religious, social, and political. Our dataset construction pipeline includes rigorous validation procedures-incorporation, consistency, and implicitness checks-using GPT-4o, with 98.2% human-model agreement in the final validation. Our benchmark consists of three tasks of increasing complexity: attitude detection, value selection, and value extraction. We find that while o1 and Deepseek-R1 models reach human-level performance in value selection (0.809 and 0.814), they still fall short in nuanced attitude detection, with F1 scores of 0.622 and 0.635, respectively. In the value extraction task, GPT-4o-mini and o3-mini score 0.602 and 0.598, highlighting the difficulty of open-ended cultural reasoning. Notably, fine-tuning smaller models (e.g., LLaMA-3.2-3B) on only 500 culturally rich examples improves performance by over 10%, even outperforming stronger baselines (o3-mini) in some cases. Using CQ-Bench, we provide insights into the current challenges in LLMs’ CQ research and suggest practical pathways for enhancing LLMs’ cross-cultural reasoning abilities.
zh

[NLP-61] Repetitions are not all alike: distinct mechanisms sustain repetition in language models

【速读】：该论文试图解决语言模型（Language Models, LMs）生成文本中重复现象（repetition）的机制问题，探索其是否由单一因素驱动或由多种内部机制共同作用。解决方案的关键在于通过实验对比两种引发重复条件下的模型行为差异：一种是重复序列自然出现在人类书写文本后的情况，另一种是在情境学习（in-context learning, ICL）设置下人为诱导重复。研究发现，模型在不同条件下表现出不同的置信度水平、依赖不同的注意力头，并对受控扰动呈现独特的响应模式，表明重复可能由多种独立或组合的内部机制驱动。这些结果强调了表面相似的行为可能由不同的底层过程支撑，这对理解重复的本质及其缓解策略具有重要意义。

链接: https://arxiv.org/abs/2504.01100
作者: Matéo Mahaut,Francesca Franzon
机构: Universitat Pompeu Fabra (庞培法布拉大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text generated by language models (LMs) can degrade into repetitive cycles, where identical word sequences are persistently repeated one after another. Prior research has typically treated repetition as a unitary phenomenon. However, repetitive sequences emerge under diverse tasks and contexts, raising the possibility that it may be driven by multiple underlying factors. Here, we experimentally explore the hypothesis that repetition in LMs can result from distinct mechanisms, reflecting different text generation strategies used by the model. We examine the internal working of LMs under two conditions that prompt repetition: one in which repeated sequences emerge naturally after human-written text, and another where repetition is explicitly induced through an in-context learning (ICL) setup. Our analysis reveals key differences between the two conditions: the model exhibits varying levels of confidence, relies on different attention heads, and shows distinct pattens of change in response to controlled perturbations. These findings suggest that distinct internal mechanisms can interact to drive repetition, with implications for its interpretation and mitigation strategies. More broadly, our results highlight that the same surface behavior in LMs may be sustained by different underlying processes, acting independently or in combination.
zh

[NLP-62] Multilingual and Multi-Accent Jailbreaking of Audio LLM s

【速读】：该论文试图解决大型音频语言模型（Large Audio Language Models, LALMs）在安全方面面临的严重漏洞，特别是通过多语言和多口音的音频越狱攻击（audio jailbreaks）对模型进行对抗性利用的问题。现有研究主要集中在英语为中心的攻击，而本文揭示了一种更为严重的风险：跨语言及跨口音的音频越狱攻击，其成功率因语音和声学变化的叠加而显著提升。为应对这一挑战，论文提出Multi-AudioJail框架，关键在于构建了一个包含对抗性扰动的多语言/多口音音频越狱提示的新数据集，并设计了一个分层评估管道，以揭示声学扰动（如混响、回声和耳语效应）与跨语言音韵学交互如何导致越狱成功率（JSR）最高提升57.25个百分点。此外，论文指出多模态大型语言模型（LLMs）比单模态系统更易受攻击，因为攻击者只需利用最薄弱环节（如非英语音频输入）即可攻破整个模型，这一点通过多语言音频-only攻击的成功率比文本-only攻击高出3.1倍得到验证。论文计划公开数据集，以推动跨模态防御的研究，呼吁社区关注LALMs演进过程中不断扩大的攻击面。

链接: https://arxiv.org/abs/2504.01094
作者: Jaechul Roh,Virat Shejwalkar,Amir Houmansadr
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Google DeepMind (谷歌深思维)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 21 pages, 6 figures, 15 tables

点击查看摘要

Abstract:Large Audio Language Models (LALMs) have significantly advanced audio understanding but introduce critical security risks, particularly through audio jailbreaks. While prior work has focused on English-centric attacks, we expose a far more severe vulnerability: adversarial multilingual and multi-accent audio jailbreaks, where linguistic and acoustic variations dramatically amplify attack success. In this paper, we introduce Multi-AudioJail, the first systematic framework to exploit these vulnerabilities through (1) a novel dataset of adversarially perturbed multilingual/multi-accent audio jailbreaking prompts, and (2) a hierarchical evaluation pipeline revealing that how acoustic perturbations (e.g., reverberation, echo, and whisper effects) interacts with cross-lingual phonetics to cause jailbreak success rates (JSRs) to surge by up to +57.25 percentage points (e.g., reverberated Kenyan-accented attack on MERaLiON). Crucially, our work further reveals that multimodal LLMs are inherently more vulnerable than unimodal systems: attackers need only exploit the weakest link (e.g., non-English audio inputs) to compromise the entire model, which we empirically show by multilingual audio-only attacks achieving 3.1x higher success rates than text-only attacks. We plan to release our dataset to spur research into cross-modal defenses, urging the community to address this expanding attack surface in multimodality as LALMs evolve.
zh

[NLP-63] ShieldGemma 2: Robust and Tractable Image Content Moderation

【速读】：本文旨在解决图像内容审核领域中高效且准确识别合成图像（如生成式 AI 输出）和自然图像潜在安全风险的问题。论文提出的关键解决方案是ShieldGemma 2，一个基于Gemma 3构建的40亿参数图像内容审核模型。该模型在性暗示、暴力/血腥以及危险内容等关键类别上提供了鲁棒的安全风险预测能力。此外，论文还介绍了一种新颖的对抗性数据生成管道，用于生成可控、多样且稳健的图像数据集，从而进一步提升模型性能。通过在内部和外部基准上的评估，ShieldGemma 2展示了相较于现有方法（如LlavaGuard、GPT-4o mini及基础版Gemma 3）的领先性能，为多模态安全与负责任的人工智能发展提供了开放的图像审核工具。

链接: https://arxiv.org/abs/2504.01081
作者: Wenjun Zeng,Dana Kurniawan,Ryan Mullins,Yuchi Liu,Tamoghna Saha,Dirichi Ike-Njoku,Jindong Gu,Yiwen Song,Cai Xu,Jingjing Zhou,Aparna Joshi,Shravan Dheep,Mani Malek,Hamid Palangi,Joon Baek,Rick Pereira,Karthik Narasimhan
机构: Google LLC (谷歌有限责任公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence \ Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both internal and external benchmarks to demonstrate state-of-the-art performance compared to LlavaGuard \citephelff2024llavaguard, GPT-4o mini \citephurst2024gpt, and the base Gemma 3 model \citepgemma_2025 based on our policies. Additionally, we present a novel adversarial data generation pipeline which enables a controlled, diverse, and robust image generation. ShieldGemma 2 provides an open image moderation tool to advance multimodal safety and responsible AI development.
zh

[NLP-64] Improving Applicability of Deep Learning based Token Classification models during Training

【速读】：本文旨在解决现有评估指标在模型训练期间不足以有效判断其在实际推理应用中的适用性的问题。传统分类指标（如实验中使用的F1分数）无法充分反映模型在真实场景下的表现，尤其是在需要高自动化水平的业务软件环境中。为了解决这一问题，论文提出了一种新的评估指标——文档完整性精确度（Document Integrity Precision, DIP），用于视觉文档理解和标记分类任务。与现有指标不同，DIP通过量化测试数据集中需要人工干预的文档比例，提供了一个严格且直观的评价方式，从而帮助研究人员和开发者深入分析业务流程自动化程度。关键在于，DIP能够针对整个实体集合保持单一可解释值，即使面对孤立的模型缺陷或复杂实体预测任务，也能更准确地指示模型部署前所需的干预量，凸显了开发面向生产环境任务特定评估指标的重要性。

链接: https://arxiv.org/abs/2504.01028
作者: Anket Mehra,Malte Prieß,Marian Himstedt
机构: University of Applied Sciences Kiel (基尔应用科学大学); Technical University of Applied Sciences Lübeck (吕贝克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper shows that further evaluation metrics during model training are needed to decide about its applicability in inference. As an example, a LayoutLM-based model is trained for token classification in documents. The documents are German receipts. We show that conventional classification metrics, represented by the F1-Score in our experiments, are insufficient for evaluating the applicability of machine learning models in practice. To address this problem, we introduce a novel metric, Document Integrity Precision (DIP), as a solution for visual document understanding and the token classification task. To the best of our knowledge, nothing comparable has been introduced in this context. DIP is a rigorous metric, describing how many documents of the test dataset require manual interventions. It enables AI researchers and software developers to conduct an in-depth investigation of the level of process automation in business software. In order to validate DIP, we conduct experiments with our created models to highlight and analyze the impact and relevance of DIP to evaluate if the model should be deployed or not in different training settings. Our results demonstrate that existing metrics barely change for isolated model impairments, whereas DIP indicates that the model requires substantial human interventions in deployment. The larger the set of entities being predicted, the less sensitive conventional metrics are, entailing poor automation quality. DIP, in contrast, remains a single value to be interpreted for entire entity sets. This highlights the importance of having metrics that focus on the business task for model training in production. Since DIP is created for the token classification task, more research is needed to find suitable metrics for other training tasks.
zh

[NLP-65] Study of scaling laws in language families

【速读】：该论文旨在探究语言家族中的缩放规律，通过分析来自六千多种语言的数据以及在Zipf-like分类图中观察到的涌现模式，试图揭示大规模（基于语系中语言的数量）和小规模（基于语系中各语言的使用人数）分类的特征。论文的关键在于发现十四种主要当代语系（排除闪含语系和尼罗-撒哈拉语系语言）被划分为三个语系四重体，每个四重体在Zipf图中表现出显著不同的指数。这一发现揭示了主要语言家族的潜在结构与组织方式，为理解语言多样性和分布提供了深刻的见解。

链接: https://arxiv.org/abs/2504.01681
作者: Maelyson R. F. Santos,Marcelo A. F. Gomes
机构: Departamento de Física, Universidade Federal de Pernambuco (巴西联邦佩南布uco大学物理系); Instituto Federal de Educação, Ciência e Tecnologia de Pernambuco (巴西佩南布uco联邦教育、科学和技术学院)
类目: Physics and Society (physics.soc-ph); Computation and Language (cs.CL)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:This article investigates scaling laws within language families using data from over six thousand languages and analyzing emergent patterns observed in Zipf-like classification graphs. Both macroscopic (based on number of languages by family) and microscopic (based on numbers of speakers by language on a family) aspects of these classifications are examined. Particularly noteworthy is the discovery of a distinct division among the fourteen largest contemporary language families, excluding Afro-Asiatic and Nilo-Saharan languages. These families are found to be distributed across three language family quadruplets, each characterized by significantly different exponents in the Zipf graphs. This finding sheds light on the underlying structure and organization of major language families, revealing intriguing insights into the nature of linguistic diversity and distribution.
zh

计算机视觉

[CV-0] Learning from Streaming Video with Orthogonal Gradients CVPR2025

【速读】：本文针对从连续视频流自监督学习表征的问题展开研究。传统视频学习方法通常通过打乱和切分视频来构建满足独立同分布（IID）假设的非冗余批次，但在仅以连续输入形式提供视频时，这一假设被打破，导致性能下降。论文在三种任务中验证了从打乱学习切换到顺序学习时性能的降低：单视频表征学习方法DoRA、多视频数据集上的标准VideoMAE以及未来视频预测任务。为了解决这一性能下降问题，论文提出了一种对标准优化器的几何修正方法，在训练过程中利用正交梯度来解耦批次间的相关性。此方法可应用于任何优化器，并在随机梯度下降（SGD）和AdamW中进行了演示。关键在于提出的正交优化器能够缓解因连续视频流训练而导致的表征学习性能下降，在下游任务中表现出色，并在三种场景下均优于强基准AdamW。

链接: https://arxiv.org/abs/2504.01961
作者: Tengda Han,Dilara Gokay,Joseph Heyward,Chuhan Zhang,Daniel Zoran,Viorica Pătrăucean,João Carreira,Dima Damen,Andrew Zisserman
机构: Google DeepMind (谷歌深度思维); University of Bristol (布里斯托尔大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer – we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW in all three scenarios.
zh

[CV-1] Diffusion-Guided Gaussian Splatting for Large-Scale Unconstrained 3D Reconstruction and Novel View Synthesis WACV

【速读】：该论文致力于解决在大规模非受限环境中，基于3D Gaussian Splatting (3DGS) 和 Neural Radiance Fields (NeRF) 的方法因稀疏且不均匀的输入覆盖、瞬时遮挡、外观变化以及不一致的相机设置而导致的重建质量下降问题。论文的关键解决方案是提出了一种名为GS-Diff的新框架，该框架通过多视图扩散模型引导生成条件于多视图输入的伪观测值，将欠约束的3D重建问题转化为适定问题，从而即使在数据稀疏的情况下也能实现稳健优化。此外，GS-Diff还集成了外观嵌入、单目深度先验、动态对象建模、各向异性正则化及先进的光栅化技术等多种增强功能，以应对真实场景中的几何与光度挑战。基准测试实验表明，GS-Diff在性能上显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.01960
作者: Niluthpol Chowdhury Mithun,Tuan Pham,Qiao Wang,Ben Southall,Kshitij Minhas,Bogdan Matei,Stephan Mandt,Supun Samarasekera,Rakesh Kumar
机构: SRI International (SRI国际); University of California, Irvine (加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: WACV ULTRRA Workshop 2025

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have achieved impressive results in real-time 3D reconstruction and novel view synthesis. However, these methods struggle in large-scale, unconstrained environments where sparse and uneven input coverage, transient occlusions, appearance variability, and inconsistent camera settings lead to degraded quality. We propose GS-Diff, a novel 3DGS framework guided by a multi-view diffusion model to address these limitations. By generating pseudo-observations conditioned on multi-view inputs, our method transforms under-constrained 3D reconstruction problems into well-posed ones, enabling robust optimization even with sparse data. GS-Diff further integrates several enhancements, including appearance embedding, monocular depth priors, dynamic object modeling, anisotropy regularization, and advanced rasterization techniques, to tackle geometric and photometric challenges in real-world settings. Experiments on four benchmarks demonstrate that GS-Diff consistently outperforms state-of-the-art baselines by significant margins.
zh

[CV-2] GaussianLSS – Toward Real-world BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting CVPR2025

【速读】：该论文旨在解决鸟瞰图（BEV）感知在实际应用中的两大挑战：不确定性建模不足和计算开销昂贵。为应对这些问题，论文提出了一种名为GaussianLSS的新型不确定性感知BEV感知框架。其关键创新在于通过重新审视基于升维（unprojection）的方法——特别是Lift-Splat-Shoot (LSS) 模式，并引入深度不确定性建模，将空间分散表示为软深度均值与深度分布方差的学习，从而隐式捕捉物体边界。此外，论文进一步将深度分布转化为三维高斯分布并进行栅格化，构建出具备不确定性感知能力的BEV特征。这一方法不仅实现了与投影法相当的性能（仅相差0.4%的IoU），还显著提升了速度（快2.5倍）和内存效率（减少至0.3倍）。

链接: https://arxiv.org/abs/2504.01957
作者: Shu-Wei Lu,Yi-Hsuan Tsai,Yi-Ting Chen
机构: National Yang Ming Chiao Tung University; Atmanity Inc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Bird’s-eye view (BEV) perception has gained significant attention because it provides a unified representation to fuse multiple view images and enables a wide range of down-stream autonomous driving tasks, such as forecasting and planning. Recent state-of-the-art models utilize projection-based methods which formulate BEV perception as query learning to bypass explicit depth estimation. While we observe promising advancements in this paradigm, they still fall short of real-world applications because of the lack of uncertainty modeling and expensive computational requirement. In this work, we introduce GaussianLSS, a novel uncertainty-aware BEV perception framework that revisits unprojection-based methods, specifically the Lift-Splat-Shoot (LSS) paradigm, and enhances them with depth un-certainty modeling. GaussianLSS represents spatial dispersion by learning a soft depth mean and computing the variance of the depth distribution, which implicitly captures object extents. We then transform the depth distribution into 3D Gaussians and rasterize them to construct uncertainty-aware BEV features. We evaluate GaussianLSS on the nuScenes dataset, achieving state-of-the-art performance compared to unprojection-based methods. In particular, it provides significant advantages in speed, running 2.5x faster, and in memory efficiency, using 0.3x less memory compared to projection-based methods, while achieving competitive performance with only a 0.4% IoU difference.
zh

[CV-3] VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

【速读】：该论文旨在解决从稀疏视图恢复三维场景的问题，这是一个由于其固有不适定性而极具挑战性的任务。传统方法通过几何正则化或前馈确定性模型等专门解决方案来缓解这一问题，但它们在输入视图间重叠较少且视觉信息不足时仍会面临性能下降的问题。尽管近期视频生成模型显示出解决此挑战的潜力，但现有基于视频扩散模型的方法受限于推理速度慢及缺乏三维约束，导致效率低下且重建结果存在与真实世界几何结构不符的伪影。论文的关键解决方案在于提出VideoScene，通过蒸馏视频扩散模型实现三维场景的一步生成，以构建高效且有效的工具来弥合视频到三维的鸿沟。具体而言，设计了一种三维感知跃变流蒸馏策略以跳过耗时的冗余信息，并训练了一个动态去噪策略网络以在推理过程中自适应地确定最优跃变时间步长。广泛的实验表明，VideoScene相比之前的视频扩散模型能够更快、更优地生成三维场景，凸显其作为未来视频到三维应用高效工具的潜力。

链接: https://arxiv.org/abs/2504.01956
作者: Hanyang Wang,Fangfu Liu,Jiawei Chi,Yueqi Duan
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: this https URL
zh

[CV-4] Scene-Centric Unsupervised Panoptic Segmentation CVPR2025

【速读】：该论文致力于解决无监督全景分割（Unsupervised Panoptic Segmentation）的问题，即在不依赖人工标注数据的情况下，将图像划分为语义上有意义的区域以及独立的对象实例。与以往专注于场景理解的研究不同，本文提出的方法直接利用以场景为中心的数据进行训练，无需依赖以对象为中心的训练数据，从而实现复杂场景的无监督理解。关键在于结合视觉表征、深度信息及运动线索，生成高分辨率的全景伪标签，并通过伪标签训练与全景自训练策略相结合的方式，实现了对复杂场景精准的全景分割预测，且无需任何人工标注。这种方法显著提升了全景分割的质量，在Cityscapes数据集上的PQ指标超越了近期最先进的无监督方法9.4个百分点。

链接: https://arxiv.org/abs/2504.01955
作者: Oliver Hahn,Christoph Reich,Nikita Araslanov,Daniel Cremers,Christian Rupprecht,Stefan Roth
机构: TU Darmstadt (达姆施塔特工业大学); TU Munich (慕尼黑工业大学); University of Oxford (牛津大学); MCML (麦克斯普朗克数字媒体与信息研究所); ELIZA (未知); hessian.AI (黑森人工智能研究所); visinf (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at CVPR 2025. Christoph Reich and Oliver Hahn - both authors contributed equally. Code: this https URL Project page: this https URL

点击查看摘要

Abstract:Unsupervised panoptic segmentation aims to partition an image into semantically meaningful regions and distinct object instances without training on manually annotated data. In contrast to prior work on unsupervised panoptic scene understanding, we eliminate the need for object-centric training data, enabling the unsupervised understanding of complex scenes. To that end, we present the first unsupervised panoptic method that directly trains on scene-centric imagery. In particular, we propose an approach to obtain high-resolution panoptic pseudo labels on complex scene-centric data, combining visual representations, depth, and motion cues. Utilizing both pseudo-label training and a panoptic self-training strategy yields a novel approach that accurately predicts panoptic segmentation of complex scenes without requiring any human annotations. Our approach significantly improves panoptic quality, e.g., surpassing the recent state of the art in unsupervised panoptic segmentation on Cityscapes by 9.4% points in PQ.
zh

[CV-5] owards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities

【速读】：该论文致力于解决跨视觉粒度（multi-granularity）的指代表达分割（Referring Expression Segmentation, RES）任务中的挑战。传统方法主要聚焦于物体级别的目标定位（object-level grounding），但在实际场景中，用户可能需要描述多目标、单目标或部分级别（part-level）的目标，这带来了显著的挑战。然而，现有数据集和模型缺乏支持更实用的多粒度RES任务的统一框架和充足的数据资源。为此，论文提出了一个新的多粒度指代表达分割（Multi-granularity Referring Expression Segmentation, MRES）任务，并构建了包含部分级别标注的RefCOCOm基准数据集，以及包含超过3220万掩码和标题的MRES-32M数据集，专门用于部分级别视觉-语言对齐。为应对多粒度RES的挑战，论文提出了一种统一的多模态大语言模型UniRES++，它整合了物体级别和部分级别的RES任务，并通过针对细粒度视觉特征探索的设计实现了性能提升。关键在于通过引入新的任务、大规模数据集和统一模型架构，解决了多粒度RES任务中的数据稀缺和技术瓶颈问题。

链接: https://arxiv.org/abs/2504.01954
作者: Jing Liu,Wenxuan Wang,Yisi Zhang,Yepeng Tang,Xingjian He,Longteng Guo,Tongtian Yue,Xinlong Wang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Beijing Academy of Artificial Intelligence (北京智源人工智能研究院); University of Science and Technology Beijing (北京科技大学); School of Computer Science and Technology, Beijing Jiaotong University (北京交通大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring expression segmentation (RES) aims at segmenting the entities’ masks that match the descriptive language expression. While traditional RES methods primarily address object-level grounding, real-world scenarios demand a more versatile framework that can handle multiple levels of target granularity, such as multi-object, single object or part-level references. This introduces great challenges due to the diverse and nuanced ways users describe targets. However, existing datasets and models mainly focus on designing grounding specialists for object-level target localization, lacking the necessary data resources and unified frameworks for the more practical multi-grained RES. In this paper, we take a step further towards visual granularity unified RES task. To overcome the limitation of data scarcity, we introduce a new multi-granularity referring expression segmentation (MRES) task, alongside the RefCOCOm benchmark, which includes part-level annotations for advancing finer-grained visual understanding. In addition, we create MRES-32M, the largest visual grounding dataset, comprising over 32.2M masks and captions across 1M images, specifically designed for part-level vision-language grounding. To tackle the challenges of multi-granularity RES, we propose UniRES++, a unified multimodal large language model that integrates object-level and part-level RES tasks. UniRES++ incorporates targeted designs for fine-grained visual feature exploration. With the joint model architecture and parameters, UniRES++ achieves state-of-the-art performance across multiple benchmarks, including RefCOCOm for MRES, gRefCOCO for generalized RES, and RefCOCO, RefCOCO+, RefCOCOg for classic RES. To foster future research into multi-grained visual grounding, our RefCOCOm benchmark, MRES-32M dataset and model UniRES++ will be publicly available at this https URL.
zh

[CV-6] Deep Representation Learning for Unsupervised Clustering of Myocardial Fiber Trajectories in Cardiac Diffusion Tensor Imaging MICCAI2025

【速读】：该论文旨在解决利用扩散张量成像（Diffusion Tensor Imaging, DTI）数据准确捕捉心肌复杂纤维结构的问题，尤其针对缺乏地面真实标签（ground truth labels）以及纤维轨迹模糊且相互交织的挑战。论文的关键解决方案在于提出了一种新颖的深度学习框架，通过无监督聚类方法识别心肌纤维的不同纤维束。该框架结合双向长短期记忆网络（Bidirectional Long Short-Term Memory, BiLSTM）提取纤维局部序列信息，并采用Transformer自编码器学习全局形状特征，同时在点级别融合重要的解剖学上下文信息。通过密度聚类算法处理这些表示，能够识别出33到62个稳健的聚类结果，成功捕获纤维轨迹在不同粒度下的细微差异。这一方法为心肌结构分析提供了灵活且定量的新途径，在手术规划优化、疾病相关重塑表征以及个性化心脏护理发展等方面具有潜在应用价值。

链接: https://arxiv.org/abs/2504.01953
作者: Mohini Anand,Xavier Tricoche
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 5 figures. Submitted to MICCAI 2025 (under review)

点击查看摘要

Abstract:Understanding the complex myocardial architecture is critical for diagnosing and treating heart disease. However, existing methods often struggle to accurately capture this intricate structure from Diffusion Tensor Imaging (DTI) data, particularly due to the lack of ground truth labels and the ambiguous, intertwined nature of fiber trajectories. We present a novel deep learning framework for unsupervised clustering of myocardial fibers, providing a data-driven approach to identifying distinct fiber bundles. We uniquely combine a Bidirectional Long Short-Term Memory network to capture local sequential information along fibers, with a Transformer autoencoder to learn global shape features, with pointwise incorporation of essential anatomical context. Clustering these representations using a density-based algorithm identifies 33 to 62 robust clusters, successfully capturing the subtle distinctions in fiber trajectories with varying levels of granularity. Our framework offers a new, flexible, and quantitative way to analyze myocardial structure, achieving a level of delineation that, to our knowledge, has not been previously achieved, with potential applications in improving surgical planning, characterizing disease-related remodeling, and ultimately, advancing personalized cardiac care.
zh

[CV-7] Image Difference Grounding with Natural Language

【速读】：该论文旨在解决视觉差异理解（Image Difference Understanding, IDU）在跨模态文本引导下的细粒度定位问题。现有方法要么仅关注检测所有变化区域而缺乏文本引导，要么只能提供粗粒度的差异描述，难以满足实际应用需求，尤其是在多图像场景下捕捉细微但有意义的视觉差异。为应对这一挑战，论文提出了Image Difference Grounding (IDG) 任务，专注于根据用户指令精确地定位视觉差异。为此，论文设计了一个包含多样化视觉变化图像对及细粒度查询指令的大规模高质量数据集DiffGround，并提出了一种名为DiffTracker的基线模型，其关键在于通过特征差异增强与共性抑制的有效整合，实现精准的差异定位。

链接: https://arxiv.org/abs/2504.01952
作者: Wenxuan Wang,Zijia Zhao,Yisi Zhang,Yepeng Tang,Erdong Hu,Xinlong Wang,Jing Liu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Beijing Academy of Artificial Intelligence (北京智源人工智能研究院); University of Science and Technology Beijing (北京科技大学); Institute of Information Science, Beijing Jiaotong University (北京交通大学信息科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual grounding (VG) typically focuses on locating regions of interest within an image using natural language, and most existing VG methods are limited to single-image interpretations. This limits their applicability in real-world scenarios like automatic surveillance, where detecting subtle but meaningful visual differences across multiple images is crucial. Besides, previous work on image difference understanding (IDU) has either focused on detecting all change regions without cross-modal text guidance, or on providing coarse-grained descriptions of differences. Therefore, to push towards finer-grained vision-language perception, we propose Image Difference Grounding (IDG), a task designed to precisely localize visual differences based on user instructions. We introduce DiffGround, a large-scale and high-quality dataset for IDG, containing image pairs with diverse visual variations along with instructions querying fine-grained differences. Besides, we present a baseline model for IDG, DiffTracker, which effectively integrates feature differential enhancement and common suppression to precisely locate differences. Experiments on the DiffGround dataset highlight the importance of our IDG dataset in enabling finer-grained IDU. To foster future research, both DiffGround data and DiffTracker model will be publicly released.
zh

[CV-8] End-to-End Driving with Online Trajectory Evaluation via BEV World Model

【速读】：该论文旨在解决端到端自动驾驶在实际应用中安全性的保障问题，特别是在线轨迹评估的有效性不足这一挑战。为实现这一目标，论文的关键解决方案是提出了一种名为WoTE的端到端驾驶框架，其核心在于利用鸟瞰图（BEV）世界模型预测未来BEV状态，从而对轨迹进行有效评估。与基于图像级的世界模型相比，该BEV世界模型具有更低的延迟，并且可以通过现成的BEV空间交通模拟器实现无缝监督。这种设计确保了模型在捕捉环境动态和预测未来状态方面的高效性和准确性，从而显著提升了轨迹评估的效果。

链接: https://arxiv.org/abs/2504.01941
作者: Yingyan Li,Yuqi Wang,Yang Liu,Jiawei He,Lue Fan,Zhaoxiang Zhang
机构: NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA) (自动化研究所，中国科学院); University of Chinese Academy of Sciences (UCAS) (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving has achieved remarkable progress by integrating perception, prediction, and planning into a fully differentiable framework. Yet, to fully realize its potential, an effective online trajectory evaluation is indispensable to ensure safety. By forecasting the future outcomes of a given trajectory, trajectory evaluation becomes much more effective. This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an end-to-end driving framework WoTE, which leverages a BEV World model to predict future BEV states for Trajectory Evaluation. The proposed BEV world model is latency-efficient compared to image-level world models and can be seamlessly supervised using off-the-shelf BEV-space traffic simulators. We validate our framework on both the NAVSIM benchmark and the closed-loop Bench2Drive benchmark based on the CARLA simulator, achieving state-of-the-art performance. Code is released at this https URL.
zh

[CV-9] ILLUME: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

【速读】：该论文旨在解决现有统一多模态大语言模型（unified MLLMs）在同时具备深度语义理解、高保真图像生成和编辑能力方面的不足。传统方法如Chameleon和EMU3因缺乏细粒度的语义交互，在视觉理解任务上表现逊色；而LaViT和ILLUME等模型虽采用语义编码器提升理解能力，但存在纹理丢失的问题，影响图像编辑效果；Janus系列则因输入输出表征的解耦限制了其处理交错的图文理解和生成任务的能力。论文的关键创新在于提出了一种名为DualViTok的统一双视觉标记化器，它能够在保持精细纹理的同时实现文本对齐的语义表达，并支持从粗到细的图像表示策略，以增强多模态的理解与生成能力。此外，通过引入扩散模型作为图像解码器，进一步提升了生成质量及超分辨率性能。这些设计共同构成了ILLUME+的核心解决方案，使其在多种任务中展现出卓越的灵活性和效率。

链接: https://arxiv.org/abs/2504.01934
作者: Runhui Huang,Chunwei Wang,Junwei Yang,Guansong Lu,Yunlong Yuan,Jianhua Han,Lu Hou,Wei Zhang,Lanqing Hong,Hengshuang Zhao,Hang Xu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation. Existing unified models have struggled to simultaneously handle the three fundamental capabilities in a unified model: understanding, generation, and editing. Models like Chameleon and EMU3 utilize VQGAN for image discretization, due to the lack of deep semantic interaction, they lag behind specialist models like LLaVA in visual understanding tasks. To mitigate this, LaViT and ILLUME employ semantic encoders for tokenization, but they struggle with image editing due to poor texture preservation. Meanwhile, Janus series decouples the input and output image representation, limiting their abilities to seamlessly handle interleaved image-text understanding and generation. In contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves both fine-grained textures and text-aligned semantics while enabling a coarse-to-fine image representation strategy for multimodal understanding and generation. Additionally, we employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution. ILLUME+ follows a continuous-input, discrete-output scheme within the unified MLLM and adopts a progressive training procedure that supports dynamic resolution across the vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible and efficient context-aware image editing and generation across diverse tasks. ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs and specialized models across multimodal understanding, generation, and editing benchmarks. With its strong performance, ILLUME+ provides a scalable and versatile foundation for future multimodal applications. Project Page: this https URL.
zh

[CV-10] Equivariant Spherical CNNs for Accurate Fiber Orientation Distribution Estimation in Neonatal Diffusion MRI with Reduced Acquisition Time

【速读】：该论文旨在解决新生儿扩散磁共振成像（diffusion Magnetic Resonance Imaging, dMRI）中脑微结构早期评估的挑战，这些问题包括低信噪比（SNR）、运动伪影以及持续的髓鞘化过程。为应对这些挑战，论文提出了一种专门针对新生儿dMRI的旋转等变球面卷积神经网络（Spherical Convolutional Neural Network, sCNN）框架。该框架的关键在于利用减少梯度方向集合（仅占完整协议的30%）的多壳dMRI信号预测纤维取向分布（Fiber Orientation Distribution, FOD），从而实现更快且更具成本效益的数据采集。实验结果表明，与多层感知机（Multi-Layer Perceptron, MLP）基线相比，所提出的sCNN在均方误差（Mean Squared Error, MSE）和角相关系数（Angular Correlation Coefficient, ACC）方面表现更优，并且基于sCNN预测的FOD进行的追踪结果在解剖学上的合理性、覆盖范围和连贯性上均有显著提升。这表明具有内在旋转等变性的sCNN为准确且临床高效的dMRI分析提供了有前景的方法，有助于提高早期脑发育诊断的能力和表征水平。

链接: https://arxiv.org/abs/2504.01925
作者: Haykel Snoussi,Davood Karimi
机构: Department of Radiology, Boston Children’s Hospital and Harvard Medical School (波士顿儿童医院和哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early and accurate assessment of brain microstructure using diffusion Magnetic Resonance Imaging (dMRI) is crucial for identifying neurodevelopmental disorders in neonates, but remains challenging due to low signal-to-noise ratio (SNR), motion artifacts, and ongoing myelination. In this study, we propose a rotationally equivariant Spherical Convolutional Neural Network (sCNN) framework tailored for neonatal dMRI. We predict the Fiber Orientation Distribution (FOD) from multi-shell dMRI signals acquired with a reduced set of gradient directions (30% of the full protocol), enabling faster and more cost-effective acquisitions. We train and evaluate the performance of our sCNN using real data from 43 neonatal dMRI datasets provided by the Developing Human Connectome Project (dHCP). Our results demonstrate that the sCNN achieves significantly lower mean squared error (MSE) and higher angular correlation coefficient (ACC) compared to a Multi-Layer Perceptron (MLP) baseline, indicating improved accuracy in FOD estimation. Furthermore, tractography results based on the sCNN-predicted FODs show improved anatomical plausibility, coverage, and coherence compared to those from the MLP. These findings highlight that sCNNs, with their inherent rotational equivariance, offer a promising approach for accurate and clinically efficient dMRI analysis, paving the way for improved diagnostic capabilities and characterization of early brain development.
zh

[CV-11] Is Temporal Prompting All We Need For Limited Labeled Action Recognition? CVPR

【速读】：该论文旨在解决视频理解任务中对大规模标注数据集依赖性强的问题，并提出一种无需修改核心架构即可实现时间建模的解决方案。论文的关键在于引入TP-CLIP，这是一种基于时间视觉提示（Temporal Visual Prompting）的CLIP模型适配方法，通过在不改变CLIP核心架构的前提下实现时间维度上的适应性，从而保留其零样本学习的泛化能力。与现有方法相比，TP-CLIP在多个数据集上的零样本和少样本学习任务中表现出色，同时仅需较少的计算资源（仅为最新SOTA方法的1/3 GFLOPs和1/28可调参数量），并在某些任务和数据集上超越后者达15.8%。

链接: https://arxiv.org/abs/2504.01890
作者: Shreyank N Gowda,Boyan Gao,Xiao Gu,Xiaobo Jin
机构: University of Nottingham (诺丁汉大学); University of Oxford (牛津大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR-W 2025

点击查看摘要

Abstract:Video understanding has shown remarkable improvements in recent years, largely dependent on the availability of large scaled labeled datasets. Recent advancements in visual-language models, especially based on contrastive pretraining, have shown remarkable generalization in zero-shot tasks, helping to overcome this dependence on labeled datasets. Adaptations of such models for videos, typically involve modifying the architecture of vision-language models to cater to video data. However, this is not trivial, since such adaptations are mostly computationally intensive and struggle with temporal modeling. We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture. This preserves its generalization abilities. TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data. Extensive experiments across various datasets demonstrate its efficacy in zero-shot and few-shot learning, outperforming existing approaches with fewer parameters and computational efficiency. In particular, we use just 1/3 the GFLOPs and 1/28 the number of tuneable parameters in comparison to recent state-of-the-art and still outperform it by up to 15.8% depending on the task and dataset.
zh

[CV-12] GMAI-VL-R1: Harnessing Reinforcement Learning for Multimodal Medical Reasoning

【速读】：该论文旨在解决现有通用医疗人工智能模型在复杂医学决策推理能力方面的不足。解决方案的关键在于通过强化学习（Reinforcement Learning, RL）增强多模态医疗推理模型GMAI-VL-R1的推理能力，并结合基于拒绝采样的推理数据合成方法生成逐步推理数据，以提升模型的泛化性能。实验结果表明，强化学习训练显著提升了模型在医学图像诊断和视觉问答等任务中的表现，强调了强化学习对于实现真正泛化的重要性。

链接: https://arxiv.org/abs/2504.01886
作者: Yanzhou Su,Tianbin Li,Jiyao Liu,Chenglong Ma,Junzhi Ning,Cheng Tang,Sibo Ju,Jin Ye,Pengcheng Chen,Ming Hu,Shixiang Tang,Lihao Liu,Bin Fu,Wenqi Shao,Xiaowei Hu,Xiangwen Liao,Yuanfeng Ji,Junjun He
机构: Fuzhou University (福州大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学); Monash University (蒙纳士大学); University of Washington (华盛顿大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in general medical AI have made significant strides, but existing models often lack the reasoning capabilities needed for complex medical decision-making. This paper presents GMAI-VL-R1, a multimodal medical reasoning model enhanced by reinforcement learning (RL) to improve its reasoning abilities. Through iterative training, GMAI-VL-R1 optimizes decision-making, significantly boosting diagnostic accuracy and clinical support. We also develop a reasoning data synthesis method, generating step-by-step reasoning data via rejection sampling, which further enhances the model’s generalization. Experimental results show that after RL training, GMAI-VL-R1 excels in tasks such as medical image diagnosis and visual question answering. While the model demonstrates basic memorization with supervised fine-tuning, RL is crucial for true generalization. Our work establishes new evaluation benchmarks and paves the way for future advancements in medical reasoning models. Code, data, and model will be released at \hrefthis https URLthis link.
zh

[CV-13] A Diffusion-Based Framework for Occluded Object Movement

【速读】：该论文旨在解决图像编辑中无缝移动遮挡物体的挑战，特别是在真实场景图像中，由于遮挡情况的存在，现有方法难以在移动前完成被遮挡部分的修复。论文的关键在于提出了一种基于扩散模型的DiffOOM框架，通过同时执行物体去遮挡与移动的双分支设计来克服这一难题。其中，去遮挡分支利用背景颜色填充策略和动态更新的目标物体掩码，将扩散过程集中于完成目标物体的被遮挡部分；而移动分支则采用潜在变量优化实现物体定位，并结合局部文本条件引导以自然融入新环境。这一创新性方案有效提升了处理质量和用户体验。

链接: https://arxiv.org/abs/2504.01873
作者: Zheng-Peng Duan,Jiawei Zhang,Siyu Liu,Zheng Lin,Chun-Le Guo,Dongqing Zou,Jimmy Ren,Chongyi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Seamlessly moving objects within a scene is a common requirement for image editing, but it is still a challenge for existing editing methods. Especially for real-world images, the occlusion situation further increases the difficulty. The main difficulty is that the occluded portion needs to be completed before movement can proceed. To leverage the real-world knowledge embedded in the pre-trained diffusion models, we propose a Diffusion-based framework specifically designed for Occluded Object Movement, named DiffOOM. The proposed DiffOOM consists of two parallel branches that perform object de-occlusion and movement simultaneously. The de-occlusion branch utilizes a background color-fill strategy and a continuously updated object mask to focus the diffusion process on completing the obscured portion of the target object. Concurrently, the movement branch employs latent optimization to place the completed object in the target location and adopts local text-conditioned guidance to integrate the object into new surroundings appropriately. Extensive evaluations demonstrate the superior performance of our method, which is further validated by a comprehensive user study.
zh

[CV-14] CoMatcher: Multi-View Collaborative Feature Matching CVPR2025

【速读】：该论文试图解决在复杂场景中可靠轨迹构建的问题，特别是在图像集匹配中由于独立配对存在显著遮挡或极端视角变化时导致的模糊估计。这一挑战源于基于有限双视图观测解释复杂三维结构的固有不确定性，因为三维到二维的投影会导致信息大量丢失。为了解决这个问题，论文引入了CoMatcher，这是一种深度多视图匹配器，其关键是利用来自不同视图的互补上下文线索形成整体三维场景理解，并通过跨视图投影一致性推断出可靠的全局解。在此基础上，进一步开发了一种组级框架，充分利用跨视图关系以应对大规模匹配任务。实验结果表明，所提出的方法优于主流的双视图匹配范式。

链接: https://arxiv.org/abs/2504.01872
作者: Jintao Zhang,Zimin Xia,Mingyue Dong,Shuhan Shen,Linwei Yue,Xianwei Zheng
机构: The State Key Lab. LIESMARS, Wuhan University (武汉大学); École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院); Chinese Academy of Sciences (中国科学院); China University of Geosciences (中国地质大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures, to be published in CVPR 2025

点击查看摘要

Abstract:This paper proposes a multi-view collaborative matching strategy for reliable track construction in complex scenarios. We observe that the pairwise matching paradigms applied to image set matching often result in ambiguous estimation when the selected independent pairs exhibit significant occlusions or extreme viewpoint changes. This challenge primarily stems from the inherent uncertainty in interpreting intricate 3D structures based on limited two-view observations, as the 3D-to-2D projection leads to significant information loss. To address this, we introduce CoMatcher, a deep multi-view matcher to (i) leverage complementary context cues from different views to form a holistic 3D scene understanding and (ii) utilize cross-view projection consistency to infer a reliable global solution. Building on CoMatcher, we develop a groupwise framework that fully exploits cross-view relationships for large-scale matching tasks. Extensive experiments on various complex scenarios demonstrate the superiority of our method over the mainstream two-view matching paradigm.
zh

[CV-15] BOGausS: Better Optimized Gaussian Splatting

【速读】：该论文旨在解决在不牺牲质量的前提下构建更小规模的3D Gaussian Splatting (3DGS) 模型的挑战。论文的关键在于提出了一种新的优化方法，即Better Optimized Gaussian Splatting (BOGausS)，通过仔细分析3DGS的训练过程实现模型大小缩小至原来的十分之一，同时保持渲染质量不变，从而显著提升了Gaussian Splatting技术的性能，使其达到或超越当前最先进的水平。

链接: https://arxiv.org/abs/2504.01844
作者: Stéphane Pateux,Matthieu Gendrin,Luce Morin,Théo Ladune,Xiaoran Jiang
机构: Orange Innovation (Orange 创新研究院), Cesson Sévigné, France; Univ Rennes (雷恩大学), INSA Rennes (雷恩国立应用科学学院), CNRS (法国国家科学研究中心), IETR-UMR 6164, F-35000 Rennes, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) proposes an efficient solution for novel view synthesis. Its framework provides fast and high-fidelity rendering. Although less complex than other solutions such as Neural Radiance Fields (NeRF), there are still some challenges building smaller models without sacrificing quality. In this study, we perform a careful analysis of 3DGS training process and propose a new optimization methodology. Our Better Optimized Gaussian Splatting (BOGausS) solution is able to generate models up to ten times lighter than the original 3DGS with no quality degradation, thus significantly boosting the performance of Gaussian Splatting compared to the state of the art.
zh

[CV-16] Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images

【速读】：该论文旨在解决皮肤疾病诊断中人工智能模型在子群间表现存在偏倚的问题，特别是针对敏感属性（如肤色）导致的性能差异。为了解决这一问题，论文提出了一种基于生成式 AI 的新框架——Dermatology Diffusion Transformer (DermDiT)，其关键在于利用大型视觉语言模型生成精确且适当的文本提示，并结合多模态文本-图像学习来生成新的皮肤镜图像。通过这种方式，DermDiT 能够生成合成图像以改善高度不平衡数据集中代表性不足群体（如患者、疾病等）的表现，从而提升临床诊断的准确性。

链接: https://arxiv.org/abs/2504.01838
作者: Nusrat Munia,Abdullah-Al-Zubaer Imran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted at International Symposium on Biomedical Imaging (ISBI 2025)

点击查看摘要

Abstract:Artificial Intelligence (AI) in skin disease diagnosis has improved significantly, but a major concern is that these models frequently show biased performance across subgroups, especially regarding sensitive attributes such as skin color. To address these issues, we propose a novel generative AI-based framework, namely, Dermatology Diffusion Transformer (DermDiT), which leverages text prompts generated via Vision Language Models and multimodal text-image learning to generate new dermoscopic images. We utilize large vision language models to generate accurate and proper prompts for each dermoscopic image which helps to generate synthetic images to improve the representation of underrepresented groups (patient, disease, etc.) in highly imbalanced datasets for clinical diagnoses. Our extensive experimentation showcases the large vision language models providing much more insightful representations, that enable DermDiT to generate high-quality images. Our code is available at this https URL
zh

[CV-17] Implicit Bias Injection Attacks against Text-to-Image Diffusion Models CVPR2025

【速读】：该论文旨在解决文本到图像扩散模型（Text-to-Image Diffusion Models, T2I DMs）中隐式偏见（Implicit Bias）的问题，这种偏见缺乏明显的视觉特征，却能在不同语义环境中以多样化的方式表现，从而对生成内容的公正性构成潜在威胁。论文的关键创新在于提出了一种隐式偏见注入攻击框架（Implicit Bias Injection Attacks, IBI-Attacks），通过在提示嵌入空间（Prompt Embedding Space）中预先计算通用的偏见方向，并根据不同的输入自适应调整，实现对T2I扩散模型的隐式偏见注入。该方案无需直接修改用户输入或重新训练模型，即可以隐蔽且高效的方式将偏见传播至生成结果中，同时保持原始语义的完整性。实验验证了该方法在引入微妙且多样偏见方面的有效性及其在多种场景中的强隐蔽性和迁移能力。

链接: https://arxiv.org/abs/2504.01819
作者: Huayang Huang,Xiangye Jin,Jiaxu Miao,Yu Wu
机构: School of Computer Science, Wuhan University (武汉大学计算机学院); School of Mathematics and Statistics, Wuhan University (武汉大学数学与统计学院); School of Cyber Science and Technology, Sun Yat-sen University (中山大学网络科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accept to CVPR 2025

点击查看摘要

Abstract:The proliferation of text-to-image diffusion models (T2I DMs) has led to an increased presence of AI-generated images in daily life. However, biased T2I models can generate content with specific tendencies, potentially influencing people’s perceptions. Intentional exploitation of these biases risks conveying misleading information to the public. Current research on bias primarily addresses explicit biases with recognizable visual patterns, such as skin color and gender. This paper introduces a novel form of implicit bias that lacks explicit visual features but can manifest in diverse ways across various semantic contexts. This subtle and versatile nature makes this bias challenging to detect, easy to propagate, and adaptable to a wide range of scenarios. We further propose an implicit bias injection attack framework (IBI-Attacks) against T2I diffusion models by precomputing a general bias direction in the prompt embedding space and adaptively adjusting it based on different inputs. Our attack module can be seamlessly integrated into pre-trained diffusion models in a plug-and-play manner without direct manipulation of user input or model retraining. Extensive experiments validate the effectiveness of our scheme in introducing bias through subtle and diverse modifications while preserving the original semantics. The strong concealment and transferability of our attack across various scenarios further underscore the significance of our approach. Code is available at this https URL.
zh

[CV-18] Spatial-R1: Enhancing MLLM s in Video Spatial Reasoning

【速读】：该论文旨在提升多模态大型语言模型（Multi-modal Large Language Models, MLLMs）在视频理解中的空间推理能力，这是当前研究中既重要又具挑战性的方向。论文的关键解决方案包括两个方面：一是构建了一个名为SR的新视频空间推理数据集，该数据集源自ScanNet，并包含跨七种任务类型的自动标注问答对（QA pairs）；二是提出了一种面向任务的组相对策略优化方法（Task-Specific Group Relative Policy Optimization, GRPO），用于模型的精调。通过在SR数据集上使用GRPO对Qwen2.5-VL-7B-Instruct模型进行训练，Spatial-R1显著提升了VSI-Bench基准上的性能，在基线模型基础上实现了7.4%的增益，并超越了其他先进的当代模型。这一工作验证了专门的数据整理与优化技术对于提高视频MLLMs复杂空间推理能力的有效性。

链接: https://arxiv.org/abs/2504.01805
作者: Kun Ouyang
机构: Cranberry-Lemon University (Cranberry-Lemon 大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Enhancing the spatial reasoning capabilities of Multi-modal Large Language Models (MLLMs) for video understanding is crucial yet challenging. We present Spatial-R1, a targeted approach involving two key contributions: the curation of SR, a new video spatial reasoning dataset from ScanNet with automatically generated QA pairs across seven task types, and the application of Task-Specific Group Relative Policy Optimization (GRPO) for fine-tuning. By training the Qwen2.5-VL-7B-Instruct model on SR using GRPO, Spatial-R1 significantly advances performance on the VSI-Bench benchmark, achieving a 7.4% gain over the baseline and outperforming strong contemporary models. This work validates the effectiveness of specialized data curation and optimization techniques for improving complex spatial reasoning in video MLLMs.
zh

[CV-19] UniViTAR: Unified Vision Transformer with Native Resolution

【速读】：该论文旨在解决传统视觉Transformer（Vision Transformer, ViT）在处理自然视觉数据时忽略其分辨率变异性的问题，从而影响空间上下文保真度。为了解决这一问题，论文提出了UniViTAR，这是一种针对多模态时代统一视觉模态和原生分辨率场景设计的同质化视觉基础模型家族。解决方案的关键在于通过架构升级引入了两种核心机制：一是分辨率课程学习，从固定分辨率预训练过渡到原生分辨率微调，利用ViT对可变长度序列的适应性；二是通过批次间图像-视频切换实现视觉模态适应，以平衡计算效率与增强的时间推理能力。此外，还结合了基于Sigmoid的对比损失和来自冻结教师模型的特征蒸馏，加速了早期收敛过程。这些改进使得模型在多个规模（0.3B到1B参数量）上的广泛实验验证了其有效性。

链接: https://arxiv.org/abs/2504.01792
作者: Limeng Qiao,Yiyang Gan,Bairui Wang,Jie Qin,Shuang Xu,Siqi Yang,Lin Ma
机构: Meituan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional Vision Transformer simplifies visual modeling by standardizing input resolutions, often disregarding the variability of natural visual data and compromising spatial-contextual fidelity. While preliminary explorations have superficially investigated native resolution modeling, existing approaches still lack systematic analysis from a visual representation perspective. To bridge this gap, we introduce UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario in the era of multimodal. Our framework first conducts architectural upgrades to the vanilla paradigm by integrating multiple advanced components. Building upon these improvements, a progressive training paradigm is introduced, which strategically combines two core mechanisms: (1) resolution curriculum learning, transitioning from fixed-resolution pretraining to native resolution tuning, thereby leveraging ViT’s inherent adaptability to variable-length sequences, and (2) visual modality adaptation via inter-batch image-video switching, which balances computational efficiency with enhanced temporal reasoning. In parallel, a hybrid training framework further synergizes sigmoid-based contrastive loss with feature distillation from a frozen teacher model, thereby accelerating early-stage convergence. Finally, trained exclusively on public datasets, externsive experiments across multiple model scales from 0.3B to 1B demonstrate its effectiveness.
zh

[CV-20] Memory-efficient Low-latency Remote Photoplethysmography through Temporal-Spatial State Space Duality

【速读】：该论文旨在解决远程光体积描记法（remote photoplethysmography, rPPG）在利用深度学习提升性能的同时面临的计算资源瓶颈问题。具体而言，随着模型复杂度增加，现有方法难以同时实现模型可扩展性、跨数据集泛化能力以及实时性约束。为应对这一挑战，论文提出了一种名为ME-rPPG的记忆高效算法，其关键在于基于时域-空域状态空间对偶性的设计，能够以极低的计算开销捕捉面部视频帧中的微弱周期性变化。这种设计不仅支持在长视频序列上的高效训练，还实现了低延迟推断，从而在多个公开数据集（MMPD、VitalVideo、PURE）上取得了显著优于基线方法的性能提升（MAE降低21.3%-60.2%），同时将内存占用降至3.6 MB，延迟控制在9.46 ms以内，大幅超越现有方法的精度与用户体验。

链接: https://arxiv.org/abs/2504.01774
作者: Kegang Wang,Jiankai Tang,Yuxuan Fan,Jiatong Ji,Yuanchun Shi,Yuntao Wang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG), enabling non-contact physiological monitoring through facial light reflection analysis, faces critical computational bottlenecks as deep learning introduces performance gains at the cost of prohibitive resource demands. This paper proposes ME-rPPG, a memory-efficient algorithm built on temporal-spatial state space duality, which resolves the trilemma of model scalability, cross-dataset generalization, and real-time constraints. Leveraging a transferable state space, ME-rPPG efficiently captures subtle periodic variations across facial frames while maintaining minimal computational overhead, enabling training on extended video sequences and supporting low-latency inference. Achieving cross-dataset MAEs of 5.38 (MMPD), 0.70 (VitalVideo), and 0.25 (PURE), ME-rPPG outperforms all baselines with improvements ranging from 21.3% to 60.2%. Our solution enables real-time inference with only 3.6 MB memory usage and 9.46 ms latency – surpassing existing methods by 19.5%-49.7% accuracy and 43.2% user satisfaction gains in real-world deployments. The code and demos are released for reproducibility on this https URL.
zh

[CV-21] Dual-stream Transformer-GCN Model with Contextualized Representations Learning for Monocular 3D Human Pose Estimation

【速读】：该论文致力于解决单目3D人体姿态估计中的深度模糊性、有限的3D标注训练数据、建模不平衡以及模型泛化受限等问题。为应对这些挑战，论文提出了一种基于上下文表征学习的创新性运动预训练方法。关键在于引入了Transformer-GCN双流模型，并通过遮掩2D姿态特征并在自蒸馏框架下学习高维表征，聚焦于空间-时间建模与上下文表征学习，从而显著提升模型理解姿态间时空关系的能力，增强泛化性能。此外，通过有效整合Transformer流的全局特征与GCN流的局部交互信息，实现了视频姿态估计中全局与局部信息的平衡建模，最终在多个基准数据集上取得了最先进的性能。

链接: https://arxiv.org/abs/2504.01764
作者: Mingrui Ye,Lianping Yang,Hegui Zhu,Zenghao Zheng,Xin Wang,Yantao Lo
机构: College of Sciences, Northeastern University, Shengyang 110819, China (东北大学理学院，沈阳 110819，中国); Department of Informatics, King’s College London, London, UK (英国伦敦国王学院信息学系); Key Laboratory of Differential Equations and Their Applications, Northeastern University, Liaoning Provincial Department of Education (东北大学微分方程及其应用重点实验室，辽宁省教育厅); Shenyang Sport University, Shenyang, China (沈阳体育大学，中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a novel approach to monocular 3D human pose estimation using contextualized representation learning with the Transformer-GCN dual-stream model. Monocular 3D human pose estimation is challenged by depth ambiguity, limited 3D-labeled training data, imbalanced modeling, and restricted model generalization. To address these limitations, our work introduces a groundbreaking motion pre-training method based on contextualized representation learning. Specifically, our method involves masking 2D pose features and utilizing a Transformer-GCN dual-stream model to learn high-dimensional representations through a self-distillation setup. By focusing on contextualized representation learning and spatial-temporal modeling, our approach enhances the model’s ability to understand spatial-temporal relationships between postures, resulting in superior generalization. Furthermore, leveraging the Transformer-GCN dual-stream model, our approach effectively balances global and local interactions in video pose estimation. The model adaptively integrates information from both the Transformer and GCN streams, where the GCN stream effectively learns local relationships between adjacent key points and frames, while the Transformer stream captures comprehensive global spatial and temporal features. Our model achieves state-of-the-art performance on two benchmark datasets, with an MPJPE of 38.0mm and P-MPJPE of 31.9mm on Human3.6M, and an MPJPE of 15.9mm on MPI-INF-3DHP. Furthermore, visual experiments on public datasets and in-the-wild videos demonstrate the robustness and generalization capabilities of our approach.
zh

[CV-22] Bridge the Gap between SNN and ANN for Image Restoration

【速读】：该论文试图解决基于传统人工神经网络（Artificial Neural Networks, ANNs）的密集预测模型在图像恢复任务中能耗过高的问题。同时，尽管基于尖峰神经网络（Spiking Neural Network, SNN）框架的模型通常仅消耗ANNs相同架构下不到10%的能量，但其训练成本远高于ANN，主要因为SNN采用启发式梯度下降策略，导致膜电位信号从稀疏到密集的变化过程缓慢，影响了整个训练的收敛性。为了解决这一问题，论文提出了一种新颖的知识蒸馏技术——不对称框架（ANN-SNN）蒸馏，其中教师模型为ANN，学生模型为SNN。关键在于利用教师ANN学习到的中间特征（特征图）作为提示来指导SNN的训练过程，这种方法不仅加速了SNN的收敛，还提升了其最终性能，有效弥合了SNN效率与ANN优越学习能力之间的差距。实验结果表明，所设计的基于SNN的图像恢复模型参数量仅为教师模型的1/300，能耗仅为教师模型的1/50，在某些去噪任务中表现与教师模型相当。

链接: https://arxiv.org/abs/2504.01755
作者: Xin Su,Chen Wu,Zhuoran Zheng
机构: Fuzhou University (福州大学); University of Science and Technology of China (中国科学技术大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Models of dense prediction based on traditional Artificial Neural Networks (ANNs) require a lot of energy, especially for image restoration tasks. Currently, neural networks based on the SNN (Spiking Neural Network) framework are beginning to make their mark in the field of image restoration, especially as they typically use less than 10% of the energy of ANNs with the same architecture. However, training an SNN is much more expensive than training an ANN, due to the use of the heuristic gradient descent strategy. In other words, the process of SNN’s potential membrane signal changing from sparse to dense is very slow, which affects the convergence of the whole this http URL tackle this problem, we propose a novel distillation technique, called asymmetric framework (ANN-SNN) distillation, in which the teacher is an ANN and the student is an SNN. Specifically, we leverage the intermediate features (feature maps) learned by the ANN as hints to guide the training process of the SNN. This approach not only accelerates the convergence of the SNN but also improves its final performance, effectively bridging the gap between the efficiency of the SNN and the superior learning capabilities of ANN. Extensive experimental results show that our designed SNN-based image restoration model, which has only 1/300 the number of parameters of the teacher network and 1/50 the energy consumption of the teacher network, is as good as the teacher network in some denoising tasks.
zh

[CV-23] Understanding Cross-Model Perceptual Invariances Through Ensemble Metamers

【速读】：该论文旨在研究人工神经网络（Artificial Neural Networks, ANNs）的感知不变性，以提升模型的可解释性并使其更符合人类视觉特性。为实现这一目标，论文引入了一种基于神经网络集成的新方法来生成metamers（具有相同神经激活但物理上不同的刺激），通过捕获不同架构（如卷积神经网络CNNs和视觉TransformerViTs）之间的共享表征子空间来增强泛化能力。关键在于利用跨多样化模型结构的表征一致性，结合图像基度量工具评估生成metamers的语义保真度与自然性，揭示网络结构偏差对表征不变性的影响。

链接: https://arxiv.org/abs/2504.01739
作者: Lukas Boehm,Jonas Leo Mueller,Christoffer Loeffler,Leo Schwinn,Bjoern Eskofier,Dario Zanca
机构: FAU Erlangen-Nürnberg (FAU 埃尔朗根-纽伦堡大学); Pontificia Universidad Católica de Valparaíso (瓦尔帕莱索天主教大学); Technical University of Munich (慕尼黑工业大学); FAU Erlangen-Nürnberg (FAU 埃尔朗根-纽伦堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the perceptual invariances of artificial neural networks is essential for improving explainability and aligning models with human vision. Metamers - stimuli that are physically distinct yet produce identical neural activations - serve as a valuable tool for investigating these invariances. We introduce a novel approach to metamer generation by leveraging ensembles of artificial neural networks, capturing shared representational subspaces across diverse architectures, including convolutional neural networks and vision transformers. To characterize the properties of the generated metamers, we employ a suite of image-based metrics that assess factors such as semantic fidelity and naturalness. Our findings show that convolutional neural networks generate more recognizable and human-like metamers, while vision transformers produce realistic but less transferable metamers, highlighting the impact of architectural biases on representational invariances.
zh

[CV-24] AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization

【速读】：该论文旨在解决大型视觉语言模型（LVLMs）在面对对抗性攻击时的脆弱性问题，同时避免因现有防御方法（如对抗微调）导致的干净输入性能下降。论文的关键创新在于提出了一种基于偏好优化（Preference Optimization, PO）的新防御策略AdPO。通过重新定义对抗训练为一个偏好优化问题，AdPO着重增强模型在干净输入上生成正常输出的倾向，并拒绝对抗样本可能诱导的误导性输出。这一方案的独特之处在于仅需修改图像编码器（例如CLIP ViT），而无需调整大规模语言模型（LLMs），从而在保持高效的同时显著提升了模型在多种下游任务中的清洁数据与对抗性数据上的表现。此外，论文验证了在小规模LVLMs上预训练并通过迁移学习到大规模模型的有效性，进一步平衡了计算成本与性能。

链接: https://arxiv.org/abs/2504.01735
作者: Chaohu Liu,Tianyi Gui,Yu Liu,Linli Xu
机构: School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院); State Key Laboratory of Cognitive Intelligence (认知智能国家重点实验室); Tongyi Lab (通义实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently witnessed remarkable advancements and are increasingly being deployed in real-world applications. However, inheriting the sensitivity of visual neural networks, LVLMs remain vulnerable to adversarial attacks, which can result in erroneous or malicious outputs. While existing efforts utilize adversarial fine-tuning to enhance robustness, they often suffer from performance degradation on clean inputs. In this paper, we proposes AdPO, a novel adversarial defense strategy for LVLMs based on preference optimization. For the first time, we reframe adversarial training as a preference optimization problem, aiming to enhance the model’s preference for generating normal outputs on clean inputs while rejecting the potential misleading outputs for adversarial examples. Notably, AdPO achieves this by solely modifying the image encoder, e.g., CLIP ViT, resulting in superior clean and adversarial performance in a variety of downsream tasks. Considering that training involves large language models (LLMs), the computational cost increases significantly. We validate that training on smaller LVLMs and subsequently transferring to larger models can achieve competitive performance while maintaining efficiency comparable to baseline methods. Our comprehensive experiments confirm the effectiveness of the proposed AdPO, which provides a novel perspective for future adversarial defense research.
zh

[CV-25] FIORD: A Fisheye Indoor-Outdoor Dataset with LIDAR Ground Truth for 3D Scene Reconstruction and Benchmarking

【速读】：该论文旨在解决现有大规模三维场景重建与新视角合成方法依赖于包含狭窄视场（Field of View, FoV）透视图像的数据集的问题。这类数据集虽然适用于小尺度场景，但需要大量的图像集合以及复杂的运动结构（Structure-from-Motion, SfM）处理流程，限制了其可扩展性。为了解决这一挑战，论文引入了一个专门设计用于场景重建任务的鱼眼图像数据集。该数据集通过双200度鱼眼镜头提供完整的360度覆盖，包括5个室内和5个室外场景，并且每个场景都配备了稀疏的SfM点云和精确的基于LIDAR的密集点云作为几何真实值，从而在遮挡和反射等具有挑战性的条件下实现稳健的基准测试。解决方案的关键在于利用鱼眼镜头提供的宽广视场，减少所需图像数量的同时提高数据集的覆盖率和可用性，支持多种场景重建、新视角合成及基于图像的渲染方法。

链接: https://arxiv.org/abs/2504.01732
作者: Ulas Gunes,Matias Turkulainen,Xuqian Ren,Arno Solin,Juho Kannala,Esa Rahtu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SCIA 2025

点击查看摘要

Abstract:The development of large-scale 3D scene reconstruction and novel view synthesis methods mostly rely on datasets comprising perspective images with narrow fields of view (FoV). While effective for small-scale scenes, these datasets require large image sets and extensive structure-from-motion (SfM) processing, limiting scalability. To address this, we introduce a fisheye image dataset tailored for scene reconstruction tasks. Using dual 200-degree fisheye lenses, our dataset provides full 360-degree coverage of 5 indoor and 5 outdoor scenes. Each scene has sparse SfM point clouds and precise LIDAR-derived dense point clouds that can be used as geometric ground-truth, enabling robust benchmarking under challenging conditions such as occlusions and reflections. While the baseline experiments focus on vanilla Gaussian Splatting and NeRF based Nerfacto methods, the dataset supports diverse approaches for scene reconstruction, novel view synthesis, and image-based rendering.
zh

[CV-26] DreamActor-M1: Holistic Expressive and Robust Human Image Animation with Hybrid Guidance

【速读】：该论文旨在解决现有基于图像的人体动画方法在精细化整体可控性、多尺度适应性和长期时间一致性方面的关键差距，这些问题限制了其表达能力和鲁棒性。论文提出了一种基于扩散Transformer（DiT）的框架——DreamActor-M1，并采用混合引导策略来克服这些局限。解决方案的关键在于通过混合控制信号（整合隐式的面部表示、3D头部球体和3D身体骨架）实现面部表情和身体运动的鲁棒控制，同时保持表达性和身份保真度；利用渐进训练策略处理不同分辨率和尺度的数据以适应多种身体姿态和图像尺度；并通过结合序列帧的运动模式与互补视觉参考确保复杂运动中未见区域的长期时间一致性。实验表明，该方法在人像、上半身及全身生成任务中优于当前最先进的技术，提供了具有鲁棒长期一致性的表达性结果。

链接: https://arxiv.org/abs/2504.01724
作者: Yuxuan Luo,Zhengkun Rong,Lizhen Wang,Longhao Zhang,Tianshu Hu,Yongming Zhu
机构: Bytedance Intelligent Creation (字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency. Project Page: this https URL.
zh

[CV-27] GSR4B: Biomass Map Super-Resolution with Sentinel-1/2 Guidance

【速读】：该论文旨在解决高分辨率（High-Resolution, HR）地上生物量（Above-Ground Biomass, AGB）制图的问题，当前这一领域面临高成本的机载激光扫描限制以及现有全球产品分辨率较低的挑战。论文提出了一种新颖的方法，通过结合高分辨率卫星观测数据与现有的低分辨率生物质产品，将低分辨率生物质图谱引导超分辨率化（Guided Super-Resolution, GSR），以从100米分辨率提升至10米分辨率。其关键是利用多尺度引导（Multi-Scale Guidance, MSG）技术，在不显著增加计算开销的情况下，不仅提升了回归精度（减少780吨/公顷均方根误差），还改善了感知质量（提高2.0分贝峰值信噪比），并更有效地捕捉高生物量值，同时发现最佳性能的AGB GSR方法倾向于最大程度保留引导图像的纹理特征。

链接: https://arxiv.org/abs/2504.01722
作者: Kaan Karaman,Yuchang Jiang,Damien Robert,Vivien Sainte Fare Garnot,Maria João Santos,Jan Dirk Wegner
机构: EcoVision Lab, DM3L, University of Zurich (苏黎世大学), Switzerland; Department of Geography, University of Zurich (苏黎世大学), Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for an oral presentation at the ISPRS Geospatial Week 2025

点击查看摘要

Abstract:Accurate Above-Ground Biomass (AGB) mapping at both large scale and high spatio-temporal resolution is essential for applications ranging from climate modeling to biodiversity assessment, and sustainable supply chain monitoring. At present, fine-grained AGB mapping relies on costly airborne laser scanning acquisition campaigns usually limited to regional scales. Initiatives such as the ESA CCI map attempt to generate global biomass products from diverse spaceborne sensors but at a coarser resolution. To enable global, high-resolution (HR) mapping, several works propose to regress AGB from HR satellite observations such as ESA Sentinel-1/2 images. We propose a novel way to address HR AGB estimation, by leveraging both HR satellite observations and existing low-resolution (LR) biomass products. We cast this problem as Guided Super-Resolution (GSR), aiming at upsampling LR biomass maps (sources) from 100 to 10 m resolution, using auxiliary HR co-registered satellite images (guides). We compare super-resolving AGB maps with and without guidance, against direct regression from satellite images, on the public BioMassters dataset. We observe that Multi-Scale Guidance (MSG) outperforms direct regression both for regression ( -780 t/ha RMSE) and perception ( +2.0 dB PSNR) metrics, and better captures high-biomass values, without significant computational overhead. Interestingly, unlike the RGB+Depth setting they were originally designed for, our best-performing AGB GSR approaches are those that most preserve the guide image texture. Our results make a strong case for adopting the GSR framework for accurate HR biomass mapping at scale. Our code and model weights are made publicly available (this https URL).
zh

[CV-28] InvFussion: Bridging Supervised and Zero-shot Diffusion for Inverse Problems

【速读】：该论文旨在解决扩散模型（Diffusion Models）在处理逆问题（inverse problems）时存在的一个根本权衡：基于训练的方法能够提供高质量的结果，但缺乏灵活性；而零样本方法虽然具备灵活性，却在性能上有所妥协。论文提出了一种框架，结合了监督方法的高性能与零样本方法的灵活性。其关键创新在于通过一种新颖的架构设计，将退化算子（degradation operator）无缝集成到去噪器（denoiser）中。具体而言，在每个模块中，所提出的架构通过对网络激活应用退化算子，并利用注意力机制对输出进行条件约束，从而实现在适应多样化退化场景的同时保持高精度性能。这一改进使去噪网络成为通用最小均方误差估计器（MMSE estimator）、后验采样器或神经后验主成分估计器，显著提升了下游任务的适用性。实验结果表明，该方法在FFHQ和ImageNet数据集上的后验采样性能达到了最先进的水平，优于基于训练的方法和零样本替代方案。

链接: https://arxiv.org/abs/2504.01689
作者: Noam Elata,Hyungjin Chung,Jong Chul Ye,Tomer Michaeli,Michael Elad
机构: Technion (以色列理工学院); KAIST (韩国科学技术院); EverEx (未知中文名称)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Models have demonstrated remarkable capabilities in handling inverse problems, offering high-quality posterior-sampling-based solutions. Despite significant advances, a fundamental trade-off persists, regarding the way the conditioned synthesis is employed: Training-based methods achieve high quality results, while zero-shot approaches trade this with flexibility. This work introduces a framework that combines the best of both worlds – the strong performance of supervised approaches and the flexibility of zero-shot methods. This is achieved through a novel architectural design that seamlessly integrates the degradation operator directly into the denoiser. In each block, our proposed architecture applies the degradation operator on the network activations and conditions the output using the attention mechanism, enabling adaptation to diverse degradation scenarios while maintaining high performance. Our work demonstrates the versatility of the proposed architecture, operating as a general MMSE estimator, a posterior sampler, or a Neural Posterior Principal Component estimator. This flexibility enables a wide range of downstream tasks, highlighting the broad applicability of our framework. The proposed modification of the denoiser network offers a versatile, accurate, and computationally efficient solution, demonstrating the advantages of dedicated network architectures for complex inverse problems. Experimental results on the FFHQ and ImageNet datasets demonstrate state-of-the-art posterior-sampling performance, surpassing both training-based and zero-shot alternatives.
zh

[CV-29] Overlap-Aware Feature Learning for Robust Unsupervised Domain Adaptation for 3D Semantic Segmentation

【速读】：该论文致力于解决三维点云语义分割（PCSS）在无监督领域自适应（UDA）场景下对现实世界扰动（如雪、雾、雨）及对抗性失真的鲁棒性不足的问题。当前方法的核心局限在于：(a) 共享类别区域中未对齐边界导致的无监督特征重叠；(b) 领域不变性学习压制目标特定模式引起的特征结构退化。为应对这些问题，论文提出了一种三部分框架，其关键在于：1) 通过鲁棒性指标量化模型评估模型对对抗攻击/损坏类型的抗干扰能力；2) 可逆注意力对齐模块（IAAM），通过注意力引导的重叠抑制实现双向领域映射，同时保留判别性结构；3) 带有质量感知对比学习的记忆库，逐步优化伪标签以获得更具有判别性的特征表示。实验结果表明，该方法在SynLiDAR到SemanticPOSS的迁移任务中，在对抗攻击下的最大mIoU提升了14.3%。

链接: https://arxiv.org/abs/2504.01668
作者: Junjie Chen,Yuecong Xu,Haosheng Li,Kemi Ding
机构: School of Automation and Intelligent Manufacturing (AIM), Southern University of Science and Technology (南方科技大学), Shenzhen, China (深圳, 中国); Department of Electrical and Computer Engineering, National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages,6 figures

点击查看摘要

Abstract:3D point cloud semantic segmentation (PCSS) is a cornerstone for environmental perception in robotic systems and autonomous driving, enabling precise scene understanding through point-wise classification. While unsupervised domain adaptation (UDA) mitigates label scarcity in PCSS, existing methods critically overlook the inherent vulnerability to real-world perturbations (e.g., snow, fog, rain) and adversarial distortions. This work first identifies two intrinsic limitations that undermine current PCSS-UDA robustness: (a) unsupervised features overlap from unaligned boundaries in shared-class regions and (b) feature structure erosion caused by domain-invariant learning that suppresses target-specific patterns. To address the proposed problems, we propose a tripartite framework consisting of: 1) a robustness evaluation model quantifying resilience against adversarial attack/corruption types through robustness metrics; 2) an invertible attention alignment module (IAAM) enabling bidirectional domain mapping while preserving discriminative structure via attention-guided overlap suppression; and 3) a contrastive memory bank with quality-aware contrastive learning that progressively refines pseudo-labels with feature quality for more discriminative representations. Extensive experiments on SynLiDAR-to-SemanticPOSS adaptation demonstrate a maximum mIoU improvement of 14.3% under adversarial attack.
zh

[CV-30] CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition

【速读】：本文旨在解决连续手语识别（Continuous Sign Language Recognition, CSLR）任务中的挑战，即从视频中解析和转录手语手势序列。论文提出了一种名为CLIP手语适应（CLIP Sign Language Adaptation, CLIP-SLA）的新框架，通过参数高效微调（Parameter-Efficient Fine-Tuning, PEFT），利用CLIP模型强大的预训练视觉编码器来适配手语任务。该方案的关键在于引入了两种变体（SLA-Adapter和SLA-LoRA），它们将PEFT模块集成到CLIP视觉编码器中，从而以极少量可训练参数实现微调。实验结果表明，CLIP-SLA的两种变体在四个数据集上的表现优于多种现有最优（State-of-the-Art, SOTA）模型，同时所需参数更少，验证了所提方法的有效性和灵活性。这些发现展示了利用大规模预训练模型进行可扩展且高效的手语识别的潜力，并为未来手语理解领域的进步奠定了基础。

链接: https://arxiv.org/abs/2504.01666
作者: Sarah Alyami,Hamzah Luqman
机构: King Fahd University of Petroleum and Minerals (法赫德国王石油矿产大学); Imam Abdulrahman Bin Faisal University (伊玛目阿卜杜勒拉赫曼本菲萨尔大学); SDAIA–KFUPM Joint Research Center for Artificial Intelligence (SDAIA–KFUPM 人工智能联合研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continuous sign language recognition (CSLR) focuses on interpreting and transcribing sequences of sign language gestures in videos. In this work, we propose CLIP sign language adaptation (CLIP-SLA), a novel CSLR framework that leverages the powerful pre-trained visual encoder from the CLIP model to sign language tasks through parameter-efficient fine-tuning (PEFT). We introduce two variants, SLA-Adapter and SLA-LoRA, which integrate PEFT modules into the CLIP visual encoder, enabling fine-tuning with minimal trainable parameters. The effectiveness of the proposed frameworks is validated on four datasets: Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500, where both CLIP-SLA variants outperformed several SOTA models with fewer trainable parameters. Extensive ablation studies emphasize the effectiveness and flexibility of the proposed methods with different vision-language models for CSLR. These findings showcase the potential of adapting large-scale pre-trained models for scalable and efficient CSLR, which pave the way for future advancements in sign language understanding.
zh

[CV-31] BioAtt: Anatomical Prior Driven Low-Dose CT Denoising

【速读】：该论文旨在解决现有基于深度学习的低剂量 CT (Low-Dose CT, LDCT) 去噪方法因纯数据驱动的注意力机制而过度平滑重要解剖细节的问题。解决方案的关键在于引入了一种新的框架——BioAtt，其通过利用从预训练的视觉-语言模型 BiomedCLIP 中提取的解剖先验分布来指导去噪模型，使其专注于与解剖相关的区域以抑制噪声，同时保留临床相关的结构。这一创新直接将解剖先验嵌入空间注意力机制中，从而有效提升了去噪性能。

链接: https://arxiv.org/abs/2504.01662
作者: Namhun Kim,UiHyun Cho
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:Deep-learning-based denoising methods have significantly improved Low-Dose CT (LDCT) image quality. However, existing models often over-smooth important anatomical details due to their purely data-driven attention mechanisms. To address this challenge, we propose a novel LDCT denoising framework, BioAtt. The key innovation lies in attending anatomical prior distributions extracted from the pretrained vision-language model BiomedCLIP. These priors guide the denoising model to focus on anatomically relevant regions to suppress noise while preserving clinically relevant structures. We highlight three main contributions: BioAtt outperforms baseline and attention-based models in SSIM, PSNR, and RMSE across multiple anatomical regions. The framework introduces a new architectural paradigm by embedding anatomic priors directly into spatial attention. Finally, BioAtt attention maps provide visual confirmation that the improvements stem from anatomical guidance rather than increased model complexity.
zh

[CV-32] Robust Unsupervised Domain Adaptation for 3D Point Cloud Segmentation Under Source Adversarial Attacks

【速读】：本文旨在解决现有无监督领域自适应（UDA）框架在源域数据受到对抗性污染时缺乏鲁棒性的问题。论文的关键创新在于提出了AdvSynLiDAR数据集与Adversarial Adaptation Framework (AAF) 方法。具体而言，AAF通过将关键点敏感（KPS）损失扩展至鲁棒长尾损失（RLT损失），并在模型中引入解码器分支，使模型在预训练阶段能够关注长尾类别，并利用高置信度解码的点云信息恢复受污染点云结构，从而有效缓解源域对抗扰动对3D点云语义分割UDA任务性能的负面影响。

链接: https://arxiv.org/abs/2504.01659
作者: Haosheng Li,Yuecong Xu,Junjie Chen,Kemi Ding
机构: Department of Automation and Intelligent Manufacturing (AIM), Southern University of Science and Technology (中国南方科技大学自动化与智能制造系); Department of Electrical and Computer Engineering, National University of Singapore (新加坡国立大学电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) frameworks have shown good generalization capabilities for 3D point cloud semantic segmentation models on clean data. However, existing works overlook adversarial robustness when the source domain itself is compromised. To comprehensively explore the robustness of the UDA frameworks, we first design a stealthy adversarial point cloud generation attack that can significantly contaminate datasets with only minor perturbations to the point cloud surface. Based on that, we propose a novel dataset, AdvSynLiDAR, comprising synthesized contaminated LiDAR point clouds. With the generated corrupted data, we further develop the Adversarial Adaptation Framework (AAF) as the countermeasure. Specifically, by extending the key point sensitive (KPS) loss towards the Robust Long-Tail loss (RLT loss) and utilizing a decoder branch, our approach enables the model to focus on long-tail classes during the pre-training phase and leverages high-confidence decoded point cloud information to restore point cloud structures during the adaptation phase. We evaluated our AAF method on the AdvSynLiDAR dataset, where the results demonstrate that our AAF method can mitigate performance degradation under source adversarial perturbations for UDA in the 3D point cloud segmentation application.
zh

[CV-33] Q-Adapt: Adapting LMM for Visual Quality Assessment with Progressive Instruction Tuning

【速读】：该论文试图解决现有大型多模态基础模型（Large Multi-modal Foundation Models, LMM）在联合指令微调过程中，图像质量评估（Image Quality Assessment, IQA）的两种感知解释（整体质量解释与属性级感知回答）之间存在的冲突问题，导致感知理解不足的问题。为了解决这一挑战，论文提出了一种新的面向感知的指令微调范式——Q-Adapt。其关键在于通过一种渐进式的指令微调策略，将LMM适应于可解释图像质量评估（Explainable Image Quality Assessment, EIQA）任务分为两个阶段：第一阶段利用高效的LoRA（Low-Rank Adaptation）转移学习策略为模型赋予针对两项任务的通用感知知识；第二阶段引入指令自适应视觉提示微调，动态调整视觉特征以适配来自两项任务的不同指令。这种方法实现了轻量化的视觉质量评估器，在多个感知相关基准和常用IQA数据库上的性能达到或部分超越现有方法。

链接: https://arxiv.org/abs/2504.01655
作者: Yiting Lu,Xin Li,Haoning Wu,Bingchen Li,Weisi Lin,Zhibo Chen
机构: University of Science and Technology of China (中国科学技术大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Multi-modal Foundation Models (LMM) has paved the way for the possible Explainable Image Quality Assessment (EIQA) with instruction tuning from two perspectives: overall quality explanation, and attribute-wise perception answering. However, existing works usually overlooked the conflicts between these two types of perception explanations during joint instruction tuning, leading to insufficient perception understanding. To mitigate this, we propose a new paradigm for perception-oriented instruction tuning, i.e., Q-Adapt, which aims to eliminate the conflicts and achieve the synergy between these two EIQA tasks when adapting LMM, resulting in enhanced multi-faceted explanations of IQA. Particularly, we propose a progressive instruction tuning strategy by dividing the adaption process of LMM for EIQA into two stages, where the first stage empowers the LMM with universal perception knowledge tailored for two tasks using an efficient transfer learning strategy, i.e., LoRA, and the second stage introduces the instruction-adaptive visual prompt tuning to dynamically adapt visual features for the different instructions from two tasks. In this way, our proposed Q-Adapt can achieve a lightweight visual quality evaluator, demonstrating comparable performance and, in some instances, superior results across perceptual-related benchmarks and commonly-used IQA databases. The source code is publicly available at this https URL.
zh

[CV-34] ProtoGuard-guided PROPEL: Class-Aware Prototype Enhancement and Progressive Labeling for Incremental 3D Point Cloud Segmentation

【速读】：该论文旨在解决在类增量学习（Class-incremental Learning, CIL）场景下，基于3D点云语义分割模型面临的类别灾难性遗忘（catastrophic forgetting）、类间边界不清、类别分布不平衡以及相似类别误分类和长尾问题。论文的关键解决方案包括两个部分：首先，在基础类别训练阶段提出ProtoGuard方法，通过注意力机制结合几何与语义原型特征，维护每个类别的原型表示；其次，在新增类别训练阶段引入PROPEL（渐进式伪标签精炼），继承基础特征提取器和分类器，利用密度分布和语义相似性指导伪标签的传播与更新。这些方法有效提升了模型在S3DIS和ScanNet数据集上的表现，尤其在S3DIS数据集的5步CIL场景下将mIoU提升了最多20.39%。

链接: https://arxiv.org/abs/2504.01648
作者: Haosheng Li,Yuecong Xu,Junjie Chen,Kemi Ding
机构: Department of Automation and Intelligent Manufacturing (AIM), Southern University of Science and Technology (南方科技大学), Shenzhen, China (深圳市, 中国); Department of Electrical and Computer Engineering, National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D point cloud semantic segmentation technology has been widely used. However, in real-world scenarios, the environment is evolving. Thus, offline-trained segmentation models may lead to catastrophic forgetting of previously seen classes. Class-incremental learning (CIL) is designed to address the problem of catastrophic forgetting. While point clouds are common, we observe high similarity and unclear boundaries between different classes. Meanwhile, they are known to be imbalanced in class distribution. These lead to issues including misclassification between similar classes and the long-tail problem, which have not been adequately addressed in previous CIL methods. We thus propose ProtoGuard and PROPEL (Progressive Refinement Of PsEudo-Labels). In the base-class training phase, ProtoGuard maintains geometric and semantic prototypes for each class, which are combined into prototype features using an attention mechanism. In the novel-class training phase, PROPEL inherits the base feature extractor and classifier, guiding pseudo-label propagation and updates based on density distribution and semantic similarity. Extensive experiments show that our approach achieves remarkable results on both the S3DIS and ScanNet datasets, improving the mIoU of 3D point cloud segmentation by a maximum of 20.39% under the 5-step CIL scenario on S3DIS.
zh

[CV-35] FlowR: Flowing from Sparse to Dense 3D Reconstructions

【速读】：该论文旨在解决3D高保真新型视图合成（NVS）在稀疏视图场景下质量显著下降的问题。现有方法依赖密集捕捉以满足高质量需求，但这种方式成本高昂且耗时。此外，已有的一些利用2D生成模型缓解此问题的方法通常仅基于少量参考视图，未能充分利用可用的3D信息，导致生成结果不一致及重建伪影。为了解决这些问题，论文提出了一种多视角流匹配模型，其关键在于学习一种流，将可能稀疏重建的新型视角渲染与期望来自密集重建的渲染相连接。这种方法通过生成新视图增强场景捕捉，从而提高重建质量。模型训练使用了一个包含360万对图像的新数据集，并能在单次前向传播中处理多达45个视角（分辨率为960x540，相当于91K令牌）。实验表明，该方法在稀疏和密集视图场景下均能稳定提升NVS性能，在多个常用NVS基准测试中实现了比以往工作更高的重建质量。

链接: https://arxiv.org/abs/2504.01647
作者: Tobias Fischer,Samuel Rota Bulò,Yung-Hsu Yang,Nikhil Varma Keetha,Lorenzo Porzi,Norman Müller,Katja Schwarz,Jonathon Luiten,Marc Pollefeys,Peter Kontschieder
机构: ETH Zürich (苏黎世联邦理工学院); Meta Reality Labs Zürich (Meta现实实验室苏黎世); CMU (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page is available at this https URL

点击查看摘要

Abstract:3D Gaussian splatting enables high-quality novel view synthesis (NVS) at real-time frame rates. However, its quality drops sharply as we depart from the training views. Thus, dense captures are needed to match the high-quality expectations of some applications, e.g. Virtual Reality (VR). However, such dense captures are very laborious and expensive to obtain. Existing works have explored using 2D generative models to alleviate this requirement by distillation or generating additional training views. These methods are often conditioned only on a handful of reference input views and thus do not fully exploit the available 3D information, leading to inconsistent generation results and reconstruction artifacts. To tackle this problem, we propose a multi-view, flow matching model that learns a flow to connect novel view renderings from possibly sparse reconstructions to renderings that we expect from dense reconstructions. This enables augmenting scene captures with novel, generated views to improve reconstruction quality. Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass. Our pipeline consistently improves NVS in sparse- and dense-view scenarios, leading to higher-quality reconstructions than prior works across multiple, widely-used NVS benchmarks.
zh

[CV-36] Bridge 2D-3D: Uncertainty-aware Hierarchical Registration Network with Domain Alignment AAAI2025

【速读】：该论文致力于解决图像到点云配准中的两个关键问题：一是直接均匀匹配图像块与点云块可能导致关注错误的噪声块而忽略关键信息；二是由于图像与点云模态之间的显著差异，难以在没有特定设计改进的情况下弥合域差距。为了解决这些问题，论文创新性地提出了不确定性感知分层匹配模块（Uncertainty-aware Hierarchical Matching Module, UHMM）和对抗模态对齐模块（Adversarial Modal Alignment Module, AMAM）。其中，UHMM 模型化了图像块中关键信息的不确定性，并促进了图像与点云特征的多层级融合交互；AMAM 设计了一种对抗方法以减少图像与点云之间的域差距。大量实验和消融研究验证了所提方法的有效性，使其成为图像到点云配准任务的最新技术。

链接: https://arxiv.org/abs/2504.01641
作者: Zhixin Cheng,Jiacheng Deng,Xinjun Li,Baoqun Yin,Tianzhu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI2025accept

点击查看摘要

Abstract:The method for image-to-point cloud registration typically determines the rigid transformation using a coarse-to-fine pipeline. However, directly and uniformly matching image patches with point cloud patches may lead to focusing on incorrect noise patches during matching while ignoring key ones. Moreover, due to the significant differences between image and point cloud modalities, it may be challenging to bridge the domain gap without specific improvements in design. To address the above issues, we innovatively propose the Uncertainty-aware Hierarchical Matching Module (UHMM) and the Adversarial Modal Alignment Module (AMAM). Within the UHMM, we model the uncertainty of critical information in image patches and facilitate multi-level fusion interactions between image and point cloud features. In the AMAM, we design an adversarial approach to reduce the domain gap between image and point cloud. Extensive experiments and ablation studies on RGB-D Scene V2 and 7-Scenes benchmarks demonstrate the superiority of our method, making it a state-of-the-art approach for image-to-point cloud registration tasks.
zh

[CV-37] Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions

【速读】：该论文旨在解决深度神经网络（DNNs）在安全关键应用中，特别是在复杂动态环境中因局部损坏导致鲁棒性不足的问题。尽管已有研究评估了语义分割（SS）模型在全图自然或对抗性损坏下的鲁棒性，但针对密集视觉模型在局部损坏下的空间鲁棒性缺乏全面探索。为填补这一空白，论文引入了专门的度量标准以基准测试分割模型的空间鲁棒性，并提出了一种评估框架来评估局部损坏的影响。论文的关键创新在于揭示了使用单一局部对抗扰动表征最坏情况鲁棒性的内在复杂性，并通过提出区域感知多攻击对抗分析方法，深入理解模型对特定区域施加的对抗性扰动的鲁棒性。此外，论文通过在驾驶场景中的15个分割模型上验证所提出的度量和分析方法，发现基于变换器的分割模型对自然局部损坏表现出显著鲁棒性，而对对抗性损坏极为脆弱，反之基于卷积神经网络（CNN）的模型则表现相反。最终，论文通过集成模型解决了平衡自然与对抗性局部损坏鲁棒性的挑战，从而提高了密集视觉任务的可靠性。

链接: https://arxiv.org/abs/2504.01632
作者: Giulia Marchiori Pietrosanti,Giulio Rossolini,Alessandro Biondi,Giorgio Buttazzo
机构: Scuola Superiore Sant’Anna (圣安娜高等学院), Pisa; Department of Excellence in Robotics & AI (机器人与人工智能卓越研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:The robustness of DNNs is a crucial factor in safety-critical applications, particularly in complex and dynamic environments where localized corruptions can arise. While previous studies have evaluated the robustness of semantic segmentation (SS) models under whole-image natural or adversarial corruptions, a comprehensive investigation into the spatial robustness of dense vision models under localized corruptions remained underexplored. This paper fills this gap by introducing specialized metrics for benchmarking the spatial robustness of segmentation models, alongside with an evaluation framework to assess the impact of localized corruptions. Furthermore, we uncover the inherent complexity of characterizing worst-case robustness using a single localized adversarial perturbation. To address this, we propose region-aware multi-attack adversarial analysis, a method that enables a deeper understanding of model robustness against adversarial perturbations applied to specific regions. The proposed metrics and analysis were evaluated on 15 segmentation models in driving scenarios, uncovering key insights into the effects of localized corruption in both natural and adversarial forms. The results reveal that models respond to these two types of threats differently; for instance, transformer-based segmentation models demonstrate notable robustness to localized natural corruptions but are highly vulnerable to adversarial ones and vice-versa for CNN-based models. Consequently, we also address the challenge of balancing robustness to both natural and adversarial localized corruptions by means of ensemble models, thereby achieving a broader threat coverage and improved reliability for dense vision tasks.
zh

[CV-38] A Conic Transformation Approach for Solving the Perspective-Three-Point Problem

【速读】：本文提出了一种圆锥变换方法以解决Perspective-Three-Point (P3P) 问题。现有最先进的求解器通过求解两个圆锥曲线的交点并构造退化圆锥来确定交点，而本文的方法基于一种新的变换，将两个圆锥曲线映射到一个新的坐标系中，使得其中一个圆锥曲线变为标准形式的抛物线。这一关键创新简化了寻找圆锥曲线交点的问题，通过将一个变量用另一个变量表示，显著降低了计算复杂度。此外，多项式的系数易于快速计算，并且只需确定实数交点，从而避免了使用复杂的高成本复数运算。尽管该方法最终转化为求解四次方程，但由于新公式的简化，其计算速度仍优于现有方法。广泛的评估表明，该方法在保持鲁棒性和稳定性的同时，实现了更高的求解速度。

链接: https://arxiv.org/abs/2504.01620
作者: Haidong Wu,Snehal Bhayani,Janne Heikkilä
机构: Center for Machine Vision and Signal Analysis (中心对于机器视觉和信号分析); University of Oulu (奥卢大学), Oulu, Finland (芬兰)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a conic transformation method to solve the Perspective-Three-Point (P3P) problem. In contrast to the current state-of-the-art solvers, which formulate the P3P problem by intersecting two conics and constructing a degenerate conic to find the intersection, our approach builds upon a new formulation based on a transformation that maps the two conics to a new coordinate system, where one of the conics becomes a standard parabola in a canonical form. This enables expressing one variable in terms of the other variable, and as a consequence, substantially simplifies the problem of finding the conic intersection. Moreover, the polynomial coefficients are fast to compute, and we only need to determine the real-valued intersection points, which avoids the requirement of using computationally expensive complex arithmetic. While the current state-of-the-art methods reduce the conic intersection problem to solving a univariate cubic equation, our approach, despite resulting in a quartic equation, is still faster thanks to this new simplified formulation. Extensive evaluations demonstrate that our method achieves higher speed while maintaining robustness and stability comparable to state-of-the-art methods.
zh

[CV-39] 3DBonsai: Structure-Aware Bonsai Modeling Using Conditioned 3D Gaussian Splatting ICME2025

【速读】：该论文旨在解决现有文本到3D生成方法在处理复杂结构（如盆景）时的局限性，这些方法因缺乏详细的结构性先验信息，难以生成复杂的三维结构。论文的关键创新在于提出了一种名为3DBonsai的新框架，其核心解决方案包括设计一个可训练的三维空间殖民算法以生成盆景结构，并通过随机采样和点云增强技术将其转化为三维高斯先验。此外，该框架提出了两种具有不同结构层次的盆景生成管道：细粒度结构条件生成利用三维结构先验初始化三维高斯分布以生成细节丰富的复杂盆景；粗粒度结构条件生成则采用多视角结构一致性模块来对齐二维与三维结构。同时，研究团队构建了一个包含统一2D和3D中式盆景数据集。实验结果表明，3DBonsai在结构感知的三维盆景生成任务上显著优于现有方法，确立了新的基准。

链接: https://arxiv.org/abs/2504.01619
作者: Hao Wu,Hao Wang,Ruochong Li,Xuran Ma,Hui Xiong
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2025

点击查看摘要

Abstract:Recent advancements in text-to-3D generation have shown remarkable results by leveraging 3D priors in combination with 2D diffusion. However, previous methods utilize 3D priors that lack detailed and complex structural information, limiting them to generating simple objects and presenting challenges for creating intricate structures such as bonsai. In this paper, we propose 3DBonsai, a novel text-to-3D framework for generating 3D bonsai with complex structures. Technically, we first design a trainable 3D space colonization algorithm to produce bonsai structures, which are then enhanced through random sampling and point cloud augmentation to serve as the 3D Gaussian priors. We introduce two bonsai generation pipelines with distinct structural levels: fine structure conditioned generation, which initializes 3D Gaussians using a 3D structure prior to produce detailed and complex bonsai, and coarse structure conditioned generation, which employs a multi-view structure consistency module to align 2D and 3D structures. Moreover, we have compiled a unified 2D and 3D Chinese-style bonsai dataset. Our experimental results demonstrate that 3DBonsai significantly outperforms existing methods, providing a new benchmark for structure-aware 3D bonsai generation.
zh

[CV-40] AtextTA: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Inpainting CVPR2025

【速读】：该论文旨在解决现有背景修复方法在生成背景时未能与前景主体保持和谐关系的问题，具体表现为前景主体位置固定，导致其与生成背景之间存在不一致性。为了解决这一挑战，论文提出了“文本引导的可变前景位置背景修复”任务，并设计了自适应变换代理（Adaptive Transformation Agent, A^\textT A）作为解决方案的关键。其核心在于通过PosAgent块动态预测合适的位移以实现前景主体位置的变化，利用逆向位移变换（Reverse Displacement Transform, RDT）模块基于语义信息调整特征图层次结构，同时引入位置切换嵌入（Position Switch Embedding）控制主体位置是自适应预测还是固定，从而确保生成结果既能满足可变前景位置的需求，也能在固定位置情况下保持良好的修复性能。

链接: https://arxiv.org/abs/2504.01603
作者: Yizhe Tang,Zhimin Sun,Yuzhen Du,Ran Yi,Guangben Lu,Teng Hu,Luying Li,Lizhuang Ma,Fangyuan Zou
机构: Shanghai Jiao Tong University (上海交通大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Image inpainting aims to fill the missing region of an image. Recently, there has been a surge of interest in foreground-conditioned background inpainting, a sub-task that fills the background of an image while the foreground subject and associated text prompt are provided. Existing background inpainting methods typically strictly preserve the subject’s original position from the source image, resulting in inconsistencies between the subject and the generated background. To address this challenge, we propose a new task, the “Text-Guided Subject-Position Variable Background Inpainting”, which aims to dynamically adjust the subject position to achieve a harmonious relationship between the subject and the inpainted background, and propose the Adaptive Transformation Agent (A ^\textT A) for this task. Firstly, we design a PosAgent Block that adaptively predicts an appropriate displacement based on given features to achieve variable subject-position. Secondly, we design the Reverse Displacement Transform (RDT) module, which arranges multiple PosAgent blocks in a reverse structure, to transform hierarchical feature maps from deep to shallow based on semantic information. Thirdly, we equip A ^\textT A with a Position Switch Embedding to control whether the subject’s position in the generated image is adaptively predicted or fixed. Extensive comparative experiments validate the effectiveness of our A ^\textT A approach, which not only demonstrates superior inpainting capabilities in subject-position variable inpainting, but also ensures good performance on subject-position fixed inpainting.
zh

[CV-41] A topology-preserving three-stage framework for fully-connected coronary artery extraction

【速读】：该论文旨在解决冠状动脉提取中由于细小远端血管、复杂拓扑结构及对比度不足导致的过分割与欠分割问题。论文的关键在于提出了一种保拓扑的三阶段框架，用于完整冠状动脉树的提取，包括血管分割、中心线重连以及缺失血管重建。具体而言，首先引入新的中心线增强损失函数优化分割过程；其次，针对断裂的血管片段，提出了正则化行走算法，结合距离、中心线分类器预测的概率以及方向余弦相似性进行中心线重连；最后，采用隐式神经表示与建模方法重构缺失血管的几何模型。实验结果表明，该框架在ASOCA和PDSCA数据集上的Dice系数分别达到88.53%和85.07%，Hausdorff Distance分别为1.07mm和1.63mm，显著优于现有方法。

链接: https://arxiv.org/abs/2504.01597
作者: Yuehui Qiu,Dandan Shan,Yining Wang,Pei Dong,Dijia Wu,Xinnian Yang,Qingqi Hong,Dinggang Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Coronary artery extraction is a crucial prerequisite for computer-aided diagnosis of coronary artery disease. Accurately extracting the complete coronary tree remains challenging due to several factors, including presence of thin distal vessels, tortuous topological structures, and insufficient contrast. These issues often result in over-segmentation and under-segmentation in current segmentation methods. To address these challenges, we propose a topology-preserving three-stage framework for fully-connected coronary artery extraction. This framework includes vessel segmentation, centerline reconnection, and missing vessel reconstruction. First, we introduce a new centerline enhanced loss in the segmentation process. Second, for the broken vessel segments, we further propose a regularized walk algorithm to integrate distance, probabilities predicted by a centerline classifier, and directional cosine similarity, for reconnecting the centerlines. Third, we apply implicit neural representation and implicit modeling, to reconstruct the geometric model of the missing vessels. Experimental results show that our proposed framework outperforms existing methods, achieving Dice scores of 88.53% and 85.07%, with Hausdorff Distances (HD) of 1.07mm and 1.63mm on ASOCA and PDSCA datasets, respectively. Code will be available at this https URL.
zh

[CV-42] DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image

【速读】：该论文旨在解决基于深度增强（Depth Enhancement）的任务中，现有基于超分辨率（Super-Resolution）的方法在实际应用中的局限性。这些方法通常依赖于理想化的假设，如精确的区域对应关系和可靠的直接飞行时间成像（dToF）输入，而忽视了由校准误差引起的错位以及dToF成像固有的异常信号，从而限制了其在真实场景中的适用性。论文的关键解决方案在于提出了一种新的基于补全（Completion-Based）的方法DEPTHOR，其核心创新点包括改进的训练策略和网络架构设计。首先，通过从合成数据集的真实地面真值模拟现实世界的dToF数据，实现对噪声的鲁棒训练；其次，设计了一种融合单目深度估计（Monocular Depth Estimation, MDE）的新型网络，利用全局深度关系和上下文信息提升挑战区域的预测精度。这些创新使得DEPTHOR在ZJU-L5数据集和自收集的更具挑战性的dToF样本集中均取得了显著性能提升。

链接: https://arxiv.org/abs/2504.01596
作者: Jijun Xiang,Xuan Zhu,Xianqi Wang,Yu Wang,Hong Zhang,Fei Guo,Xin Yang
机构: Huazhong University of Science and Technology (华中科技大学); Honor Device Co., Ltd (荣耀终端有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Depth enhancement, which uses RGB images as guidance to convert raw signals from dToF into high-precision, dense depth maps, is a critical task in computer vision. Although existing super-resolution-based methods show promising results on public datasets, they often rely on idealized assumptions like accurate region correspondences and reliable dToF inputs, overlooking calibration errors that cause misalignment and anomaly signals inherent to dToF imaging, limiting real-world applicability. To address these challenges, we propose a novel completion-based method, named DEPTHOR, featuring advances in both the training strategy and model architecture. First, we propose a method to simulate real-world dToF data from the accurate ground truth in synthetic datasets to enable noise-robust training. Second, we design a novel network that incorporates monocular depth estimation (MDE), leveraging global depth relationships and contextual information to improve prediction in challenging regions. On the ZJU-L5 dataset, our training strategy significantly enhances depth completion models, achieving results comparable to depth super-resolution methods, while our model achieves state-of-the-art results, improving Rel and RMSE by 27% and 18%, respectively. On a more challenging set of dToF samples we collected, our method outperforms SOTA methods on preliminary stereo-based GT, improving Rel and RMSE by 23% and 22%, respectively. Our Code is available at this https URL
zh

[CV-43] Leverag ing Modality Tags for Enhanced Cross-Modal Video Retrieval

【速读】：该论文试图解决视频检索中视觉内容与自然语言描述对齐的问题。解决方案的关键在于提出了一种名为Modality Auxiliary Concepts for Video Retrieval (MAC-VR) 的新方法，通过利用从基础模型自动提取的模态特定标签（modality-specific tags）来增强视频检索性能。具体而言，MAC-VR 方法在潜在空间中对齐不同模态，并学习和对齐从视频及其对应标题特征中衍生出的辅助潜在概念（auxiliary latent concepts），以改进视觉和文本潜在概念的对齐，从而实现概念间的区分。实验结果表明，这种模态特定标签显著提升了跨模态对齐能力，在五个数据集上的表现优于当前最先进的方法。

链接: https://arxiv.org/abs/2504.01591
作者: Adriano Fragomeni,Dima Damen,Michael Wray
机构: University of Bristol (布里斯托尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video retrieval requires aligning visual content with corresponding natural language descriptions. In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR), a novel approach that leverages modality-specific tags – automatically extracted from foundation models – to enhance video retrieval. We propose to align modalities in a latent space, along with learning and aligning auxiliary latent concepts, derived from the features of a video and its corresponding caption. We introduce these auxiliary concepts to improve the alignment of visual and textual latent concepts, and so are able to distinguish concepts from one other. We conduct extensive experiments on five diverse datasets: MSR-VTT, DiDeMo, TGIF, Charades and YouCook2. The experimental results consistently demonstrate that modality-specific tags improve cross-modal alignment, outperforming current state-of-the-art methods across three datasets and performing comparably or better across the other two.
zh

[CV-44] xt Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在处理跨模态冲突信号时的能力不足问题。具体而言，研究聚焦于VLMs如何处理ASCII艺术这一特殊媒介，在其中文本元素共同构成视觉模式，但可能产生语义-视觉冲突的情况。论文引入了一种新的评估框架，通过对抗性ASCII艺术系统性挑战五种最先进的模型，揭示了这些模型存在强烈的文本优先偏见，即它们倾向于优先处理文本信息而非视觉模式，并且随着语义复杂性的增加，其视觉识别能力显著下降。尽管尝试了多种缓解措施，如调整视觉参数和提示工程，但仅取得了有限的改进，表明这种局限性需要从架构层面解决。因此，解决方案的关键在于开发能够从根本上改善多模态信息整合方式的新架构设计。

链接: https://arxiv.org/abs/2504.01589
作者: Zhaochen Wang,Yujun Cai,Zi Huang,Bryan Hooi,Yiwei Wang,Ming-Hsuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review at COLM 2025

点击查看摘要

Abstract:Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals across modalities remains underexplored. This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts. We introduce a novel evaluation framework that systematically challenges five state-of-the-art models (including GPT-4o, Claude, and Gemini) using adversarial ASCII art, where character-level semantics deliberately contradict global visual patterns. Our experiments reveal a strong text-priority bias: VLMs consistently prioritize textual information over visual patterns, with visual recognition ability declining dramatically as semantic complexity increases. Various mitigation attempts through visual parameter tuning and prompt engineering yielded only modest improvements, suggesting that this limitation requires architectural-level solutions. These findings uncover fundamental flaws in how current VLMs integrate multimodal information, providing important guidance for future model development while highlighting significant implications for content moderation systems vulnerable to adversarial examples.
zh

[CV-45] Pro-DG: Procedural Diffusion Guidance for Architectural Facade Generation

【速读】：该论文试图解决在保持局部细节真实性和外观保真度的同时，实现建筑立面（facade）的大规模可控编辑的问题。解决方案的关键在于提出了一种名为Pro-DG的框架，结合了基于规则的形状文法（procedural shape grammar）与扩散模型（diffusion-based image synthesis）的图像生成方法。具体而言，Pro-DG首先通过形状文法规则从单张输入图像重建立面布局，然后利用用户定义的变换编辑结构，并引入层次匹配程序以对齐不同层级的立面结构，从而生成控制映射来指导生成式扩散管道。这种方法能够在大规模编辑（如楼层复制或窗户重排）时保留局部外观的真实性，同时实现精确且可控的图像操作。

链接: https://arxiv.org/abs/2504.01571
作者: Aleksander Plocharski,Jan Swidzinski,Przemyslaw Musialski
机构: IDEAS NCBR; Warsaw University of Technology; New Jersey Institute of Technology (新泽西理工学院)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 13 figures

点击查看摘要

Abstract:We present Pro-DG, a framework for procedurally controllable photo-realistic facade generation that combines a procedural shape grammar with diffusion-based image synthesis. Starting from a single input image, we reconstruct its facade layout using grammar rules, then edit that structure through user-defined transformations. As facades are inherently multi-hierarchical structures, we introduce hierarchical matching procedure that aligns facade structures at different levels which is used to introduce control maps to guide a generative diffusion pipeline. This approach retains local appearance fidelity while accommodating large-scale edits such as floor duplication or window rearrangement. We provide a thorough evaluation, comparing Pro-DG against inpainting-based baselines and synthetic ground truths. Our user study and quantitative measurements indicate improved preservation of architectural identity and higher edit accuracy. Our novel method is the first to integrate neuro-symbolically derived shape-grammars for modeling with modern generative model and highlights the broader potential of such approaches for precise and controllable image manipulation.
zh

[CV-46] RealityAvatar: Towards Realistic Loose Clothing Modeling in Animatable 3D Gaussian Avatars

【速读】：该论文致力于解决利用单目或多视角视频建模可动画化人体 avatar 时，现有方法难以准确捕捉宽松服装动态的问题。这些现有方法通常依赖全局姿态条件或静态逐帧表示，导致非刚性区域出现过平滑及时间不一致性。论文提出的解决方案关键在于 RealityAvatar 框架，它通过结合 3D 高斯点 splatting 捕捉复杂服装变形与运动动力学，同时确保几何一致性。此外，通过引入运动趋势模块和潜在骨骼编码器，显式建模姿态相关的变形以及服装行为的时间变化，从而有效提升动态人体重建中的结构保真度和感知质量，尤其是在非刚性区域，并实现跨时间帧的一致性增强。

链接: https://arxiv.org/abs/2504.01559
作者: Yahui Li,Zhi Zeng,Liming Pang,Guixuan Zhang,Shuwu Zhang
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modeling animatable human avatars from monocular or multi-view videos has been widely studied, with recent approaches leveraging neural radiance fields (NeRFs) or 3D Gaussian Splatting (3DGS) achieving impressive results in novel-view and novel-pose synthesis. However, existing methods often struggle to accurately capture the dynamics of loose clothing, as they primarily rely on global pose conditioning or static per-frame representations, leading to oversmoothing and temporal inconsistencies in non-rigid regions. To address this, We propose RealityAvatar, an efficient framework for high-fidelity digital human modeling, specifically targeting loosely dressed avatars. Our method leverages 3D Gaussian Splatting to capture complex clothing deformations and motion dynamics while ensuring geometric consistency. By incorporating a motion trend module and a latentbone encoder, we explicitly model pose-dependent deformations and temporal variations in clothing behavior. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach in capturing fine-grained clothing deformations and motion-driven shape variations. Our method significantly enhances structural fidelity and perceptual quality in dynamic human reconstruction, particularly in non-rigid regions, while achieving better consistency across temporal frames.
zh

[CV-47] Semi-Supervised Biomedical Image Segmentation via Diffusion Models and Teacher-Student Co-Training

【速读】：该论文旨在解决医学图像语义分割中监督深度学习方法因需要大规模标注数据而限制其在临床应用可扩展性的问题。解决方案的关键在于提出了一种新颖的基于生成模型启发的半监督教师-学生框架，利用去噪扩散概率模型（Denoising Diffusion Probabilistic Models, DDPMs）逐步精炼噪声输入以生成分割掩膜。教师模型首先通过基于噪声污染图像重建的循环一致性约束以无监督方式训练，从而生成具有信息性的语义掩膜；随后，该教师模型与双胞胎学生网络共同参与协同训练过程。学生模型在有真实标签时学习，而在缺乏标签时则从教师生成的伪标签中学习，同时教师模型持续提升其伪标签生成能力。此外，为了进一步提高性能，引入了多轮伪标签生成策略以迭代优化伪标签生成过程。实验结果表明，该方法在多个生物医学成像基准测试中始终优于最先进的半监督技术，特别是在标注数据有限的情况下表现出显著有效性。

链接: https://arxiv.org/abs/2504.01547
作者: Luca Ciampi,Gabriele Lagani,Giuseppe Amato,Fabrizio Falchi
机构: ISTI-CNR (意大利国家研究委员会信息学与自动化技术研究所); ISTI-CNR (意大利国家研究委员会信息学与自动化技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Supervised deep learning for semantic segmentation has achieved excellent results in accurately identifying anatomical and pathological structures in medical images. However, it often requires large annotated training datasets, which limits its scalability in clinical settings. To address this challenge, semi-supervised learning is a well-established approach that leverages both labeled and unlabeled data. In this paper, we introduce a novel semi-supervised teacher-student framework for biomedical image segmentation, inspired by the recent success of generative models. Our approach leverages denoising diffusion probabilistic models (DDPMs) to generate segmentation masks by progressively refining noisy inputs conditioned on the corresponding images. The teacher model is first trained in an unsupervised manner using a cycle-consistency constraint based on noise-corrupted image reconstruction, enabling it to generate informative semantic masks. Subsequently, the teacher is integrated into a co-training process with a twin-student network. The student learns from ground-truth labels when available and from teacher-generated pseudo-labels otherwise, while the teacher continuously improves its pseudo-labeling capabilities. Finally, to further enhance performance, we introduce a multi-round pseudo-label generation strategy that iteratively improves the pseudo-labeling process. We evaluate our approach on multiple biomedical imaging benchmarks, spanning multiple imaging modalities and segmentation tasks. Experimental results show that our method consistently outperforms state-of-the-art semi-supervised techniques, highlighting its effectiveness in scenarios with limited annotated data. The code to replicate our experiments can be found at this https URL
zh

[CV-48] Beyond Nearest Neighbor Interpolation in Data Augmentation

【速读】：该论文旨在解决在数据增强过程中因使用最近邻插值（Nearest Neighbor Interpolation）避免未定义类别标签风险的同时，可能加剧像素级标注错误的问题。为同时规避这两种风险，作者的关键解决方案是对卷积神经网络的数据变换函数进行了修改：引入改进的几何变换函数以消除对最近邻插值的依赖，并通过基于均值的类别过滤机制处理未定义的类别标签，采用替代插值算法提升增强数据的质量。实验结果表明，在三个医学图像数据集上的语义分割任务中，所提出的替代插值算法在定性和定量方面均取得了显著改进。

链接: https://arxiv.org/abs/2504.01527
作者: Olivier Rukundo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 9 figures, 1 table

点击查看摘要

Abstract:Avoiding the risk of undefined categorical labels using nearest neighbor interpolation overlooks the risk of exacerbating pixel level annotation errors in data augmentation. To simultaneously avoid these risks, the author modified convolutional neural networks data transformation functions by incorporating a modified geometric transformation function to improve the quality of augmented data by removing the reliance on nearest neighbor interpolation and integrating a mean based class filtering mechanism to handle undefined categorical labels with alternative interpolation algorithms. Experiments on semantic segmentation tasks using three medical image datasets demonstrated both qualitative and quantitative improvements with alternative interpolation algorithms.
zh

[CV-49] Domain Guidance: A Simple Transfer Approach for a Pre-trained Diffusion Model

【速读】：该论文试图解决如何在保持生成质量的同时，降低个性化扩散模型构建的计算开销问题。论文提出的关键解决方案是“Domain Guidance”，这是一种基于预训练知识的条件生成方法，通过引导采样过程以适应目标域来实现模型迁移。与传统的微调方法相比，Domain Guidance 在机制上类似于高级无分类器引导，能够更好地实现领域对齐并生成高质量结果。实验表明，该方法在多个迁移基准测试中显著提升了性能，FID 改善超过 19.6%，FD（_ \textDINOv2）改善达 23.4%，且无需额外训练即可与现有微调模型结合使用。

链接: https://arxiv.org/abs/2504.01521
作者: Jincheng Zhong,Xiangcheng Zhang,Jianmin Wang,Mingsheng Long
机构: School of Software, BNRist, Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion models have revolutionized generative modeling. However, the impressive and vivid outputs they produce often come at the cost of significant model scaling and increased computational demands. Consequently, building personalized diffusion models based on off-the-shelf models has emerged as an appealing alternative. In this paper, we introduce a novel perspective on conditional generation for transferring a pre-trained model. From this viewpoint, we propose Domain Guidance, a straightforward transfer approach that leverages pre-trained knowledge to guide the sampling process toward the target domain. Domain Guidance shares a formulation similar to advanced classifier-free guidance, facilitating better domain alignment and higher-quality generations. We provide both empirical and theoretical analyses of the mechanisms behind Domain Guidance. Our experimental results demonstrate its substantial effectiveness across various transfer benchmarks, achieving over a 19.6% improvement in FID and a 23.4% improvement in FD _\textDINOv2 compared to standard fine-tuning. Notably, existing fine-tuned models can seamlessly integrate Domain Guidance to leverage these benefits, without additional training.
zh

[CV-50] raining-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis

【速读】：本文旨在解决现有生成方法任务导向性强、适用范围狭窄的问题，即如何在更广泛的条件下实现条件图像合成。为应对这一挑战，论文提出了一种创新方法，将条件图像合成视为多样化基础条件单元的模块化组合。关键在于设计了针对三种主要条件单元（文本、布局、拖拽）的专用对齐模块：引入密集概念对齐（Dense Concept Alignment, DCA）模块以实现视觉与文本的细粒度对齐；提出密集几何对齐（Dense Geometry Alignment, DGA）模块以施加全面的几何约束，保持空间配置；设计密集运动对齐（Dense Motion Alignment, DMA）模块以应用多层级运动正则化，确保像素轨迹自然且无视觉伪影。通过灵活插入和组合这些对齐模块，该框架显著提升了模型在多样化条件生成任务中的适应性，并大幅扩展了其应用场景。实验结果验证了所提方法在多种条件下的优越性能。

链接: https://arxiv.org/abs/2504.01515
作者: Zixuan Wang,Duo Peng,Feng Chen,Yuwei Yang,Yinjie Lei
机构: Sichuan University (四川大学); Singapore University of Technology and Design (新加坡科技设计大学); The University of Adelaide (阿德莱德大学); Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conditional image synthesis is a crucial task with broad applications, such as artistic creation and virtual reality. However, current generative methods are often task-oriented with a narrow scope, handling a restricted condition with constrained applicability. In this paper, we propose a novel approach that treats conditional image synthesis as the modular combination of diverse fundamental condition units. Specifically, we divide conditions into three primary units: text, layout, and drag. To enable effective control over these conditions, we design a dedicated alignment module for each. For the text condition, we introduce a Dense Concept Alignment (DCA) module, which achieves dense visual-text alignment by drawing on diverse textual concepts. For the layout condition, we propose a Dense Geometry Alignment (DGA) module to enforce comprehensive geometric constraints that preserve the spatial configuration. For the drag condition, we introduce a Dense Motion Alignment (DMA) module to apply multi-level motion regularization, ensuring that each pixel follows its desired trajectory without visual artifacts. By flexibly inserting and combining these alignment modules, our framework enhances the model’s adaptability to diverse conditional generation tasks and greatly expands its application range. Extensive experiments demonstrate the superior performance of our framework across a variety of conditions, including textual description, segmentation mask (bounding box), drag manipulation, and their combinations. Code is available at this https URL.
zh

[CV-51] High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model

【速读】：该论文试图解决单视角3D生成中因二维图像几何歧义和三维高斯表示缺乏结构而导致的不一致问题，这些问题引发了三维物体生成的扭曲和模糊现象。论文的关键解决方案是提出了一种新的RGBN-体积高斯重建模型（GS-RGBN），其核心洞察在于采用一种结构化的三维表示方法，同时缓解上述两个问题。为此，论文设计了一种新颖的体素-高斯混合表示，其中三维体素表示包含显式的三维几何信息，消除了来自二维图像的几何歧义，并在学习过程中对高斯分布进行结构化处理，使优化更容易收敛到更好的局部最优解。三维体素表示通过一个融合模块获得，该模块对齐了从二维图像估计的RGB特征和表面法线特征。大量实验验证了所提方法在高质量重建结果、鲁棒泛化能力和良好效率方面的优越性。

链接: https://arxiv.org/abs/2504.01512
作者: Yiyang Shen,Kun Zhou,He Wang,Yin Yang,Tianjia Shao
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学重点实验室); AI Centre, University College London (伦敦大学学院人工智能中心); University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages

点击查看摘要

Abstract:Recently single-view 3D generation via Gaussian splatting has emerged and developed quickly. They learn 3D Gaussians from 2D RGB images generated from pre-trained multi-view diffusion (MVD) models, and have shown a promising avenue for 3D generation through a single image. Despite the current progress, these methods still suffer from the inconsistency jointly caused by the geometric ambiguity in the 2D images, and the lack of structure of 3D Gaussians, leading to distorted and blurry 3D object generation. In this paper, we propose to fix these issues by GS-RGBN, a new RGBN-volume Gaussian Reconstruction Model designed to generate high-fidelity 3D objects from single-view images. Our key insight is a structured 3D representation can simultaneously mitigate the afore-mentioned two issues. To this end, we propose a novel hybrid Voxel-Gaussian representation, where a 3D voxel representation contains explicit 3D geometric information, eliminating the geometric ambiguity from 2D images. It also structures Gaussians during learning so that the optimization tends to find better local optima. Our 3D voxel representation is obtained by a fusion module that aligns RGB features and surface normal features, both of which can be estimated from 2D images. Extensive experiments demonstrate the superiority of our methods over prior works in terms of high-quality reconstruction results, robust generalization, and good efficiency.
zh

[CV-52] Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment CVPR2025

【速读】：该论文旨在解决在多样化真实世界光照条件下（如低光、过曝等）进行高质量新型视图合成（Novel View Synthesis, NVS）的挑战。现有基于神经辐射场（Neural Radiance Fields, NeRF）和3D高斯点 splatting（3D Gaussian Splatting, 3DGS）的方法在多视角场景下难以应对因光照变化和图像信号处理器（Image Signal Processor, ISP）设置差异导致的光度不一致问题。论文的关键创新在于提出了一种名为Luminance-GS的新方法，通过采用每视角颜色矩阵映射和视点自适应曲线调整技术，在保持3DGS显式表示不变的前提下，实现了在多种复杂光照条件下的高质量NVS结果，同时达到了实时渲染速度与更优的重建质量。

链接: https://arxiv.org/abs/2504.01503
作者: Ziteng Cui,Xuangeng Chu,Tatsuya Harada
机构: The University of Tokyo (东京大学); RIKEN AIP (理化学研究所人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, project page: this https URL

点击查看摘要

Abstract:Capturing high-quality photographs under diverse real-world lighting conditions is challenging, as both natural lighting (e.g., low-light) and camera exposure settings (e.g., exposure time) significantly impact image quality. This challenge becomes more pronounced in multi-view scenarios, where variations in lighting and image signal processor (ISP) settings across viewpoints introduce photometric inconsistencies. Such lighting degradations and view-dependent variations pose substantial challenges to novel view synthesis (NVS) frameworks based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). To address this, we introduce Luminance-GS, a novel approach to achieving high-quality novel view synthesis results under diverse challenging lighting conditions using 3DGS. By adopting per-view color matrix mapping and view-adaptive curve adjustments, Luminance-GS achieves state-of-the-art (SOTA) results across various lighting conditions – including low-light, overexposure, and varying exposure – while not altering the original 3DGS explicit representation. Compared to previous NeRF- and 3DGS-based baselines, Luminance-GS provides real-time rendering speed with improved reconstruction quality.
zh

[CV-53] GarmageNet: A Dataset and Scalable Representation for Generic Garment Modeling

【速读】：该论文旨在解决高保真服装建模的挑战，主要由于缺乏大规模高质量数据集以及能够高效处理非防水、多层几何结构的表示方法。论文的关键解决方案是提出了Garmage，这是一种神经网络和计算机图形学友好的服装表示方法，通过一组结构化的面板几何图像无缝编码复杂多层服装的精确几何形状和缝合图案。作为双2D-3D表示，Garmage实现了基于2D图像的算法与3D建模工作流的前所未有的集成，从而支持高保真的非防水、多层服装几何结构，并直接兼容工业级应用。在此基础上，论文进一步引入了GarmageNet，一种新型生成框架，能够根据用户提示或现有的野生缝合图案生成详细的多层服装及其符合身体轮廓的初始几何形状和复杂的缝合图案。此外，还提出了一种鲁棒的缝合算法，用于恢复每顶点的缝线，确保无缝集成到灵活的模拟管道中，以实现下游缝合图案、材料属性和动态模拟的编辑。最后，论文发布了一个工业标准的大规模高保真服装数据集，包含详细的注释、逐顶点对应关系以及将无结构生产缝合图案转换为GarmageNet标准结构资产的强大管道，为大规模工业级服装生成系统奠定了基础。

链接: https://arxiv.org/abs/2504.01483
作者: Siran Li,Ruiyang Liu,Chen Liu,Zhendong Wang,Gaofeng He,Yong-Lu Li,Xiaogang Jin,Huamin Wang
机构: Zhejiang Sci-Tech University (浙江理工大学); Style3D Research; State Key Lab of CAD&CG, Zhejiang University (浙江大学计算机辅助设计与图形学国家重点实验室); Shanghai Jiao Tong University (上海交通大学); Style3D Research; Style3D Research; Shanghai Jiao Tong University (上海交通大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-fidelity garment modeling remains challenging due to the lack of large-scale, high-quality datasets and efficient representations capable of handling non-watertight, multi-layer geometries. In this work, we introduce Garmage, a neural-network-and-CG-friendly garment representation that seamlessly encodes the accurate geometry and sewing pattern of complex multi-layered garments as a structured set of per-panel geometry images. As a dual-2D-3D representation, Garmage achieves an unprecedented integration of 2D image-based algorithms with 3D modeling workflows, enabling high fidelity, non-watertight, multi-layered garment geometries with direct compatibility for industrial-grade this http URL upon this representation, we present GarmageNet, a novel generation framework capable of producing detailed multi-layered garments with body-conforming initial geometries and intricate sewing patterns, based on user prompts or existing in-the-wild sewing patterns. Furthermore, we introduce a robust stitching algorithm that recovers per-vertex stitches, ensuring seamless integration into flexible simulation pipelines for downstream editing of sewing patterns, material properties, and dynamic simulations. Finally, we release an industrial-standard, large-scale, high-fidelity garment dataset featuring detailed annotations, vertex-wise correspondences, and a robust pipeline for converting unstructured production sewing patterns into GarmageNet standard structural assets, paving the way for large-scale, industrial-grade garment generation systems.
zh

[CV-54] Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction ICME2025

【速读】：该论文旨在解决跨模态3D检索任务中的挑战，目标是实现三维模型与文本模态之间的双向检索。当前方法主要依赖单一的三维表示形式（如点云），而较少利用二维与三维的一致性和互补关系，这限制了其性能表现。为弥合这一差距，论文提出采用多视图图像与点云联合表示三维形状，并通过三模态对齐（即图像、点云、文本）来提升跨模态3D检索效果。关键在于引入三模态重建以增强编码器的泛化能力，具体而言，利用文本特征指导从点特征重建图像特征，反之亦然；同时通过细粒度的二维-三维融合聚合多模态特征嵌入，强化几何与语义理解。此外，针对现有数据集中普遍存在的噪声问题（许多三维形状与文本具有相似语义），论文采用硬负样本对比训练，突出更难的负样本，从而生成鲁棒的判别性嵌入。实验结果表明，所提方法在Text2Shape数据集上的形状到文本及文本到形状检索任务中显著超越了现有最先进方法。

链接: https://arxiv.org/abs/2504.01476
作者: Junlong Ren,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); The Hong Kong University of Science and Technology (Hong Kong) (香港科技大学（香港）)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME 2025

点击查看摘要

Abstract:Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities. Current methods predominantly rely on a certain 3D representation (e.g., point cloud), with few exploiting the 2D-3D consistency and complementary relationships, which constrains their performance. To bridge this gap, we propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment (i.e., image, point, text) for enhanced cross-modal 3D retrieval. Notably, we introduce tri-modal reconstruction to improve the generalization ability of encoders. Given point features, we reconstruct image features under the guidance of text features, and vice versa. With well-aligned point cloud and multi-view image features, we aggregate them as multimodal embeddings through fine-grained 2D-3D fusion to enhance geometric and semantic understanding. Recognizing the significant noise in current datasets where many 3D shapes and texts share similar semantics, we employ hard negative contrastive training to emphasize harder negatives with greater significance, leading to robust discriminative embeddings. Extensive experiments on the Text2Shape dataset demonstrate that our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks by a substantial margin.
zh

[CV-55] ANNEXE: Unified Analyzing Answering and Pixel Grounding for Egocentric Interaction

【速读】：该论文旨在解决现有自视角交互理解方法无法同时提供连贯的文本和像素级响应以满足多样化下游应用需求的问题。为全面解析自视角交互，论文提出了名为“自视角交互推理与像素定位（Ego-IRG）”的新任务，通过分析、回答及像素级定位三个关键步骤生成流畅的文本描述和精细的像素级响应。针对现有数据集无法支持Ego-IRG任务的局限性，论文构建了包含超过20,000张自视角图像及其对应多模态交互查询和响应的Ego-IRGBench数据集。此外，设计了统一的ANNEXE模型，利用多模态大型语言模型生成文本和像素级输出，实现对自视角交互的全面解读。实验结果表明，ANNEXE模型在Ego-IRGBench上的表现优于其他方法。

链接: https://arxiv.org/abs/2504.01472
作者: Yuejiao Su,Yi Wang,Qiongyang Hu,Chuang Yang,Lap-Pui Chau
机构: Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University (香港理工大学), Hong Kong SAR
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Computer Vision and Pattern Recognition

点击查看摘要

Abstract:Egocentric interaction perception is one of the essential branches in investigating human-environment interaction, which lays the basis for developing next-generation intelligent systems. However, existing egocentric interaction understanding methods cannot yield coherent textual and pixel-level responses simultaneously according to user queries, which lacks flexibility for varying downstream application requirements. To comprehend egocentric interactions exhaustively, this paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding, which results in fluent textual and fine-grained pixel-level responses. Another challenge is that existing datasets cannot meet the conditions for the Ego-IRG task. To address this limitation, this paper creates the Ego-IRGBench dataset based on extensive manual efforts, which includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions. Moreover, we design a unified ANNEXE model to generate text- and pixel-level outputs utilizing multimodal large language models, which enables a comprehensive interpretation of egocentric interactions. The experiments on the Ego-IRGBench exhibit the effectiveness of our ANNEXE model compared with other works.
zh

[CV-56] Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies

【速读】：该论文旨在解决唇.syncing深度伪造视频的检测难题，这类深度伪造通过AI模型合成人物嘴唇动作以匹配修改或全新的音频，其伪造痕迹主要局限于口部区域且更为微妙，难以察觉。论文提出的关键解决方案是LIPINC-V2框架，它结合视觉时间Transformer与多头交叉注意力机制，通过识别口部区域在时空上的不一致性来检测唇.syncing深度伪造。此方法能够捕捉短期和长期的口部运动变化，从而更有效地发现这些细微的伪造特征。此外，作者构建了一个新的唇.syncing深度伪造数据集LipSyncTIMIT，用于模拟真实场景，进一步验证了模型的有效性。

链接: https://arxiv.org/abs/2504.01470
作者: Soumyya Kanti Datta,Shan Jia,Siwei Lyu
机构: University at Buffalo, State University of New York (布法罗大学，纽约州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person’s lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at this https URL .
zh

[CV-57] Mesh Mamba: A Unified State Space Model for Saliency Prediction in Non-Textured and Textured Meshes CVPR2025

【速读】：本文旨在探索几何结构与纹理在塑造视觉注意中的交互作用，并提出了一种新的解决方案。为系统性捕捉有纹理和无纹理条件下显著性分布的差异，建立了首个全面的网格显著性数据集。解决方案的关键在于引入了Mesh Mamba模型，这是一种基于状态空间模型（State Space Model, SSM）的统一显著性预测方法。Mesh Mamba能够有效分析网格的几何结构，同时将纹理特征无缝融入拓扑框架中，确保增强外观建模的一致性。更重要的是，通过子图嵌入和双向SSM，该模型实现了局部几何与纹理的全局上下文建模，保留了拓扑结构，提升了对视觉细节和结构复杂性的理解。广泛的理论与实证验证表明，该模型不仅在多种网格类型上提升了性能，还展现了高可扩展性和通用性，特别是在跨视觉特征验证中表现突出。

链接: https://arxiv.org/abs/2504.01466
作者: Kaiwei Zhang,Dandan Zhu,Xiongkuo Min,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: to be published in CVPR 2025

点击查看摘要

Abstract:Mesh saliency enhances the adaptability of 3D vision by identifying and emphasizing regions that naturally attract visual attention. To investigate the interaction between geometric structure and texture in shaping visual attention, we establish a comprehensive mesh saliency dataset, which is the first to systematically capture the differences in saliency distribution under both textured and non-textured visual conditions. Furthermore, we introduce mesh Mamba, a unified saliency prediction model based on a state space model (SSM), designed to adapt across various mesh types. Mesh Mamba effectively analyzes the geometric structure of the mesh while seamlessly incorporating texture features into the topological framework, ensuring coherence throughout appearance-enhanced modeling. More importantly, by subgraph embedding and a bidirectional SSM, the model enables global context modeling for both local geometry and texture, preserving the topological structure and improving the understanding of visual details and structural complexity. Through extensive theoretical and empirical validation, our model not only improves performance across various mesh types but also demonstrates high scalability and versatility, particularly through cross validations of various visual features.
zh

[CV-58] Deep LG-Track: An Enhanced Localization-Confidence-Guided Multi-Object Tracker

【速读】：该论文旨在解决多目标跟踪（Multi-object Tracking, MOT）在复杂场景下的准确性与鲁棒性问题。论文提出的Deep LG-Track通过三项关键技术改进实现这一目标：首先，开发了一种自适应卡尔曼滤波器（Adaptive Kalman Filter），动态更新测量噪声协方差以适应检测置信度和轨迹消失情况；其次，设计了一种新的代价矩阵（Cost Matrix），自适应融合运动与外观信息，并利用定位置信度和检测置信度作为权重因子；最后，引入一种动态外观特征更新策略，根据外观清晰度和定位精度调整历史与当前外观特征的相对权重。这些关键方案显著提升了跟踪性能，在MOT17和MOT20数据集上的评估结果表明其在多个性能指标上超越现有先进方法。

链接: https://arxiv.org/abs/2504.01457
作者: Ting Meng,Chunyun Fu,Xiangyan Yan,Zheng Liang,Pan Ji,Jianwen Wang,Tao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 fugures

点击查看摘要

Abstract:Multi-object tracking plays a crucial role in various applications, such as autonomous driving and security surveillance. This study introduces Deep LG-Track, a novel multi-object tracker that incorporates three key enhancements to improve the tracking accuracy and robustness. First, an adaptive Kalman filter is developed to dynamically update the covariance of measurement noise based on detection confidence and trajectory disappearance. Second, a novel cost matrix is formulated to adaptively fuse motion and appearance information, leveraging localization confidence and detection confidence as weighting factors. Third, a dynamic appearance feature updating strategy is introduced, adjusting the relative weighting of historical and current appearance features based on appearance clarity and localization accuracy. Comprehensive evaluations on the MOT17 and MOT20 datasets demonstrate that the proposed Deep LG-Track consistently outperforms state-of-the-art trackers across multiple performance metrics, highlighting its effectiveness in multi-object tracking tasks.
zh

[CV-59] BiSeg-SAM: Weakly-Supervised Post-Processing Framework for Boosting Binary Segmentation in Segment Anything Models

【速读】：该论文旨在解决结肠镜息肉与皮肤病变分割任务中像素级标注耗时且昂贵的问题，同时克服现有基础视觉模型（如Segment Anything Model, SAM）因缺乏领域特定医学知识而在医疗分割任务中表现不佳的局限。论文提出了一种名为BiSeg-SAM的方法，其关键是通过微调SAM结合CNN模块学习局部特征，并引入WeakBox实现粗略掩码到边界框的转换以匹配粗标签与精确预测之间的差异。此外，通过DetailRefine模块利用少量真实标签优化边界精度，同时采用尺度一致性损失进行预测尺度对齐，从而显著提升了多任务分割性能，在五个息肉数据集和一个皮肤癌数据集上的测试结果优于当前最先进的方法。

链接: https://arxiv.org/abs/2504.01452
作者: Encheng Su,Hu Cao,Alois Knoll
机构: Technische Universität München (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

点击查看摘要

Abstract:Accurate segmentation of polyps and skin lesions is essential for diagnosing colorectal and skin cancers. While various segmentation methods for polyps and skin lesions using fully supervised deep learning techniques have been developed, the pixel-level annotation of medical images by doctors is both time-consuming and costly. Foundational vision models like the Segment Anything Model (SAM) have demonstrated superior performance; however, directly applying SAM to medical segmentation may not yield satisfactory results due to the lack of domain-specific medical knowledge. In this paper, we propose BiSeg-SAM, a SAM-guided weakly supervised prompting and boundary refinement network for the segmentation of polyps and skin lesions. Specifically, we fine-tune SAM combined with a CNN module to learn local features. We introduce a WeakBox with two functions: automatically generating box prompts for the SAM model and using our proposed Multi-choice Mask-to-Box (MM2B) transformation for rough mask-to-box conversion, addressing the mismatch between coarse labels and precise predictions. Additionally, we apply scale consistency (SC) loss for prediction scale alignment. Our DetailRefine module enhances boundary precision and segmentation accuracy by refining coarse predictions using a limited amount of ground truth labels. This comprehensive approach enables BiSeg-SAM to achieve excellent multi-task segmentation performance. Our method demonstrates significant superiority over state-of-the-art (SOTA) methods when tested on five polyp datasets and one skin cancer dataset.
zh

[CV-60] Multimodal Point Cloud Semantic Segmentation With Virtual Point Enhancement

【速读】：该论文旨在解决基于 LiDAR 的 3D 点云语义分割中因点云稀疏性和密度变化导致难以捕捉物体细节的问题，特别是在中距离和小目标场景下。论文的关键解决方案包括两个主要部分：首先，提出了一种基于虚拟点增强（Virtual Point Enhancement, VPE）的多模态点云语义分割方法，通过从图像生成密集但噪声较大的虚拟点来补充原始点云信息；其次，引入了一个空间差异驱动的自适应滤波模块，该模块依据密度和距离选择性地提取有价值的伪点，从而提升中距离目标的点云密度。此外，还设计了一个抗噪稀疏特征编码器，结合抗噪特征提取与细粒度特征增强技术，进一步优化分割性能。实验结果表明，在 nuScenes 数据集上引入 7.7% 的虚拟点后，mIoU 提升了 2.89%。

链接: https://arxiv.org/abs/2504.01449
作者: Zaipeng Duan,Xuzhong Hu,Pei An,Jie Ma
机构: School of Artificial Intelligence and Automation, Huazhong University of Science and Technology (华中科技大学); National Key Laboratory of Science and Technology on Multispectral Information Processing, HUST (华中科技大学); School of Electronic Information and Communications, Huazhong University of Science and Technology (华中大学科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR-based 3D point cloud recognition has been proven beneficial in various applications. However, the sparsity and varying density pose a significant challenge in capturing intricate details of objects, particularly for medium-range and small targets. Therefore, we propose a multi-modal point cloud semantic segmentation method based on Virtual Point Enhancement (VPE), which integrates virtual points generated from images to address these issues. These virtual points are dense but noisy, and directly incorporating them can increase computational burden and degrade performance. Therefore, we introduce a spatial difference-driven adaptive filtering module that selectively extracts valuable pseudo points from these virtual points based on density and distance, enhancing the density of medium-range targets. Subsequently, we propose a noise-robust sparse feature encoder that incorporates noise-robust feature extraction and fine-grained feature enhancement. Noise-robust feature extraction exploits the 2D image space to reduce the impact of noisy points, while fine-grained feature enhancement boosts sparse geometric features through inner-voxel neighborhood point aggregation and downsampled voxel aggregation. The results on the SemanticKITTI and nuScenes, two large-scale benchmark data sets, have validated effectiveness, significantly improving 2.89% mIoU with the introduction of 7.7% virtual points on nuScenes.
zh

[CV-61] MuTri: Multi-view Tri-alignment for OCT to OCTA 3D Image Translation

【速读】：该论文旨在解决利用现有3D光学相干断层扫描（OCT）图像生成高质量三维光学相干断层扫描血管成像（OCTA）图像的问题。现有的OCTA转换方法直接在连续无限空间中从OCT域到OCTA域学习映射，仅依赖单一视角（即OCTA投影图），导致结果次优。为了解决这一问题，论文提出了一种名为MuTri的多视角三对齐框架，用于离散有限空间中的OCT到OCTA三维图像转换。其关键是首先通过重建3D OCT和3D OCTA数据预训练两个向量量化变分自编码器（VQ-VAE），提供语义先验，然后利用多视角对齐促进另一个VQ-VAE模型的学习。具体而言，提出了受对比学习启发的语义对齐以最大化来自OCT和OCTA视图的预训练模型之间的互信息，从而促进码本学习，并设计了血管结构对齐以最小化与OCTA投影图视图预训练模型之间的结构差异，以充分利用详细的血管结构信息。此外，收集了一个包含846名受试者配对OCT和OCTA体积的大规模数据集OCTA2024。

链接: https://arxiv.org/abs/2504.01428
作者: Zhuangzhuang Chen,Hualiang Wang,Chubin Ou,Xiaomeng Li
机构: Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology (香港科技大学); Department of Radiology, Guangdong Provincial People’s Hospital, Southern Medical University (广东省人民医院，南方医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optical coherence tomography angiography (OCTA) shows its great importance in imaging microvascular networks by providing accurate 3D imaging of blood vessels, but it relies upon specialized sensors and expensive devices. For this reason, previous works show the potential to translate the readily available 3D Optical Coherence Tomography (OCT) images into 3D OCTA images. However, existing OCTA translation methods directly learn the mapping from the OCT domain to the OCTA domain in continuous and infinite space with guidance from only a single view, i.e., the OCTA project map, resulting in suboptimal results. To this end, we propose the multi-view Tri-alignment framework for OCT to OCTA 3D image translation in discrete and finite space, named MuTri. In the first stage, we pre-train two vector-quantized variational auto-encoder (VQ- VAE) by reconstructing 3D OCT and 3D OCTA data, providing semantic prior for subsequent multi-view guidances. In the second stage, our multi-view tri-alignment facilitates another VQVAE model to learn the mapping from the OCT domain to the OCTA domain in discrete and finite space. Specifically, a contrastive-inspired semantic alignment is proposed to maximize the mutual information with the pre-trained models from OCT and OCTA views, to facilitate codebook learning. Meanwhile, a vessel structure alignment is proposed to minimize the structure discrepancy with the pre-trained models from the OCTA project map view, benefiting from learning the detailed vessel structure information. We also collect the first large-scale dataset, namely, OCTA2024, which contains a pair of OCT and OCTA volumes from 846 subjects.
zh

[CV-62] meSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding

【速读】：该论文旨在解决大型视频语言模型（LVLMs）在处理长视频时面临的显著挑战，特别是由于大量视频帧导致的计算负担以及因时空下采样引起的视觉幻觉问题，这些问题使得长视频的准确理解变得困难。论文提出的解决方案是TimeSearch框架，其关键在于结合两种类人化的操作：1）Spotlight通过时间增强帧表示（TAFR）高效识别相关的时间事件，并将视觉特征与时间戳显式绑定；2）Reflection利用LVLMs固有的时间自我反思能力评估所识别事件的正确性。TimeSearch通过逐步探索关键事件并基于反射置信度优先搜索时间信息，在多个长视频基准测试中实现了性能提升，验证了其有效性。

链接: https://arxiv.org/abs/2504.01407
作者: Junwen Pan,Rui Zhang,Xin Wan,Yuan Zhang,Ming Lu,Qi She
机构: ByteDance (字节跳动); School of Computer Science, Peking University (北京大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks. However, they encounter significant challenges when processing long videos because of the large number of video frames involved. Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos. Motivated by human hierarchical temporal search strategies, we propose \textbfTimeSearch, a novel framework enabling LVLMs to understand long videos in a human-like manner. TimeSearch integrates two human-like primitives into a unified autoregressive LVLM: 1) \textbfSpotlight efficiently identifies relevant temporal events through a Temporal-Augmented Frame Representation (TAFR), explicitly binding visual features with timestamps; 2) \textbfReflection evaluates the correctness of the identified events, leveraging the inherent temporal self-reflection capabilities of LVLMs. TimeSearch progressively explores key events and prioritizes temporal search based on reflection confidence. Extensive experiments on challenging long-video benchmarks confirm that TimeSearch substantially surpasses previous state-of-the-art, improving the accuracy from 41.8% to 51.5% on the LVBench. Additionally, experiments on temporal grounding demonstrate that appropriate TAFR is adequate to effectively stimulate the surprising temporal grounding ability of LVLMs in a simpler yet versatile manner, which improves mIoU on Charades-STA by 11.8%. The code will be released.
zh

[CV-63] Leverag ing Generalizability of Image-to-Image Translation for Enhanced Adversarial Defense

【速读】：该论文试图解决机器学习模型在面对对抗性攻击时稳定性与可靠性不足的问题，特别是这些攻击能够通过微小的扰动误导模型做出错误预测。论文的关键在于提出了一种基于图像到图像翻译的改进防御模型，该模型引入残差块以增强其对多种对抗攻击的泛化能力。此方法仅需训练单一模型即可有效抵御多样化的攻击类型，并且具有良好的跨模型迁移性能，同时显著降低了时间与计算开销，实验结果显示其分类准确率可从接近零恢复至平均72%，且性能与当前最先进的方法相当。

链接: https://arxiv.org/abs/2504.01399
作者: Haibo Zhang,Zhihua Yao,Kouichi Sakurai,Takeshi Saitoh
机构: Kyushu Institute of Technology (九州工业技术学院); The University of Kitakyushu (北九州大学); Kyushu University (九州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the rapidly evolving field of artificial intelligence, machine learning emerges as a key technology characterized by its vast potential and inherent risks. The stability and reliability of these models are important, as they are frequent targets of security threats. Adversarial attacks, first rigorously defined by Ian Goodfellow et al. in 2013, highlight a critical vulnerability: they can trick machine learning models into making incorrect predictions by applying nearly invisible perturbations to images. Although many studies have focused on constructing sophisticated defensive mechanisms to mitigate such attacks, they often overlook the substantial time and computational costs of training and maintaining these models. Ideally, a defense method should be able to generalize across various, even unseen, adversarial attacks with minimal overhead. Building on our previous work on image-to-image translation-based defenses, this study introduces an improved model that incorporates residual blocks to enhance generalizability. The proposed method requires training only a single model, effectively defends against diverse attack types, and is well-transferable between different target models. Experiments show that our model can restore the classification accuracy from near zero to an average of 72% while maintaining competitive performance compared to state-of-the-art methods.
zh

[CV-64] All Patches Matter More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning

【速读】：该论文旨在解决AI生成图像（AIGIs）检测中缺乏鲁棒性和泛化能力的问题。随着AIGIs的指数增长，迫切需要有效的检测方法。论文通过系统分析确立了两个关键原则：首先，“所有补丁都很重要”，即每个补丁都包含合成伪影，因此每个补丁都是重要的检测特征来源；其次，“更多补丁更好”，利用更多补丁中的分布式伪影可以提高检测的鲁棒性，并减少对特定补丁的过度依赖。然而，反事实分析揭示了一个不良现象：未经优化的检测器往往表现出“少数补丁偏见”，即仅基于少数补丁来区分真实与合成图像。论文将此现象归因于“懒惰学习者”行为，即检测器倾向于优先学习有限补丁中的显眼伪影，而忽略更广泛的伪影分布。为了解决这一偏见，论文提出了“全景补丁学习（PPL）”框架，其核心包括随机补丁替换（Random Patch Replacement）以迫使模型识别未充分利用区域的伪影，以及补丁级对比学习（Patch-wise Contrastive Learning）以确保所有补丁的一致判别能力，从而实现对所有补丁的均匀利用。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.01396
作者: Zheng Yang,Ruoxin Chen,Zhiyuan Yan,Ke-Yue Zhang,Xinghe Fu,Shuang Wu,Xiujun Shu,Taiping Yao,Junchi Yan,Shouhong Ding,Xi Li
机构: Zhejiang University (浙江大学); Youtu Lab, Tencent (腾讯优图实验室); Peking University (北京大学); WeChat Pay, Tencent (微信支付, 腾讯); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The exponential growth of AI-generated images (AIGIs) underscores the urgent need for robust and generalizable detection methods. In this paper, we establish two key principles for AIGI detection through systematic analysis: \textbf(1) All Patches Matter: Unlike conventional image classification where discriminative features concentrate on object-centric regions, each patch in AIGIs inherently contains synthetic artifacts due to the uniform generation process, suggesting that every patch serves as an important artifact source for detection. \textbf(2) More Patches Better: Leveraging distributed artifacts across more patches improves detection robustness by capturing complementary forensic evidence and reducing over-reliance on specific patches, thereby enhancing robustness and generalization. However, our counterfactual analysis reveals an undesirable phenomenon: naively trained detectors often exhibit a \textbfFew-Patch Bias, discriminating between real and synthetic images based on minority patches. We identify \textbfLazy Learner as the root cause: detectors preferentially learn conspicuous artifacts in limited patches while neglecting broader artifact distributions. To address this bias, we propose the \textbfPanoptic \textbfPatch \textbfLearning (PPL) framework, involving: (1) Random Patch Replacement that randomly substitutes synthetic patches with real counterparts to compel models to identify artifacts in underutilized regions, encouraging the broader use of more patches; (2) Patch-wise Contrastive Learning that enforces consistent discriminative capability across all patches, ensuring uniform utilization of all patches. Extensive experiments across two different settings on several benchmarks verify the effectiveness of our approach.
zh

[CV-65] DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data

【速读】：该论文旨在解决现有基于CLIP（Contrastive Language-Image Pre-training）方法在特定领域数据（如生物学）应用中存在的两个主要问题：一是未能充分利用领域特定数据的特性（如生物数据的细粒度特征），导致模型能力受限；二是过度调优后往往丢失CLIP在通用域中的原始能力。为应对这些挑战，论文提出了一种名为分布对齐的语言-图像预训练方法（Distribution Alignment-based Language-Image Pre-Training, DALIP）。其关键在于通过匹配图像-文本对特征分布而非传统的[cls]令牌来优化CLIP模型，从而提取更丰富且有效的表征信息以更好地处理生物数据的细粒度特性。此外，DALIP利用一阶和二阶统计量高效近似特征分布，并引入多头布朗距离协方差（Multi-head Brownian Distance Covariance, MBDC）模块以有效获取令牌特征的二阶统计量。实验结果表明，该方法在生物领域显著优于现有CLIP相关工作，同时在遥感和医学影像等领域表现出良好的泛化性能。此外，构建的新植物领域数据集PlantMix-13M进一步提升了DALIP在植物领域的性能，同时保持了其在通用域的能力。

链接: https://arxiv.org/abs/2504.01386
作者: Junjie Wu,Jiangtao Xie,Zhaolin Zhang,Qilong Wang,Qinghua Hu,Peihua Li,Sen Xu
机构: Tianjin University (天津大学); Dalian University of Technology (大连理工大学); Yancheng Institute of Technology (盐城工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:Recently, Contrastive Language-Image Pre-training (CLIP) has shown promising performance in domain-specific data (e.g., biology), and has attracted increasing research attention. Existing works generally focus on collecting extensive domain-specific data and directly tuning the original CLIP models. Intuitively, such a paradigm takes no full consideration of the characteristics lying in domain-specific data (e.g., fine-grained nature of biological data) and so limits model capability, while mostly losing the original ability of CLIP in the general domain. In this paper, we propose a Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data. Specifically, DALIP optimizes CLIP models by matching the similarity between feature distribution of image-text pairs instead of the original [cls] token, which can capture rich yet effective information inherent in image-text pairs as powerful representations, and so better cope with fine-grained nature of biological data. Particularly, our DALIP efficiently approximates feature distribution via its first- and second-order statistics, while presenting a Multi-head Brownian Distance Covariance (MBDC) module to acquire second-order statistics of token features efficiently. Furthermore, we collect a new dataset for plant domain (e.g., specific data in biological domain) comprising 10M plant data with 3M general-domain data (namely PlantMix-13M) according to data mixing laws. Extensive experiments show that DALIP clearly outperforms existing CLIP counterparts in biological domain, while well generalizing to remote sensing and medical imaging domains. Besides, our PlantMix-13M dataset further boosts performance of DALIP in plant domain, while preserving model ability in general domain.
zh

[CV-66] v-CLR: View-Consistent Learning for Open-World Instance Segmentation CVPR2025

【速读】：该论文旨在解决开放世界实例分割（Open-World Instance Segmentation）中的挑战，现有方法在视觉网络中隐式偏向于学习外观信息（如纹理）以识别物体，导致模型在开放世界场景下无法检测具有未见过纹理的新颖物体。为了解决这一问题，论文提出了一种名为视图一致性学习（view-Consistent LeaRning, v-CLR）的框架。其关键是通过引入图像的额外视图，在保持图像底层结构的同时显著改变纹理，强制模型学习与外观无关的表示，从而实现鲁棒的实例分割。具体而言，v-CLR通过跨视图对象特征的一致性约束来减少外观依赖性，并利用无监督的类无关对象提议模型进行跨视图对象特征匹配，增强物体感知能力。

链接: https://arxiv.org/abs/2504.01383
作者: Chang-Bin Zhang,Jinhong Ni,Yujie Zhong,Kai Han
机构: Visual AI Lab, The University of Hong Kong (香港大学视觉人工智能实验室); Meituan Inc. (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025, Project page: this https URL , Code: this https URL

点击查看摘要

Abstract:In this paper, we address the challenging problem of open-world instance segmentation. Existing works have shown that vanilla visual networks are biased toward learning appearance information, \eg texture, to recognize objects. This implicit bias causes the model to fail in detecting novel objects with unseen textures in the open-world setting. To address this challenge, we propose a learning framework, called view-Consistent LeaRning (v-CLR), which aims to enforce the model to learn appearance-invariant representations for robust instance segmentation. In v-CLR, we first introduce additional views for each image, where the texture undergoes significant alterations while preserving the image’s underlying structure. We then encourage the model to learn the appearance-invariant representation by enforcing the consistency between object features across different views, for which we obtain class-agnostic object proposals using off-the-shelf unsupervised models that possess strong object-awareness. These proposals enable cross-view object feature matching, greatly reducing the appearance dependency while enhancing the object-awareness. We thoroughly evaluate our method on public benchmarks under both cross-class and cross-dataset settings, achieving state-of-the-art performance. Project page: this https URL
zh

[CV-67] 3D Gaussian Inverse Rendering with Approximated Global Illumination

【速读】：该论文旨在解决现有3D Gaussian Splatting方法在物理基础渲染和场景编辑中的局限性问题，这些方法通常将光照烘焙到表示中。论文提出了一种通过屏幕空间光线追踪实现高效全局光照的新方法。其关键是利用观察到的大部分间接光可以追溯到当前视锥体内的可见表面这一特性，通过将蒙特卡洛屏幕空间光线追踪与3D Gaussian直接着色相结合，捕获一次反弹的间接光照，从而实现实时渲染和编辑的同时保持计算效率和可编辑性。

链接: https://arxiv.org/abs/2504.01358
作者: Zirui Wu,Jianteng Chen,Laijian Li,Shaoteng Wu,Zhikai Zhu,Kang Xu,Martin R. Oswald,Jie Song
机构: HKUST(GZ)(香港科技大学(广州)); NIO(蔚来); University of Amsterdam(阿姆斯特丹大学); HKUST(香港科技大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting shows great potential in reconstructing photo-realistic 3D scenes. However, these methods typically bake illumination into their representations, limiting their use for physically-based rendering and scene editing. Although recent inverse rendering approaches aim to decompose scenes into material and lighting components, they often rely on simplifying assumptions that fail when editing. We present a novel approach that enables efficient global illumination for 3D Gaussians Splatting through screen-space ray tracing. Our key insight is that a substantial amount of indirect light can be traced back to surfaces visible within the current view frustum. Leveraging this observation, we augment the direct shading computed by 3D Gaussians with Monte-Carlo screen-space ray-tracing to capture one-bounce indirect illumination. In this way, our method enables realistic global illumination without sacrificing the computational efficiency and editability benefits of 3D Gaussians. Through experiments, we show that the screen-space approximation we utilize allows for indirect illumination and supports real-time rendering and editing. Code, data, and models will be made available at our project page: this https URL.
zh

[CV-68] Prompt-Guided Attention Head Selection for Focus-Oriented Image Retrieval CVPR2025

【速读】：该论文旨在解决面向特定对象的图像检索（Focus-Oriented Image Retrieval, FOIR）任务中，预训练视觉Transformer（Vision Transformer, ViT）模型在处理多目标复杂场景时性能不足的问题。标准图像编码器通过单一全局特征向量表征图像，难以满足用户针对特定对象进行精确检索的需求。为克服这一局限，论文提出了一种名为Prompt-guided Attention Head Selection（PHS）的关键解决方案。PHS以可提示的方式利用ViT中多头注意力机制的潜力，通过匹配注意力图与用户的视觉提示（如点、框或分割掩码），选择特定的注意力头，从而引导模型聚焦于感兴趣的特定对象，同时保留周围视觉上下文信息。此方法无需模型重新训练且不改变图像本身，实验结果表明PHS显著提升了多个数据集上的性能，为FOIR任务提供了一种实用且无需额外训练的增强方案。

链接: https://arxiv.org/abs/2504.01348
作者: Yuji Nozawa,Yu-Chieh Lin,Kazumoto Nakamura,Youyang Ng
机构: Kioxia Corporation (Kioxia公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted to CVPR 2025 PixFoundation Workshop

点击查看摘要

Abstract:The goal of this paper is to enhance pretrained Vision Transformer (ViT) models for focus-oriented image retrieval with visual prompting. In real-world image retrieval scenarios, both query and database images often exhibit complexity, with multiple objects and intricate backgrounds. Users often want to retrieve images with specific object, which we define as the Focus-Oriented Image Retrieval (FOIR) task. While a standard image encoder can be employed to extract image features for similarity matching, it may not perform optimally in the multi-object-based FOIR task. This is because each image is represented by a single global feature vector. To overcome this, a prompt-based image retrieval solution is required. We propose an approach called Prompt-guided attention Head Selection (PHS) to leverage the head-wise potential of the multi-head attention mechanism in ViT in a promptable manner. PHS selects specific attention heads by matching their attention maps with user’s visual prompts, such as a point, box, or segmentation. This empowers the model to focus on specific object of interest while preserving the surrounding visual context. Notably, PHS does not necessitate model re-training and avoids any image alteration. Experimental results show that PHS substantially improves performance on multiple datasets, offering a practical and training-free solution to enhance model performance in the FOIR task.
zh

[CV-69] Slow-Fast Architecture for Video Multi-Modal Large Language Models

【速读】：该论文旨在解决在有限计算资源下视频驱动的多模态大型语言模型（Multi-modal Large Language Models, MLLMs）中时间分辨率与空间细节难以兼顾的问题。现有方法通常通过预定义规则压缩视频表示后再输入到语言模型中，这会导致不可逆的信息丢失，并且往往忽视输入指令。为了解决这一挑战，论文提出了一种新颖的慢-快架构（Slow-Fast Architecture），其关键在于通过双令牌策略自然规避了这一权衡问题：1）“快速”视觉令牌（紧凑的压缩视频特征集）与文本嵌入一同输入到语言模型中，提供快速概览；2）“慢速”视觉令牌（未压缩的视频特征）通过专门设计的混合解码器层与文本嵌入进行跨注意力操作，以线性复杂度实现与指令相关的视觉细节提取。实验表明，该模型显著优于仅使用自注意力的基线模型，在计算增加仅3%的情况下将输入帧数从16扩展到128，并在五个视频理解基准测试中平均提升了16%的性能。此外，该架构具有模块化设计，可轻松集成到其他视频MLLM中以提高效率和可扩展性。

链接: https://arxiv.org/abs/2504.01328
作者: Min Shi,Shihao Wang,Chieh-Yun Chen,Jitesh Jain,Kai Wang,Junjun Xiong,Guilin Liu,Zhiding Yu,Humphrey Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Balancing temporal resolution and spatial detail under limited compute budget remains a key challenge for video-based multi-modal large language models (MLLMs). Existing methods typically compress video representations using predefined rules before feeding them into the LLM, resulting in irreversible information loss and often ignoring input instructions. To address this, we propose a novel slow-fast architecture that naturally circumvents this trade-off, enabling the use of more input frames while preserving spatial details. Inspired by how humans first skim a video before focusing on relevant parts, our slow-fast design employs a dual-token strategy: 1) “fast” visual tokens – a compact set of compressed video features – are fed into the LLM alongside text embeddings to provide a quick overview; 2) “slow” visual tokens – uncompressed video features – are cross-attended by text embeddings through specially designed hybrid decoder layers, enabling instruction-aware extraction of relevant visual details with linear complexity. We conduct systematic exploration to optimize both the overall architecture and key components. Experiments show that our model significantly outperforms self-attention-only baselines, extending the input capacity from 16 to 128 frames with just a 3% increase in computation, and achieving a 16% average performance improvement across five video understanding benchmarks. Our 7B model achieves state-of-the-art performance among models of similar size. Furthermore, our slow-fast architecture is a plug-and-play design that can be integrated into other video MLLMs to improve efficiency and scalability.
zh

[CV-70] CFMD: Dynamic Cross-layer Feature Fusion for Salient Object Detection

【速读】：该论文针对传统跨层特征金字塔网络（Cross-layer Feature Pyramid Networks, CFPNs）在显著物体检测任务中存在的两个核心问题：(1) 复杂特征加权操作导致的计算瓶颈，以及(2) 上采样过程中特征模糊引起的边界精度下降，提出了新的解决方案。论文的关键创新点包括设计了一个上下文感知特征聚合模块（Context-Aware Feature Aggregation Module, CFLMA），通过引入最先进的Mamba架构构建动态权重分配机制，从而自适应调整特征重要性，显著提升表示效率和泛化能力；同时提出了一种自适应动态上采样单元（Adaptive Dynamic Upsampling Unit, CFLMD），通过动态调整上采样范围并结合双线性初始化策略，在分辨率恢复过程中有效减少特征重叠并保持细粒度的边界结构。这些改进显著提升了像素级准确性及边界分割质量，特别是在复杂场景下表现突出。

链接: https://arxiv.org/abs/2504.01326
作者: Jin Lian,Zhongyu Wan,Ming Gao,JunFeng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-layer feature pyramid networks (CFPNs) have achieved notable progress in multi-scale feature fusion and boundary detail preservation for salient object detection. However, traditional CFPNs still suffer from two core limitations: (1) a computational bottleneck caused by complex feature weighting operations, and (2) degraded boundary accuracy due to feature blurring in the upsampling process. To address these challenges, we propose CFMD, a novel cross-layer feature pyramid network that introduces two key innovations. First, we design a context-aware feature aggregation module (CFLMA), which incorporates the state-of-the-art Mamba architecture to construct a dynamic weight distribution mechanism. This module adaptively adjusts feature importance based on image context, significantly improving both representation efficiency and generalization. Second, we introduce an adaptive dynamic upsampling unit (CFLMD) that preserves spatial details during resolution recovery. By adjusting the upsampling range dynamically and initializing with a bilinear strategy, the module effectively reduces feature overlap and maintains fine-grained boundary structures. Extensive experiments on three standard benchmarks using three mainstream backbone networks demonstrate that CFMD achieves substantial improvements in pixel-level accuracy and boundary segmentation quality, especially in complex scenes. The results validate the effectiveness of CFMD in jointly enhancing computational efficiency and segmentation performance, highlighting its strong potential in salient object detection tasks.
zh

[CV-71] COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking

【速读】：该论文旨在解决视觉-语言（Vision-Language, VL）跟踪算法中多阶段多模态融合机制依赖复杂设计以及直接多模态融合忽略模态间特征空间分布差异的问题。论文的关键解决方案是提出COST（Contrastive One-Stage Transformer Fusion Framework），一种对比一致性的一阶段Transformer融合框架。其核心在于引入了一种对比对齐策略，通过最大化视频与其对应语言描述之间的互信息（Mutual Information, MI），实现有效的跨模态对齐，从而在表征空间中获得语义一致的特征。同时，利用视觉-语言Transformer构建高效的多模态融合与推理机制，验证了简单堆叠Transformer编码器即可有效生成统一的VL表示。此外，论文还贡献了一个新的小目标VL跟踪基准数据集VL-SOT500，包含两个挑战性子集VL-SOT230和VL-SOT270，用于评估通用和高速小目标跟踪性能，并首次探索了语言线索增强小目标视觉表征的应用。实验结果表明，COST在现有五个VL跟踪数据集及新提出的VL-SOT500数据集上均达到当前最优性能。

链接: https://arxiv.org/abs/2504.01321
作者: Chunhui Zhang,Li Liu,Jialin Gao,Xin Sun,Hao Wen,Xi Zhou,Shiming Ge,Yanfeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint submitted to Elsevier. this https URL

点击查看摘要

Abstract:Transformer has recently demonstrated great potential in improving vision-language (VL) tracking algorithms. However, most of the existing VL trackers rely on carefully designed mechanisms to perform the multi-stage multi-modal fusion. Additionally, direct multi-modal fusion without alignment ignores distribution discrepancy between modalities in feature space, potentially leading to suboptimal representations. In this work, we propose COST, a contrastive one-stage transformer fusion framework for VL tracking, aiming to learn semantically consistent and unified VL representations. Specifically, we introduce a contrastive alignment strategy that maximizes mutual information (MI) between a video and its corresponding language description. This enables effective cross-modal alignment, yielding semantically consistent features in the representation space. By leveraging a visual-linguistic transformer, we establish an efficient multi-modal fusion and reasoning mechanism, empirically demonstrating that a simple stack of transformer encoders effectively enables unified VL representations. Moreover, we contribute a newly collected VL tracking benchmark dataset for small object tracking, named VL-SOT500, with bounding boxes and language descriptions. Our dataset comprises two challenging subsets, VL-SOT230 and VL-SOT270, dedicated to evaluating generic and high-speed small object tracking, respectively. Small object tracking is notoriously challenging due to weak appearance and limited features, and this dataset is, to the best of our knowledge, the first to explore the usage of language cues to enhance visual representation for small object tracking. Extensive experiments demonstrate that COST achieves state-of-the-art performance on five existing VL tracking datasets, as well as on our proposed VL-SOT500 dataset. Source codes and dataset will be made publicly available.
zh

[CV-72] Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在处理噪声增强图像输入时存在的安全漏洞问题。尽管现有VLMs在训练过程中已采取一定的安全措施以缓解攻击威胁，但针对噪声增强输入的特定漏洞仍未被充分关注。论文指出，缺乏噪声增强训练会导致严重的安全缺口，许多VLMs对简单的扰动（如高斯噪声）也极为脆弱。

为应对这一挑战，论文提出了两个关键解决方案：首先，提出Robust-VLGuard，这是一个包含对齐/不对齐图像-文本对的多模态安全数据集，并结合噪声增强微调，以降低攻击成功率同时保持VLM的功能性；其次，针对更强大的基于优化的视觉扰动攻击，提出DiffPure-VLM，利用扩散模型将对抗性扰动转换为类似高斯噪声，这种噪声可通过噪声增强的安全微调进行防御。实验结果表明，扩散模型的分布偏移特性与微调后的VLM高度兼容，显著减轻了不同强度的对抗性扰动。

链接: https://arxiv.org/abs/2504.01308
作者: Jiawei Wang,Yushen Zuo,Yuanjun Chai,Zhendong Liu,Yichen Fu,Yichun Feng,Kin-man Lam
机构: University of Science and Technology of China (中国科学技术大学); The Hong Kong Polytechnic University (香港理工大学); University of Washington (华盛顿大学); Nanjing University (南京大学); Stanford University (斯坦福大学); University of the Chinese Academy of Sciences (中国科学院大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images. Although existing VLMs adopt security measures during training to mitigate such attacks, vulnerabilities associated with noise-augmented visual inputs are overlooked. In this work, we identify that missing noise-augmented training causes critical security gaps: many VLMs are susceptible to even simple perturbations such as Gaussian noise. To address this challenge, we propose Robust-VLGuard, a multimodal safety dataset with aligned / misaligned image-text pairs, combined with noise-augmented fine-tuning that reduces attack success rates while preserving functionality of VLM. For stronger optimization-based visual perturbation attacks, we propose DiffPure-VLM, leveraging diffusion models to convert adversarial perturbations into Gaussian-like noise, which can be defended by VLMs with noise-augmented safety fine-tuning. Experimental results demonstrate that the distribution-shifting property of diffusion model aligns well with our fine-tuned VLMs, significantly mitigating adversarial perturbations across varying intensities. The dataset and code are available at this https URL.
zh

[CV-73] Direction-Aware Hybrid Representation Learning for 3D Hand Pose and Shape Estimation CVPR2025

【速读】：该论文旨在解决基于模型的3D手部姿态和形状估计方法在弱监督下直接从图像回归参数化模型参数时面临的挑战，即优化过程中存在大量局部极小值，导致训练困难的问题。为了解决这一难题，论文提出了一种方向感知混合特征（DaHyF）的学习方法，通过融合隐式的图像特征与显式的二维关节坐标特征，并结合相机坐标系中的像素方向信息，以实现姿态、形状及相机视角的估计。关键在于DaHyF表示的学习及其在减少运动捕捉过程中的抖动方面的应用，后者依赖于对比学习中的预测置信度。

链接: https://arxiv.org/abs/2504.01298
作者: Shiyong Liu,Zhihao Li,Xiao Tang,Jianzhuang Liu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Shenzhen Institute of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025 workshop

点击查看摘要

Abstract:Most model-based 3D hand pose and shape estimation methods directly regress the parametric model parameters from an image to obtain 3D joints under weak supervision. However, these methods involve solving a complex optimization problem with many local minima, making training difficult. To address this challenge, we propose learning direction-aware hybrid features (DaHyF) that fuse implicit image features and explicit 2D joint coordinate features. This fusion is enhanced by the pixel direction information in the camera coordinate system to estimate pose, shape, and camera viewpoint. Our method directly predicts 3D hand poses with DaHyF representation and reduces jittering during motion capture using prediction confidence based on contrastive learning. We evaluate our method on the FreiHAND dataset and show that it outperforms existing state-of-the-art methods by more than 33% in accuracy. DaHyF also achieves the top ranking on both the HO3Dv2 and HO3Dv3 leaderboards for the metric of Mean Joint Error (after scale and translation alignment). Compared to the second-best results, the largest improvement observed is 10%. We also demonstrate its effectiveness in real-time motion capture scenarios with hand position variability, occlusion, and motion blur.
zh

[CV-74] ForestVO: Enhancing Visual Odometry in Forest Environments through ForestGlue

【速读】：该论文旨在解决复杂森林环境中视觉里程计系统面临的挑战，这些问题包括密集树叶遮挡、可变光照条件以及重复纹理导致的关键点特征匹配精度下降。为应对这些挑战，论文提出了ForestGlue，通过优化SuperPoint特征检测器以适配灰度图、RGB图像、RGB-D数据及立体视觉四种配置，同时结合LightGlue或重新训练的SuperGlue进行特征匹配。关键在于利用合成森林数据对匹配算法进行再训练，并大幅减少所需关键点数量至基线模型的25%，从而显著降低计算开销，提升动态森林环境下的性能表现。此外，通过将ForestGlue与基于Transformer的位姿估计模型相结合，构建了ForestVO系统，在TartanAir数据集的森林序列测试中实现了优于直接法（如DSO）40%的相对位姿误差（RPE），并在仅使用10%训练数据的情况下保持竞争力，展示了其在资源受限平台上的实时部署潜力。

链接: https://arxiv.org/abs/2504.01261
作者: Thomas Pritchard,Saifullah Ijaz,Ronald Clark,Basaran Bahadir Kocer
机构: Computing Department, Imperial College London (帝国理工学院计算系); Department of Computer Science, University of Oxford (牛津大学计算机科学系); School of Civil, Aerospace and Design Engineering, University of Bristol (布里斯托尔大学土木、航天与设计工程学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE Robotics and Automation Letters

点击查看摘要

Abstract:Recent advancements in visual odometry systems have improved autonomous navigation; however, challenges persist in complex environments like forests, where dense foliage, variable lighting, and repetitive textures compromise feature correspondence accuracy. To address these challenges, we introduce ForestGlue, enhancing the SuperPoint feature detector through four configurations - grayscale, RGB, RGB-D, and stereo-vision - optimised for various sensing modalities. For feature matching, we employ LightGlue or SuperGlue, retrained with synthetic forest data. ForestGlue achieves comparable pose estimation accuracy to baseline models but requires only 512 keypoints - just 25% of the baseline’s 2048 - to reach an LO-RANSAC AUC score of 0.745 at a 10° threshold. With only a quarter of keypoints needed, ForestGlue significantly reduces computational overhead, demonstrating effectiveness in dynamic forest environments, and making it suitable for real-time deployment on resource-constrained platforms. By combining ForestGlue with a transformer-based pose estimation model, we propose ForestVO, which estimates relative camera poses using matched 2D pixel coordinates between frames. On challenging TartanAir forest sequences, ForestVO achieves an average relative pose error (RPE) of 1.09 m and a kitti_score of 2.33%, outperforming direct-based methods like DSO by 40% in dynamic scenes. Despite using only 10% of the dataset for training, ForestVO maintains competitive performance with TartanVO while being a significantly lighter model. This work establishes an end-to-end deep learning pipeline specifically tailored for visual odometry in forested environments, leveraging forest-specific training data to optimise feature correspondence and pose estimation, thereby enhancing the accuracy and robustness of autonomous navigation systems.
zh

[CV-75] FUSION: Frequency-guided Underwater Spatial Image recOnstructioN

【速读】：该论文针对水下图像因波长相关的吸收和散射导致的颜色失真、可视性降低及结构细节丢失等问题，试图解决现有增强方法主要关注空间域处理而忽视频率域能够捕捉全局颜色分布和长程依赖关系的局限。论文的关键解决方案是提出FUSION，这是一种联合利用空间域和频率域信息的双域深度学习框架。在空间域内，FUSION通过多尺度卷积核和自适应注意力机制独立处理每个RGB通道；同时，在频率域通过基于FFT的频率注意力提取全局结构信息。此外，引入频率引导融合模块整合两域互补特征，并结合通道间融合与自适应通道重校准以确保色彩分布平衡。这一方案不仅实现了最先进的性能，还在参数量和计算复杂度上具有显著优势。

链接: https://arxiv.org/abs/2504.01243
作者: Jaskaran Singh Walia,Shravan Venkatraman,Pavithra LK
机构: Vellore Institute of Technology (维洛尔理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Underwater images suffer from severe degradations, including color distortions, reduced visibility, and loss of structural details due to wavelength-dependent attenuation and scattering. Existing enhancement methods primarily focus on spatial-domain processing, neglecting the frequency domain’s potential to capture global color distributions and long-range dependencies. To address these limitations, we propose FUSION, a dual-domain deep learning framework that jointly leverages spatial and frequency domain information. FUSION independently processes each RGB channel through multi-scale convolutional kernels and adaptive attention mechanisms in the spatial domain, while simultaneously extracting global structural information via FFT-based frequency attention. A Frequency Guided Fusion module integrates complementary features from both domains, followed by inter-channel fusion and adaptive channel recalibration to ensure balanced color distributions. Extensive experiments on benchmark datasets (UIEB, EUVP, SUIM-E) demonstrate that FUSION achieves state-of-the-art performance, consistently outperforming existing methods in reconstruction fidelity (highest PSNR of 23.717 dB and SSIM of 0.883 on UIEB), perceptual quality (lowest LPIPS of 0.112 on UIEB), and visual enhancement metrics (best UIQM of 3.414 on UIEB), while requiring significantly fewer parameters (0.28M) and lower computational complexity, demonstrating its suitability for real-time underwater imaging applications.
zh

[CV-76] nAd: A Tensor-based Low-rank Black Box Adversarial Attack for Video Classification

【速读】：该论文旨在解决深度学习模型在计算机视觉领域虽取得显著成就，但在黑盒设置下（即模型细节未知时）易受对抗攻击且现有方法效率低下的问题。传统方法通常将视频数据简单视为向量，忽略了其多维结构特性，并需要大量查询操作，导致效率低下且容易被检测。论文提出了一种名为\textbfTenAd的新颖基于张量的低秩对抗攻击方法，通过将视频表示为四阶张量来利用视频数据的多维属性。该方法的关键在于采用低秩攻击策略，大幅减少了搜索空间和所需的查询次数，从而有效生成不可察觉的对抗扰动，同时提高了黑盒设置下的攻击成功率与查询效率。实验结果表明，与现有最先进的方法相比，\textbfTenAd在攻击成功率、查询效率及扰动不可察觉性方面均表现出色，展示了基于张量的方法在视频模型对抗攻击中的潜力。

链接: https://arxiv.org/abs/2504.01228
作者: Kimia haghjooei,Mansoor Rezghi
机构: Tarbiat Modares University (塔比阿特莫达勒斯大学), Tehran, Iran (伊朗)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning models have achieved remarkable success in computer vision but remain vulnerable to adversarial attacks, particularly in black-box settings where model details are unknown. Existing adversarial attack methods(even those works with key frames) often treat video data as simple vectors, ignoring their inherent multi-dimensional structure, and require a large number of queries, making them inefficient and detectable. In this paper, we propose \textbfTenAd, a novel tensor-based low-rank adversarial attack that leverages the multi-dimensional properties of video data by representing videos as fourth-order tensors. By exploiting low-rank attack, our method significantly reduces the search space and the number of queries needed to generate adversarial examples in black-box settings. Experimental results on standard video classification datasets demonstrate that \textbfTenAd effectively generates imperceptible adversarial perturbations while achieving higher attack success rates and query efficiency compared to state-of-the-art methods. Our approach outperforms existing black-box adversarial attacks in terms of success rate, query efficiency, and perturbation imperceptibility, highlighting the potential of tensor-based methods for adversarial attacks on video models.
zh

[CV-77] rPPG-SysDiaGAN: Systolic-Diastolic Feature Localization in rPPG Using Generative Adversarial Network with Multi-Domain Discriminator

【速读】：该论文旨在解决现有远程光电容积脉搏波描记法（remote Photoplethysmography, rPPG）方法在重建PPG信号时存在的局限性，特别是无法准确区分收缩期和舒张期成分的问题。目前多数方法仅专注于提取心率，而未能充分表征完整的PPG信号。为克服这一限制，论文提出了一种基于生成对抗网络（Generative Adversarial Networks, GAN）的新架构，并引入多判别器来从面部视频中提取rPPG信号。关键在于设计了三种关注不同特性的判别器：时间域、频率域以及原始时间域信号的二阶导数，并结合四种损失函数（方差损失、动态时间规整损失、稀疏性损失及方差损失），以提高信号重构的准确性与鲁棒性。

链接: https://arxiv.org/abs/2504.01220
作者: Banafsheh Adami,Nima Karimian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) offers a novel approach to noninvasive monitoring of vital signs, such as respiratory rate, utilizing a camera. Although several supervised and self-supervised methods have been proposed, they often fail to accurately reconstruct the PPG signal, particularly in distinguishing between systolic and diastolic components. Their primary focus tends to be solely on extracting heart rate, which may not accurately represent the complete PPG signal. To address this limitation, this paper proposes a novel deep learning architecture using Generative Adversarial Networks by introducing multi-discriminators to extract rPPG signals from facial videos. These discriminators focus on the time domain, the frequency domain, and the second derivative of the original time domain signal. The discriminator integrates four loss functions: variance loss to mitigate local minima caused by noise; dynamic time warping loss to address local minima induced by alignment and sequences of variable lengths; Sparsity Loss for heart rate adjustment, and Variance Loss to ensure a uniform distribution across the desired frequency domain and time interval between systolic and diastolic phases of the PPG signal.
zh

[CV-78] Prompting Forgetting: Unlearning in GANs via Textual Guidance

【速读】：该论文旨在解决生成式对抗网络（GANs）在内容移除（Content Removal Techniques, CRTs）方面的研究空白。现有工作主要集中在扩散模型（diffusion models）上的机器遗忘（Machine Unlearning），而GANs中的遗忘机制尚未得到充分探索。论文提出了一种名为Text-to-Unlearn的新框架，通过仅使用文本提示，实现对预训练GANs中特定概念的选择性遗忘，包括特征移除、身份移除以及人脸图像的细粒度属性（如表情和多属性）移除。其关键在于利用自然语言描述引导遗忘过程，无需额外数据集或监督微调，从而提供了一种可扩展且高效的解决方案。此外，论文还引入了一种基于最先进的图像-文本对齐指标的自动评估方法，以全面分析所提出的遗忘方法的有效性。Text-to-Unlearn是首个针对GANs的跨模态遗忘框架，代表了管理生成模型行为的一种灵活且高效的进步。

链接: https://arxiv.org/abs/2504.01218
作者: Piyush Nagasubramaniam(1),Neeraj Karamchandani(1),Chen Wu(2),Sencun Zhu(1) ((1) The Pennsylvania State University, (2) Meta)
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Meta (Meta)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art generative models exhibit powerful image-generation capabilities, introducing various ethical and legal challenges to service providers hosting these models. Consequently, Content Removal Techniques (CRTs) have emerged as a growing area of research to control outputs without full-scale retraining. Recent work has explored the use of Machine Unlearning in generative models to address content removal. However, the focus of such research has been on diffusion models, and unlearning in Generative Adversarial Networks (GANs) has remained largely unexplored. We address this gap by proposing Text-to-Unlearn, a novel framework that selectively unlearns concepts from pre-trained GANs using only text prompts, enabling feature unlearning, identity unlearning, and fine-grained tasks like expression and multi-attribute removal in models trained on human faces. Leveraging natural language descriptions, our approach guides the unlearning process without requiring additional datasets or supervised fine-tuning, offering a scalable and efficient solution. To evaluate its effectiveness, we introduce an automatic unlearning assessment method adapted from state-of-the-art image-text alignment metrics, providing a comprehensive analysis of the unlearning methodology. To our knowledge, Text-to-Unlearn is the first cross-modal unlearning framework for GANs, representing a flexible and efficient advancement in managing generative model behavior.
zh

[CV-79] PolygoNet: Leverag ing Simplified Polygonal Representation for Effective Image Classification

【速读】：该论文旨在解决深度学习模型在图像相关任务中面临的计算复杂度高和过拟合问题。为应对这些挑战，论文提出了一种利用多边形表示（polygonal representations）的高效方法，通过将输入图像转换为基于显著点（dominant points）或轮廓坐标（contour coordinates）的紧凑形式，显著降低了计算需求，加速了训练过程，并有效减少了资源消耗，使其适用于实时及资源受限的应用场景。方案的关键在于这种多边形表示能够自然地捕获图像的核心特征同时滤除噪声，从而提供一种内在的正则化效果以缓解过拟合现象。实验结果表明，由此产生的轻量化模型在性能上与使用全分辨率图像的方法相当，同时支持边缘设备部署，验证了该方法在降低复杂性、提升泛化能力以及促进边缘计算应用方面的有效性。

链接: https://arxiv.org/abs/2504.01214
作者: Salim Khazem,Jeremy Fix,Cédric Pradalier
机构: Talan (塔兰); LORIA, CNRS, CentraleSupélec, Université de Paris-Saclay (洛里亚, 法国国家科学研究中心, 中央理工学院, 巴黎-萨克雷大学); GeorgiaTech-CNRS, GeorgiaTech Europe (乔治亚理工-法国国家科学研究中心, 乔治亚理工欧洲)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning models have achieved significant success in various image related tasks. However, they often encounter challenges related to computational complexity and overfitting. In this paper, we propose an efficient approach that leverages polygonal representations of images using dominant points or contour coordinates. By transforming input images into these compact forms, our method significantly reduces computational requirements, accelerates training, and conserves resources making it suitable for real time and resource constrained applications. These representations inherently capture essential image features while filtering noise, providing a natural regularization effect that mitigates overfitting. The resulting lightweight models achieve performance comparable to state of the art methods using full resolution images while enabling deployment on edge devices. Extensive experiments on benchmark datasets validate the effectiveness of our approach in reducing complexity, improving generalization, and facilitating edge computing applications. This work demonstrates the potential of polygonal representations in advancing efficient and scalable deep learning solutions for real world scenarios. The code for the experiments of the paper is provided in this https URL.
zh

[CV-80] GRU-AUNet: A Domain Adaptation Framework for Contactless Fingerprint Presentation Attack Detection

【速读】：该论文旨在解决接触式指纹识别中反欺骗（anti-spoofing）技术在域适应学习方法上的局限性，这些局限性包括较低的泛化能力和扩展性。为应对这些问题，论文提出了一种名为GRU-AUNet的新方法，其关键在于结合了基于Swin Transformer的UNet架构、增强注意力机制的门控循环单元（GRU）、瓶颈部分的动态滤波网络以及融合焦点损失与对比损失的综合损失函数。通过在真实和欺骗指纹图像上的训练，GRU-AUNet展示了对呈现攻击的强大鲁棒性，在CLARKSON、COLFISPOOF和IIITD数据集上的平均BPCER为0.09%，APCER为1.2%，优于现有的域适应方法。

链接: https://arxiv.org/abs/2504.01213
作者: Banafsheh Adami,Nima Karimian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although contactless fingerprints offer user comfort, they are more vulnerable to spoofing. The current solution for anti-spoofing in the area of contactless fingerprints relies on domain adaptation learning, limiting their generalization and scalability. To address these limitations, we introduce GRU-AUNet, a domain adaptation approach that integrates a Swin Transformer-based UNet architecture with GRU-enhanced attention mechanisms, a Dynamic Filter Network in the bottleneck, and a combined Focal and Contrastive Loss function. Trained in both genuine and spoof fingerprint images, GRU-AUNet demonstrates robust resilience against presentation attacks, achieving an average BPCER of 0.09% and APCER of 1.2% in the CLARKSON, COLFISPOOF, and IIITD datasets, outperforming state-of-the-art domain adaptation methods.
zh

[CV-81] Articulated Kinematics Distillation from Video Diffusion Models

【速读】：该论文旨在解决通过文本生成高保真四维（3D+时间）角色动画的问题，特别是克服现有基于4D神经变形场方法在保持形状一致性方面的挑战。论文提出的关键解决方案是Articulated Kinematics Distillation (AKD) 框架，它结合了基于骨架的动画和现代生成模型的优势。AKD通过关节级控制显著降低了自由度（Degrees of Freedom, DoFs），实现了高效且一致的运动合成。其核心创新在于利用预训练视频扩散模型的Score Distillation Sampling (SDS) 技术，从复杂、多关节运动中蒸馏出高质量的运动序列，同时保持结构完整性。此外，AKD与基于物理的模拟自然兼容，确保了物理上的合理性。实验结果表明，AKD在3D一致性及运动质量方面优于现有的文本到4D生成方法。

链接: https://arxiv.org/abs/2504.01204
作者: Xuan Li,Qianli Ma,Tsung-Yi Lin,Yongxin Chen,Chenfanfu Jiang,Ming-Yu Liu,Donglai Xiang
机构: UCLA (加州大学洛杉矶分校); NVIDIA (英伟达)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Articulated Kinematics Distillation (AKD), a framework for generating high-fidelity character animations by merging the strengths of skeleton-based animation and modern generative models. AKD uses a skeleton-based representation for rigged 3D assets, drastically reducing the Degrees of Freedom (DoFs) by focusing on joint-level control, which allows for efficient, consistent motion synthesis. Through Score Distillation Sampling (SDS) with pre-trained video diffusion models, AKD distills complex, articulated motions while maintaining structural integrity, overcoming challenges faced by 4D neural deformation fields in preserving shape consistency. This approach is naturally compatible with physics-based simulation, ensuring physically plausible interactions. Experiments show that AKD achieves superior 3D consistency and motion quality compared with existing works on text-to-4D generation. Project page: this https URL
zh

[CV-82] RipVIS: Rip Currents Video Instance Segmentation Benchmark for Beach Monitoring and Safety

【速读】：该论文旨在解决准确识别复杂海洋环境中（如海滩）生成式 AI (Generative AI) 所描述的裂流（rip currents）这一难题。裂流具有无固定形状且难以标注的特点，传统方法依赖专家知识，限制了其广泛应用。为应对这些挑战，论文提出了RipVIS，这是一个专门设计用于裂流分割的大规模视频实例分割基准数据集。RipVIS的关键创新在于其规模（比现有数据集大一个量级）以及多样性，包含来自多种设备（无人机、手机、固定摄像头等）采集的184段视频（总计212,328帧），其中150段视频（163,528帧）含有裂流，并覆盖全球多个地理位置的多样化视觉场景。此外，引入基于时间置信聚合（Temporal Confidence Aggregation, TCA）的后处理技术显著提升了模型在动态环境中的分割性能，特别是在优化召回率和减少假阴性方面通过F₂分数进行重点评估。此方案的核心在于构建高质量标注数据集与开发适配动态场景的先进算法相结合，以推动更安全的海滩环境研究进展。

链接: https://arxiv.org/abs/2504.01128
作者: Andrei Dumitriu,Florin Tatui,Florin Miron,Aakash Ralhan,Radu Tudor Ionescu,Radu Timofte
机构: Computer Vision Lab, CAIDAS & IFI, University of Würzburg (维尔茨堡大学); University of Bucharest (布加勒斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rip currents are strong, localized and narrow currents of water that flow outwards into the sea, causing numerous beach-related injuries and fatalities worldwide. Accurate identification of rip currents remains challenging due to their amorphous nature and the lack of annotated data, which often requires expert knowledge. To address these issues, we present RipVIS, a large-scale video instance segmentation benchmark explicitly designed for rip current segmentation. RipVIS is an order of magnitude larger than previous datasets, featuring 184 videos ( 212,328 frames), of which 150 videos ( 163,528 frames) are with rip currents, collected from various sources, including drones, mobile phones, and fixed beach cameras. Our dataset encompasses diverse visual contexts, such as wave-breaking patterns, sediment flows, and water color variations, across multiple global locations, including USA, Mexico, Costa Rica, Portugal, Italy, Greece, Romania, Sri Lanka, Australia and New Zealand. Most videos are annotated at 5 FPS to ensure accuracy in dynamic scenarios, supplemented by an additional 34 videos ( 48,800 frames) without rip currents. We conduct comprehensive experiments with Mask R-CNN, Cascade Mask R-CNN, SparseInst and YOLO11, fine-tuning these models for the task of rip current segmentation. Results are reported in terms of multiple metrics, with a particular focus on the F_2 score to prioritize recall and reduce false negatives. To enhance segmentation performance, we introduce a novel post-processing step based on Temporal Confidence Aggregation (TCA). RipVIS aims to set a new standard for rip current segmentation, contributing towards safer beach environments. We offer a benchmark website to share data, models, and results with the research community, encouraging ongoing collaboration and future contributions, at this https URL.
zh

[CV-83] Knowledge-Base based Semantic Image Transmission Using CLIP

【速读】：该论文旨在解决传统图像传输系统中过于依赖峰值信噪比（Peak Signal-to-Noise Ratio, PSNR）等像素级指标的问题，提出了一种基于知识库（Knowledge-Base, KB）辅助的语义通信框架，以实现更注重语义准确性（Semantic Accuracy）的图像传输。论文的关键在于结合对比语言-图像预训练模型（Contrastive Language-Image Pre-Training, CLIP）提取语义嵌入，并利用轻量级神经网络压缩特征，同时在接收端通过Facebook AI相似性搜索（FAISS）数据库进行语义匹配与检索。这种方案的核心创新点在于将语义一致性而非传统像素精度作为传输成功与否的评价标准，从而为语义感知通信系统提供了新的评估范式。

链接: https://arxiv.org/abs/2504.01053
作者: Chongyang Li,Yanmei He,Tianqian Zhang,Mingjian He,Shouyin Liu
机构: Dept. of Electronic & information Engineering, College of Physical Science and Technology, Central China Normal University (华中师范大学物理科学与技术学院电子与信息工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes a novel knowledge-Base (KB) assisted semantic communication framework for image transmission. At the receiver, a Facebook AI Similarity Search (FAISS) based vector database is constructed by extracting semantic embeddings from images using the Contrastive Language-Image Pre-Training (CLIP) model. During transmission, the transmitter first extracts a 512-dimensional semantic feature using the CLIP model, then compresses it with a lightweight neural network for transmission. After receiving the signal, the receiver reconstructs the feature back to 512 dimensions and performs similarity matching from the KB to retrieve the most semantically similar image. Semantic transmission success is determined by category consistency between the transmitted and retrieved images, rather than traditional metrics like Peak Signal-to-Noise Ratio (PSNR). The proposed system prioritizes semantic accuracy, offering a new evaluation paradigm for semantic-aware communication systems. Experimental validation on CIFAR100 demonstrates the effectiveness of the framework in achieving semantic image transmission.
zh

[CV-84] SViQA: A Unified Speech-Vision Multimodal Model for Textless Visual Question Answering

【速读】：该论文致力于解决跨模态理解中的语音-视觉模态差距问题，特别是在基于语音的视觉问答（SBVQA）任务中，现有方法主要关注文本-视觉集成，而忽视了语音与视觉模态之间由于异构性导致的潜在关联。为了解决这一挑战，论文提出了SViQA模型，这是一种统一的语音-视觉框架，可以直接处理语音提问而不依赖于文本转录。其解决方案的关键创新点在于：(1) 端到端的语音特征提取，避免了中间的文本转换步骤；(2) 跨模态对齐优化，实现了语音信号与视觉内容的有效融合。实验结果表明，SViQA在SBVQA基准测试中达到了75.62%的准确率，并展现出卓越的多模态泛化能力，同时混合输入语音和文本进一步提升了性能至78.85%，证明了其增强的鲁棒性和有效的跨模态注意力对齐能力。

链接: https://arxiv.org/abs/2504.01049
作者: Bingxin Li
机构: School of Computer Science, Fudan University (计算机科学学院, 复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal models integrating speech and vision hold significant potential for advancing human-computer interaction, particularly in Speech-Based Visual Question Answering (SBVQA) where spoken questions about images require direct audio-visual understanding. Existing approaches predominantly focus on text-visual integration, leaving speech-visual modality gaps underexplored due to their inherent heterogeneity. To this end, we introduce SViQA, a unified speech-vision model that directly processes spoken questions without text transcription. Building upon the LLaVA architecture, our framework bridges auditory and visual modalities through two key innovations: (1) end-to-end speech feature extraction eliminating intermediate text conversion, and (2) cross-modal alignment optimization enabling effective fusion of speech signals with visual content. Extensive experimental results on the SBVQA benchmark demonstrate the proposed SViQA’s state-of-the-art performance, achieving 75.62% accuracy, and competitive multimodal generalization. Leveraging speech-text mixed input boosts performance to 78.85%, a 3.23% improvement over pure speech input, highlighting SViQA’s enhanced robustness and effective cross-modal attention alignment.
zh

[CV-85] How does Watermarking Affect Visual Language Models in Document Understanding?

【速读】：该论文试图解决的问题是：水印是否会对视觉语言模型（Visual Language Models, VLMs）在文档理解任务中的性能产生负面影响。论文通过提出一个新颖的评估框架来研究可见水印对VLMs性能的影响，并系统性地分析了不同类型的文档数据、水印在文档中的位置以及水印内容变化等因素的作用。
解决方案的关键在于通过实验揭示水印对VLMs性能的具体影响机制，包括发现分散式水印比集中式水印造成更强干扰，且水印中的语义内容比简单视觉遮挡带来更大破坏。进一步通过注意力机制分析和嵌入空间相似性检查，确定性能下降的主要原因在于水印导致注意力分布的广泛重新分配以及嵌入空间中语义表示的改变。

链接: https://arxiv.org/abs/2504.01048
作者: Chunxue Xu,Yiwei Wang,Bryan Hooi,Yujun Cai,Songze Li
机构: Southeast University (东南大学, China); University of California, Merced (美国); National University of Singapore (新加坡); The University of Queensland (澳大利亚)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual Language Models (VLMs) have become foundational models for document understanding tasks, widely used in the processing of complex multimodal documents across domains such as finance, law, and academia. However, documents often contain noise-like information, such as watermarks, which inevitably leads us to inquire: \emphDo watermarks degrade the performance of VLMs in document understanding? To address this, we propose a novel evaluation framework to investigate the effect of visible watermarks on VLMs performance. We takes into account various factors, including different types of document data, the positions of watermarks within documents and variations in watermark content. Our experimental results reveal that VLMs performance can be significantly compromised by watermarks, with performance drop rates reaching up to 36%. We discover that \emphscattered watermarks cause stronger interference than centralized ones, and that \emphsemantic contents in watermarks creates greater disruption than simple visual occlusion. Through attention mechanism analysis and embedding similarity examination, we find that the performance drops are mainly attributed to that watermarks 1) force widespread attention redistribution, and 2) alter semantic representation in the embedding space. Our research not only highlights significant challenges in deploying VLMs for document understanding, but also provides insights towards developing robust inference mechanisms on watermarked documents.
zh

[CV-86] Predicting Movie Production Years through Facial Recognition of Actors with Machine Learning

【速读】：该论文旨在解决从电影随机抽取的图像中识别演员身份并提取其年龄的问题。这一任务面临诸如非均匀光照、演员多样且复杂的姿态以及图像中包含多重元素等挑战，并且化妆、假发、胡须以及不同配饰和服装的使用进一步增加了同一演员身份识别的难度。为应对这些挑战，论文构建了一个包含574张来自阿拉伯电影图像的阿拉伯演员数据集（Arab Actors Dataset, AAD），并尝试利用多种特征提取模型与机器学习算法进行分类和预测。研究的关键在于通过实验评估不同算法的性能，最终发现逻辑回归（Logistic Regression）模型在训练阶段表现出色，其AUC、精确度（Precision）、分类准确率（CA）和F1分数分别达到了99%、86%、85.5%和84.2%，从而证明了该模型在处理此类图像类型方面的有效性。这些成果可为提升面部识别技术的精度和可靠性提供支持，应用于电影搜索服务、推荐算法及电影类型分类等领域。

链接: https://arxiv.org/abs/2504.01047
作者: Asraa Muayed Abdalah,Noor Redha Alkazaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study used machine learning algorithms to identify actors and extract the age of actors from images taken randomly from movies. The use of images taken from Arab movies includes challenges such as non-uniform lighting, different and multiple poses for the actors and multiple elements with the actor or a group of actors. Additionally, the use of make-up, wigs, beards, and wearing different accessories and costumes made it difficult for the system to identify the personality of the same actor. The Arab Actors Dataset-AAD comprises 574 images sourced from various movies, encompassing both black and white as well as color compositions. The images depict complete scenes or fragments thereof. Multiple models were employed for feature extraction, and diverse machine learning algorithms were utilized during the classification and prediction stages to determine the most effective algorithm for handling such image types. The study demonstrated the effectiveness of the Logistic Regression model exhibited the best performance compared to other models in the training phase, as evidenced by its AUC, precision, CA and F1score values of 99%, 86%, 85.5% and 84.2% respectively. The findings of this study can be used to improve the precision and reliability of facial recognition technology for various uses as with movies search services, movie suggestion algorithms, and genre classification of movies.
zh

[CV-87] Coarse-to-Fine Learning for Multi-Pipette Localisation in Robot-Assisted In Vivo Patch-Clamp

【速读】：该论文旨在解决在体多套管片膜钳技术（in vivo image-guided multi-pipette patch-clamp）中，由于依赖手动操作而造成的可访问性和可扩展性限制问题。此外，现有的机器人自动化方法主要针对离体实验或单套管应用，难以满足在体多套管场景下的精确、实时检测需求。论文的关键解决方案在于提出了一种基于热图增强的粗到细学习技术（heatmap-augmented coarse-to-fine learning technique），用于辅助机器人辅助的在体多套管实时定位。具体而言，通过引入基于生成对抗网络（GAN）的模块来去除背景噪声并增强套管可见性，并设计了一个两阶段Transformer模型，首先预测套管尖端的粗略热图，随后利用精细坐标回归模块实现精确的尖端定位。为确保训练鲁棒性，采用匈牙利算法（Hungarian algorithm）优化预测与实际位置之间的匹配。实验结果表明，该方法在10 μm范围内达到98%的准确性，在5 μm范围内达到89%，平均均方误差（MSE）为2.52 μm。

链接: https://arxiv.org/abs/2504.01044
作者: Lan Wei,Gema Vera Gonzalez,Phatsimo Kgwarae,Alexander Timms,Denis Zahorovsky,Simon Schultz,Dandan Zhang
机构: Imperial College London (帝国理工学院); Imperial-X Initiative, Department of Bioengineering (帝国-X计划, 生物工程系), Imperial College London (帝国理工学院), London, United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In vivo image-guided multi-pipette patch-clamp is essential for studying cellular interactions and network dynamics in neuroscience. However, current procedures mainly rely on manual expertise, which limits accessibility and scalability. Robotic automation presents a promising solution, but achieving precise real-time detection of multiple pipettes remains a challenge. Existing methods focus on ex vivo experiments or single pipette use, making them inadequate for in vivo multi-pipette scenarios. To address these challenges, we propose a heatmap-augmented coarse-to-fine learning technique to facilitate multi-pipette real-time localisation for robot-assisted in vivo patch-clamp. More specifically, we introduce a Generative Adversarial Network (GAN)-based module to remove background noise and enhance pipette visibility. We then introduce a two-stage Transformer model that starts with predicting the coarse heatmap of the pipette tips, followed by the fine-grained coordination regression module for precise tip localisation. To ensure robust training, we use the Hungarian algorithm for optimal matching between the predicted and actual locations of tips. Experimental results demonstrate that our method achieved 98% accuracy within 10 \mum, and 89% accuracy within 5 \mum for the localisation of multi-pipette tips. The average MSE is 2.52 \mum.
zh

[CV-88] Cal or No Cal? – Real-Time Miscalibration Detection of LiDAR and Camera Sensors

【速读】：该论文旨在解决传感器外参标定过程中实时性和资源需求受限的问题，特别是在目标less在线标定场景下，现有方法难以满足严格的时间和资源约束。论文的关键创新在于提出了一种误标定检测框架，将直接回归标定参数的任务转变为对两种不同传感器模态标定状态（标定或误标定）的二分类任务。为此，作者设计了一种对比学习方法，在潜在空间中比较嵌入特征以实现标定状态分类。此外，论文还对特征嵌入及具有挑战性的标定误差进行了全面分析，验证了所提方法在检测性能、推理时间和资源需求方面的优越性。

链接: https://arxiv.org/abs/2504.01040
作者: Ilir Tahiraj,Jeremialie Swadiryus,Felix Fent,Markus Lienkamp
机构: Institute of Automotive Technology (汽车技术研究所); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The goal of extrinsic calibration is the alignment of sensor data to ensure an accurate representation of the surroundings and enable sensor fusion applications. From a safety perspective, sensor calibration is a key enabler of autonomous driving. In the current state of the art, a trend from target-based offline calibration towards targetless online calibration can be observed. However, online calibration is subject to strict real-time and resource constraints which are not met by state-of-the-art methods. This is mainly due to the high number of parameters to estimate, the reliance on geometric features, or the dependence on specific vehicle maneuvers. To meet these requirements and ensure the vehicle’s safety at any time, we propose a miscalibration detection framework that shifts the focus from the direct regression of calibration parameters to a binary classification of the calibration state, i.e., calibrated or miscalibrated. Therefore, we propose a contrastive learning approach that compares embedded features in a latent space to classify the calibration state of two different sensor modalities. Moreover, we provide a comprehensive analysis of the feature embeddings and challenging calibration errors that highlight the performance of our approach. As a result, our method outperforms the current state-of-the-art in terms of detection performance, inference time, and resource demand. The code is open source and available on this https URL.
zh

[CV-89] Mesh Compression with Quantized Neural Displacement Fields

【速读】：该论文致力于解决将隐式神经表示（INRs）应用于未结构化3D数据（如三角网格和点云）进行压缩的问题。现有方法主要针对结构化数据（如SDFs、体素网格及图像等）有效，但在处理未结构化3D几何数据时受到限制。论文的关键解决方案在于提出一种简单而有效的编码方式，通过训练一个小的神经网络来学习表面位移场，从而优化粗略版本的3D三角网格表面。这种方法的核心优势在于，训练完成后，神经网络参数占用的内存远低于原始位移场或表面模型，同时能够保持复杂的几何细节，并在压缩比从4倍到380倍范围内表现出最先进的性能。

链接: https://arxiv.org/abs/2504.01027
作者: Sai Karthikey Pentapati,Gregoire Phillips,Alan C. Bovik
机构: Laboratory of Image and Video Engineering, The University of Texas at Austin (图像与视频工程实验室，德克萨斯大学奥斯汀分校); Ericsson Inc. (爱立信), Santa Clara, CA (圣克拉拉, 加利福尼亚州)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit neural representations (INRs) have been successfully used to compress a variety of 3D surface representations such as Signed Distance Functions (SDFs), voxel grids, and also other forms of structured data such as images, videos, and audio. However, these methods have been limited in their application to unstructured data such as 3D meshes and point clouds. This work presents a simple yet effective method that extends the usage of INRs to compress 3D triangle meshes. Our method encodes a displacement field that refines the coarse version of the 3D mesh surface to be compressed using a small neural network. Once trained, the neural network weights occupy much lower memory than the displacement field or the original surface. We show that our method is capable of preserving intricate geometric textures and demonstrates state-of-the-art performance for compression ratios ranging from 4x to 380x.
zh

[CV-90] Gaze-Guided 3D Hand Motion Prediction for Detecting Intent in Egocentric Grasping Tasks

【速读】：该论文旨在解决神经康复应用中基于手部运动预测的人类意图检测问题，传统方法受限于生理信号测量且缺乏环境上下文信息。为克服这些限制，论文提出了一种新颖的方法，通过整合 gaze 信息（gaze information）、历史手部运动序列以及环境物体数据，动态适应患者的辅助需求，而无需事先了解目标抓握对象。关键在于使用向量量化变分自编码器（Vector-Quantized Variational Autoencoder, VQ-VAE）实现鲁棒的手势编码，并结合自回归生成式 Transformer 模型有效预测手部运动序列。研究通过实验验证了所提方法在少量输入帧情况下显著提升预测能力，展示了其在实际应用中的潜力。

链接: https://arxiv.org/abs/2504.01024
作者: Yufei He,Xucong Zhang,Arno H. A. Stienen
机构: Department of Biomechanical Engineering, Faculty of Mechanical Engineering, Delft University of Technology (代尔夫特理工大学); Department of Intelligent Systems, Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Human intention detection with hand motion prediction is critical to drive the upper-extremity assistive robots in neurorehabilitation applications. However, the traditional methods relying on physiological signal measurement are restrictive and often lack environmental context. We propose a novel approach that predicts future sequences of both hand poses and joint positions. This method integrates gaze information, historical hand motion sequences, and environmental object data, adapting dynamically to the assistive needs of the patient without prior knowledge of the intended object for grasping. Specifically, we use a vector-quantized variational autoencoder for robust hand pose encoding with an autoregressive generative transformer for effective hand motion sequence prediction. We demonstrate the usability of these novel techniques in a pilot study with healthy subjects. To train and evaluate the proposed method, we collect a dataset consisting of various types of grasp actions on different objects from multiple subjects. Through extensive experiments, we demonstrate that the proposed method can successfully predict sequential hand movement. Especially, the gaze information shows significant enhancements in prediction capabilities, particularly with fewer input frames, highlighting the potential of the proposed method for real-world applications.
zh

[CV-91] Omnidirectional Depth-Aided Occupancy Prediction based on Cylindrical Voxel for Autonomous Driving

【速读】：本文旨在解决自动驾驶中3D感知面临的几何歧义问题，传统方法因缺乏几何先验而表现不佳。为应对这些挑战，论文引入全向深度估计以提供几何先验，并提出基于草图着色框架的OmniDepth-Occ方法。关键在于利用深度信息结合极坐标系下的圆柱体体素表示，以更好地匹配全景相机的径向特性。此外，针对自动驾驶任务中鱼眼相机数据集匮乏的问题，构建了一个包含六台鱼眼相机的虚拟场景数据集，其数据量达到SemanticKITTI的两倍。实验结果表明，所提出的草图着色网络显著提升了3D感知性能。

链接: https://arxiv.org/abs/2504.01023
作者: Chaofan Wu,Jiaheng Li,Jinghao Cao,Ming Li,Yongkang Feng,Jiayu Wu Shuwen Xu,Zihang Gao,Sidan Du,Yang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Accurate 3D perception is essential for autonomous driving. Traditional methods often struggle with geometric ambiguity due to a lack of geometric prior. To address these challenges, we use omnidirectional depth estimation to introduce geometric prior. Based on the depth information, we propose a Sketch-Coloring framework OmniDepth-Occ. Additionally, our approach introduces a cylindrical voxel representation based on polar coordinate to better align with the radial nature of panoramic camera views. To address the lack of fisheye camera dataset in autonomous driving tasks, we also build a virtual scene dataset with six fisheye cameras, and the data volume has reached twice that of SemanticKITTI. Experimental results demonstrate that our Sketch-Coloring network significantly enhances 3D perception performance.
zh

[CV-92] Leverag ing Embedding Techniques in Multimodal Machine Learning for Mental Illness Assessment

【速读】：该论文旨在解决精神障碍（如抑郁症和创伤后应激障碍 PTSD）诊断工具的客观性、可扩展性和一致性不足的问题。传统临床评估方法在这些方面存在局限性，而论文提出利用多模态机器学习（Multimodal Machine Learning, MMML）整合文本、音频和视频数据中的互补信息来应对这些挑战。解决方案的关键在于采用全面的数据预处理技术（如基于语句的分块和格式化策略）、多种先进的嵌入模型、特征提取网络（如卷积神经网络 CNN 和双向长短期记忆网络 BiLSTM），以及多层级融合方法（包括引入大型语言模型 LLM 的预测结果）。此外，通过将多层感知机分类器替换为支持向量机进一步优化性能，并探索从严重程度预测到多类别分类的应用。研究发现，基于语句的分块显著提升了文本和音频模态的表现，而结合 CNN-BiLSTM 架构与外部 LLM 的决策级融合方法实现了最高的检测准确性（抑郁检测的平衡准确率为 94.8%，PTSD 检测为 96.2%），为开发更精准、可及且个性化的心理健康评估工具提供了有力支持。

链接: https://arxiv.org/abs/2504.01767
作者: Abdelrahaman A. Hassan,Abdelrahman A. Ali,Aya E. Fouda,Radwa J. Hanafy,Mohammed E. Fouda
机构: Compumacy for Artificial Intelligence solutions (计算能力人工智能解决方案公司); Department of Behavioural Health- Saint Elizabeths Hospital (行为健康系-圣伊丽莎白医院)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing global prevalence of mental disorders, such as depression and PTSD, requires objective and scalable diagnostic tools. Traditional clinical assessments often face limitations in accessibility, objectivity, and consistency. This paper investigates the potential of multimodal machine learning to address these challenges, leveraging the complementary information available in text, audio, and video data. Our approach involves a comprehensive analysis of various data preprocessing techniques, including novel chunking and utterance-based formatting strategies. We systematically evaluate a range of state-of-the-art embedding models for each modality and employ Convolutional Neural Networks (CNNs) and Bidirectional LSTM Networks (BiLSTMs) for feature extraction. We explore data-level, feature-level, and decision-level fusion techniques, including a novel integration of Large Language Model (LLM) predictions. We also investigate the impact of replacing Multilayer Perceptron classifiers with Support Vector Machines. We extend our analysis to severity prediction using PHQ-8 and PCL-C scores and multi-class classification (considering co-occurring conditions). Our results demonstrate that utterance-based chunking significantly improves performance, particularly for text and audio modalities. Decision-level fusion, incorporating LLM predictions, achieves the highest accuracy, with a balanced accuracy of 94.8% for depression and 96.2% for PTSD detection. The combination of CNN-BiLSTM architectures with utterance-level chunking, coupled with the integration of external LLM, provides a powerful and nuanced approach to the detection and assessment of mental health conditions. Our findings highlight the potential of MMML for developing more accurate, accessible, and personalized mental healthcare tools.
zh

[CV-93] Instance Migration Diffusion for Nuclear Instance Segmentation in Pathology

【速读】：该论文旨在解决数字病理学中由于标注数据有限导致的细胞核实例分割性能受限的问题。为应对这一挑战，论文提出了一种名为Instance Migration Diffusion Model (IM-Diffusion) 的新型数据增强框架。该框架通过构建多样化的细胞核布局和细胞间空间关系，生成更丰富的病理图像。解决方案的关键在于引入了两个核心模块：Nuclear Migration Module (NMM)，用于通过模拟细胞核迁移过程来构造多样化的细胞核布局；以及Internuclear-regions Inpainting Module (IIM)，通过结构感知的修复生成多样的细胞间空间关系。基于这两个模块，IM-Diffusion能够生成具有不同布局和细胞间空间关系的多样化病理图像，从而提升下游任务的性能。

链接: https://arxiv.org/abs/2504.01577
作者: Lirui Qi,Hongliang He,Tong Wang,Siwei Feng,Guohong Fu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nuclear instance segmentation plays a vital role in disease diagnosis within digital pathology. However, limited labeled data in pathological images restricts the overall performance of nuclear instance segmentation. To tackle this challenge, we propose a novel data augmentation framework Instance Migration Diffusion Model (IM-Diffusion), IM-Diffusion designed to generate more varied pathological images by constructing diverse nuclear layouts and internuclear spatial relationships. In detail, we introduce a Nuclear Migration Module (NMM) which constructs diverse nuclear layouts by simulating the process of nuclear migration. Building on this, we further present an Internuclear-regions Inpainting Module (IIM) to generate diverse internuclear spatial relationships by structure-aware inpainting. On the basis of the above, IM-Diffusion generates more diverse pathological images with different layouts and internuclear spatial relationships, thereby facilitating downstream tasks. Evaluation on the CoNSeP and GLySAC datasets demonstrate that the images generated by IM-Diffusion effectively enhance overall instance segmentation performance. Code will be made public later.
zh

[CV-94] STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation

【速读】：该论文旨在解决医学图像分析中病变分割准确性受病变分布和大小不确定性影响的问题。传统方法仅依赖视觉特征难以应对这些挑战。为了解决这些问题，论文提出了一种名为STPNet（Scale-aware Text Prompt Network）的网络，其关键是利用跨模态学习将视觉与语言模态的知识融合。STPNet通过多尺度文本描述引导病变定位，并采用检索-分割联合学习来弥合视觉与语言之间的语义鸿沟。在训练过程中，它从专业的医学文本库中检索相关文本信息，无需在推理阶段提供文本输入，同时保留了跨模态学习的优势。

链接: https://arxiv.org/abs/2504.01561
作者: Dandan Shan,Zihan Li,Yunxiang Li,Qingde Li,Jie Tian,Qingqi Hong
机构: Xiamen University (厦门大学); Department of Bioengineering, University of Washington (华盛顿大学生物工程系); Department of Radiation Oncology, UT Southwestern Medical Center (UT Southwestern 医学中心放射肿瘤科); Department of Computer Science, University of Hull (赫尔大学计算机科学系); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation. Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities. Crucially, STPNet retrieves relevant textual information from a specialized medical text repository during training, eliminating the need for text input during inference while retaining the benefits of cross-modal learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG. Experimental results show that our vision-language approach outperforms state-of-the-art segmentation methods, demonstrating the effectiveness of incorporating textual semantic knowledge into medical image analysis. The code has been made publicly on this https URL.
zh

[CV-95] BOLDSimNet: Examining Brain Network Similarity between Task and Resting-State fMRI

【速读】：该论文旨在解决传统基于任务和静息态功能磁共振成像（fMRI）因果连接方法在准确捕捉信息流方向时面临的挑战，这些问题源于其对噪声敏感以及无法有效建模多变量依赖关系。这些局限性阻碍了认知状态下大脑网络的有效比较，使得分析任务态与静息态下的网络重构变得困难。为了解决这些问题，论文提出了BOLDSimNet这一新框架，其关键在于利用多元传递熵（Multivariate Transfer Entropy, MTE）来衡量因果连接性和不同认知状态下的网络相似性，并通过将功能相似的感兴趣区域（Regions of Interest, ROIs）分组而非基于空间相邻节点进行分组，从而提高了网络对齐的准确性。

链接: https://arxiv.org/abs/2504.01274
作者: Boseong Kim,Debashis Das Chakladar,Haejun Chung,Ikbeom Jang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional causal connectivity methods in task-based and resting-state functional magnetic resonance imaging (fMRI) face challenges in accurately capturing directed information flow due to their sensitivity to noise and inability to model multivariate dependencies. These limitations hinder the effective comparison of brain networks between cognitive states, making it difficult to analyze network reconfiguration during task and resting states. To address these issues, we propose BOLDSimNet, a novel framework utilizing Multivariate Transfer Entropy (MTE) to measure causal connectivity and network similarity across different cognitive states. Our method groups functionally similar regions of interest (ROIs) rather than spatially adjacent nodes, improving accuracy in network alignment. We applied BOLDSimNet to fMRI data from 40 healthy controls and found that children exhibited higher similarity scores between task and resting states compared to adolescents, indicating reduced variability in attention shifts. In contrast, adolescents showed more differences between task and resting states in the Dorsal Attention Network (DAN) and the Default Mode Network (DMN), reflecting enhanced network adaptability. These findings emphasize developmental variations in the reconfiguration of the causal brain network, showcasing BOLDSimNet’s ability to quantify network similarity and identify attentional fluctuations between different cognitive states.
zh

[CV-96] Lightweight Deep Models for Dermatological Disease Detection: A Study on Instance Selection and Channel Optimization

【速读】：该论文试图解决皮肤病分类中数据集质量对轻量级卷积神经网络（Lightweight Convolutional Neural Networks）性能影响的问题。解决方案的关键在于提出了一种针对dermaMNIST数据集的预处理方法，通过减少训练实例数量来提升数据质量，同时保持与ResNet模型相当的分类性能。

链接: https://arxiv.org/abs/2504.01208
作者: Ian Mateos Gonzalez,Estefani Jaramilla Nava,Abraham Sánchez Morales,Jesús García-Ramírez,Ricardo Ramos-Aguilar
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Mexican Conference on Pattern Recognition 2025

点击查看摘要

Abstract:The identification of dermatological disease is an important problem in Mexico according with different studies. Several works in literature use the datasets of different repositories without applying a study of the data behavior, especially in medical images domain. In this work, we propose a methodology to preprocess dermaMNIST dataset in order to improve its quality for the classification stage, where we use lightweight convolutional neural networks. In our results, we reduce the number of instances for the neural network training obtaining a similar performance of models as ResNet.
zh

[CV-97] An Integrated AI-Enabled System Using One Class Twin Cross Learning (OCT-X) for Early Gastric Cancer Detection

【速读】：该论文旨在解决胃癌早期检测中因现有诊断技术局限性导致的误诊率高和漏诊率高的问题。为应对这些挑战，论文提出了一种集成系统，通过结合先进的硬件与软件技术实现速度与准确性的平衡。解决方案的关键在于引入One Class Twin Cross Learning (OCT-X)算法，该算法利用新颖的快速双阈值网格搜索策略（Fast Double-Threshold Grid Search, FDT-GS）和基于图像块的深度全卷积网络，在实时数据处理和病变监测方面实现了卓越的诊断准确性。此外，硬件组件采用一体化即时检测(POCT)设备，配备高分辨率成像传感器、实时数据处理及无线连接功能，由NI CompactDAQ和LabVIEW软件支持。这一集成系统的诊断准确率达到99.70%，显著优于现有模型，并在多速率适应性方面提高了10%，充分体现了OCT-X算法及其集成系统的临床应用潜力。

链接: https://arxiv.org/abs/2504.01038
作者: Xian-Xian Liu,Yuanyuan Wei,Mingkun Xu,Yongze Guo,Hongwei Zhang,Huicong Dong,Qun Song,Qi Zhao,Wei Luo,Feng Tien,Juntao Gao,Simon Fong
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 26 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Early detection of gastric cancer, a leading cause of cancer-related mortality worldwide, remains hampered by the limitations of current diagnostic technologies, leading to high rates of misdiagnosis and missed diagnoses. To address these challenges, we propose an integrated system that synergizes advanced hardware and software technologies to balance speed-accuracy. Our study introduces the One Class Twin Cross Learning (OCT-X) algorithm. Leveraging a novel fast double-threshold grid search strategy (FDT-GS) and a patch-based deep fully convolutional network, OCT-X maximizes diagnostic accuracy through real-time data processing and seamless lesion surveillance. The hardware component includes an all-in-one point-of-care testing (POCT) device with high-resolution imaging sensors, real-time data processing, and wireless connectivity, facilitated by the NI CompactDAQ and LabVIEW software. Our integrated system achieved an unprecedented diagnostic accuracy of 99.70%, significantly outperforming existing models by up to 4.47%, and demonstrated a 10% improvement in multirate adaptability. These findings underscore the potential of OCT-X as well as the integrated system in clinical diagnostics, offering a path toward more accurate, efficient, and less invasive early gastric cancer detection. Future research will explore broader applications, further advancing oncological diagnostics. Code is available at this https URL.
zh

[CV-98] Novel sparse PCA method via Runge Kutta numerical method(s) for face recognition

【速读】：该论文旨在解决人脸识别这一数据科学与生物特征安全领域的核心问题，其应用场景涵盖军事、金融及零售等行业。论文的关键解决方案是将稀疏主成分分析（Sparse PCA）与k-近邻方法或核岭回归方法相结合，并采用近端梯度法（Proximal Gradient，也称ISTA）或龙格-库塔数值方法来求解稀疏PCA。实验结果表明，利用这两种方法优化后的稀疏PCA与分类系统结合后，其识别精度高于标准PCA，且基于龙格-库塔方法的稀疏PCA计算在速度上始终优于近端梯度法。

链接: https://arxiv.org/abs/2504.01035
作者: Loc Hoang Tran,Luong Anh Tuan Nguyen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 3 tables

点击查看摘要

Abstract:Face recognition is a crucial topic in data science and biometric security, with applications spanning military, finance, and retail industries. This paper explores the implementation of sparse Principal Component Analysis (PCA) using the Proximal Gradient method (also known as ISTA) and the Runge-Kutta numerical methods. To address the face recognition problem, we integrate sparse PCA with either the k-nearest neighbor method or the kernel ridge regression method. Experimental results demonstrate that combining sparse PCA-solved via the Proximal Gradient method or the Runge-Kutta numerical approach-with a classification system yields higher accuracy compared to standard PCA. Additionally, we observe that the Runge-Kutta-based sparse PCA computation consistently outperforms the Proximal Gradient method in terms of speed.
zh

[CV-99] Diagnosis of Pulmonary Hypertension by Integrating Multimodal Data with a Hybrid Graph Convolutional and Transformer Network

【速读】：该论文旨在解决早期准确诊断肺动脉高压（Pulmonary Hypertension, PH）及其区分前毛细血管 PH 和后毛细血管 PH 的难题，以优化临床管理决策。解决方案的关键在于开发一种基于深度学习的多模态诊断模型，该模型融合了图卷积网络（Graph Convolutional Networks, GCN）、卷积神经网络（Convolutional Neural Networks, CNN）和 Transformer，能够有效处理包括短轴（SAX）序列、四腔心（4CH）序列以及临床参数在内的多模态数据，从而实现对非 PH、前毛细血管 PH 和后毛细血管 PH 的分类，并在测试集上达到了 AUC = 0.81 ± 0.06 和 ACC = 0.73 ± 0.06 的性能表现。

链接: https://arxiv.org/abs/2504.01025
作者: Fubao Zhu,Yang Zhang,Gengmin Liang,Jiaofen Nan,Yanting Li,Chuang Han,Danyang Sun,Zhiguo Wang,Chen Zhao,Wenxuan Zhou,Jian He,Yi Xu,Iokfai Cheang,Xu Zhu,Yanli Zhou,Weihua Zhou
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 23 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Early and accurate diagnosis of pulmonary hypertension (PH) is essential for optimal patient management. Differentiating between pre-capillary and post-capillary PH is critical for guiding treatment decisions. This study develops and validates a deep learning-based diagnostic model for PH, designed to classify patients as non-PH, pre-capillary PH, or post-capillary PH. This retrospective study analyzed data from 204 patients (112 with pre-capillary PH, 32 with post-capillary PH, and 60 non-PH controls) at the First Affiliated Hospital of Nanjing Medical University. Diagnoses were confirmed through right heart catheterization. We selected 6 samples from each category for the test set (18 samples, 10%), with the remaining 186 samples used for the training set. This process was repeated 35 times for testing. This paper proposes a deep learning model that combines Graph convolutional networks (GCN), Convolutional neural networks (CNN), and Transformers. The model was developed to process multimodal data, including short-axis (SAX) sequences, four-chamber (4CH) sequences, and clinical parameters. Our model achieved a performance of Area under the receiver operating characteristic curve (AUC) = 0.81 ± 0.06(standard deviation) and Accuracy (ACC) = 0.73 ± 0.06 on the test set. The discriminative abilities were as follows: non-PH subjects (AUC = 0.74 ± 0.11), pre-capillary PH (AUC = 0.86 ± 0.06), and post-capillary PH (AUC = 0.83 ± 0.10). It has the potential to support clinical decision-making by effectively integrating multimodal data to assist physicians in making accurate and timely diagnoses.
zh

人工智能

[AI-0] Efficient Federated Learning Tiny Language Models for Mobile Network Feature Prediction

链接: https://arxiv.org/abs/2504.01947
作者: Daniel Becking,Ingo Friese,Karsten Müller,Thomas Buchholz,Mandy Galkow-Schneider,Wojciech Samek,Detlev Marpe
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Signal Processing (eess.SP)
*备注: Accepted at 2025 EuCNC 6G Summit Poster Session

点击查看摘要

Abstract:In telecommunications, Autonomous Networks (ANs) automatically adjust configurations based on specific requirements (e.g., bandwidth) and available resources. These networks rely on continuous monitoring and intelligent mechanisms for self-optimization, self-repair, and self-protection, nowadays enhanced by Neural Networks (NNs) to enable predictive modeling and pattern recognition. Here, Federated Learning (FL) allows multiple AN cells - each equipped with NNs - to collaboratively train models while preserving data privacy. However, FL requires frequent transmission of large neural data and thus an efficient, standardized compression strategy for reliable communication. To address this, we investigate NNCodec, a Fraunhofer implementation of the ISO/IEC Neural Network Coding (NNC) standard, within a novel FL framework that integrates tiny language models (TLMs) for various mobile network feature prediction (e.g., ping, SNR or band frequency). Our experimental results on the Berlin V2X dataset demonstrate that NNCodec achieves transparent compression (i.e., negligible performance loss) while reducing communication overhead to below 1%, showing the effectiveness of combining NNC with FL in collaboratively learned autonomous mobile networks.

[AI-1] Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?

链接: https://arxiv.org/abs/2504.01935
作者: Celine Lee,Alexander M. Rush,Keyon Vafa
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) often benefit from verbalized reasoning at inference time, but it remains unclear which aspects of task difficulty these extra reasoning tokens address. To investigate this question, we formalize a framework using deterministic finite automata (DFAs). DFAs offer a formalism through which we can characterize task complexity through measurable properties such as run length (number of reasoning steps required) and state-space size (decision complexity). We first show that across different tasks and models of different sizes and training paradigms, there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized. We then investigate which properties of complexity govern this critical length: we find that task instances with longer corresponding underlying DFA runs (i.e. demand greater latent state-tracking requirements) correlate with longer reasoning lengths, but, surprisingly, that DFA size (i.e. state-space complexity) does not. We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.

[AI-2] Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework

链接: https://arxiv.org/abs/2504.01908
作者: Andrey Sidorenko,Michael Platzer,Mario Scriminaci,Paul Tiwald
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 7 figures, 1 table

点击查看摘要

Abstract:Evaluating the quality of synthetic data remains a key challenge for ensuring privacy and utility in data-driven research. In this work, we present an evaluation framework that quantifies how well synthetic data replicates original distributional properties while ensuring privacy. The proposed approach employs a holdout-based benchmarking strategy that facilitates quantitative assessment through low- and high-dimensional distribution comparisons, embedding-based similarity measures, and nearest-neighbor distance metrics. The framework supports various data types and structures, including sequential and contextual information, and enables interpretable quality diagnostics through a set of standardized metrics. These contributions aim to support reproducibility and methodological consistency in benchmarking of synthetic data generation techniques. The code of the framework is available at this https URL.

[AI-3] Accelerating IoV Intrusion Detection: Benchmarking GPU-Accelerated vs CPU-Based ML Libraries

链接: https://arxiv.org/abs/2504.01905
作者: Furkan Çolhak,Hasan Coşkun,Tsafac Nkombong Regine Cyrille,Tedi Hoxa,Mert İlhan Ecevit,Mehmet Nafiz Aydın
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: CIIT 2025 22nd International Conference on Informatics and Information Technologies (CIIT)

点击查看摘要

Abstract:The Internet of Vehicles (IoV) may face challenging cybersecurity attacks that may require sophisticated intrusion detection systems, necessitating a rapid development and response system. This research investigates the performance advantages of GPU-accelerated libraries (cuML) compared to traditional CPU-based implementations (scikit-learn), focusing on the speed and efficiency required for machine learning models used in IoV threat detection environments. The comprehensive evaluations conducted employ four machine learning approaches (Random Forest, KNN, Logistic Regression, XGBoost) across three distinct IoV security datasets (OTIDS, GIDS, CICIoV2024). Our findings demonstrate that GPU-accelerated implementations dramatically improved computational efficiency, with training times reduced by a factor of up to 159 and prediction speeds accelerated by up to 95 times compared to traditional CPU processing, all while preserving detection accuracy. This remarkable performance breakthrough empowers researchers and security specialists to harness GPU acceleration for creating faster, more effective threat detection systems that meet the urgent real-time security demands of today’s connected vehicle networks.

[AI-4] A novel gesture interaction control method for rehabilitation lower extremity exoskeleton

链接: https://arxiv.org/abs/2504.01888
作者: Shuang Qiu,Zhongcai Pei,Chen Wang,Jing Zhang,Zhiyong Tang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:With the rapid development of Rehabilitation Lower Extremity Robotic Exoskeletons (RLEEX) technology, significant advancements have been made in Human-Robot Interaction (HRI) methods. These include traditional physical HRI methods that are easily recognizable and various bio-electrical signal-based HRI methods that can visualize and predict actions. However, most of these HRI methods are contact-based, facing challenges such as operational complexity, sensitivity to interference, risks associated with implantable devices, and, most importantly, limitations in comfort. These challenges render the interaction less intuitive and natural, which can negatively impact patient motivation for rehabilitation. To address these issues, this paper proposes a novel non-contact gesture interaction control method for RLEEX, based on RGB monocular camera depth estimation. This method integrates three key steps: detecting keypoints, recognizing gestures, and assessing distance, thereby applying gesture information and augmented reality triggering technology to control gait movements of RLEEX. Results indicate that this approach provides a feasible solution to the problems of poor comfort, low reliability, and high latency in HRI for RLEEX platforms. Specifically, it achieves a gesture-controlled exoskeleton motion accuracy of 94.11% and an average system response time of 0.615 seconds through non-contact HRI. The proposed non-contact HRI method represents a pioneering advancement in control interactions for RLEEX, paving the way for further exploration and development in this field.

[AI-5] Interpreting Emergent Planning in Model-Free Reinforcement Learning ICLR2025

链接: https://arxiv.org/abs/2504.01871
作者: Thomas Bush,Stephen Chung,Usman Anwar,Adrià Garriga-Alonso,David Krueger
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025 oral

点击查看摘要

Abstract:We present the first mechanistic evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban – a commonly used benchmark for studying planning. Specifically, we demonstrate that DRC, a generic model-free agent introduced by Guez et al. (2019), uses learned concept representations to internally formulate plans that both predict the long-term effects of actions on the environment and influence action selection. Our methodology involves: (1) probing for planning-relevant concepts, (2) investigating plan formation within the agent’s representations, and (3) verifying that discovered plans (in the agent’s representations) have a causal effect on the agent’s behavior through interventions. We also show that the emergence of these plans coincides with the emergence of a planning-like property: the ability to benefit from additional test-time compute. Finally, we perform a qualitative analysis of the planning algorithm learned by the agent and discover a strong resemblance to parallelized bidirectional search. Our findings advance understanding of the internal mechanisms underlying planning behavior in agents, which is important given the recent trend of emergent planning and reasoning capabilities in LLMs through RL

[AI-6] From Code Generation to Software Testing: AI Copilot with Context-Based RAG

链接: https://arxiv.org/abs/2504.01866
作者: Yuchen Wang,Shangxin Guo,Chee Wei Tan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: This work has been accepted for publication in IEEE Software (DOI: https://doi.org/10.1109/MS.2025.3549628 )

点击查看摘要

Abstract:The rapid pace of large-scale software development places increasing demands on traditional testing methodologies, often leading to bottlenecks in efficiency, accuracy, and coverage. We propose a novel perspective on software testing by positing bug detection and coding with fewer bugs as two interconnected problems that share a common goal, which is reducing bugs with limited resources. We extend our previous work on AI-assisted programming, which supports code auto-completion and chatbot-powered QA, to the realm of software testing. We introduce Copilot for Testing, an automated testing system that synchronizes bug detection with codebase updates, leveraging context-based Retrieval Augmented Generation (RAG) to enhance the capabilities of large language models (LLMs). Our evaluation demonstrates a 31.2% improvement in bug detection accuracy, a 12.6% increase in critical test coverage, and a 10.5% higher user acceptance rate, highlighting the transformative potential of AI-driven technologies in modern software development practices.

[AI-7] Enhanced Diffusion Sampling via Extrapolation with Multiple ODE Solutions ICLR2025

链接: https://arxiv.org/abs/2504.01855
作者: Jinyoung Choi,Junoh Kang,Bohyung Han
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025

点击查看摘要

Abstract:Diffusion probabilistic models (DPMs), while effective in generating high-quality samples, often suffer from high computational costs due to their iterative sampling process. To address this, we propose an enhanced ODE-based sampling method for DPMs inspired by Richardson extrapolation, which reduces numerical error and improves convergence rates. Our method, RX-DPM, leverages multiple ODE solutions at intermediate time steps to extrapolate the denoised prediction in DPMs. This significantly enhances the accuracy of estimations for the final sample while maintaining the number of function evaluations (NFEs). Unlike standard Richardson extrapolation, which assumes uniform discretization of the time grid, we develop a more general formulation tailored to arbitrary time step scheduling, guided by local truncation error derived from a baseline sampling method. The simplicity of our approach facilitates accurate estimation of numerical solutions without significant computational overhead, and allows for seamless and convenient integration into various DPMs and solvers. Additionally, RX-DPM provides explicit error estimates, effectively demonstrating the faster convergence as the leading error term’s order increases. Through a series of experiments, we show that the proposed method improves the quality of generated samples without requiring additional sampling iterations.

[AI-8] Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks

链接: https://arxiv.org/abs/2504.01850
作者: Ali Al-Kaswan,Sebastian Deatc,Begüm Koç,Arie van Deursen,Maliheh Izadi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: FSE’25 Technical Track

点击查看摘要

Abstract:Nowadays, developers increasingly rely on solutions powered by Large Language Models (LLM) to assist them with their coding tasks. This makes it crucial to align these tools with human values to prevent malicious misuse. In this paper, we propose a comprehensive framework for assessing the potential harmfulness of LLMs within the software engineering domain. We begin by developing a taxonomy of potentially harmful software engineering scenarios and subsequently, create a dataset of prompts based on this taxonomy. To systematically assess the responses, we design and validate an automatic evaluator that classifies the outputs of a variety of LLMs both open-source and closed-source models, as well as general-purpose and code-specific LLMs. Furthermore, we investigate the impact of models size, architecture family, and alignment strategies on their tendency to generate harmful content. The results show significant disparities in the alignment of various LLMs for harmlessness. We find that some models and model families, such as Openhermes, are more harmful than others and that code-specific models do not perform better than their general-purpose counterparts. Notably, some fine-tuned models perform significantly worse than their base-models due to their design choices. On the other side, we find that larger models tend to be more helpful and are less likely to respond with harmful information. These results highlight the importance of targeted alignment strategies tailored to the unique challenges of software engineering tasks and provide a foundation for future work in this critical area.

[AI-9] An Approach to Technical AGI Safety and Security

链接: https://arxiv.org/abs/2504.01849
作者: Rohin Shah,Alex Irpan,Alexander Matt Turner,Anna Wang,Arthur Conmy,David Lindner,Jonah Brown-Cohen,Lewis Ho,Neel Nanda,Raluca Ada Popa,Rishub Jain,Rory Greig,Samuel Albanie,Scott Emmons,Sebastian Farquhar,Sébastien Krier,Senthooran Rajamanoharan,Sophie Bridgers,Tobi Ijitoye,Tom Everitt,Victoria Krakovna,Vikrant Varma,Vladimir Mikulik,Zachary Kenton,Dave Orr,Shane Legg,Noah Goodman,Allan Dafoe,Four Flynn,Anca Dragan
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations. To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned. Techniques from interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness of these mitigations. Finally, we briefly outline how these ingredients could be combined to produce safety cases for AGI systems.

[AI-10] A Novel Approach To Implementing Knowledge Distillation In Tsetlin Machines

链接: https://arxiv.org/abs/2504.01798
作者: Calvin Kinateder
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: Master’s Thesis. 75 pages, 30 figures

点击查看摘要

Abstract:The Tsetlin Machine ™ is a propositional logic based model that uses conjunctive clauses to learn patterns from data. As with typical neural networks, the performance of a Tsetlin Machine is largely dependent on its parameter count, with a larger number of parameters producing higher accuracy but slower execution. Knowledge distillation in neural networks transfers information from an already-trained teacher model to a smaller student model to increase accuracy in the student without increasing execution time. We propose a novel approach to implementing knowledge distillation in Tsetlin Machines by utilizing the probability distributions of each output sample in the teacher to provide additional context to the student. Additionally, we propose a novel clause-transfer algorithm that weighs the importance of each clause in the teacher and initializes the student with only the most essential data. We find that our algorithm can significantly improve performance in the student model without negatively impacting latency in the tested domains of image recognition and text classification.

[AI-11] Rethinking industrial artificial intelligence: a unified foundation framework

链接: https://arxiv.org/abs/2504.01797
作者: Jay Lee,Hanqi Su
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The paper submitted to IJAMD, the International Journal of AI for Materials and Design, has been accepted

点击查看摘要

Abstract:Recent advancement in industrial artificial intelligence (AI) is reshaping the industry, driving smarter manufacturing, predictive maintenance, and intelligent decision-making. However, existing approaches often focus primarily on algorithms and models, overlooking the importance of systematically integrating domain knowledge, data, and models to ensure more comprehensive and effective AI solutions. Therefore, the effective development and deployment of Industrial AI solutions require a more comprehensive and systematic approach. To address this gap, this paper summarizes previous research and rethinks the role of industrial AI and presents a unified industrial AI foundation framework comprising three core modules: knowledge module, data module, and model module. These modules help to extend and enhance the industrial AI methodology platform, supporting various industrial applications. In addition, a case study on rotating machinery diagnosis demonstrates the framework’s effectiveness, and several future directions are highlighted for the development of the industrial AI foundation framework.

[AI-12] CLaP – State Detection from Time Series

链接: https://arxiv.org/abs/2504.01783
作者: Arik Ermshaus,Patrick Schäfer,Ulf Leser
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The ever-growing amount of sensor data from machines, smart devices, and the environment leads to an abundance of high-resolution, unannotated time series (TS). These recordings encode the recognizable properties of latent states and transitions from physical phenomena that can be modelled as abstract processes. The unsupervised localization and identification of these states and their transitions is the task of time series state detection (TSSD). We introduce CLaP, a new, highly accurate and efficient algorithm for TSSD. It leverages the predictive power of time series classification for TSSD in an unsupervised setting by applying novel self-supervision techniques to detect whether data segments emerge from the same state or not. To this end, CLaP cross-validates a classifier with segment-labelled subsequences to quantify confusion between segments. It merges labels from segments with high confusion, representing the same latent state, if this leads to an increase in overall classification quality. We conducted an experimental evaluation using 391 TS from four benchmarks and found CLaP to be significantly more precise in detecting states than five state-of-the-art competitors. It achieves the best accuracy-runtime tradeoff and is scalable to large TS. We provide a Python implementation of CLaP, which can be deployed in TS analysis workflows.

[AI-13] Enhancing Interpretability in Generative AI Through Search-Based Data Influence Analysis

链接: https://arxiv.org/abs/2504.01771
作者: Theodoros Aivalis,Iraklis A. Klampanos,Antonis Troumpoukis,Joemon M. Jose
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative AI models offer powerful capabilities but often lack transparency, making it difficult to interpret their output. This is critical in cases involving artistic or copyrighted content. This work introduces a search-inspired approach to improve the interpretability of these models by analysing the influence of training data on their outputs. Our method provides observational interpretability by focusing on a model’s output rather than on its internal state. We consider both raw data and latent-space embeddings when searching for the influence of data items in generated content. We evaluate our method by retraining models locally and by demonstrating the method’s ability to uncover influential subsets in the training data. This work lays the groundwork for future extensions, including user-based evaluations with domain experts, which is expected to improve observational interpretability further.

[AI-14] Epistemic Skills: Reasoning about Knowledge and Oblivion

链接: https://arxiv.org/abs/2504.01733
作者: Xiaolong Liang,Yì N. Wáng
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:This paper presents a class of epistemic logics that captures the dynamics of acquiring knowledge and descending into oblivion, while incorporating concepts of group knowledge. The approach is grounded in a system of weighted models, introducing an epistemic skills'' metric to represent the epistemic capacities tied to knowledge updates. Within this framework, knowledge acquisition is modeled as a process of upskilling, whereas oblivion is represented as a consequence of downskilling. The framework further enables exploration of knowability’’ and ``forgettability,‘’ defined as the potential to gain knowledge through upskilling and to lapse into oblivion through downskilling, respectively. Additionally, it supports a detailed analysis of the distinctions between epistemic de re and de dicto expressions. The computational complexity of the model checking and satisfiability problems is examined, offering insights into their theoretical foundations and practical implications.

[AI-15] Sky of Unlearning (SoUL): Rewiring Federated Machine Unlearning via Selective Pruning

链接: https://arxiv.org/abs/2504.01705
作者: Md Mahabub Uz Zaman,Xiang Sun,Jingjing Yao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 6 pages, 6 figures, IEEE International Conference on Communications (ICC 2025)

点击查看摘要

Abstract:The Internet of Drones (IoD), where drones collaborate in data collection and analysis, has become essential for applications such as surveillance and environmental monitoring. Federated learning (FL) enables drones to train machine learning models in a decentralized manner while preserving data privacy. However, FL in IoD networks is susceptible to attacks like data poisoning and model inversion. Federated unlearning (FU) mitigates these risks by eliminating adversarial data contributions, preventing their influence on the model. This paper proposes sky of unlearning (SoUL), a federated unlearning framework that efficiently removes the influence of unlearned data while maintaining model performance. A selective pruning algorithm is designed to identify and remove neurons influential in unlearning but minimally impact the overall performance of the model. Simulations demonstrate that SoUL outperforms existing unlearning methods, achieves accuracy comparable to full retraining, and reduces computation and communication overhead, making it a scalable and efficient solution for resource-constrained IoD networks.

[AI-16] Reasoning LLM s for User-Aware Multimodal Conversational Agents

链接: https://arxiv.org/abs/2504.01700
作者: Hamed Rahimi,Jeanne Cattoni,Meriem Beghili,Mouad Abrini,Mahdi Khoramshahi,Maribel Pino,Mohamed Chetouani
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Personalization in social robotics is critical for fostering effective human-robot interactions, yet systems often face the cold start problem, where initial user preferences or characteristics are unavailable. This paper proposes a novel framework called USER-LLM R1 for a user-aware conversational agent that addresses this challenge through dynamic user profiling and model initiation. Our approach integrates chain-of-thought (CoT) reasoning models to iteratively infer user preferences and vision-language models (VLMs) to initialize user profiles from multimodal inputs, enabling personalized interactions from the first encounter. Leveraging a Retrieval-Augmented Generation (RAG) architecture, the system dynamically refines user representations within an inherent CoT process, ensuring contextually relevant and adaptive responses. Evaluations on the ElderlyTech-VQA Bench demonstrate significant improvements in ROUGE-1 (+23.2%), ROUGE-2 (+0.6%), and ROUGE-L (+8%) F1 scores over state-of-the-art baselines, with ablation studies underscoring the impact of reasoning model size on performance. Human evaluations further validate the framework’s efficacy, particularly for elderly users, where tailored responses enhance engagement and trust. Ethical considerations, including privacy preservation and bias mitigation, are rigorously discussed and addressed to ensure responsible deployment.

[AI-17] oken Pruning in Audio Transformers: Optimizing Performance and Decoding Patch Importance

链接: https://arxiv.org/abs/2504.01690
作者: Taehan Lee,Hyukjun Lee
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to the IEEE for possible publication. Source code is available at this https URL

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved state-of-the-art performance across various computer vision tasks, but their high computational cost remains a challenge. Token pruning has been proposed to reduce this cost by selectively removing less important tokens. While effective in vision tasks by discarding non-object regions, applying this technique to audio tasks presents unique challenges, as distinguishing relevant from irrelevant regions in time-frequency representations is less straightforward. In this study, for the first time, we applied token pruning to ViT-based audio classification models using Mel-spectrograms and analyzed the trade-offs between model performance and computational cost: TopK token pruning can reduce MAC operations of AudioMAE and AST by 30-40%, with less than a 1% drop in classification accuracy. Our analysis reveals that while high-intensity tokens contribute significantly to model accuracy, low-intensity tokens remain important. In particular, they play a more critical role in general audio classification tasks than in speech-specific tasks.

[AI-18] Anomaly Detection for Hybrid Butterfly Subspecies via Probability Filtering AAAI’25

链接: https://arxiv.org/abs/2504.01671
作者: Bo-Kai Ruan,Yi-Zeng Fang,Hong-Han Shuai,Juinn-Dar Huang
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注: AAAI’25 Workshop in Anomaly Detection in Scientific Domains

点击查看摘要

Abstract:Detecting butterfly hybrids requires knowledge of the parent subspecies, and the process can be tedious when encountering a new subspecies. This study focuses on a specific scenario where a model trained to recognize hybrid species A can generalize to species B when B biologically mimics A. Since species A and B share similar patterns, we leverage BioCLIP as our feature extractor to capture features based on their taxonomy. Consequently, the algorithm designed for species A can be transferred to B, as their hybrid and non-hybrid patterns exhibit similar relationships. To determine whether a butterfly is a hybrid, we adopt proposed probability filtering and color jittering to augment and simulate the mimicry. With these approaches, we achieve second place in the official development phase. Our code is publicly available at this https URL.

[AI-19] Market-Oriented Flow Allocation for Thermal Solar Plants: An Auction-Based Methodology with Artificial Intelligence

链接: https://arxiv.org/abs/2504.01652
作者: Sara Ruiz-Moreno,Antonio J. Gallego,Antonio J. Gallego,Antonio J. Gallego
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注: This manuscript has been submitted to Renewable Energy

点击查看摘要

Abstract:This paper presents a novel method to optimize thermal balance in parabolic trough collector (PTC) plants. It uses a market-based system to distribute flow among loops combined with an artificial neural network (ANN) to reduce computation and data requirements. This auction-based approach balances loop temperatures, accommodating varying thermal losses and collector efficiencies. Validation across different thermal losses, optical efficiencies, and irradiance conditions-sunny, partially cloudy, and cloudy-show improved thermal power output and intercept factors compared to a no-allocation system. It demonstrates scalability and practicality for large solar thermal plants, enhancing overall performance. The method was first validated through simulations on a realistic solar plant model, then adapted and successfully tested in a 50 MW solar trough plant, demonstrating its advantages. Furthermore, the algorithms have been implemented, commissioned, and are currently operating in 13 commercial solar trough plants.

[AI-20] Proposition of Affordance-Driven Environment Recognition Framework Using Symbol Networks in Large Language Models

链接: https://arxiv.org/abs/2504.01644
作者: Kazuma Arii,Satoshi Kurihara
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In the quest to enable robots to coexist with humans, understanding dynamic situations and selecting appropriate actions based on common sense and affordances are essential. Conventional AI systems face challenges in applying affordance, as it represents implicit knowledge derived from common sense. However, large language models (LLMs) offer new opportunities due to their ability to process extensive human knowledge. This study proposes a method for automatic affordance acquisition by leveraging LLM outputs. The process involves generating text using LLMs, reconstructing the output into a symbol network using morphological and dependency analysis, and calculating affordances based on network distances. Experiments using ``apple’’ as an example demonstrated the method’s ability to extract context-dependent affordances with high explainability. The results suggest that the proposed symbol network, reconstructed from LLM outputs, enables robots to interpret affordances effectively, bridging the gap between symbolized data and human-like situational understanding.

[AI-21] LLM -mediated Dynamic Plan Generation with a Multi-Agent Approach

链接: https://arxiv.org/abs/2504.01637
作者: Reo Abe,Akifumi Ito,Kanata Takayasu,Satoshi Kurihara
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Planning methods with high adaptability to dynamic environments are crucial for the development of autonomous and versatile robots. We propose a method for leveraging a large language model (GPT-4o) to automatically generate networks capable of adapting to dynamic environments. The proposed method collects environmental “status,” representing conditions and goals, and uses them to generate agents. These agents are interconnected on the basis of specific conditions, resulting in networks that combine flexibility and generality. We conducted evaluation experiments to compare the networks automatically generated with the proposed method with manually constructed ones, confirming the comprehensiveness of the proposed method’s networks and their higher generality. This research marks a significant advancement toward the development of versatile planning methods applicable to robotics, autonomous vehicles, smart systems, and other complex environments.

[AI-22] Building Knowledge from Interactions: An LLM -Based Architecture for Adaptive Tutoring and Social Reasoning IROS

链接: https://arxiv.org/abs/2504.01588
作者: Luca Garello,Giulia Belgiovine,Gabriele Russo,Francesco Rea,Alessandra Sciutti
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025

点击查看摘要

Abstract:Integrating robotics into everyday scenarios like tutoring or physical training requires robots capable of adaptive, socially engaging, and goal-oriented interactions. While Large Language Models show promise in human-like communication, their standalone use is hindered by memory constraints and contextual incoherence. This work presents a multimodal, cognitively inspired framework that enhances LLM-based autonomous decision-making in social and task-oriented Human-Robot Interaction. Specifically, we develop an LLM-based agent for a robot trainer, balancing social conversation with task guidance and goal-driven motivation. To further enhance autonomy and personalization, we introduce a memory system for selecting, storing and retrieving experiences, facilitating generalized reasoning based on knowledge built across different interactions. A preliminary HRI user study and offline experiments with a synthetic dataset validate our approach, demonstrating the system’s ability to manage complex interactions, autonomously drive training tasks, and build and retrieve contextual memories, advancing socially intelligent robotics.

[AI-23] Optimizing Package Delivery with Quantum Annealers: Addressing Time-Windows and Simultaneous Pickup and Delivery

链接: https://arxiv.org/abs/2504.01560
作者: Eneko Osaba,Esther Villar-Rodriguez,Pablo Miranda-Rodriguez,Antón Asla
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
*备注: 8 pages, 1 table, 9 figures, paper submitted to the IEEE International Conference on Quantum Computing and Engineering (QCE 2025)

点击查看摘要

Abstract:Recent research at the intersection of quantum computing and routing problems has been highly prolific. Much of this work focuses on classical problems such as the Traveling Salesman Problem and the Vehicle Routing Problem. The practical applicability of these problems depends on the specific objectives and constraints considered. However, it is undeniable that translating complex real-world requirements into these classical formulations often proves challenging. In this paper, we resort to our previously published quantum-classical technique for addressing real-world-oriented routing problems, known as Quantum for Real Package Delivery (Q4RPD), and elaborate on solving additional realistic problem instances. Accordingly, this paper emphasizes the following characteristics: i) simultaneous pickup and deliveries, ii) time-windows, and iii) mobility restrictions by vehicle type. To illustrate the application of Q4RPD, we have conducted an experimentation comprising seven instances, serving as a demonstration of the newly developed features.

[AI-24] Identifying Macro Causal Effects in C-DMGs

链接: https://arxiv.org/abs/2504.01551
作者: Simon Ferreira,Charles K. Assaad
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Causal effect identification using causal graphs is a fundamental challenge in causal inference. While extensive research has been conducted in this area, most existing methods assume the availability of fully specified causal graphs. However, in complex domains such as medicine and epidemiology, complete causal knowledge is often unavailable, and only partial information about the system is accessible. This paper focuses on causal effect identification within partially specified causal graphs, with particular emphasis on cluster-directed mixed graphs (C-DMGs). These graphs provide a higher-level representation of causal relationships by grouping variables into clusters, offering a more practical approach for handling complex systems. Unlike fully specified causal graphs, C-DMGs can contain cycles, which complicate their analysis and interpretation. Furthermore, their cluster-based nature introduces new challenges, as it gives rise to two distinct types of causal effects, macro causal effects and micro causal effects, with different properties. In this work, we focus on macro causal effects, which describe the effects of entire clusters on other clusters. We establish that the do-calculus is both sound and complete for identifying these effects in C-DMGs. Additionally, we provide a graphical characterization of non-identifiability for macro causal effects in these graphs.

[AI-25] Hyperbolic Diffusion Recommender Model

链接: https://arxiv.org/abs/2504.01541
作者: Meng Yuan,Yutian Xiao,Wei Chen,Chu Zhao,Deqing Wang,Fuzhen Zhuang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have emerged as the new state-of-the-art family of deep generative models. To gain deeper insights into the limitations of diffusion models in recommender systems, we investigate the fundamental structural disparities between images and items. Consequently, items often exhibit distinct anisotropic and directional structures that are less prevalent in images. However, the traditional forward diffusion process continuously adds isotropic Gaussian noise, causing anisotropic signals to degrade into noise, which impairs the semantically meaningful representations in recommender systems. Inspired by the advancements in hyperbolic spaces, we propose a novel \textit\textbfHyperbolic \textit\textbfDiffusion \textit\textbfRecommender \textit\textbfModel (named HDRM). Unlike existing directional diffusion methods based on Euclidean space, the intrinsic non-Euclidean structure of hyperbolic space makes it particularly well-adapted for handling anisotropic diffusion processes. In particular, we begin by formulating concepts to characterize latent directed diffusion processes within a geometrically grounded hyperbolic space. Subsequently, we propose a novel hyperbolic latent diffusion process specifically tailored for users and items. Drawing upon the natural geometric attributes of hyperbolic spaces, we impose structural restrictions on the space to enhance hyperbolic diffusion propagation, thereby ensuring the preservation of the intrinsic topology of user-item graphs. Extensive experiments on three benchmark datasets demonstrate the effectiveness of HDRM. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.01541 [cs.IR] (or arXiv:2504.01541v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.01541 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-26] AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge

链接: https://arxiv.org/abs/2504.01538
作者: You-Le Fang,Dong-Shan Jian,Xiang Li,Yan-Qing Ma
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC); High Energy Physics - Phenomenology (hep-ph); Classical Physics (physics.class-ph)
*备注: 31 pages, 5 figures

点击查看摘要

Abstract:Current limitations in human scientific discovery necessitate a new research paradigm. While advances in artificial intelligence (AI) offer a highly promising solution, enabling AI to emulate human-like scientific discovery remains an open challenge. To address this, we propose AI-Newton, a concept-driven discovery system capable of autonomously deriving physical laws from raw data – without supervision or prior physical knowledge. The system integrates a knowledge base and knowledge representation centered on physical concepts, along with an autonomous discovery workflow. As a proof of concept, we apply AI-Newton to a large set of Newtonian mechanics problems. Given experimental data with noise, the system successfully rediscovers fundamental laws, including Newton’s second law, energy conservation and law of gravitation, using autonomously defined concepts. This achievement marks a significant step toward AI-driven autonomous scientific discovery.

[AI-27] HH-PIM: Dynamic Optimization of Power and Performance with Heterogeneous-Hybrid PIM for Edge AI Devices

链接: https://arxiv.org/abs/2504.01468
作者: Sangmin Jeon,Kangju Lee,Kyeongwon Lee,Woojoo Lee
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 7 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Processing-in-Memory (PIM) architectures offer promising solutions for efficiently handling AI applications in energy-constrained edge environments. While traditional PIM designs enhance performance and energy efficiency by reducing data movement between memory and processing units, they are limited in edge devices due to continuous power demands and the storage requirements of large neural network weights in SRAM and DRAM. Hybrid PIM architectures, incorporating non-volatile memories like MRAM and ReRAM, mitigate these limitations but struggle with a mismatch between fixed computing resources and dynamically changing inference workloads. To address these challenges, this study introduces a Heterogeneous-Hybrid PIM (HH-PIM) architecture, comprising high-performance MRAM-SRAM PIM modules and low-power MRAM-SRAM PIM modules. We further propose a data placement optimization algorithm that dynamically allocates data based on computational demand, maximizing energy efficiency. FPGA prototyping and power simulations with processors featuring HH-PIM and other PIM types demonstrate that the proposed HH-PIM achieves up to 60.43 percent average energy savings over conventional PIMs while meeting application latency requirements. These results confirm the suitability of HH-PIM for adaptive, energy-efficient AI processing in edge devices.

[AI-28] Probabilistic Curriculum Learning for Goal-Based Reinforcement Learning

链接: https://arxiv.org/abs/2504.01459
作者: Llewyn Salt,Marcus Gallagher
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) – algorithms that teach artificial agents to interact with environments by maximising reward signals – has achieved significant success in recent years. These successes have been facilitated by advances in algorithms (e.g., deep Q-learning, deep deterministic policy gradients, proximal policy optimisation, trust region policy optimisation, and soft actor-critic) and specialised computational resources such as GPUs and TPUs. One promising research direction involves introducing goals to allow multimodal policies, commonly through hierarchical or curriculum reinforcement learning. These methods systematically decompose complex behaviours into simpler sub-tasks, analogous to how humans progressively learn skills (e.g. we learn to run before we walk, or we learn arithmetic before calculus). However, fully automating goal creation remains an open challenge. We present a novel probabilistic curriculum learning algorithm to suggest goals for reinforcement learning agents in continuous control and navigation tasks.

[AI-29] Enabling Systematic Generalization in Abstract Spatial Reasoning through Meta-Learning for Compositionality

链接: https://arxiv.org/abs/2504.01445
作者: Philipp Mondorf,Shijia Zhou,Monica Riedler,Barbara Plank
类目: Artificial Intelligence (cs.AI)
*备注: 30 pages, 14 figures

点击查看摘要

Abstract:Systematic generalization refers to the capacity to understand and generate novel combinations from known components. Despite recent progress by large language models (LLMs) across various domains, these models often fail to extend their knowledge to novel compositional scenarios, revealing notable limitations in systematic generalization. There has been an ongoing debate about whether neural networks possess the capacity for systematic generalization, with recent studies suggesting that meta-learning approaches designed for compositionality can significantly enhance this ability. However, these insights have largely been confined to linguistic problems, leaving their applicability to other tasks an open question. In this study, we extend the approach of meta-learning for compositionality to the domain of abstract spatial reasoning. To this end, we introduce \textitSYGAR -a dataset designed to evaluate the capacity of models to systematically generalize from known geometric transformations (e.g., translation, rotation) of two-dimensional objects to novel combinations of these transformations (e.g., translation+rotation). Our results show that a transformer-based encoder-decoder model, trained via meta-learning for compositionality, can systematically generalize to previously unseen transformation compositions, significantly outperforming state-of-the-art LLMs, including o3-mini, GPT-4o, and Gemini 2.0 Flash, which fail to exhibit similar systematic behavior. Our findings highlight the effectiveness of meta-learning in promoting systematicity beyond linguistic tasks, suggesting a promising direction toward more robust and generalizable models.

[AI-30] PiCo: Jailbreaking Multimodal Large Language Models via textbfPictorial textbfCode Contextualization

链接: https://arxiv.org/abs/2504.01444
作者: Aofan Liu,Lulu Tang,Ting Pan,Yuguo Yin,Bin Wang,Ao Yang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs), which integrate vision and other modalities into Large Language Models (LLMs), significantly enhance AI capabilities but also introduce new security vulnerabilities. By exploiting the vulnerabilities of the visual modality and the long-tail distribution characteristic of code training data, we present PiCo, a novel jailbreaking framework designed to progressively bypass multi-tiered defense mechanisms in advanced MLLMs. PiCo employs a tier-by-tier jailbreak strategy, using token-level typographic attacks to evade input filtering and embedding harmful intent within programming context instructions to bypass runtime monitoring. To comprehensively assess the impact of attacks, a new evaluation metric is further proposed to assess both the toxicity and helpfulness of model outputs post-attack. By embedding harmful intent within code-style visual instructions, PiCo achieves an average Attack Success Rate (ASR) of 84.13% on Gemini-Pro Vision and 52.66% on GPT-4, surpassing previous methods. Experimental results highlight the critical gaps in current defenses, underscoring the need for more robust strategies to secure advanced MLLMs.

[AI-31] From Easy to Hard: Building a Shortcut for Differentially Private Image Synthesis

链接: https://arxiv.org/abs/2504.01395
作者: Kecen Li,Chen Gong,Xiaochen Li,Yuzhong Zhao,Xinwen Hou,Tianhao Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Accepted at IEEE SP (Oakland) 2025; code available at this https URL

点击查看摘要

Abstract:Differentially private (DP) image synthesis aims to generate synthetic images from a sensitive dataset, alleviating the privacy leakage concerns of organizations sharing and utilizing synthetic images. Although previous methods have significantly progressed, especially in training diffusion models on sensitive images with DP Stochastic Gradient Descent (DP-SGD), they still suffer from unsatisfactory performance. In this work, inspired by curriculum learning, we propose a two-stage DP image synthesis framework, where diffusion models learn to generate DP synthetic images from easy to hard. Unlike existing methods that directly use DP-SGD to train diffusion models, we propose an easy stage in the beginning, where diffusion models learn simple features of the sensitive images. To facilitate this easy stage, we propose to use `central images’, simply aggregations of random samples of the sensitive dataset. Intuitively, although those central images do not show details, they demonstrate useful characteristics of all images and only incur minimal privacy costs, thus helping early-phase model training. We conduct experiments to present that on the average of four investigated image datasets, the fidelity and utility metrics of our synthetic images are 33.1% and 2.1% better than the state-of-the-art method.

[AI-32] Virtual Reality and Artificial Intelligence as Psychological Countermeasures in Space and Other Isolated and Confined Environments: A Scoping Review

链接: https://arxiv.org/abs/2504.01366
作者: Jennifer Sharp,Joshua Kelson,Daryl South,Anthony Saliba,Muhammad Ashad Kabir
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: 34 pages

点击查看摘要

Abstract:Spaceflight is an isolated and confined environment (ICE) that exposes astronauts to psychological hazards, such as stress, danger, and monotony. Virtual reality (VR) and artificial intelligence (AI) technologies can serve as psychological countermeasures as they can digitally simulate immersive environments, interactive companions, and therapeutic experiences. Our study employs a scoping literature review approach to identify what is currently known about the use and effectiveness of VR and AI-based interventions as psychological countermeasures to improve mood or emotional states in adults in space or other ICEs. Additionally, this review aimed to identify gaps in the knowledge base and whether a systematic review with meta-analysis was warranted. The review included studies where the intervention was used or intended for use in space or other extraterrestrial environments (ICE). Our search strategy yielded 19 studies from 3390 records across seven major databases. All studies focused on VR-based interventions, with no eligible AI-based intervention studies found. VR interventions were found to be effective for relaxation and improving mood, emergency training, as an interactive communication platform, for comparing interior designs, and for enhancing exercise. There were improvements for measures of mood and emotion\n (e.g., anxiety and stress); however, user preferences varied, and some instances of cybersickness were reported. A systematic review with meta-analysis is not recommended due to the heterogeneity of results. There is significant scope for further research into the use of VR for a wider range of mood and emotion variables using standardised assessment instruments. Additionally, the potential application of AI as a psychological countermeasure warrants further investigation.

[AI-33] An Explainable Reconfiguration-Based Optimization Algorithm for Industrial and Reliability-Redundancy Allocation Problems

链接: https://arxiv.org/abs/2504.01331
作者: Dikshit Chauhan,Nitin Gupta,Anupam Yadav
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 38 pages, 12 figures

点击查看摘要

Abstract:Industrial and reliability optimization problems often involve complex constraints and require efficient, interpretable solutions. This paper presents AI-AEFA, an advanced parameter reconfiguration-based metaheuristic algorithm designed to address large-scale industrial and reliability-redundancy allocation problems. AI-AEFA enhances search space exploration and convergence efficiency through a novel log-sigmoid-based parameter adaptation and chaotic mapping mechanism. The algorithm is validated across twenty-eight IEEE CEC 2017 constrained benchmark problems, fifteen large-scale industrial optimization problems, and seven reliability-redundancy allocation problems, consistently outperforming state-of-the-art optimization techniques in terms of feasibility, computational efficiency, and convergence speed. The additional key contribution of this work is the integration of SHAP (Shapley Additive Explanations) to enhance the interpretability of AI-AEFA, providing insights into the impact of key parameters such as Coulomb’s constant, charge, acceleration, and electrostatic force. This explainability feature enables a deeper understanding of decision-making within the AI-AEFA framework during the optimization processes. The findings confirm AI-AEFA as a robust, scalable, and interpretable optimization tool with significant real-world applications.

[AI-34] Strategize Globally Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning

链接: https://arxiv.org/abs/2504.01278
作者: Si Chen,Xiao Yu,Ninareh Mehrabi,Rahul Gupta,Zhou Yu,Ruoxi Jia
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The exploitation of large language models (LLMs) for malicious purposes poses significant security risks as these models become more powerful and widespread. While most existing red-teaming frameworks focus on single-turn attacks, real-world adversaries typically operate in multi-turn scenarios, iteratively probing for vulnerabilities and adapting their prompts based on threat model responses. In this paper, we propose \AlgName, a novel multi-turn red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions: global tactic-wise learning that accumulates knowledge over time and generalizes to new attack goals, and local prompt-wise learning that refines implementations for specific goals when initial attempts fail. Unlike previous multi-turn approaches that rely on fixed strategy sets, \AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics. Empirical evaluations on JailbreakBench demonstrate our framework’s superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns, outperforming state-of-the-art baselines. These results highlight the effectiveness of dynamic learning in identifying and exploiting model vulnerabilities in realistic multi-turn scenarios.

[AI-35] Dynamic Graph Structure Estimation for Learning Multivariate Point Process using Spiking Neural Networks

链接: https://arxiv.org/abs/2504.01246
作者: Biswadeep Chakraborty,Hemant Kumawat,Beomseok Kang,Saibal Mukhopadhyay
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 3 figures

点击查看摘要

Abstract:Modeling and predicting temporal point processes (TPPs) is critical in domains such as neuroscience, epidemiology, finance, and social sciences. We introduce the Spiking Dynamic Graph Network (SDGN), a novel framework that leverages the temporal processing capabilities of spiking neural networks (SNNs) and spike-timing-dependent plasticity (STDP) to dynamically estimate underlying spatio-temporal functional graphs. Unlike existing methods that rely on predefined or static graph structures, SDGN adapts to any dataset by learning dynamic spatio-temporal dependencies directly from the event data, enhancing generalizability and robustness. While SDGN offers significant improvements over prior methods, we acknowledge its limitations in handling dense graphs and certain non-Gaussian dependencies, providing opportunities for future refinement. Our evaluations, conducted on both synthetic and real-world datasets including NYC Taxi, 911, Reddit, and Stack Overflow, demonstrate that SDGN achieves superior predictive accuracy while maintaining computational efficiency. Furthermore, we include ablation studies to highlight the contributions of its core components.

[AI-36] Off-Policy Evaluation for Sequential Persuasion Process with Unobserved Confounding

链接: https://arxiv.org/abs/2504.01211
作者: Nishanth Venkatesh S.,Heeseung Bang,Andreas A. Malikopoulos
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: 8 pages, 4 Figures

点击查看摘要

Abstract:In this paper, we expand the Bayesian persuasion framework to account for unobserved confounding variables in sender-receiver interactions. While traditional models assume that belief updates follow Bayesian principles, real-world scenarios often involve hidden variables that impact the receiver’s belief formation and decision-making. We conceptualize this as a sequential decision-making problem, where the sender and receiver interact over multiple rounds. In each round, the sender communicates with the receiver, who also interacts with the environment. Crucially, the receiver’s belief update is affected by an unobserved confounding variable. By reformulating this scenario as a Partially Observable Markov Decision Process (POMDP), we capture the sender’s incomplete information regarding both the dynamics of the receiver’s beliefs and the unobserved confounder. We prove that finding an optimal observation-based policy in this POMDP is equivalent to solving for an optimal signaling strategy in the original persuasion framework. Furthermore, we demonstrate how this reformulation facilitates the application of proximal learning for off-policy evaluation in the persuasion process. This advancement enables the sender to evaluate alternative signaling strategies using only observational data from a behavioral policy, thus eliminating the necessity for costly new experiments.

[AI-37] Neural Approaches to SAT Solving: Design Choices and Interpretability

链接: https://arxiv.org/abs/2504.01173
作者: David Mojžíšek,Jan Hůla,Ziwei Li,Ziyu Zhou,Mikoláš Janota
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this contribution, we provide a comprehensive evaluation of graph neural networks applied to Boolean satisfiability problems, accompanied by an intuitive explanation of the mechanisms enabling the model to generalize to different instances. We introduce several training improvements, particularly a novel closest assignment supervision method that dynamically adapts to the model’s current state, significantly enhancing performance on problems with larger solution spaces. Our experiments demonstrate the suitability of variable-clause graph representations with recurrent neural network updates, which achieve good accuracy on SAT assignment prediction while reducing computational demands. We extend the base graph neural network into a diffusion model that facilitates incremental sampling and can be effectively combined with classical techniques like unit propagation. Through analysis of embedding space patterns and optimization trajectories, we show how these networks implicitly perform a process very similar to continuous relaxations of MaxSAT, offering an interpretable view of their reasoning process. This understanding guides our design choices and explains the ability of recurrent architectures to scale effectively at inference time beyond their training distribution, which we demonstrate with test-time scaling experiments.

[AI-38] Remember but also Forget: Bridging Myopic and Perfect Recall Fairness with Past-Discounting

链接: https://arxiv.org/abs/2504.01154
作者: Ashwin Kumar,William Yeoh
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Dynamic resource allocation in multi-agent settings often requires balancing efficiency with fairness over time–a challenge inadequately addressed by conventional, myopic fairness measures. Motivated by behavioral insights that human judgments of fairness evolve with temporal distance, we introduce a novel framework for temporal fairness that incorporates past-discounting mechanisms. By applying a tunable discount factor to historical utilities, our approach interpolates between instantaneous and perfect-recall fairness, thereby capturing both immediate outcomes and long-term equity considerations. Beyond aligning more closely with human perceptions of fairness, this past-discounting method ensures that the augmented state space remains bounded, significantly improving computational tractability in sequential decision-making settings. We detail the formulation of discounted-recall fairness in both additive and averaged utility contexts, illustrate its benefits through practical examples, and discuss its implications for designing balanced, scalable resource allocation strategies.

[AI-39] Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations

链接: https://arxiv.org/abs/2504.01153
作者: Mahjabin Nahar,Eun-Ju Lee,Jin Won Park,Dongwon Lee
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While we increasingly rely on large language models (LLMs) for various tasks, these models are known to produce inaccurate content or ‘hallucinations’ with potentially disastrous consequences. The recent integration of web search results into LLMs prompts the question of whether people utilize them to verify the generated content, thereby avoiding falling victim to hallucinations. This study (N = 560) investigated how the provision of search results, either static (fixed search results) or dynamic (participant-driven searches), affect participants’ perceived accuracy and confidence in evaluating LLM-generated content (i.e., genuine, minor hallucination, major hallucination), compared to the control condition (no search results). Findings indicate that participants in both static and dynamic conditions (vs. control) rated hallucinated content to be less accurate. However, those in the dynamic condition rated genuine content as more accurate and demonstrated greater overall confidence in their assessments than those in the static or control conditions. In addition, those higher in need for cognition (NFC) rated major hallucinations to be less accurate than low NFC participants, with no corresponding difference for genuine content or minor hallucinations. These results underscore the potential benefits of integrating web search results into LLMs for the detection of hallucinations, as well as the need for a more nuanced approach when developing human-centered systems, taking user characteristics into account.

[AI-40] ffstruc2vec: Flat Flexible and Scalable Learning of Node Representations from Structural Identities

链接: https://arxiv.org/abs/2504.01122
作者: Mario Heidrich,Jeffrey Heidemann,Rüdiger Buchkremer,Gonzalo Wandosell Fernández de Bobadilla
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Node embedding refers to techniques that generate low-dimensional vector representations of nodes in a graph while preserving specific properties of the nodes. A key challenge in the field is developing scalable methods that can preserve structural properties suitable for the required types of structural patterns of a given downstream application task. While most existing methods focus on preserving node proximity, those that do preserve structural properties often lack the flexibility to preserve various types of structural patterns required by downstream application tasks. This paper introduces ffstruc2vec, a scalable deep-learning framework for learning node embedding vectors that preserve structural identities. Its flat, efficient architecture allows high flexibility in capturing diverse types of structural patterns, enabling broad adaptability to various downstream application tasks. The proposed framework significantly outperforms existing approaches across diverse unsupervised and supervised tasks in practical applications. Moreover, ffstruc2vec enables explainability by quantifying how individual structural patterns influence task outcomes, providing actionable interpretation. To our knowledge, no existing framework combines this level of flexibility, scalability, and structural interpretability, underscoring its unique capabilities.

[AI-41] Hard-constraining Neumann boundary conditions in physics-informed neural networks via Fourier feature embeddings

链接: https://arxiv.org/abs/2504.01093
作者: Christopher Straub,Philipp Brendel,Vlad Medvedev,Andreas Rosskopf
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: 13 pages, 3 figures, 3 tables

点击查看摘要

Abstract:We present a novel approach to hard-constrain Neumann boundary conditions in physics-informed neural networks (PINNs) using Fourier feature embeddings. Neumann boundary conditions are used to described critical processes in various application, yet they are more challenging to hard-constrain in PINNs than Dirichlet conditions. Our method employs specific Fourier feature embeddings to directly incorporate Neumann boundary conditions into the neural network’s architecture instead of learning them. The embedding can be naturally extended by high frequency modes to better capture high frequency phenomena. We demonstrate the efficacy of our approach through experiments on a diffusion problem, for which our method outperforms existing hard-constraining methods and classical PINNs, particularly in multiscale and high frequency scenarios.

[AI-42] HomeEmergency – Using Audio to Find and Respond to Emergencies in the Home

链接: https://arxiv.org/abs/2504.01089
作者: James F. Mullen Jr,Dhruva Kumar,Xuewei Qi,Rajasimman Madhivanan,Arnie Sen,Dinesh Manocha,Richard Kim
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the United States alone accidental home deaths exceed 128,000 per year. Our work aims to enable home robots who respond to emergency scenarios in the home, preventing injuries and deaths. We introduce a new dataset of household emergencies based in the ThreeDWorld simulator. Each scenario in our dataset begins with an instantaneous or periodic sound which may or may not be an emergency. The agent must navigate the multi-room home scene using prior observations, alongside audio signals and images from the simulator, to determine if there is an emergency or not. In addition to our new dataset, we present a modular approach for localizing and identifying potential home emergencies. Underpinning our approach is a novel probabilistic dynamic scene graph (P-DSG), where our key insight is that graph nodes corresponding to agents can be represented with a probabilistic edge. This edge, when refined using Bayesian inference, enables efficient and effective localization of agents in the scene. We also utilize multi-modal vision-language models (VLMs) as a component in our approach, determining object traits (e.g. flammability) and identifying emergencies. We present a demonstration of our method completing a real-world version of our task on a consumer robot, showing the transferability of both our task and our method. Our dataset will be released to the public upon this papers publication. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.01089 [cs.RO] (or arXiv:2504.01089v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2504.01089 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-43] Are clinicians ethically obligated to disclose their use of medical machine learning systems to patients?

链接: https://arxiv.org/abs/2504.01043
作者: Joshua Hatherley
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 21 pages. Journal of Medical Ethics, forthcoming 2024

点击查看摘要

Abstract:It is commonly accepted that clinicians are ethically obligated to disclose their use of medical machine learning systems to patients, and that failure to do so would amount to a moral fault for which clinicians ought to be held accountable. Call this “the disclosure thesis.” Four main arguments have been, or could be, given to support the disclosure thesis in the ethics literature: the risk-based argument, the rights-based argument, the materiality argument, and the autonomy argument. In this article, I argue that each of these four arguments are unconvincing, and therefore, that the disclosure thesis ought to be rejected. I suggest that mandating disclosure may also even risk harming patients by providing stakeholders with a way to avoid accountability for harm that results from improper applications or uses of these systems.

[AI-44] One Person One Bot

链接: https://arxiv.org/abs/2504.01039
作者: Liat Lavi
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:This short paper puts forward a vision for a new democratic model enabled by the recent technological advances in agentic AI. It therefore opens with drawing a clear and concise picture of the model, and only later addresses related proposals and research directions, and concerns regarding feasibility and safety. It ends with a note on the timeliness of this idea and on optimism. The model proposed is that of assigning each citizen an AI Agent that would serve as their political delegate, enabling the return to direct democracy. The paper examines this models relation to existing research, its potential setbacks and feasibility and argues for its further development.

[AI-45] Artificial intelligence and democracy: Towards digital authoritarianism or a democratic upgrade?

链接: https://arxiv.org/abs/2504.01034
作者: Fereniki Panagopoulou
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Do robots vote? Do machines make decisions instead of us? No, (at least not yet), but this is something that could happen. The impact of Artificial Intelligence (AI) on democracy is a complex issue that requires thorough research and careful regulation. At the most important level, that of the electoral process, it is noted that it is not determined by the AI, but it is greatly impacted by its multiple applications. New types of online campaigns, driven by AI applications, are replacing traditional ones. The potential for manipulating voters and indirectly influencing the electoral outcome should not be underestimated. Certainly, instances of voter manipulation are not absent from traditional political campaigns, with the only difference being that digital manipulation is often carried out without our knowledge, e.g. by monitoring our behavior on social media. Nevertheless, we should not overlook the positive impact that AI has in the upgrading of democratic institutions by providing a forum for participation in decision-making. In this context, as a first step, we look into the potential jeopardization of democratic processes posed by the use of AI tools. Secondly, we consider the possibility of strengthening democratic processes by using AI, as well as the democratization of AI itself through the possibilities it offers. And thirdly, the impact of AI on the representative system is also discussed. The paper is concluded with recommendations and conclusions.

[AI-46] Who Owns the Output? Bridging Law and Technology in LLM s Attribution

链接: https://arxiv.org/abs/2504.01032
作者: Emanuele Mezzi,Asimina Mertzani,Michael P. Manis,Siyanna Lilova,Nicholas Vadivoulis,Stamatis Gatirdakis,Styliani Roussou,Rodayna Hmede
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 20 pages, 1 figure

点击查看摘要

Abstract:Since the introduction of ChatGPT in 2022, Large language models (LLMs) and Large Multimodal Models (LMM) have transformed content creation, enabling the generation of human-quality content, spanning every medium, text, images, videos, and audio. The chances offered by generative AI models are endless and are drastically reducing the time required to generate content and usually raising the quality of the generation. However, considering the complexity and the difficult traceability of the generated content, the use of these tools provides challenges in attributing AI-generated content. The difficult attribution resides for a variety of reasons, starting from the lack of a systematic fingerprinting of the generated content and ending with the enormous amount of data on which LLMs and LMM are trained, which makes it difficult to connect generated content to the training data. This scenario is raising concerns about intellectual property and ethical responsibilities. To address these concerns, in this paper, we bridge the technological, ethical, and legislative aspects, by proposing a review of the legislative and technological instruments today available and proposing a legal framework to ensure accountability. In the end, we propose three use cases of how these can be combined to guarantee that attribution is respected. However, even though the techniques available today can guarantee a greater attribution to a greater extent, strong limitations still apply, that can be solved uniquely by the development of new attribution techniques, to be applied to LLMs and LMMs.

[AI-47] Who is Responsible When AI Fails? Mapping Causes Entities and Consequences of AI Privacy and Ethical Incidents

链接: https://arxiv.org/abs/2504.01029
作者: Hilda Hadan,Reza Hadi Mogavi,Leah Zhang-Kennedy,Lennart E. Nacke
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Databases (cs.DB); Human-Computer Interaction (cs.HC)
*备注: 63 pages, 7 tables, 7 figures

点击查看摘要

Abstract:The rapid growth of artificial intelligence (AI) technologies has changed decision-making in many fields. But, it has also raised major privacy and ethical concerns. However, many AI incidents taxonomies and guidelines for academia, industry, and government lack grounding in real-world incidents. We analyzed 202 real-world AI privacy and ethical incidents. This produced a taxonomy that classifies incident types across AI lifecycle stages. It accounts for contextual factors such as causes, responsible entities, disclosure sources, and impacts. Our findings show insufficient incident reporting from AI developers and users. Many incidents are caused by poor organizational decisions and legal non-compliance. Only a few legal actions and corrective measures exist, while risk-mitigation efforts are limited. Our taxonomy contributes a structured approach in reporting of future AI incidents. Our findings demonstrate that current AI governance frameworks are inadequate. We urgently need child-specific protections and AI policies on social media. They must moderate and reduce the spread of harmful AI-generated content. Our research provides insights for policymakers and practitioners, which lets them design ethical AI. It also support AI incident detection and risk management. Finally, it guides AI policy development. Improved policies will protect people from harmful AI applications and support innovation in AI systems.

[AI-48] Segmentation variability and radiomics stability for predicting Triple-Negative Breast Cancer subtype using Magnetic Resonance Imaging

链接: https://arxiv.org/abs/2504.01692
作者: Isabella Cama,Alejandro Guzmán,Cristina Campi,Michele Piana,Karim Lekadir,Sara Garbarino,Oliver Díaz
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
*备注: 22 pages, 7 figures

点击查看摘要

Abstract:Most papers caution against using predictive models for disease stratification based on unselected radiomic features, as these features are affected by contouring variability. Instead, they advocate for the use of the Intraclass Correlation Coefficient (ICC) as a measure of stability for feature selection. However, the direct effect of segmentation variability on the predictive models is rarely studied. This study investigates the impact of segmentation variability on feature stability and predictive performance in radiomics-based prediction of Triple-Negative Breast Cancer (TNBC) subtype using Magnetic Resonance Imaging. A total of 244 images from the Duke dataset were used, with segmentation variability introduced through modifications of manual segmentations. For each mask, explainable radiomic features were selected using the Shapley Additive exPlanations method and used to train logistic regression models. Feature stability across segmentations was assessed via ICC, Pearson’s correlation, and reliability scores quantifying the relationship between feature stability and segmentation variability. Results indicate that segmentation accuracy does not significantly impact predictive performance. While incorporating peritumoral information may reduce feature reproducibility, it does not diminish feature predictive capability. Moreover, feature selection in predictive models is not inherently tied to feature stability with respect to segmentation, suggesting that an overreliance on ICC or reliability scores for feature selection might exclude valuable predictive features.

[AI-49] K-P Quantum Neural Networks

链接: https://arxiv.org/abs/2504.01673
作者: Elija Perrier
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:We present an extension of K-P time-optimal quantum control solutions using global Cartan KAK decompositions for geodesic-based solutions. Extending recent time-optimal \emphconstant- \theta control results, we integrate Cartan methods into equivariant quantum neural network (EQNN) for quantum control tasks. We show that a finite-depth limited EQNN ansatz equipped with Cartan layers can replicate the constant- \theta sub-Riemannian geodesics for K-P problems. We demonstrate how for certain classes of control problem on Riemannian symmetric spaces, gradient-based training using an appropriate cost function converges to certain global time-optimal solutions when satisfying simple regularity conditions. This generalises prior geometric control theory methods and clarifies how optimal geodesic estimation can be performed in quantum machine learning contexts.

机器学习

[LG-0] A Unified Approach to Analysis and Design of Denoising Markov Models

链接: https://arxiv.org/abs/2504.01938
作者: Yinuo Ren,Grant M. Rotskoff,Lexing Ying
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Probabilistic generative models based on measure transport, such as diffusion and flow-based models, are often formulated in the language of Markovian stochastic dynamics, where the choice of the underlying process impacts both algorithmic design choices and theoretical analysis. In this paper, we aim to establish a rigorous mathematical foundation for denoising Markov models, a broad class of generative models that postulate a forward process transitioning from the target distribution to a simple, easy-to-sample distribution, alongside a backward process particularly constructed to enable efficient sampling in the reverse direction. Leveraging deep connections with nonequilibrium statistical mechanics and generalized Doob’s h -transform, we propose a minimal set of assumptions that ensure: (1) explicit construction of the backward generator, (2) a unified variational objective directly minimizing the measure transport discrepancy, and (3) adaptations of the classical score-matching approach across diverse dynamics. Our framework unifies existing formulations of continuous and discrete diffusion models, identifies the most general form of denoising Markov models under certain regularity assumptions on forward generators, and provides a systematic recipe for designing denoising Markov models driven by arbitrary Lévy-type processes. We illustrate the versatility and practical effectiveness of our approach through novel denoising Markov models employing geometric Brownian motion and jump processes as forward dynamics, highlighting the framework’s potential flexibility and capability in modeling complex distributions.

[LG-1] Hessian-aware Training for Enhancing DNNs Resilience to Parameter Corruptions

链接: https://arxiv.org/abs/2504.01933
作者: Tahmid Hasan Prato,Seijoon Kim,Lizhong Chen,Sanghyun Hong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Pre-print

点击查看摘要

Abstract:Deep neural networks are not resilient to parameter corruptions: even a single-bitwise error in their parameters in memory can cause an accuracy drop of over 10%, and in the worst cases, up to 99%. This susceptibility poses great challenges in deploying models on computing platforms, where adversaries can induce bit-flips through software or bitwise corruptions may occur naturally. Most prior work addresses this issue with hardware or system-level approaches, such as integrating additional hardware components to verify a model’s integrity at inference. However, these methods have not been widely deployed as they require infrastructure or platform-wide modifications. In this paper, we propose a new approach to addressing this issue: training models to be more resilient to bitwise corruptions to their parameters. Our approach, Hessian-aware training, promotes models with flatter loss surfaces. We show that, while there have been training methods, designed to improve generalization through Hessian-based approaches, they do not enhance resilience to parameter corruptions. In contrast, models trained with our method demonstrate increased resilience to parameter corruptions, particularly with a 20 - 50% reduction in the number of bits whose individual flipping leads to a 90 - 100% accuracy drop. Moreover, we show the synergy between ours and existing hardware and system-level defenses. Comments: Pre-print Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2504.01933 [cs.CR] (or arXiv:2504.01933v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.01933 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Gen-C: Populating Virtual Worlds with Generative Crowds

链接: https://arxiv.org/abs/2504.01924
作者: Andreas Panayiotou,Panayiotis Charalambous,Ioannis Karamouzas
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 11 pages

点击查看摘要

Abstract:Over the past two decades, researchers have made significant advancements in simulating human crowds, yet these efforts largely focus on low-level tasks like collision avoidance and a narrow range of behaviors such as path following and flocking. However, creating compelling crowd scenes demands more than just functional movement-it requires capturing high-level interactions between agents, their environment, and each other over time. To address this issue, we introduce Gen-C, a generative model to automate the task of authoring high-level crowd behaviors. Gen-C bypasses the labor-intensive and challenging task of collecting and annotating real crowd video data by leveraging a large language model (LLM) to generate a limited set of crowd scenarios, which are subsequently expanded and generalized through simulations to construct time-expanded graphs that model the actions and interactions of virtual agents. Our method employs two Variational Graph Auto-Encoders guided by a condition prior network: one dedicated to learning a latent space for graph structures (agent interactions) and the other for node features (agent actions and navigation). This setup enables the flexible generation of dynamic crowd interactions. The trained model can be conditioned on natural language, empowering users to synthesize novel crowd behaviors from text descriptions. We demonstrate the effectiveness of our approach in two scenarios, a University Campus and a Train Station, showcasing its potential for populating diverse virtual environments with agents exhibiting varied and dynamic behaviors that reflect complex interactions and high-level decision-making patterns.

[LG-3] Client Selection in Federated Learning with Data Heterogeneity and Network Latencies

链接: https://arxiv.org/abs/2504.01921
作者: Harsh Vardhan,Xiaofan Yu,Tajana Rosing,Arya Mazumdar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a distributed machine learning paradigm where multiple clients conduct local training based on their private data, then the updated models are sent to a central server for global aggregation. The practical convergence of FL is challenged by multiple factors, with the primary hurdle being the heterogeneity among clients. This heterogeneity manifests as data heterogeneity concerning local data distribution and latency heterogeneity during model transmission to the server. While prior research has introduced various efficient client selection methods to alleviate the negative impacts of either of these heterogeneities individually, efficient methods to handle real-world settings where both these heterogeneities exist simultaneously do not exist. In this paper, we propose two novel theoretically optimal client selection schemes that can handle both these heterogeneities. Our methods involve solving simple optimization problems every round obtained by minimizing the theoretical runtime to convergence. Empirical evaluations on 9 datasets with non-iid data distributions, 2 practical delay distributions, and non-convex neural network models demonstrate that our algorithms are at least competitive to and at most 20 times better than best existing baselines.

[LG-4] Representing Flow Fields with Divergence-Free Kernels for Reconstruction

链接: https://arxiv.org/abs/2504.01913
作者: Xingyu Ni,Jingrui Xing,Xingqiao Li,Bin Wang,Baoquan Chen
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately reconstructing continuous flow fields from sparse or indirect measurements remains an open challenge, as existing techniques often suffer from oversmoothing artifacts, reliance on heterogeneous architectures, and the computational burden of enforcing physics-informed losses in implicit neural representations (INRs). In this paper, we introduce a novel flow field reconstruction framework based on divergence-free kernels (DFKs), which inherently enforce incompressibility while capturing fine structures without relying on hierarchical or heterogeneous representations. Through qualitative analysis and quantitative ablation studies, we identify the matrix-valued radial basis functions derived from Wendland’s \mathcalC^4 polynomial (DFKs-Wen4) as the optimal form of analytically divergence-free approximation for velocity fields, owing to their favorable numerical properties, including compact support, positive definiteness, and second-order differentiablility. Experiments across various reconstruction tasks, spanning data compression, inpainting, super-resolution, and time-continuous flow inference, has demonstrated that DFKs-Wen4 outperform INRs and other divergence-free representations in both reconstruction accuracy and computational efficiency while requiring the fewest trainable parameters.

[LG-5] Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

链接: https://arxiv.org/abs/2504.01898
作者: Robert M. Gower,Guillaume Garrigos,Nicolas Loizou,Dimitris Oikonomou,Konstantin Mishchenko,Fabian Schaipp
类目: Machine Learning (cs.LG)
*备注: 44 pages, 7 figures

点击查看摘要

Abstract:We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS ^* . Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS ^* as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an O(1/\sqrtt) anytime convergence in the smooth setting. We show how to combine SPS ^* with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

[LG-6] Multi-fidelity Parameter Estimation Using Conditional Diffusion Models

链接: https://arxiv.org/abs/2504.01894
作者: Caroline Tatsuoka,Minglei Yang,Dongbin Xiu,Guannan Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a multi-fidelity method for uncertainty quantification of parameter estimates in complex systems, leveraging generative models trained to sample the target conditional distribution. In the Bayesian inference setting, traditional parameter estimation methods rely on repeated simulations of potentially expensive forward models to determine the posterior distribution of the parameter values, which may result in computationally intractable workflows. Furthermore, methods such as Markov Chain Monte Carlo (MCMC) necessitate rerunning the entire algorithm for each new data observation, further increasing the computational burden. Hence, we propose a novel method for efficiently obtaining posterior distributions of parameter estimates for high-fidelity models given data observations of interest. The method first constructs a low-fidelity, conditional generative model capable of amortized Bayesian inference and hence rapid posterior density approximation over a wide-range of data observations. When higher accuracy is needed for a specific data observation, the method employs adaptive refinement of the density approximation. It uses outputs from the low-fidelity generative model to refine the parameter sampling space, ensuring efficient use of the computationally expensive high-fidelity solver. Subsequently, a high-fidelity, unconditional generative model is trained to achieve greater accuracy in the target posterior distribution. Both low- and high- fidelity generative models enable efficient sampling from the target posterior and do not require repeated simulation of the high-fidelity forward model. We demonstrate the effectiveness of the proposed method on several numerical examples, including cases with multi-modal densities, as well as an application in plasma physics for a runaway electron simulation model.

[LG-7] CO-DEFEND: Continuous Decentralized Federated Learning for Secure DoH-Based Threat Detection

链接: https://arxiv.org/abs/2504.01882
作者: Diego Cajaraville-Aboy,Marta Moure-Garrido,Carlos Beis-Penedo,Carlos Garcia-Rubio,Rebeca P. Díaz-Redondo,Celeste Campo,Ana Fernández-Vilas,Manuel Fernández-Veiga
类目: Machine Learning (cs.LG)
*备注: 15 pages, 8 figures, 4 tables

点击查看摘要

Abstract:The use of DNS over HTTPS (DoH) tunneling by an attacker to hide malicious activity within encrypted DNS traffic poses a serious threat to network security, as it allows malicious actors to bypass traditional monitoring and intrusion detection systems while evading detection by conventional traffic analysis techniques. Machine Learning (ML) techniques can be used to detect DoH tunnels; however, their effectiveness relies on large datasets containing both benign and malicious traffic. Sharing such datasets across entities is challenging due to privacy concerns. In this work, we propose CO-DEFEND (Continuous Decentralized Federated Learning for Secure DoH-Based Threat Detection), a Decentralized Federated Learning (DFL) framework that enables multiple entities to collaboratively train a classification machine learning model while preserving data privacy and enhancing resilience against single points of failure. The proposed DFL framework, which is scalable and privacy-preserving, is based on a federation process that allows multiple entities to train online their local models using incoming DoH flows in real time as they are processed by the entity. In addition, we adapt four classical machine learning algorithms, Support Vector Machines (SVM), Logistic Regression (LR), Decision Trees (DT), and Random Forest (RF), for federated scenarios, comparing their results with more computationally complex alternatives such as neural networks. We compare our proposed method by using the dataset CIRA-CIC-DoHBrw-2020 with existing machine learning approaches to demonstrate its effectiveness in detecting malicious DoH tunnels and the benefits it brings.

[LG-8] Architect Your Landscape Approach (AYLA) for Optimizations in Deep Learning

链接: https://arxiv.org/abs/2504.01875
作者: Ben Keslaki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stochastic Gradient Descent (SGD) and its variants, such as ADAM, are foundational to deep learning optimization, adjusting model parameters using fixed or adaptive learning rates based on loss function gradients. However, these methods often face challenges in balancing adaptability and efficiency in non-convex, high-dimensional settings. This paper introduces AYLA, a novel optimization technique that enhances training dynamics through loss function transformations. By applying a tunable power-law transformation, AYLA preserves critical points while scaling loss values to amplify gradient sensitivity, accelerating convergence. We further propose a dynamic (effective) learning rate that adapts to the transformed loss, improving optimization efficiency. Empirical tests on finding minimum of a synthetic non-convex polynomial, a non-convex curve-fitting dataset, and digit classification (MNIST) demonstrate that AYLA surpasses SGD and ADAM in convergence speed and stability. This approach redefines the loss landscape for better optimization outcomes, offering a promising advancement for deep neural networks and can be applied to any optimization method and potentially improve the performance of it.

[LG-9] Corner-Grasp: Multi-Action Grasp Detection and Active Gripper Adaptation for Grasping in Cluttered Environments

链接: https://arxiv.org/abs/2504.01861
作者: Yeong Gwang Son,Seunghwan Um,Juyong Hong,Tat Hieu Bui,Hyouk Ryeol Choi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 11 pages, 14 figures

点击查看摘要

Abstract:Robotic grasping is an essential capability, playing a critical role in enabling robots to physically interact with their surroundings. Despite extensive research, challenges remain due to the diverse shapes and properties of target objects, inaccuracies in sensing, and potential collisions with the environment. In this work, we propose a method for effectively grasping in cluttered bin-picking environments where these challenges intersect. We utilize a multi-functional gripper that combines both suction and finger grasping to handle a wide range of objects. We also present an active gripper adaptation strategy to minimize collisions between the gripper hardware and the surrounding environment by actively leveraging the reciprocating suction cup and reconfigurable finger motion. To fully utilize the gripper’s capabilities, we built a neural network that detects suction and finger grasp points from a single input RGB-D image. This network is trained using a larger-scale synthetic dataset generated from simulation. In addition to this, we propose an efficient approach to constructing a real-world dataset that facilitates grasp point detection on various objects with diverse characteristics. Experiment results show that the proposed method can grasp objects in cluttered bin-picking scenarios and prevent collisions with environmental constraints such as a corner of the bin. Our proposed method demonstrated its effectiveness in the 9th Robotic Grasping and Manipulation Competition (RGMC) held at ICRA 2024.

[LG-10] shapr: Explaining Machine Learning Models with Conditional Shapley Values in R and Python

链接: https://arxiv.org/abs/2504.01842
作者: Martin Jullum,Lars Henry Berge Olsen,Jon Lachmann,Annabelle Redelmeier
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:This paper introduces the shapr package, a versatile tool for generating Shapley value explanations for machine learning and statistical regression models in both R and Python. The package emphasizes conditional Shapley value estimates, providing a comprehensive range of approaches for accurately capturing feature dependencies, which is crucial for correct model interpretation and lacking in similar software. In addition to regular tabular data, the shapr R-package includes specialized functionality for explaining time series forecasts. The package offers a minimal set of user functions with sensible defaults for most use cases while providing extensive flexibility for advanced users to fine-tune computations. Additional features include parallelized computations, iterative estimation with convergence detection, and rich visualization tools. shapr also extends its functionality to compute causal and asymmetric Shapley values when causal information is available. In addition, we introduce the shaprpy Python library, which brings core capabilities of shapr to the Python ecosystem. Overall, the package aims to enhance the interpretability of predictive models within a powerful and user-friendly framework.

[LG-11] Inference of hidden common driver dynamics by anisotropic self-organizing neural networks

链接: https://arxiv.org/abs/2504.01811
作者: Zsigmond Benkő,Marcell Stippinger,Zoltán Somogyvári
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We are introducing a novel approach to infer the underlying dynamics of hidden common drivers, based on analyzing time series data from two driven dynamical systems. The inference relies on time-delay embedding, estimation of the intrinsic dimension of the observed systems, and their mutual dimension. A key component of our approach is a new anisotropic training technique applied to Kohonen’s self-organizing map, which effectively learns the attractor of the driven system and separates it into submanifolds corresponding to the self-dynamics and shared dynamics. To demonstrate the effectiveness of our method, we conducted simulated experiments using different chaotic maps in a setup, where two chaotic maps were driven by a third map with nonlinear coupling. The inferred time series exhibited high correlation with the time series of the actual hidden common driver, in contrast to the observed systems. The quality of our reconstruction were compared and shown to be superior to several other methods that are intended to find the common features behind the observed time series, including linear methods like PCA and ICA as well as nonlinear methods like dynamical component analysis, canonical correlation analysis and even deep canonical correlation analysis. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.01811 [cs.LG] (or arXiv:2504.01811v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.01811 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] Barrier Certificates for Unknown Systems with Latent States and Polynomial Dynamics using Bayesian Inference

链接: https://arxiv.org/abs/2504.01807
作者: Robert Lefringhausen,Sami Leon Noel Aziz Hanna,Elias August,Sandra Hirche
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted to the 64th IEEE Conference on Decision and Control

点击查看摘要

Abstract:Certifying safety in dynamical systems is crucial, but barrier certificates - widely used to verify that system trajectories remain within a safe region - typically require explicit system models. When dynamics are unknown, data-driven methods can be used instead, yet obtaining a valid certificate requires rigorous uncertainty quantification. For this purpose, existing methods usually rely on full-state measurements, limiting their applicability. This paper proposes a novel approach for synthesizing barrier certificates for unknown systems with latent states and polynomial dynamics. A Bayesian framework is employed, where a prior in state-space representation is updated using input-output data via a targeted marginal Metropolis-Hastings sampler. The resulting samples are used to construct a candidate barrier certificate through a sum-of-squares program. It is shown that if the candidate satisfies the required conditions on a test set of additional samples, it is also valid for the true, unknown system with high probability. The approach and its probabilistic guarantees are illustrated through a numerical simulation.

[LG-13] BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing CVPR2025

链接: https://arxiv.org/abs/2504.01786
作者: Yunqi Gu,Ian Huang,Jihyeon Je,Guandao Yang,Leonidas Guibas
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: CVPR 2025 Accepted

点击查看摘要

Abstract:3D graphics editing is crucial in applications like movie production and game design, yet it remains a time-consuming process that demands highly specialized domain expertise. Automating this process is challenging because graphical editing requires performing a variety of tasks, each requiring distinct skill sets. Recently, vision-language models (VLMs) have emerged as a powerful framework for automating the editing process, but their development and evaluation are bottlenecked by the lack of a comprehensive benchmark that requires human-level perception and presents real-world editing complexity. In this work, we present BlenderGym, the first comprehensive VLM system benchmark for 3D graphics editing. BlenderGym evaluates VLM systems through code-based 3D reconstruction tasks. We evaluate closed- and open-source VLM systems and observe that even the state-of-the-art VLM system struggles with tasks relatively easy for human Blender users. Enabled by BlenderGym, we study how inference scaling techniques impact VLM’s performance on graphics editing tasks. Notably, our findings reveal that the verifier used to guide the scaling of generation can itself be improved through inference scaling, complementing recent insights on inference scaling of LLM generation in coding and math tasks. We further show that inference compute is not uniformly effective and can be optimized by strategically distributing it between generation and verification.

[LG-14] Learning with Imperfect Models: When Multi-step Prediction Mitigates Compounding Error

链接: https://arxiv.org/abs/2504.01766
作者: Anne Somalwar,Bruce D. Lee,George J. Pappas,Nikolai Matni
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Compounding error, where small prediction mistakes accumulate over time, presents a major challenge in learning-based control. For example, this issue often limits the performance of model-based reinforcement learning and imitation learning. One common approach to mitigate compounding error is to train multi-step predictors directly, rather than relying on autoregressive rollout of a single-step model. However, it is not well understood when the benefits of multi-step prediction outweigh the added complexity of learning a more complicated model. In this work, we provide a rigorous analysis of this trade-off in the context of linear dynamical systems. We show that when the model class is well-specified and accurately captures the system dynamics, single-step models achieve lower asymptotic prediction error. On the other hand, when the model class is misspecified due to partial observability, direct multi-step predictors can significantly reduce bias and thus outperform single-step approaches. These theoretical results are supported by numerical experiments, wherein we also (a) empirically evaluate an intermediate strategy which trains a single-step model using a multi-step loss and (b) evaluate performance of single step and multi-step predictors in a closed loop control setting.

[LG-15] A Two-Timescale Approach for Wireless Federated Learning with Parameter Freezing and Power Control

链接: https://arxiv.org/abs/2504.01752
作者: Jinhao Ouyang,Yuan Liu,Hang Liu
类目: Machine Learning (cs.LG)
*备注: 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting, republishing, or reuse in other works. This work has been accepted to IEEE Transactions on Mobile Computing

点击查看摘要

Abstract:Federated learning (FL) enables distributed devices to train a shared machine learning (ML) model collaboratively while protecting their data privacy. However, the resource-limited mobile devices suffer from intensive computation-and-communication costs of model parameters. In this paper, we observe the phenomenon that the model parameters tend to be stabilized long before convergence during training process. Based on this observation, we propose a two-timescale FL framework by joint optimization of freezing stabilized parameters and controlling transmit power for the unstable parameters to balance the energy consumption and convergence. First, we analyze the impact of model parameter freezing and unreliable transmission on the convergence rate. Next, we formulate a two-timescale optimization problem of parameter freezing percentage and transmit power to minimize the model convergence error subject to the energy budget. To solve this problem, we decompose it into parallel sub-problems and decompose each sub-problem into two different timescales problems using the Lyapunov optimization method. The optimal parameter freezing and power control strategies are derived in an online fashion. Experimental results demonstrate the superiority of the proposed scheme compared with the benchmark schemes.

[LG-16] High Dimensional Bayesian Optimization using Lasso Variable Selection

链接: https://arxiv.org/abs/2504.01743
作者: Vu Viet Hoang,Hung The Tran,Sunil Gupta,Vu Nguyen
类目: Machine Learning (cs.LG)
*备注: Accepted at The 28th International Conference on Artificial Intelligence and Statistics

点击查看摘要

Abstract:Bayesian optimization (BO) is a leading method for optimizing expensive black-box optimization and has been successfully applied across various scenarios. However, BO suffers from the curse of dimensionality, making it challenging to scale to high-dimensional problems. Existing work has adopted a variable selection strategy to select and optimize only a subset of variables iteratively. Although this approach can mitigate the high-dimensional challenge in BO, it still leads to sample inefficiency. To address this issue, we introduce a novel method that identifies important variables by estimating the length scales of Gaussian process kernels. Next, we construct an effective search region consisting of multiple subspaces and optimize the acquisition function within this region, focusing on only the important variables. We demonstrate that our proposed method achieves cumulative regret with a sublinear growth rate in the worst case while maintaining computational efficiency. Experiments on high-dimensional synthetic functions and real-world problems show that our method achieves state-of-the-art performance.

[LG-17] Stable Structure Learning with HC-Stable and Tabu-Stable Algorithms

链接: https://arxiv.org/abs/2504.01740
作者: Neville K. Kitson,Anthony C. Constantinou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many Bayesian Network structure learning algorithms are unstable, with the learned graph sensitive to arbitrary dataset artifacts, such as the ordering of columns (i.e., variable order). PC-Stable attempts to address this issue for the widely-used PC algorithm, prompting researchers to use the “stable” version instead. However, this problem seems to have been overlooked for score-based algorithms. In this study, we show that some widely-used score-based algorithms, as well as hybrid and constraint-based algorithms, including PC-Stable, suffer from the same issue. We propose a novel solution for score-based greedy hill-climbing that eliminates instability by determining a stable node order, leading to consistent results regardless of variable ordering. Two implementations, HC-Stable and Tabu-Stable, are introduced. Tabu-Stable achieves the highest BIC scores across all networks, and the highest accuracy for categorical networks. These results highlight the importance of addressing instability in structure learning and provide a robust and practical approach for future applications. This extends the scope and impact of our previous work presented at Probabilistic Graphical Models 2024 by incorporating continuous variables. The implementation, along with usage instructions, is freely available on GitHub at this https URL.

[LG-18] Enlightenment Period Improving DNN Performance

链接: https://arxiv.org/abs/2504.01737
作者: Tiantian Liu,Weishi Xu,Meng Wan,Jue Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the early stage of deep neural network training, the loss decreases rapidly before gradually leveling off. Extensive research has shown that during this stage, the model parameters undergo significant changes and their distribution is largely established. Existing studies suggest that the introduction of noise during early training can degrade model performance. We identify a critical “enlightenment period” encompassing up to the first 4% of the training cycle (1–20 epochs for 500-epoch training schedules), a phase characterized by intense parameter fluctuations and heightened noise sensitivity. Our findings reveal that strategically reducing noise during this brief phase–by disabling data augmentation techniques such as Mixup or removing high-loss samples–leads to statistically significant improvements in model performance. This work opens new avenues for exploring the relationship between the enlightenment period and network training dynamics across diverse model architectures and tasks.

[LG-19] Beyond Non-Expert Demonstrations: Outcome-Driven Action Constraint for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2504.01719
作者: Ke Jiang,Wen Jiang,Yao Li,Xiaoyang Tan
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We address the challenge of offline reinforcement learning using realistic data, specifically non-expert data collected through sub-optimal behavior policies. Under such circumstance, the learned policy must be safe enough to manage \textitdistribution shift while maintaining sufficient flexibility to deal with non-expert (bad) demonstrations from offline this http URL tackle this issue, we introduce a novel method called Outcome-Driven Action Flexibility (ODAF), which seeks to reduce reliance on the empirical action distribution of the behavior policy, hence reducing the negative impact of those bad this http URL be specific, a new conservative reward mechanism is developed to deal with \it distribution shift by evaluating actions according to whether their outcomes meet safety requirements - remaining within the state support area, rather than solely depending on the actions’ likelihood based on offline this http URL theoretical justification, we provide empirical evidence on widely used MuJoCo and various maze benchmarks, demonstrating that our ODAF method, implemented using uncertainty quantification techniques, effectively tolerates unseen transitions for improved “trajectory stitching,” while enhancing the agent’s ability to learn from realistic non-expert data.

[LG-20] ransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication

链接: https://arxiv.org/abs/2504.01708
作者: Petr Vanc,Karla Stepanova
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:As human-robot collaboration advances, natural and flexible communication methods are essential for effective robot control. Traditional methods relying on a single modality or rigid rules struggle with noisy or misaligned data as well as with object descriptions that do not perfectly fit the predefined object names (e.g. ‘Pick that red object’). We introduce TransforMerger, a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs. Our approach merges multimodal data into a single unified sentence, which is then processed by the language model. We employ probabilistic embeddings to handle uncertainty and we integrate contextual scene understanding to resolve ambiguous references (e.g., gestures pointing to multiple objects or vague verbal cues like “this”). We evaluate TransforMerger in simulated and real-world experiments, demonstrating its robustness to noise, misalignment, and missing information. Our results show that TransforMerger outperforms deterministic baselines, especially in scenarios requiring more contextual knowledge, enabling more robust and flexible human-robot communication. Code and datasets are available at: this http URL.

[LG-21] Satellite Edge Artificial Intelligence with Large Models: Architectures and Technologies

链接: https://arxiv.org/abs/2504.01676
作者: Yuanming Shi,Jingyang Zhu,Chunxiao Jiang,Linling Kuang,Khaled B. Letaief
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: 15 pages, 5 figures; submitted to SCIENCE CHINA Information Sciences for possible publication

点击查看摘要

Abstract:Driven by the growing demand for intelligent remote sensing applications, large artificial intelligence (AI) models pre-trained on large-scale unlabeled datasets and fine-tuned for downstream tasks have significantly improved learning performance for various downstream tasks due to their generalization capabilities. However, many specific downstream tasks, such as extreme weather nowcasting (e.g., downburst and tornado), disaster monitoring, and battlefield surveillance, require real-time data processing. Traditional methods via transferring raw data to ground stations for processing often cause significant issues in terms of latency and trustworthiness. To address these challenges, satellite edge AI provides a paradigm shift from ground-based to on-board data processing by leveraging the integrated communication-and-computation capabilities in space computing power networks (Space-CPN), thereby enhancing the timeliness, effectiveness, and trustworthiness for remote sensing downstream tasks. Moreover, satellite edge large AI model (LAM) involves both the training (i.e., fine-tuning) and inference phases, where a key challenge lies in developing computation task decomposition principles to support scalable LAM deployment in resource-constrained space networks with time-varying topologies. In this article, we first propose a satellite federated fine-tuning architecture to split and deploy the modules of LAM over space and ground networks for efficient LAM fine-tuning. We then introduce a microservice-empowered satellite edge LAM inference architecture that virtualizes LAM components into lightweight microservices tailored for multi-task multimodal inference. Finally, we discuss the future directions for enhancing the efficiency and scalability of satellite edge LAM, including task-oriented communication, brain-inspired computing, and satellite edge AI network optimization.

[LG-22] Multi-Relation Graph-Kernel Strengthen Network for Graph-Level Clustering

链接: https://arxiv.org/abs/2504.01605
作者: Renda Han,Guangzhen Yao,Wenxin Zhang,Yu Li,Wen Xin,Huajie Lei,Mengfei Li,Zeyu Zhang,Chengze Du,Yahe Tian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph-level clustering is a fundamental task of data mining, aiming at dividing unlabeled graphs into distinct groups. However, existing deep methods that are limited by pooling have difficulty extracting diverse and complex graph structure features, while traditional graph kernel methods rely on exhaustive substructure search, unable to adaptive handle multi-relational data. This limitation hampers producing robust and representative graph-level embeddings. To address this issue, we propose a novel Multi-Relation Graph-Kernel Strengthen Network for Graph-Level Clustering (MGSN), which integrates multi-relation modeling with graph kernel techniques to fully leverage their respective advantages. Specifically, MGSN constructs multi-relation graphs to capture diverse semantic relationships between nodes and graphs, which employ graph kernel methods to extract graph similarity features, enriching the representation space. Moreover, a relation-aware representation refinement strategy is designed, which adaptively aligns multi-relation information across views while enhancing graph-level features through a progressive fusion process. Extensive experiments on multiple benchmark datasets demonstrate the superiority of MGSN over state-of-the-art methods. The results highlight its ability to leverage multi-relation structures and graph kernel features, establishing a new paradigm for robust graph-level clustering.

[LG-23] DRAN: A Distribution and Relation Adaptive Network for Spatio-temporal Forecasting

链接: https://arxiv.org/abs/2504.01531
作者: Xiaobei Zou,Luolin Xiong,Kexuan Zhang,Cesare Alippi,Yang Tang
类目: Machine Learning (cs.LG)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:Accurate predictions of spatio-temporal systems’ states are crucial for tasks such as system management, control, and crisis prevention. However, the inherent time variance of spatio-temporal systems poses challenges to achieving accurate predictions whenever stationarity is not granted. To address non-stationarity frameworks, we propose a Distribution and Relation Adaptive Network (DRAN) capable of dynamically adapting to relation and distribution changes over time. While temporal normalization and de-normalization are frequently used techniques to adapt to distribution shifts, this operation is not suitable for the spatio-temporal context as temporal normalization scales the time series of nodes and possibly disrupts the spatial relations among nodes. In order to address this problem, we develop a Spatial Factor Learner (SFL) module that enables the normalization and de-normalization process in spatio-temporal systems. To adapt to dynamic changes in spatial relationships among sensors, we propose a Dynamic-Static Fusion Learner (DSFL) module that effectively integrates features learned from both dynamic and static relations through an adaptive fusion ratio mechanism. Furthermore, we introduce a Stochastic Learner to capture the noisy components of spatio-temporal representations. Our approach outperforms state of the art methods in weather prediction and traffic flows forecasting tasks. Experimental results show that our SFL efficiently preserves spatial relationships across various temporal normalization operations. Visualizations of the learned dynamic and static relations demonstrate that DSFL can capture both local and distant relationships between nodes. Moreover, ablation studies confirm the effectiveness of each component.

[LG-24] UAKNN: Label Distribution Learning via Uncertainty-Aware KNN

链接: https://arxiv.org/abs/2504.01508
作者: Pu Wang,Yu Zhang,Zhuoran Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Label Distribution Learning (LDL) aims to characterize the polysemy of an instance by building a set of descriptive degrees corresponding to the instance. In recent years, researchers seek to model to obtain an accurate label distribution by using low-rank, label relations, expert experiences, and label uncertainty estimation. In general, these methods are based on algorithms with parameter learning in a linear (including kernel functions) or deep learning framework. However, these methods are difficult to deploy and update online due to high training costs, limited scalability, and outlier sensitivity. To address this problem, we design a novel LDL method called UAKNN, which has the advantages of the KNN algorithm with the benefits of uncertainty modeling. In addition, we provide solutions to the dilemma of existing work on extremely label distribution spaces. Extensive experiments demonstrate that our method is significantly competitive on 12 benchmarks and that the inference speed of the model is well-suited for industrial-level applications.

[LG-25] MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storag e ICDE2025

链接: https://arxiv.org/abs/2504.01506
作者: Yongjun He,Roger Waleffe,Zhichao Han,Johnu George,Binhang Yuan,Zitao Zhang,Yinan Shan,Yang Zhao,Debojyoti Dutta,Theodoros Rekatsinas,Ce Zhang
类目: Machine Learning (cs.LG)
*备注: To appear in ICDE 2025

点击查看摘要

Abstract:Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for specific tasks. Although the scalability issues that arise in different embedding model training tasks are similar, each of these frameworks independently reinvents and customizes storage components for specific tasks, leading to substantial duplicated engineering efforts in both development and deployment. This paper presents MLKV, an efficient, extensible, and reusable data storage framework designed to address the scalability challenges in embedding model training, specifically data stall and staleness. MLKV augments disk-based key-value storage by democratizing optimizations that were previously exclusive to individual specialized frameworks and provides easy-to-use interfaces for embedding model training tasks. Extensive experiments on open-source workloads, as well as applications in eBay’s payment transaction risk detection and seller payment risk detection, show that MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x. MLKV is open-source at this https URL.

[LG-26] Approximate Agreement Algorithms for Byzantine Collaborative Learning

链接: https://arxiv.org/abs/2504.01504
作者: Tijana Milentijević,Mélanie Cambus,Darya Melnyk,Stefan Schmid
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In Byzantine collaborative learning, n clients in a peer-to-peer network collectively learn a model without sharing their data by exchanging and aggregating stochastic gradient estimates. Byzantine clients can prevent others from collecting identical sets of gradient estimates. The aggregation step thus needs to be combined with an efficient (approximate) agreement subroutine to ensure convergence of the training process. In this work, we study the geometric median aggregation rule for Byzantine collaborative learning. We show that known approaches do not provide theoretical guarantees on convergence or gradient quality in the agreement subroutine. To satisfy these theoretical guarantees, we present a hyperbox algorithm for geometric median aggregation. We practically evaluate our algorithm in both centralized and decentralized settings under Byzantine attacks on non-i.i.d. data. We show that our geometric median-based approaches can tolerate sign-flip attacks better than known mean-based approaches from the literature. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2504.01504 [cs.LG] (or arXiv:2504.01504v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.01504 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] A Robust Model-Based Approach for Continuous-Time Policy Evaluation with Unknown Lévy Process Dynamics

链接: https://arxiv.org/abs/2504.01482
作者: Qihao Ye,Xiaochuan Tian,Yuhua Zhu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 27 pages, 9 figures

点击查看摘要

Abstract:This paper develops a model-based framework for continuous-time policy evaluation (CTPE) in reinforcement learning, incorporating both Brownian and Lévy noise to model stochastic dynamics influenced by rare and extreme events. Our approach formulates the policy evaluation problem as solving a partial integro-differential equation (PIDE) for the value function with unknown coefficients. A key challenge in this setting is accurately recovering the unknown coefficients in the stochastic dynamics, particularly when driven by Lévy processes with heavy tail effects. To address this, we propose a robust numerical approach that effectively handles both unbiased and censored trajectory datasets. This method combines maximum likelihood estimation with an iterative tail correction mechanism, improving the stability and accuracy of coefficient recovery. Additionally, we establish a theoretical bound for the policy evaluation error based on coefficient recovery error. Through numerical experiments, we demonstrate the effectiveness and robustness of our method in recovering heavy-tailed Lévy dynamics and verify the theoretical error analysis in policy evaluation.

[LG-28] Identifying Obfuscated Code through Graph-Based Semantic Analysis of Binary Code

链接: https://arxiv.org/abs/2504.01481
作者: Roxane Cohen(LAMSADE),Robin David,Florian Yger(LITIS),Fabrice Rossi(CEREMADE)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: The 13th International Conference on Complex Networks and their Applications, Dec 2024, Istabul, Turkey

点击查看摘要

Abstract:Protecting sensitive program content is a critical issue in various situations, ranging from legitimate use cases to unethical contexts. Obfuscation is one of the most used techniques to ensure such protection. Consequently, attackers must first detect and characterize obfuscation before launching any attack against it. This paper investigates the problem of function-level obfuscation detection using graph-based approaches, comparing algorithms, from elementary baselines to promising techniques like GNN (Graph Neural Networks), on different feature choices. We consider various obfuscation types and obfuscators, resulting in two complex datasets. Our findings demonstrate that GNNs need meaningful features that capture aspects of function semantics to outperform baselines. Our approach shows satisfactory results, especially in a challenging 11-class classification task and in a practical malware analysis example.

[LG-29] A Prefixed Patch Time Series Transformer for Two-Point Boundary Value Problems in Three-Body Problems

链接: https://arxiv.org/abs/2504.01464
作者: Akira Hatakeyama,Shota Ito,Toshihiko Yanase,Naoya Ozaki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Two-point boundary value problems for cislunar trajectories present significant challenges in circler restricted three body problem, making traditional analytical methods like Lambert’s problem inapplicable. This study proposes a novel approach using a prefixed patch time series Transformer model that automates the solution of two-point boundary value problems from lunar flyby to arbitrary terminal conditions. Using prefix tokens of terminal conditions in our deep generative model enables solving boundary value problems in three-body dynamics. The training dataset consists of trajectories obtained through forward propagation rather than solving boundary value problems directly. The model demonstrates potential practical utility for preliminary trajectory design in cislunar mission scenarios.

[LG-30] LLM -VPRF: Large Language Model Based Vector Pseudo Relevance Feedback

链接: https://arxiv.org/abs/2504.01448
作者: Hang Li,Shengyao Zhuang,Bevan Koopman,Guido Zuccon
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vector Pseudo Relevance Feedback (VPRF) has shown promising results in improving BERT-based dense retrieval systems through iterative refinement of query representations. This paper investigates the generalizability of VPRF to Large Language Model (LLM) based dense retrievers. We introduce LLM-VPRF and evaluate its effectiveness across multiple benchmark datasets, analyzing how different LLMs impact the feedback mechanism. Our results demonstrate that VPRF’s benefits successfully extend to LLM architectures, establishing it as a robust technique for enhancing dense retrieval performance regardless of the underlying models. This work bridges the gap between VPRF with traditional BERT-based dense retrievers and modern LLMs, while providing insights into their future directions.

[LG-31] Solving Time-Fractional Partial Integro-Differential Equations Using Tensor Neural Networks

链接: https://arxiv.org/abs/2504.01440
作者: Zhongshuo Lin,Qingkui Ma,Hehu Xie,Xiaobo Yin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel machine learning method based on adaptive tensor neural network subspace to solve linear time-fractional diffusion-wave equations and nonlinear time-fractional partial integro-differential equations. In this framework, the tensor neural network and Gauss-Jacobi quadrature are effectively combined to construct a universal numerical scheme for the temporal Caputo derivative with orders spanning (0,1) and (1,2) . Specifically, in order to effectively utilize Gauss-Jacobi quadrature to discretize Caputo derivatives, we design the tensor neural network function multiplied by the function t^\mu where the power \mu is selected according to the parameters of the equations at hand. Finally, some numerical examples are provided to validate the efficiency and accuracy of the proposed tensor neural network-based machine learning method.

[LG-32] On the Role of Priors in Bayesian Causal Learning

链接: https://arxiv.org/abs/2504.01424
作者: Bernhard C. Geiger,Roman Kern
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, accepted for publication in IEEE Transactions on Artificial Intelligence

点击查看摘要

Abstract:In this work, we investigate causal learning of independent causal mechanisms from a Bayesian perspective. Confirming previous claims from the literature, we show in a didactically accessible manner that unlabeled data (i.e., cause realizations) do not improve the estimation of the parameters defining the mechanism. Furthermore, we observe the importance of choosing an appropriate prior for the cause and mechanism parameters, respectively. Specifically, we show that a factorized prior results in a factorized posterior, which resonates with Janzing and Schölkopf’s definition of independent causal mechanisms via the Kolmogorov complexity of the involved distributions and with the concept of parameter independence of Heckerman et al.

[LG-33] aching Robots to Handle Nuclear Waste: A Teleoperation-Based Learning Approach

链接: https://arxiv.org/abs/2504.01405
作者: Joong-Ku Lee,Hyeonseok Choi,Young Soo Park,Jee-Hwan Ryu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Waste Management Symposia 2025

点击查看摘要

Abstract:This paper presents a Learning from Teleoperation (LfT) framework that integrates human expertise with robotic precision to enable robots to autonomously perform skills learned from human operators. The proposed framework addresses challenges in nuclear waste handling tasks, which often involve repetitive and meticulous manipulation operations. By capturing operator movements and manipulation forces during teleoperation, the framework utilizes this data to train machine learning models capable of replicating and generalizing human skills. We validate the effectiveness of the LfT framework through its application to a power plug insertion task, selected as a representative scenario that is repetitive yet requires precise trajectory and force control. Experimental results highlight significant improvements in task efficiency, while reducing reliance on continuous operator involvement.

[LG-34] Cause or Trigger? From Philosophy to Causal Modeling

链接: https://arxiv.org/abs/2504.01398
作者: Kateřina Hlaváčková-Schindler,Rainer Wöß,Vera Pecorino,Philip Schindler
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Not much has been written about the role of triggers in the literature on causal reasoning, causal modeling, or philosophy. In this paper, we focus on describing triggers and causes in the metaphysical sense and on characterizations that differentiate them from each other. We carry out a philosophical analysis of these differences. From this, we formulate a definition that clearly differentiates triggers from causes and can be used for causal reasoning in natural sciences. We propose a mathematical model and the Cause-Trigger algorithm, which, based on given data to observable processes, is able to determine whether a process is a cause or a trigger of an effect. The possibility to distinguish triggers from causes directly from data makes the algorithm a useful tool in natural sciences using observational data, but also for real-world scenarios. For example, knowing the processes that trigger causes of a tropical storm could give politicians time to develop actions such as evacuation the population. Similarly, knowing the triggers of processes that cause global warming could help politicians focus on effective actions. We demonstrate our algorithm on the climatological data of two recent cyclones, Freddy and Zazu. The Cause-Trigger algorithm detects processes that trigger high wind speed in both storms during their cyclogenesis. The findings obtained agree with expert knowledge.

[LG-35] De Novo Molecular Design Enabled by Direct Preference Optimization and Curriculum Learning

链接: https://arxiv.org/abs/2504.01389
作者: Junyu Hou
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:De novo molecular design has extensive applications in drug discovery and materials science. The vast chemical space renders direct molecular searches computationally prohibitive, while traditional experimental screening is both time- and labor-intensive. Efficient molecular generation and screening methods are therefore essential for accelerating drug discovery and reducing costs. Although reinforcement learning (RL) has been applied to optimize molecular properties via reward mechanisms, its practical utility is limited by issues in training efficiency, convergence, and stability. To address these challenges, we adopt Direct Preference Optimization (DPO) from NLP, which uses molecular score-based sample pairs to maximize the likelihood difference between high- and low-quality molecules, effectively guiding the model toward better compounds. Moreover, integrating curriculum learning further boosts training efficiency and accelerates convergence. A systematic evaluation of the proposed method on the GuacaMol Benchmark yielded excellent scores. For instance, the method achieved a score of 0.883 on the Perindopril MPO task, representing a 6% improvement over competing models. And subsequent target protein binding experiments confirmed its practical efficacy. These results demonstrate the strong potential of DPO for molecular design tasks and highlight its effectiveness as a robust and efficient solution for data-driven drug discovery.

[LG-36] UniFault: A Fault Diagnosis Foundation Model from Bearing Data

链接: https://arxiv.org/abs/2504.01373
作者: Emadeldeen Eldele,Mohamed Ragab,Xu Qing,Edward,Zhenghua Chen,Min Wu,Xiaoli Li,Jay Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine fault diagnosis (FD) is a critical task for predictive maintenance, enabling early fault detection and preventing unexpected failures. Despite its importance, existing FD models are operation-specific with limited generalization across diverse datasets. Foundation models (FM) have demonstrated remarkable potential in both visual and language domains, achieving impressive generalization capabilities even with minimal data through few-shot or zero-shot learning. However, translating these advances to FD presents unique hurdles. Unlike the large-scale, cohesive datasets available for images and text, FD datasets are typically smaller and more heterogeneous, with significant variations in sampling frequencies and the number of channels across different systems and applications. This heterogeneity complicates the design of a universal architecture capable of effectively processing such diverse data while maintaining robust feature extraction and learning capabilities. In this paper, we introduce UniFault, a foundation model for fault diagnosis that systematically addresses these issues. Specifically, the model incorporates a comprehensive data harmonization pipeline featuring two key innovations. First, a unification scheme transforms multivariate inputs into standardized univariate sequences while retaining local inter-channel relationships. Second, a novel cross-domain temporal fusion strategy mitigates distribution shifts and enriches sample diversity and count, improving the model generalization across varying conditions. UniFault is pretrained on over 9 billion data points spanning diverse FD datasets, enabling superior few-shot performance. Extensive experiments on real-world FD datasets demonstrate that UniFault achieves SoTA performance, setting a new benchmark for fault diagnosis models and paving the way for more scalable and robust predictive maintenance solutions.

[LG-37] xML-workFlow: an end-to-end explainable scikit-learn workflow for rapid biomedical experimentation

链接: https://arxiv.org/abs/2504.01356
作者: Khoa A. Tran,John V. Pearson,Nicola Waddell
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Technical Note, 8 pages, 1 figure

点击查看摘要

Abstract:Motivation: Building and iterating machine learning models is often a resource-intensive process. In biomedical research, scientific codebases can lack scalability and are not easily transferable to work beyond what they were intended. xML-workFlow addresses this issue by providing a rapid, robust, and traceable end-to-end workflow that can be adapted to any ML project with minimal code rewriting. Results: We show a practical, end-to-end workflow that integrates scikit-learn, MLflow, and SHAP. This template significantly reduces the time and effort required to build and iterate on ML models, addressing the common challenges of scalability and reproducibility in biomedical research. Adapting our template may save bioinformaticians time in development and enables biomedical researchers to deploy ML projects. Availability and implementation: xML-workFlow is available at this https URL. Comments: Technical Note, 8 pages, 1 figure Subjects: Machine Learning (cs.LG); Software Engineering (cs.SE) ACMclasses: D.2.3; D.2.11; I.2.5; J.3 Cite as: arXiv:2504.01356 [cs.LG] (or arXiv:2504.01356v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.01356 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] FlowMotion: Target-Predictive Flow Matching for Realistic Text-Driven Human Motion Generation

链接: https://arxiv.org/abs/2504.01338
作者: Manolo Canales Cuba,João Paulo Gois
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Achieving highly diverse and perceptually consistent 3D character animations with natural motion and low computational costs remains a challenge in computer animation. Existing methods often struggle to provide the nuanced complexity of human movement, resulting in perceptual inconsistencies and motion artifacts. To tackle these issues, we introduce FlowMotion, a novel approach that leverages Conditional Flow Matching (CFM) for improved motion synthesis. FlowMotion incorporates an innovative training objective that more accurately predicts target motion, reducing the inherent jitter associated with CFM while enhancing stability, realism, and computational efficiency in generating animations. This direct prediction approach enhances the perceptual quality of animations by reducing erratic motion and aligning the training more closely with the dynamic characteristics of human movement. Our experimental results demonstrate that FlowMotion achieves higher balance between motion smoothness and generalization capability while maintaining the computational efficiency inherent in flow matching compared to state-of-the-art methods.

[LG-39] Inverse RL Scene Dynamics Learning for Nonlinear Predictive Control in Autonomous Vehicles

链接: https://arxiv.org/abs/2504.01336
作者: Sorin Grigorescu,Mihai Zaha
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 21 pages, 14 figures, journal paper

点击查看摘要

Abstract:This paper introduces the Deep Learning-based Nonlinear Model Predictive Controller with Scene Dynamics (DL-NMPC-SD) method for autonomous navigation. DL-NMPC-SD uses an a-priori nominal vehicle model in combination with a scene dynamics model learned from temporal range sensing information. The scene dynamics model is responsible for estimating the desired vehicle trajectory, as well as to adjust the true system model used by the underlying model predictive controller. We propose to encode the scene dynamics model within the layers of a deep neural network, which acts as a nonlinear approximator for the high order state-space of the operating conditions. The model is learned based on temporal sequences of range sensing observations and system states, both integrated by an Augmented Memory component. We use Inverse Reinforcement Learning and the Bellman optimality principle to train our learning controller with a modified version of the Deep Q-Learning algorithm, enabling us to estimate the desired state trajectory as an optimal action-value function. We have evaluated DL-NMPC-SD against the baseline Dynamic Window Approach (DWA), as well as against two state-of-the-art End2End and reinforcement learning methods, respectively. The performance has been measured in three experiments: i) in our GridSim virtual environment, ii) on indoor and outdoor navigation tasks using our RovisLab AMTU (Autonomous Mobile Test Unit) platform and iii) on a full scale autonomous test vehicle driving on public roads.

[LG-40] Flexible and Explainable Graph Analysis for EEG-based Alzheimers Disease Classification

链接: https://arxiv.org/abs/2504.01329
作者: Jing Wang,Jun-En Ding,Feng Liu,Elisa Kallioniemi,Shuqiang Wang,Wen-Xiang Tsai,Albert C. Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alzheimer’s Disease is a progressive neurological disorder that is one of the most common forms of dementia. It leads to a decline in memory, reasoning ability, and behavior, especially in older people. The cause of Alzheimer’s Disease is still under exploration and there is no all-inclusive theory that can explain the pathologies in each individual patient. Nevertheless, early intervention has been found to be effective in managing symptoms and slowing down the disease’s progression. Recent research has utilized electroencephalography (EEG) data to identify biomarkers that distinguish Alzheimer’s Disease patients from healthy individuals. Prior studies have used various machine learning methods, including deep learning and graph neural networks, to examine electroencephalography-based signals for identifying Alzheimer’s Disease patients. In our research, we proposed a Flexible and Explainable Gated Graph Convolutional Network (GGCN) with Multi-Objective Tree-Structured Parzen Estimator (MOTPE) hyperparameter tuning. This provides a flexible solution that efficiently identifies the optimal number of GGCN blocks to achieve the optimized precision, specificity, and recall outcomes, as well as the optimized area under the Receiver Operating Characteristic (AUC). Our findings demonstrated a high efficacy with an over 0.9 Receiver Operating Characteristic score, alongside precision, specificity, and recall scores in distinguishing health control with Alzheimer’s Disease patients in Moderate to Severe Dementia using the power spectrum density (PSD) of electroencephalography signals across various frequency bands. Moreover, our research enhanced the interpretability of the embedded adjacency matrices, revealing connectivity differences in frontal and parietal brain regions between Alzheimer’s patients and healthy individuals.

[LG-41] FLAMES: A Hybrid Spiking-State Space Model for Adaptive Memory Retention in Event-Based Learning

链接: https://arxiv.org/abs/2504.01257
作者: Biswadeep Chakraborty,Saibal Mukhopadhyay
类目: Machine Learning (cs.LG)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:We propose \textbfFLAMES (Fast Long-range Adaptive Memory for Event-based Systems), a novel hybrid framework integrating structured state-space dynamics with event-driven computation. At its core, the \textitSpike-Aware HiPPO (SA-HiPPO) mechanism dynamically adjusts memory retention based on inter-spike intervals, preserving both short- and long-range dependencies. To maintain computational efficiency, we introduce a normal-plus-low-rank (NPLR) decomposition, reducing complexity from \mathcalO(N^2) to \mathcalO(Nr) . FLAMES achieves state-of-the-art results on the Long Range Arena benchmark and event datasets like HAR-DVS and Celex-HAR. By bridging neuromorphic computing and structured sequence modeling, FLAMES enables scalable long-range reasoning in event-driven systems.

[LG-42] R2DN: Scalable Parameterization of Contracting and Lipschitz Recurrent Deep Networks

链接: https://arxiv.org/abs/2504.01250
作者: Nicholas H. Barbara,Ruigang Wang,Ian R. Manchester
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper presents the Robust Recurrent Deep Network (R2DN), a scalable parameterization of robust recurrent neural networks for machine learning and data-driven control. We construct R2DNs as a feedback interconnection of a linear time-invariant system and a 1-Lipschitz deep feedforward network, and directly parameterize the weights so that our models are stable (contracting) and robust to small input perturbations (Lipschitz) by design. Our parameterization uses a structure similar to the previously-proposed recurrent equilibrium networks (RENs), but without the requirement to iteratively solve an equilibrium layer at each time-step. This speeds up model evaluation and backpropagation on GPUs, and makes it computationally feasible to scale up the network size, batch size, and input sequence length in comparison to RENs. We compare R2DNs to RENs on three representative problems in nonlinear system identification, observer design, and learning-based feedback control and find that training and inference are both up to an order of magnitude faster with similar test set performance, and that training/inference times scale more favorably with respect to model expressivity.

[LG-43] Explainable post-training bias mitigation with distribution-based fairness metrics

链接: https://arxiv.org/abs/2504.01223
作者: Ryan Franks,Alexey Miroshnikov
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 37 pages, 6 figures

点击查看摘要

Abstract:We develop a novel optimization framework with distribution-based fairness constraints for efficiently producing demographically blind, explainable models across a wide range of fairness levels. This is accomplished through post-processing, avoiding the need for retraining. Our framework, which is based on stochastic gradient descent, can be applied to a wide range of model types, with a particular emphasis on the post-processing of gradient-boosted decision trees. Additionally, we design a broad class of interpretable global bias metrics compatible with our method by building on previous work. We empirically test our methodology on a variety of datasets and compare it to other methods.

[LG-44] AutoML Benchmark with shorter time constraints and early stopping ICLR2025

链接: https://arxiv.org/abs/2504.01222
作者: Israel Campero Jurado,Pieter Gijsbers,Joaquin Vanschoren
类目: Machine Learning (cs.LG)
*备注: Workshop on the Future of Machine Learning Data Practices and Repositories, ICLR 2025

点击查看摘要

Abstract:Automated Machine Learning (AutoML) automatically builds machine learning (ML) models on data. The de facto standard for evaluating new AutoML frameworks for tabular data is the AutoML Benchmark (AMLB). AMLB proposed to evaluate AutoML frameworks using 1- and 4-hour time budgets across 104 tasks. We argue that shorter time constraints should be considered for the benchmark because of their practical value, such as when models need to be retrained with high frequency, and to make AMLB more accessible. This work considers two ways in which to reduce the overall computation used in the benchmark: smaller time constraints and the use of early stopping. We conduct evaluations of 11 AutoML frameworks on 104 tasks with different time constraints and find the relative ranking of AutoML frameworks is fairly consistent across time constraints, but that using early-stopping leads to a greater variety in model performance.

[LG-45] Gradient-free Continual Learning

链接: https://arxiv.org/abs/2504.01219
作者: Grzegorz Rypeść
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning (CL) presents a fundamental challenge in training neural networks on sequential tasks without experiencing catastrophic forgetting. Traditionally, the dominant approach in CL has been gradient-based optimization, where updates to the network parameters are performed using stochastic gradient descent (SGD) or its variants. However, a major limitation arises when previous data is no longer accessible, as is often assumed in CL settings. In such cases, there is no gradient information available for past data, leading to uncontrolled parameter changes and consequently severe forgetting of previously learned tasks. By shifting focus from data availability to gradient availability, this work opens up new avenues for addressing forgetting in CL. We explore the hypothesis that gradient-free optimization methods can provide a robust alternative to conventional gradient-based continual learning approaches. We discuss the theoretical underpinnings of such method, analyze their potential advantages and limitations, and present empirical evidence supporting their effectiveness. By reconsidering the fundamental cause of forgetting, this work aims to contribute a fresh perspective to the field of continual learning and inspire novel research directions.

[LG-46] Cooper: A Library for Constrained Optimization in Deep Learning

链接: https://arxiv.org/abs/2504.01212
作者: Jose Gallego-Posada,Juan Ramirez,Meraj Hashemizadeh,Simon Lacoste-Julien
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注:

点击查看摘要

Abstract:Cooper is an open-source package for solving constrained optimization problems involving deep learning models. Cooper implements several Lagrangian-based first-order update schemes, making it easy to combine constrained optimization algorithms with high-level features of PyTorch such as automatic differentiation, and specialized deep learning architectures and optimizers. Although Cooper is specifically designed for deep learning applications where gradients are estimated based on mini-batches, it is suitable for general non-convex continuous constrained optimization. Cooper’s source code is available at this https URL.

[LG-47] Global explainability of a deep abstaining classifier

链接: https://arxiv.org/abs/2504.01202
作者: Sayera Dhaubhadel(1 and 2),Jamaludin Mohd-Yusof(1),Benjamin H. McMahon(1),Trilce Estrada(2),Kumkum Ganguly(1),Adam Spannaus(3),John P. Gounley(3),Xiao-Cheng Wu(4),Eric B. Durbin(5),Heidi A. Hanson(3),Tanmoy Bhattacharya(1) ((1) Los Alamos National Laboratory, (2) University of New Mexico, (3) Oak Ridge National Laboratory, (4) Louisiana Tumor Registry, (5) Kentucky Cancer Registry)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a global explainability method to characterize sources of errors in the histology prediction task of our real-world multitask convolutional neural network (MTCNN)-based deep abstaining classifier (DAC), for automated annotation of cancer pathology reports from NCI-SEER registries. Our classifier was trained and evaluated on 1.04 million hand-annotated samples and makes simultaneous predictions of cancer site, subsite, histology, laterality, and behavior for each report. The DAC framework enables the model to abstain on ambiguous reports and/or confusing classes to achieve a target accuracy on the retained (non-abstained) samples, but at the cost of decreased coverage. Requiring 97% accuracy on the histology task caused our model to retain only 22% of all samples, mostly the less ambiguous and common classes. Local explainability with the GradInp technique provided a computationally efficient way of obtaining contextual reasoning for thousands of individual predictions. Our method, involving dimensionality reduction of approximately 13000 aggregated local explanations, enabled global identification of sources of errors as hierarchical complexity among classes, label noise, insufficient information, and conflicting evidence. This suggests several strategies such as exclusion criteria, focused annotation, and reduced penalties for errors involving hierarchically related classes to iteratively improve our DAC in this complex real-world implementation.

[LG-48] Value Iteration for Learning Concurrently Executable Robotic Control Tasks AAMAS2025

链接: https://arxiv.org/abs/2504.01174
作者: Sheikh A. Tahmid,Gennaro Notomista
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: To be published in AAMAS 2025 conference: this https URL

点击查看摘要

Abstract:Many modern robotic systems such as multi-robot systems and manipulators exhibit redundancy, a property owing to which they are capable of executing multiple tasks. This work proposes a novel method, based on the Reinforcement Learning (RL) paradigm, to train redundant robots to be able to execute multiple tasks concurrently. Our approach differs from typical multi-objective RL methods insofar as the learned tasks can be combined and executed in possibly time-varying prioritized stacks. We do so by first defining a notion of task independence between learned value functions. We then use our definition of task independence to propose a cost functional that encourages a policy, based on an approximated value function, to accomplish its control objective while minimally interfering with the execution of higher priority tasks. This allows us to train a set of control policies that can be executed simultaneously. We also introduce a version of fitted value iteration to learn to approximate our proposed cost functional efficiently. We demonstrate our approach on several scenarios and robotic systems.

[LG-49] Efficient n-body simulations using physics informed graph neural networks WWW

链接: https://arxiv.org/abs/2504.01169
作者: Víctor Ramos-Osuna,Alberto Díaz-Álvarez,Raúl Lara-Cabrera
类目: Machine Learning (cs.LG)
*备注: 10 pages, 6 figures, 3 tables, accepted in conference MAEB 2025 (more info at this https URL )

点击查看摘要

Abstract:This paper presents a novel approach for accelerating n-body simulations by integrating a physics-informed graph neural networks (GNN) with traditional numerical methods. Our method implements a leapfrog-based simulation engine to generate datasets from diverse astrophysical scenarios which are then transformed into graph representations. A custom-designed GNN is trained to predict particle accelerations with high precision. Experiments, conducted on 60 training and 6 testing simulations spanning from 3 to 500 bodies over 1000 time steps, demonstrate that the proposed model achieves extremely low prediction errors-loss values while maintaining robust long-term stability, with accumulated errors in position, velocity, and acceleration remaining insignificant. Furthermore, our method yields a modest speedup of approximately 17% over conventional simulation techniques. These results indicate that the integration of deep learning with traditional physical simulation methods offers a promising pathway to significantly enhance computational efficiency without compromising accuracy.

[LG-50] Performative Drift Resistant Classification Using Generative Domain Adversarial Networks

链接: https://arxiv.org/abs/2504.01135
作者: Maciej Makowski,Brandon Gower-Winter,Georg Krempl
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, 5 tables. Accepted at Symposium on Intelligent Data Analysis (IDA) 2025

点击查看摘要

Abstract:Performative Drift is a special type of Concept Drift that occurs when a model’s predictions influence the future instances the model will encounter. In these settings, retraining is not always feasible. In this work, we instead focus on drift understanding as a method for creating drift-resistant classifiers. To achieve this, we introduce the Generative Domain Adversarial Network (GDAN) which combines both Domain and Generative Adversarial Networks. Using GDAN, domain-invariant representations of incoming data are created and a generative network is used to reverse the effects of performative drift. Using semi-real and synthetic data generators, we empirically evaluate GDAN’s ability to provide drift-resistant classification. Initial results are promising with GDAN limiting performance degradation over several timesteps. Additionally, GDAN’s generative network can be used in tandem with other models to limit their performance degradation in the presence of performative drift. Lastly, we highlight the relationship between model retraining and the unpredictability of performative drift, providing deeper insights into the challenges faced when using traditional Concept Drift mitigation strategies in the performative setting.

[LG-51] Uncovering the Limitations of Query Performance Prediction: Failures Insights and Implications for Selective Query Processing

链接: https://arxiv.org/abs/2504.01101
作者: Adrian-Gabriel Chifu,Sébastien Déjean,Josiane Mothe,Moncef Garouani,Diego Ortiz,Md Zia Ullah
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:Query Performance Prediction (QPP) estimates retrieval systems effectiveness for a given query, offering valuable insights for search effectiveness and query processing. Despite extensive research, QPPs face critical challenges in generalizing across diverse retrieval paradigms and collections. This paper provides a comprehensive evaluation of state-of-the-art QPPs (e.g. NQC, UQC), LETOR-based features, and newly explored dense-based predictors. Using diverse sparse rankers (BM25, DFree without and with query expansion) and hybrid or dense (SPLADE and ColBert) rankers and diverse test collections ROBUST, GOV2, WT10G, and MS MARCO; we investigate the relationships between predicted and actual performance, with a focus on generalization and robustness. Results show significant variability in predictors accuracy, with collections as the main factor and rankers next. Some sparse predictors perform somehow on some collections (TREC ROBUST and GOV2) but do not generalise to other collections (WT10G and MS-MARCO). While some predictors show promise in specific scenarios, their overall limitations constrain their utility for applications. We show that QPP-driven selective query processing offers only marginal gains, emphasizing the need for improved predictors that generalize across collections, align with dense retrieval architectures and are useful for downstream applications.

[LG-52] MPCritic: A plug-and-play MPC architecture for reinforcement learning

链接: https://arxiv.org/abs/2504.01086
作者: Nathan P. Lawrence,Thomas Banker,Ali Mesbah
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Preprint for CDC 2025

点击查看摘要

Abstract:The reinforcement learning (RL) and model predictive control (MPC) communities have developed vast ecosystems of theoretical approaches and computational tools for solving optimal control problems. Given their conceptual similarities but differing strengths, there has been increasing interest in synergizing RL and MPC. However, existing approaches tend to be limited for various reasons, including computational cost of MPC in an RL algorithm and software hurdles towards seamless integration of MPC and RL tools. These challenges often result in the use of “simple” MPC schemes or RL algorithms, neglecting the state-of-the-art in both areas. This paper presents MPCritic, a machine learning-friendly architecture that interfaces seamlessly with MPC tools. MPCritic utilizes the loss landscape defined by a parameterized MPC problem, focusing on “soft” optimization over batched training steps; thereby updating the MPC parameters while avoiding costly minimization and parametric sensitivities. Since the MPC structure is preserved during training, an MPC agent can be readily used for online deployment, where robust constraint satisfaction is paramount. We demonstrate the versatility of MPCritic, in terms of MPC architectures and RL algorithms that it can accommodate, on classic control benchmarks.

[LG-53] Machine Learning for Identifying Potential Participants in Uruguayan Social Programs

链接: https://arxiv.org/abs/2504.01045
作者: Christian Beron Curti,Rodrigo Vargas Sainz,Yitong Tseo
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: in Spanish language

点击查看摘要

Abstract:This research project explores the optimization of the family selection process for participation in Uruguay’s Crece Contigo Family Support Program (PAF) through machine learning. An anonymized database of 15,436 previous referral cases was analyzed, focusing on pregnant women and children under four years of age. The main objective was to develop a predictive algorithm capable of determining whether a family meets the conditions for acceptance into the program. The implementation of this model seeks to streamline the evaluation process and allow for more efficient resource allocation, allocating more team time to direct support. The study included an exhaustive data analysis and the implementation of various machine learning models, including Neural Networks (NN), XGBoost (XGB), LSTM, and ensemble models. Techniques to address class imbalance, such as SMOTE and RUS, were applied, as well as decision threshold optimization to improve prediction accuracy and balance. The results demonstrate the potential of these techniques for efficient classification of families requiring assistance.

[LG-54] Carbon Footprint Evaluation of Code Generation through LLM as a Service

链接: https://arxiv.org/abs/2504.01036
作者: Tina Vartziotis,Maximilian Schmidt,George Dasoulas,Ippolyti Dellatolas,Stefano Attademo,Viet Dung Le,Anke Wiechmann,Tim Hoffmann,Michael Keckeisen,Sotirios Kotsopoulos
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Stuttgart Symposium, Springer

点击查看摘要

Abstract:Due to increased computing use, data centers consume and emit a lot of energy and carbon. These contributions are expected to rise as big data analytics, digitization, and large AI models grow and become major components of daily working routines. To reduce the environmental impact of software development, green (sustainable) coding and claims that AI models can improve energy efficiency have grown in popularity. Furthermore, in the automotive industry, where software increasingly governs vehicle performance, safety, and user experience, the principles of green coding and AI-driven efficiency could significantly contribute to reducing the sector’s environmental footprint. We present an overview of green coding and metrics to measure AI model sustainability awareness. This study introduces LLM as a service and uses a generative commercial AI language model, GitHub Copilot, to auto-generate code. Using sustainability metrics to quantify these AI models’ sustainability awareness, we define the code’s embodied and operational carbon.

[LG-55] Over-the-Air Edge Inference via End-to-End Metasurfaces-Integrated Artificial Neural Networks

链接: https://arxiv.org/abs/2504.00233
作者: Kyriakos Stylianopoulos,Paolo Di Lorenzo,George C. Alexandropoulos
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Information Theory (cs.IT)
*备注: Submitted for journal publication

点击查看摘要

Abstract:In the Edge Inference (EI) paradigm, where a Deep Neural Network (DNN) is split across the transceivers to wirelessly communicate goal-defined features in solving a computational task, the wireless medium has been commonly treated as a source of noise. In this paper, motivated by the emerging technologies of Reconfigurable Intelligent Surfaces (RISs) and Stacked Intelligent Metasurfaces (SIM) that offer programmable propagation of wireless signals, either through controllable reflections or diffractions, we optimize the RIS/SIM-enabled smart wireless environment as a means of over-the-air computing, resembling the operations of DNN layers. We propose a framework of Metasurfaces-Integrated Neural Networks (MINNs) for EI, presenting its modeling, training through a backpropagation variation for fading channels, and deployment aspects. The overall end-to-end DNN architecture is general enough to admit RIS and SIM devices, through controllable reconfiguration before each transmission or fixed configurations after training, while both channel-aware and channel-agnostic transceivers are considered. Our numerical evaluation showcases metasurfaces to be instrumental in performing image classification under link budgets that impede conventional communications or metasurface-free systems. It is demonstrated that our MINN framework can significantly simplify EI requirements, achieving near-optimal performance with 50~ dB lower testing signal-to-noise ratio compared to training, even without transceiver channel knowledge.

[LG-56] A Randomized Zeroth-Order Hierarchical Framework for Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2504.01839
作者: Yuyang Qiu,Kibaek Kim,Farzad Yousefian
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneity in federated learning (FL) is a critical and challenging aspect that significantly impacts model performance and convergence. In this paper, we propose a novel framework by formulating heterogeneous FL as a hierarchical optimization problem. This new framework captures both local and global training process through a bilevel formulation and is capable of the following: (i) addressing client heterogeneity through a personalized learning framework; (ii) capturing pre-training process on server’s side; (iii) updating global model through nonstandard aggregation; (iv) allowing for nonidentical local steps; and (v) capturing clients’ local constraints. We design and analyze an implicit zeroth-order FL method (ZO-HFL), provided with nonasymptotic convergence guarantees for both the server-agent and the individual client-agents, and asymptotic guarantees for both the server-agent and client-agents in an almost sure sense. Notably, our method does not rely on standard assumptions in heterogeneous FL, such as the bounded gradient dissimilarity condition. We implement our method on image classification tasks and compare with other methods under different heterogeneous settings.

[LG-57] Autonomous optical navigation for DESTINY: Enhancing misalignment robustness in flyby observations with a rotating telescope

链接: https://arxiv.org/abs/2504.01835
作者: Takayuki Hosonuma,Takeshi Miyabara,Naoya Ozaki,Ko Ishibashi,Yuta Suzaki,Peng Hong,Masayuki Ohta,Takeshi Takashima
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: 19 pages, 25 figures, submitted to Acta Astronautica

点击查看摘要

Abstract:DESTINY+ is an upcoming JAXA Epsilon medium-class mission to flyby multiple asteroids including Phaethon. As an asteroid flyby observation instrument, a telescope mechanically capable of single-axis rotation, named TCAP, is mounted on the spacecraft to track and observe the target asteroids during flyby. As in past flyby missions utilizing rotating telescopes, TCAP is also used as a navigation camera for autonomous optical navigation during the closest-approach phase. To mitigate the degradation of the navigation accuracy, past missions performed calibration of the navigation camera’s alignment before starting optical navigation. However, such calibration requires significant operational time to complete and imposes constraints on the operation sequence. From the above background, the DESTINY+ team has studied the possibility of reducing operational costs by allowing TCAP alignment errors to remain. This paper describes an autonomous optical navigation algorithm robust to the misalignment of rotating telescopes, proposed in this context. In the proposed method, the misalignment of the telescope is estimated simultaneously with the spacecraft’s orbit relative to the flyby target. To deal with the nonlinearity between the misalignment and the observation value, the proposed method utilizes the unscented Kalman filter, instead of the extended Kalman filter widely used in past studies. The proposed method was evaluated with numerical simulations on a PC and with hardware-in-the-loop simulation, taking the Phaethon flyby in the DESTINY+ mission as an example. The validation results suggest that the proposed method can mitigate the misalignment-induced degradation of the optical navigation accuracy with reasonable computational costs suited for onboard computers.

[LG-58] KD2M: An unifying framework for feature knowledge distillation

链接: https://arxiv.org/abs/2504.01757
作者: Eduardo Fernandes Montesuma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 8 pages, 2 figures, 1 table, under review

点击查看摘要

Abstract:Knowledge Distillation (KD) seeks to transfer the knowledge of a teacher, towards a student neural net. This process is often done by matching the networks’ predictions (i.e., their output), but, recently several works have proposed to match the distributions of neural nets’ activations (i.e., their features), a process known as \emphdistribution matching. In this paper, we propose an unifying framework, Knowledge Distillation through Distribution Matching (KD ^2 M), which formalizes this strategy. Our contributions are threefold. We i) provide an overview of distribution metrics used in distribution matching, ii) benchmark on computer vision datasets, and iii) derive new theoretical results for KD.

[LG-59] A Causal Inference Framework for Data Rich Environments

链接: https://arxiv.org/abs/2504.01702
作者: Alberto Abadie,Anish Agarwal,Devavrat Shah
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a formal model for counterfactual estimation with unobserved confounding in “data-rich” settings, i.e., where there are a large number of units and a large number of measurements per unit. Our model provides a bridge between the structural causal model view of causal inference common in the graphical models literature with that of the latent factor model view common in the potential outcomes literature. We show how classic models for potential outcomes and treatment assignments fit within our framework. We provide an identification argument for the average treatment effect, the average treatment effect on the treated, and the average treatment effect on the untreated. For any estimator that has a fast enough estimation error rate for a certain nuisance parameter, we establish it is consistent for these various causal parameters. We then show principal component regression is one such estimator that leads to consistent estimation, and we analyze the minimal smoothness required of the potential outcomes function for consistency.

[LG-60] Sparse Gaussian Neural Processes

链接: https://arxiv.org/abs/2504.01650
作者: Tommy Rochussen,Vincent Fortuin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Proceedings of the 7th Symposium on Advances in Approximate Bayesian Inference, PMLR, 2025. 25 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Despite significant recent advances in probabilistic meta-learning, it is common for practitioners to avoid using deep learning models due to a comparative lack of interpretability. Instead, many practitioners simply use non-meta-models such as Gaussian processes with interpretable priors, and conduct the tedious procedure of training their model from scratch for each task they encounter. While this is justifiable for tasks with a limited number of data points, the cubic computational cost of exact Gaussian process inference renders this prohibitive when each task has many observations. To remedy this, we introduce a family of models that meta-learn sparse Gaussian process inference. Not only does this enable rapid prediction on new tasks with sparse Gaussian processes, but since our models have clear interpretations as members of the neural process family, it also allows manual elicitation of priors in a neural process for the first time. In meta-learning regimes for which the number of observed tasks is small or for which expert domain knowledge is available, this offers a crucial advantage.

[LG-61] Density estimation via mixture discrepancy and moments

链接: https://arxiv.org/abs/2504.01570
作者: Zhengyang Lei,Sihong Shao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:With the aim of generalizing histogram statistics to higher dimensional cases, density estimation via discrepancy based sequential partition (DSP) has been proposed [D. Li, K. Yang, W. Wong, Advances in Neural Information Processing Systems (2016) 1099-1107] to learn an adaptive piecewise constant approximation defined on a binary sequential partition of the underlying domain, where the star discrepancy is adopted to measure the uniformity of particle distribution. However, the calculation of the star discrepancy is NP-hard and it does not satisfy the reflection invariance and rotation invariance either. To this end, we use the mixture discrepancy and the comparison of moments as a replacement of the star discrepancy, leading to the density estimation via mixture discrepancy based sequential partition (DSP-mix) and density estimation via moments based sequential partition (MSP), respectively. Both DSP-mix and MSP are computationally tractable and exhibit the reflection and rotation invariance. Numerical experiments in reconstructing the d -D mixture of Gaussians and Betas with d=2, 3, \dots, 6 demonstrate that DSP-mix and MSP both run approximately ten times faster than DSP while maintaining the same accuracy.

[LG-62] Incorporating Coupling Knowledge into Echo State Networks for Learning Spatiotemporally Chaotic Dynamics

链接: https://arxiv.org/abs/2504.01532
作者: Kuei-Jan Chu,Nozomi Akashi,Akihiro Yamamoto
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG)
*备注: 16 pages, 12 figures

点击查看摘要

Abstract:Machine learning methods have shown promise in learning chaotic dynamical systems, enabling model-free short-term prediction and attractor reconstruction. However, when applied to large-scale, spatiotemporally chaotic systems, purely data-driven machine learning methods often suffer from inefficiencies, as they require a large learning model size and a massive amount of training data to achieve acceptable performance. To address this challenge, we incorporate the spatial coupling structure of the target system as an inductive bias in the network design. Specifically, we introduce physics-guided clustered echo state networks, leveraging the efficiency of the echo state networks as a base model. Experimental results on benchmark chaotic systems demonstrate that our physics-informed method outperforms existing echo state network models in learning the target chaotic systems. Additionally, our models exhibit robustness to noise in training data and remain effective even when prior coupling knowledge is imperfect. This approach has the potential to enhance other machine learning methods.

[LG-63] Multi-convex Programming for Discrete Latent Factor Models Prototyping

链接: https://arxiv.org/abs/2504.01431
作者: Hao Zhu,Shengchao Yan,Jasper Hoffmann,Joschka Boedecker
类目: Optimization and Control (math.OC); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete latent factor models (DLFMs) are widely used in various domains such as machine learning, economics, neuroscience, psychology, etc. Currently, fitting a DLFM to some dataset relies on a customized solver for individual models, which requires lots of effort to implement and is limited to the targeted specific instance of DLFMs. In this paper, we propose a generic framework based on CVXPY, which allows users to specify and solve the fitting problem of a wide range of DLFMs, including both regression and classification models, within a very short script. Our framework is flexible and inherently supports the integration of regularization terms and constraints on the DLFM parameters and latent factors, such that the users can easily prototype the DLFM structure according to their dataset and application scenario. We introduce our open-source Python implementation and illustrate the framework in several examples.

[LG-64] Initial Conditions from Galaxies: Machine-Learning Subgrid Correction to Standard Reconstruction

链接: https://arxiv.org/abs/2504.01092
作者: Liam Parker,Adrian E. Bayer,Uros Seljak
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:We present a hybrid method for reconstructing the primordial density from late-time halos and galaxies. Our approach involves two steps: (1) apply standard Baryon Acoustic Oscillation (BAO) reconstruction to recover the large-scale features in the primordial density field and (2) train a deep learning model to learn small-scale corrections on partitioned subgrids of the full volume. At inference, this correction is then convolved across the full survey volume, enabling scaling to large survey volumes. We train our method on both mock halo catalogs and mock galaxy catalogs in both configuration and redshift space from the Quijote 1(h^-1,\mathrmGpc)^3 simulation suite. When evaluated on held-out simulations, our combined approach significantly improves the reconstruction cross-correlation coefficient with the true initial density field and remains robust to moderate model misspecification. Additionally, we show that models trained on 1(h^-1,\mathrmGpc)^3 can be applied to larger boxes–e.g., (3h^-1,\mathrmGpc)^3 --without retraining. Finally, we perform a Fisher analysis on our method’s recovery of the BAO peak, and find that it significantly improves the error on the acoustic scale relative to standard BAO reconstruction. Ultimately, this method robustly captures nonlinearities and bias without sacrificing large-scale accuracy, and its flexibility to handle arbitrarily large volumes without escalating computational requirements makes it especially promising for large-volume surveys like DESI.

[LG-65] Denoising guarantees for optimized sampling schemes in compressed sensing

链接: https://arxiv.org/abs/2504.01046
作者: Yaniv Plan,Matthew S. Scott,Xia Sheng,Ozgur Yilmaz
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Probability (math.PR)
*备注: 29 pages, 4 figures. Submitted for review to the SIAM Journal on Mathematics of Data Science (SIMODS). Author roles: Authors listed in alphabetic order. MS was primarily responsible for developing the theory, the sparsity-based numerics, and writing the paper. XS was primarily responsible for training the generative model and creating and presenting the related numerics

点击查看摘要

Abstract:Compressed sensing with subsampled unitary matrices benefits from \emphoptimized sampling schemes, which feature improved theoretical guarantees and empirical performance relative to uniform subsampling. We provide, in a first of its kind in compressed sensing, theoretical guarantees showing that the error caused by the measurement noise vanishes with an increasing number of measurements for optimized sampling schemes, assuming that the noise is Gaussian. We moreover provide similar guarantees for measurements sampled with-replacement with arbitrary probability weights. All our results hold on prior sets contained in a union of low-dimensional subspaces. Finally, we demonstrate that this denoising behavior appears in empirical experiments with a rate that closely matches our theoretical guarantees when the prior set is the range of a generative ReLU neural network and when it is the set of sparse vectors.

[LG-66] Estimating Unbounded Density Ratios: Applications in Error Control under Covariate Shift

链接: https://arxiv.org/abs/2504.01031
作者: Shuntuo Xu,Zhou Yu,Jian Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The density ratio is an important metric for evaluating the relative likelihood of two probability distributions, with extensive applications in statistics and machine learning. However, existing estimation theories for density ratios often depend on stringent regularity conditions, mainly focusing on density ratio functions with bounded domains and ranges. In this paper, we study density ratio estimators using loss functions based on least squares and logistic regression. We establish upper bounds on estimation errors with standard minimax optimal rates, up to logarithmic factors. Our results accommodate density ratio functions with unbounded domains and ranges. We apply our results to nonparametric regression and conditional flow models under covariate shift and identify the tail properties of the density ratio as crucial for error control across domains affected by covariate shift. We provide sufficient conditions under which loss correction is unnecessary and demonstrate effective generalization capabilities of a source estimator to any suitable target domain. Our simulation experiments support these theoretical findings, indicating that the source estimator can outperform those derived from loss correction methods, even when the true density ratio is known.

[LG-67] Fair Sufficient Representation Learning

链接: https://arxiv.org/abs/2504.01030
作者: Xueyu Zhou,Chun Yin IP,Jian Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 35 pages, 11 figures, and 6 tables (1 in the main text, 5 in the appendix)

点击查看摘要

Abstract:The main objective of fair statistical modeling and machine learning is to minimize or eliminate biases that may arise from the data or the model itself, ensuring that predictions and decisions are not unjustly influenced by sensitive attributes such as race, gender, age, or other protected characteristics. In this paper, we introduce a Fair Sufficient Representation Learning (FSRL) method that balances sufficiency and fairness. Sufficiency ensures that the representation should capture all necessary information about the target variables, while fairness requires that the learned representation remains independent of sensitive attributes. FSRL is based on a convex combination of an objective function for learning a sufficient representation and an objective function that ensures fairness. Our approach manages fairness and sufficiency at the representation level, offering a novel perspective on fair representation learning. We implement this method using distance covariance, which is effective for characterizing independence between random variables. We establish the convergence properties of the learned representations. Experiments conducted on healthcase and text datasets with diverse structures demonstrate that FSRL achieves a superior trade-off between fairness and accuracy compared to existing approaches.

信息检索

[IR-0] Is Less Really More? Fake News Detection with Limited Information

链接: https://arxiv.org/abs/2504.01922
作者: Zhaoyang Cao,John Nguyen,Reza Zafarani
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The threat that online fake news and misinformation pose to democracy, justice, public confidence, and especially to vulnerable populations, has led to a sharp increase in the need for fake news detection and intervention. Whether multi-modal or pure text-based, most fake news detection methods depend on textual analysis of entire articles. However, these fake news detection methods come with certain limitations. For instance, fake news detection methods that rely on full text can be computationally inefficient, demand large amounts of training data to achieve competitive accuracy, and may lack robustness across different datasets. This is because fake news datasets have strong variations in terms of the level and types of information they provide; where some can include large paragraphs of text with images and metadata, others can be a few short sentences. Perhaps if one could only use minimal information to detect fake news, fake news detection methods could become more robust and resilient to the lack of information. We aim to overcome these limitations by detecting fake news using systematically selected, limited information that is both effective and capable of delivering robust, promising performance. We propose a framework called SLIM Systematically-selected Limited Information) for fake news detection. In SLIM, we quantify the amount of information by introducing information-theoretic measures. SLIM leverages limited information to achieve performance in fake news detection comparable to that of state-of-the-art obtained using the full text. Furthermore, by combining various types of limited information, SLIM can perform even better while significantly reducing the quantity of information required for training compared to state-of-the-art language model-based fake news detection techniques.

[IR-1] Extending MovieLens-32M to Provide New Evaluation Objectives

链接: https://arxiv.org/abs/2504.01863
作者: Mark D. Smucker,Houmaan Chamani
类目: Information Retrieval (cs.IR)
*备注: Our extension to MovieLens-32M is available for researchers at this https URL

点击查看摘要

Abstract:Offline evaluation of recommender systems has traditionally treated the problem as a machine learning problem. In the classic case of recommending movies, where the user has provided explicit ratings of which movies they like and don’t like, each user’s ratings are split into test and train sets, and the evaluation task becomes to predict the held out test data using the training data. This machine learning style of evaluation makes the objective to recommend the movies that a user has watched and rated highly, which is not the same task as helping the user find movies that they would enjoy if they watched them. This mismatch in objective between evaluation and task is a compromise to avoid the cost of asking a user to evaluate recommendations by watching each movie. As a resource available for download, we offer an extension to the MovieLens-32M dataset that provides for new evaluation objectives. Our primary objective is to predict the movies that a user would be interested in watching, i.e. predict their watchlist. To construct this extension, we recruited MovieLens users, collected their profiles, made recommendations with a diverse set of algorithms, pooled the recommendations, and had the users assess the pools. Notably, we found that the traditional machine learning style of evaluation ranks the Popular algorithm, which recommends movies based on total number of ratings in the system, in the middle of the twenty-two recommendation runs we used to build the pools. In contrast, when we rank the runs by users’ interest in watching movies, we find that recommending popular movies as a recommendation algorithm becomes one of the worst performing runs. It appears that by asking users to assess their personal recommendations, we can alleviate the popularity bias issues created by using information retrieval effectiveness measures for the evaluation of recommender systems.

[IR-2] Comment Staytime Prediction with LLM -enhanced Comment Understanding WWW2025

链接: https://arxiv.org/abs/2504.01602
作者: Changshuo Zhang,Zihan Lin,Shukai Liu,Yongqi Liu,Han Li
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW 2025 Industry Track

点击查看摘要

Abstract:In modern online streaming platforms, the comments section plays a critical role in enhancing the overall user experience. Understanding user behavior within the comments section is essential for comprehensive user interest modeling. A key factor of user engagement is staytime, which refers to the amount of time that users browse and post comments. Existing watchtime prediction methods struggle to adapt to staytime prediction, overlooking interactions with individual comments and their interrelation. In this paper, we present a micro-video recommendation dataset with video comments (named as KuaiComt) which is collected from Kuaishou platform. correspondingly, we propose a practical framework for comment staytime prediction with LLM-enhanced Comment Understanding (LCU). Our framework leverages the strong text comprehension capabilities of large language models (LLMs) to understand textual information of comments, while also incorporating fine-grained comment ranking signals as auxiliary tasks. The framework is two-staged: first, the LLM is fine-tuned using domain-specific tasks to bridge the video and the comments; second, we incorporate the LLM outputs into the prediction model and design two comment ranking auxiliary tasks to better understand user preference. Extensive offline experiments demonstrate the effectiveness of our framework, showing significant improvements on the task of comment staytime prediction. Additionally, online A/B testing further validates the practical benefits on industrial scenario. Our dataset KuaiComt (this https URL) and code for LCU (this https URL) are fully released.

[IR-3] st-Time Alignment for Tracking User Interest Shifts in Sequential Recommendation

链接: https://arxiv.org/abs/2504.01489
作者: Changshuo Zhang,Xiao Zhang,Teng Shi,Jun Xu,Ji-Rong Wen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommendation is essential in modern recommender systems, aiming to predict the next item a user may interact with based on their historical behaviors. However, real-world scenarios are often dynamic and subject to shifts in user interests. Conventional sequential recommendation models are typically trained on static historical data, limiting their ability to adapt to such shifts and resulting in significant performance degradation during testing. Recently, Test-Time Training (TTT) has emerged as a promising paradigm, enabling pre-trained models to dynamically adapt to test data by leveraging unlabeled examples during testing. However, applying TTT to effectively track and address user interest shifts in recommender systems remains an open and challenging problem. Key challenges include how to capture temporal information effectively and explicitly identifying shifts in user interests during the testing phase. To address these issues, we propose T ^2 ARec, a novel model leveraging state space model for TTT by introducing two Test-Time Alignment modules tailored for sequential recommendation, effectively capturing the distribution shifts in user interest patterns over time. Specifically, T ^2 ARec aligns absolute time intervals with model-adaptive learning intervals to capture temporal dynamics and introduce an interest state alignment mechanism to effectively and explicitly identify the user interest shifts with theoretical guarantees. These two alignment modules enable efficient and incremental updates to model parameters in a self-supervised manner during testing, enhancing predictions for online recommendation. Extensive evaluations on three benchmark datasets demonstrate that T ^2 ARec achieves state-of-the-art performance and robustly mitigates the challenges posed by user interest shifts.

[IR-4] GeoRAG : A Question-Answering Approach from a Geographical Perspective

链接: https://arxiv.org/abs/2504.01458
作者: Jian Wang,Zhuo Zhao,Zheng Jie Wang,Bo Da Cheng,Lei Nie,Wen Luo,Zhao Yuan Yu,Ling Wang Yuan
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Geographic Question Answering (GeoQA) addresses natural language queries in geographic domains to fulfill complex user demands and improve information retrieval efficiency. Traditional QA systems, however, suffer from limited comprehension, low retrieval accuracy, weak interactivity, and inadequate handling of complex tasks, hindering precise information acquisition. This study presents GeoRAG, a knowledge-enhanced QA framework integrating domain-specific fine-tuning and prompt engineering with Retrieval-Augmented Generation (RAG) technology to enhance geographic knowledge retrieval accuracy and user interaction. The methodology involves four components: (1) A structured geographic knowledge base constructed from 3267 corpora (research papers, monographs, and technical reports), categorized via a multi-agent approach into seven dimensions: semantic understanding, spatial location, geometric morphology, attribute characteristics, feature relationships, evolutionary processes, and operational mechanisms. This yielded 145234 classified entries and 875432 multi-dimensional QA pairs. (2) A multi-label text classifier based on BERT-Base-Chinese, trained to analyze query types through geographic dimension classification. (3) A retrieval evaluator leveraging QA pair data to assess query-document relevance, optimizing retrieval precision. (4) GeoPrompt templates engineered to dynamically integrate user queries with retrieved information, enhancing response quality through dimension-specific prompting. Comparative experiments demonstrate GeoRAG’s superior performance over conventional RAG across multiple base models, validating its generalizability. This work advances geographic AI by proposing a novel paradigm for deploying large language models in domain-specific contexts, with implications for improving GeoQA systems scalability and accuracy in real-world applications.

[IR-5] Real-time Ad retrieval via LLM -generative Commercial Intention for Sponsored Search Advertising

链接: https://arxiv.org/abs/2504.01304
作者: Tongtong Liu,Zhaohui Wang,Meiyue Qin,Zenghui Lu,Xudong Chen,Yuekui Yang,Peng Shu
类目: Information Retrieval (cs.IR)
*备注: 13pages,5 figures

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with retrieval systems has shown promising potential in retrieving documents (docs) or advertisements (ads) for a given query. Existing LLM-based retrieval methods generate numeric or content-based DocIDs to retrieve docs/ads. However, the one-to-few mapping between numeric IDs and docs, along with the time-consuming content extraction, leads to semantic inefficiency and limits scalability in large-scale corpora. In this paper, we propose the Real-time Ad REtrieval (RARE) framework, which leverages LLM-generated text called Commercial Intentions (CIs) as an intermediate semantic representation to directly retrieve ads for queries in real-time. These CIs are generated by a customized LLM injected with commercial knowledge, enhancing its domain relevance. Each CI corresponds to multiple ads, yielding a lightweight and scalable set of CIs. RARE has been implemented in a real-world online system, handling daily search volumes in the hundreds of millions. The online implementation has yielded significant benefits: a 5.04% increase in consumption, a 6.37% rise in Gross Merchandise Volume (GMV), a 1.28% enhancement in click-through rate (CTR) and a 5.29% increase in shallow conversions. Extensive offline experiments show RARE’s superiority over ten competitive baselines in four major categories.

[IR-6] Migrating a Job Search Relevance Function AAAI2025

链接: https://arxiv.org/abs/2504.01284
作者: Bennett Mountain,Gabriel Womark,Ritvik Kharkar
类目: Information Retrieval (cs.IR)
*备注: Accepted at AAAI 2025 Computational Jobs Workshop

点击查看摘要

Abstract:In this paper, we describe the migration of a homebrewed C++ search engine to OpenSearch, aimed at preserving and improving search performance with minimal impact on business metrics. To facilitate the migration, we froze our job corpus and executed queries in low inventory locations to capture a representative mixture of high- and low-quality search results. These query-job pairs were labeled by crowd-sourced annotators using a custom rubric designed to reflect relevance and user satisfaction. Leveraging Bayesian optimization, we fine-tuned a new retrieval algorithm on OpenSearch, replicating key components of the original engine’s logic while introducing new functionality where necessary. Through extensive online testing, we demonstrated that the new system performed on par with the original, showing improvements in specific engagement metrics, with negligible effects on revenue.

[IR-7] Information Retrieval for Climate Impact

链接: https://arxiv.org/abs/2504.01162
作者: Maarten de Rijke,Bart van den Hurk,Flora Salim,Alaa Al Khourdajie,Nan Bai,Renato Calzone,Declan Curran,Getnet Demil,Lesley Frew,Noah Gießing,Mukesh Kumar Gupta,Maria Heuss,Sanaa Hobeichi,David Huard,Jingwei Kang,Ana Lucic,Tanwi Mallick,Shruti Nath,Andrew Okem,Barbara Pernici,Thilina Rajapakse,Hira Saleem,Harry Scells,Nicole Schneider,Damiano Spina,Yuanyuan Tian,Edmund Totin,Andrew Trotman,Ramamurthy Valavandan,Dereje Workneh,Yangxinyu Xie
类目: Information Retrieval (cs.IR)
*备注: Report on the MANILA24 Workshop

点击查看摘要

Abstract:The purpose of the MANILA24 Workshop on information retrieval for climate impact was to bring together researchers from academia, industry, governments, and NGOs to identify and discuss core research problems in information retrieval to assess climate change impacts. The workshop aimed to foster collaboration by bringing communities together that have so far not been very well connected – information retrieval, natural language processing, systematic reviews, impact assessments, and climate science. The workshop brought together a diverse set of researchers and practitioners interested in contributing to the development of a technical research agenda for information retrieval to assess climate change impacts.

[IR-8] Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB

链接: https://arxiv.org/abs/2504.01157
作者: Anas Dorbani,Sunny Yasser,Jimmy Lin,Amine Mhedhbi
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Knowledge-intensive analytical applications retrieve context from both structured tabular data and unstructured, text-free documents for effective decision-making. Large language models (LLMs) have made it significantly easier to prototype such retrieval and reasoning data pipelines. However, implementing these pipelines efficiently still demands significant effort and has several challenges. This often involves orchestrating heterogeneous data systems, managing data movement, and handling low-level implementation details, e.g., LLM context management. To address these challenges, we introduce FlockMTL: an extension for DBMSs that deeply integrates LLM capabilities and retrieval-augmented generation (RAG). FlockMTL includes model-driven scalar and aggregate functions, enabling chained predictions through tuple-level mappings and reductions. Drawing inspiration from the relational model, FlockMTL incorporates: (i) cost-based optimizations, which seamlessly apply techniques such as batching and caching; and (ii) resource independence, enabled through novel SQL DDL abstractions: PROMPT and MODEL, introduced as first-class schema objects alongside TABLE. FlockMTL streamlines the development of knowledge-intensive analytical applications, and its optimizations ease the implementation burden. Subjects: Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2504.01157 [cs.DB] (or arXiv:2504.01157v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2504.01157 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-04-03

目录

概览 (2025-04-03)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载