本篇博文主要内容为 2025-04-18 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-04-18)

今日共更新444篇论文,其中:

  • 自然语言处理112篇(Computation and Language (cs.CL))
  • 人工智能138篇(Artificial Intelligence (cs.AI))
  • 计算机视觉121篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习117篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLM s

【速读】: 该论文旨在解决跨模态检索(Cross-modal Retrieval, CMR)任务中生成式方法面临的语义信息不足问题,特别是在标识符构建与生成过程中的语义理解薄弱。为了解决这些问题,论文提出了一种新颖的统一语义增强生成式跨模态检索框架(Semantic-enhanced generative Cross-mOdal REtrieval, SemCORE)。该框架的关键在于引入结构化自然语言标识符(Structured natural language IDentifier, SID),以更好地与优化用于自然语言理解和生成的生成模型对齐,并通过生成式语义验证(Generative Semantic Verification, GSV)策略实现目标的细粒度区分。此外,SemCORE 是首个同时考虑文本到图像和图像到文本检索任务的生成式跨模态检索框架,实验表明其在多个基准数据集上的表现显著优于现有方法,特别是在文本到图像检索任务中,Recall@1 平均提升了 8.65 分点。

链接: https://arxiv.org/abs/2504.13172
作者: Haoxuan Li,Yi Bin,Yunshan Ma,Guoqing Wang,Yang Yang,See-Kiong Ng,Tat-Seng Chua
机构: University of Electronic Science and Technology of China (电子科技大学)(Chengdu, China); Tongji University (同济大学)(Shanghai, China); Singapore Management University (新加坡管理大学)(Singapore); National University of Singapore (新加坡国立大学)(Singapore)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.
zh

[NLP-1] Sleep-time Compute: Beyond Inference Scaling at Test-time

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂问题时,于测试阶段因扩展计算资源而导致的高延迟和高昂推理成本的问题。论文提出了一种名为“sleep-time compute”的解决方案,其关键在于允许模型在用户查询呈现之前离线“思考”上下文信息。通过预测用户可能提出的查询并预先计算有用的量值,该方法能够显著减少测试阶段所需的计算需求。实验结果表明,在Stateful GSM-Symbolic和Stateful AIME任务中,sleep-time compute可使测试阶段的计算需求降低约5倍,同时通过进一步扩展sleep-time compute,可分别提升Stateful GSM-Symbolic和Stateful AIME任务的准确性达13%和18%。此外,论文还引入Multi-Query GSM-Symbolic以优化相关查询的计算分配,将单个查询的平均成本降低2.5倍,并通过分析发现用户查询的可预测性与sleep-time compute的效果高度相关。最后,论文展示了该方法在实际代理任务中的应用案例。

链接: https://arxiv.org/abs/2504.13171
作者: Kevin Lin,Charlie Snell,Yu Wang,Charles Packer,Sarah Wooders,Ion Stoica,Joseph E. Gonzalez
机构: Letta (Letta); University of California, Berkeley (加州大学伯克利分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code and data released at: this https URL

点击查看摘要

Abstract:Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to “think” offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.
zh

[NLP-2] CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

【速读】: 该论文旨在解决预训练数据混合优化这一挑战性问题。尽管精心设计的数据混合能够显著提升预训练模型的性能,但现有预训练数据集(如Common Crawl)缺乏明确的领域划分,而人工标注数据集(如The Pile)则耗时且成本高昂。因此,如何自动发现和优化适合大规模预训练的语言模型的数据混合成为亟待解决的问题。

为了解决上述问题,论文提出了一种名为CLIMB(CLustering-based Iterative Data Mixture Bootstrapping)的自动化框架。该框架的关键在于通过嵌入和聚类大规模数据集于语义空间中,利用小规模代理模型和预测器迭代搜索最优的数据混合方案。这种方法不仅能够有效识别不同领域的数据分布,还能够在连续训练过程中动态调整数据比例,从而实现更高效的预训练。实验结果表明,基于CLIMB优化的数据混合可以使10亿参数模型在4000亿词 Tokens 上的性能超越当前最先进的Llama-3.2-1B模型2.0%,并且针对特定领域(如社会科学)的优化可进一步提升5%。此外,论文还提供了ClimbLab和ClimbMix等资源以促进相关研究。

链接: https://arxiv.org/abs/2504.13161
作者: Shizhe Diao,Yu Yang,Yonggan Fu,Xin Dong,Dan Su,Markus Kliegl,Zijia Chen,Peter Belcak,Yoshi Suhara,Hongxu Yin,Mostofa Patwary,Yingyan(Celine)Lin,Jan Kautz,Pavlo Molchanov
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 9 figures

点击查看摘要

Abstract:Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: this https URL
zh

[NLP-3] MIB: A Mechanistic Interpretability Benchmark

【速读】: 该论文试图解决的问题是如何评估新兴的机制性可解释性方法是否实现了真正的改进。为追求有意义且持久的评估标准,论文提出了一种名为MIB的基准测试,包含两个赛道、四个任务和五个模型。MIB更倾向于那些能够精确且简洁地恢复神经语言模型中相关因果路径或特定因果变量的方法。关键解决方案在于设计了电路定位赛道和因果变量定位赛道,分别用于比较定位模型组件及其任务相关连接的重要性的方法(如归因修补或信息流路径),以及将隐藏向量特征化以定位与任务相关的因果变量特征的方法。通过MIB,研究发现归因和掩码优化方法在电路定位中表现最佳,而在因果变量定位中,监督分布式对齐搜索(DAS)方法表现最优,同时稀疏自编码器(SAE)特征并不优于神经元。这些发现表明MIB能够实现方法的有意义比较,并增强了对该领域实际进展的信心。

链接: https://arxiv.org/abs/2504.13151
作者: Aaron Mueller,Atticus Geiger,Sarah Wiegreffe,Dana Arad,Iván Arcuschin,Adam Belfki,Yik Siu Chan,Jaden Fiotto-Kaufman,Tal Haklay,Michael Hanna,Jing Huang,Rohan Gupta,Yaniv Nikankin,Hadas Orgad,Nikhil Prakash,Anja Reusch,Aruna Sankaranarayanan,Shun Shao,Alessandro Stolfo,Martin Tutek,Amir Zur,David Bau,Yonatan Belinkov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and locate model features for a causal variable relevant to the task. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., standard dimensions of hidden vectors. These findings illustrate that MIB enables meaningful comparisons of methods, and increases our confidence that there has been real progress in the field.
zh

[NLP-4] Antidistillation Sampling

【速读】: 该论文试图解决生成式模型在产生扩展推理轨迹时,由于生成丰富令牌序列而容易受到模型蒸馏攻击的问题。论文提出的关键解决方案是“抗蒸馏采样”(antidistillation sampling),通过策略性地修改模型的下一令牌概率分布,有目的地毒化推理轨迹,从而显著降低其对模型蒸馏的有效性,同时保持模型的实际实用性。

链接: https://arxiv.org/abs/2504.13146
作者: Yash Savani,Asher Trockman,Zhili Feng,Avi Schwarzschild,Alexander Robey,Marc Finzi,J. Zico Kolter
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. \emphAntidistillation sampling provides exactly this capability. By strategically modifying a model’s next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model’s practical utility. For further details, see this https URL.
zh

[NLP-5] Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

【速读】: 该论文旨在解决在广泛的语言模型(LM)应用中,生成符合语法或语义约束的文本的问题。传统方法通过概率条件化来施加这些约束,但由此产生的分布与原始语言模型的基分布可能差异显著,且精确采样通常不可行。论文的关键解决方案是提出了一种基于顺序蒙特卡洛(Sequential Monte Carlo, SMC)的可控语言模型生成架构。该架构允许在推理阶段灵活引入领域和任务特定的约束,并在生成过程中根据新信息高效重新分配计算资源。通过在四个具有挑战性的任务(包括数据科学中的Python代码生成、文本到SQL转换、目标推断和分子合成)上的实验,证明了该方法在少量开销下,可以使小型开源语言模型的表现超越超过其8倍规模的模型以及经过微调的闭源模型。性能提升的关键在于更好地近似后验分布。此系统基于Lew等人(2023)的框架,集成了概率编程语言,为用户提供了一种简单且可编程的方式来解决广泛的可控生成问题。

链接: https://arxiv.org/abs/2504.13139
作者: João Loula,Benjamin LeBrun,Li Du,Ben Lipkin,Clemente Pasti,Gabriel Grand,Tianyu Liu,Yahya Emara,Marjorie Freedman,Jason Eisner,Ryan Cotterel,Vikash Mansinghka,Alexander K. Lew,Tim Vieira,Timothy J. O’Donnell
机构: MIT (麻省理工学院); ETH Zürich (瑞士联邦理工学院); McGill (麦吉尔大学); Mila (米拉研究所); Johns Hopkins (约翰斯·霍普金斯大学); Yale (耶鲁大学); ISI (信息科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 pages, 4 figures

点击查看摘要

Abstract:A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be naturally framed as probabilistic conditioning, but exact generation from the resulting distribution – which can differ substantially from the LM’s base distribution – is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). Our SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains – Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis – we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8x larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.
zh

[NLP-6] Energy-Based Reward Models for Robust Language Model Alignment

【速读】: 该论文试图解决现有奖励模型(Reward Models, RMs)在捕捉复杂人类偏好和泛化到未见数据时面临的挑战。为了解决这些问题,论文提出了一种名为能量基奖励模型(Energy-Based Reward Model, EBRM)的轻量级后处理精炼框架。EBRM 的关键是显式建模奖励分布,通过冲突感知的数据过滤、标签噪声感知的对比训练以及混合初始化方法,有效捕获人类偏好的不确定性并减轻标注噪声或不一致的影响,同时无需重新训练即可增强奖励模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2504.13134
作者: Anamika Lochab,Ruqi Zhang
机构: Department of Computer Science (计算机科学系), Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reward models (RMs) are essential for aligning Large Language Models (LLMs) with human preferences. However, they often struggle with capturing complex human preferences and generalizing to unseen data. To address these challenges, we introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework that enhances RM robustness and generalization. EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations. It achieves this through conflict-aware data filtering, label-noise-aware contrastive training, and hybrid initialization. Notably, EBRM enhances RMs without retraining, making it computationally efficient and adaptable across different models and tasks. Empirical evaluations on RM benchmarks demonstrate significant improvements in both robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks compared to standard RMs. Furthermore, reinforcement learning experiments confirm that our refined rewards enhance alignment quality, effectively delaying reward hacking. These results demonstrate our approach as a scalable and effective enhancement for existing RMs and alignment pipelines. The code is available at EBRM.
zh

[NLP-7] FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

【速读】: 本文旨在解决信息检索(Information Retrieval, IR)领域基准数据集构建困难的问题,特别是缺乏针对快速发展的新兴和利基主题的高质量评价基准。论文提出了一种可重用的框架 FreshStack,其核心解决方案是通过自动化流程从社区提问与答案中生成高质量的检索评估基准。关键步骤包括:(1) 自动收集代码和技术文档语料库,(2) 基于社区问题生成检索片段(nuggets),以及 (3) 使用检索技术融合与混合架构实现片段级支持的文档检索。实验表明,现有检索模型在未经调优的情况下显著落后于 oracle 方法,表明仍有较大的改进空间以提升 IR 质量。此外,研究发现,在两个主题中重排序器未能明显改善初始检索准确性,这为进一步优化检索模型提供了方向。FreshStack 的创新之处在于提供了一种现实、可扩展且无污染的 IR 和检索增强生成(RAG)基准建设方法。

链接: https://arxiv.org/abs/2504.13128
作者: Nandan Thakur,Jimmy Lin,Sam Havens,Michael Carbin,Omar Khattab,Andrew Drozdov
机构: University of Waterloo (滑铁卢大学); Databricks; San Francisco (旧金山)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce FreshStack, a reusable framework for automatically building information retrieval (IR) evaluation benchmarks from community-asked questions and answers. FreshStack conducts the following steps: (1) automatic corpus collection from code and technical documentation, (2) nugget generation from community-asked questions and answers, and (3) nugget-level support, retrieving documents using a fusion of retrieval techniques and hybrid architectures. We use FreshStack to build five datasets on fast-growing, recent, and niche topics to ensure the tasks are sufficiently challenging. On FreshStack, existing retrieval models, when applied out-of-the-box, significantly underperform oracle approaches on all five topics, denoting plenty of headroom to improve IR quality. In addition, we identify cases where rerankers do not clearly improve first-stage retrieval accuracy (two out of five topics). We hope that FreshStack will facilitate future work toward constructing realistic, scalable, and uncontaminated IR and RAG evaluation benchmarks. FreshStack datasets are available at: this https URL.
zh

[NLP-8] LLM s Meet Finance: Fine-Tuning Foundation Models for the Open FinLLM Leaderboard

【速读】: 该论文旨在探索大型语言模型(Large Language Models, LLMs)在金融任务中的应用。论文通过在Open FinLLM排行榜上以基础模型为基准进行微调,试图解决如何有效提升LLMs在金融领域的性能问题。关键在于采用了一系列先进的微调技术,包括有监督微调(Supervised Fine-Tuning, SFT)、直接偏好优化(Direct Preference Optimization, DPO)以及强化学习(Reinforcement Learning, RL),以增强模型的金融处理能力。这些方法显著提升了模型在多种金融任务上的表现,并进一步研究了金融领域内的数据规模规律。

链接: https://arxiv.org/abs/2504.13125
作者: Varun Rao,Youran Sun,Mahendra Kumar,Tejas Mutneja,Agastya Mukherjee,Haizhao Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper investigates the application of large language models (LLMs) to financial tasks. We fine-tuned foundation models using the Open FinLLM Leaderboard as a benchmark. Building on Qwen2.5 and Deepseek-R1, we employed techniques including supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning (RL) to enhance their financial capabilities. The fine-tuned models demonstrated substantial performance gains across a wide range of financial tasks. Moreover, we measured the data scaling law in the financial domain. Our work demonstrates the potential of large language models (LLMs) in financial applications.
zh

[NLP-9] Probing and Inducing Combinational Creativity in Vision-Language Models

【速读】: 该论文试图解决的问题是如何评估和提升视觉语言模型(Vision-Language Models, VLMs)在组合创造力(combinational creativity)方面的表现。论文指出,现有VLMs如GPT-4V和DALLE-3的输出是否真正体现了通过结合已有概念生成新颖想法的能力,还是仅仅反映了对训练数据的高度模式匹配,这一问题尚无明确结论。为解决此问题,论文的关键在于提出了一种名为Identification-Explanation-Implication (IEI) 的框架,将创意过程分解为三个层次:识别输入空间、提取共享属性以及推导新的语义含义。并通过构建高质量的数据集CreativeMashup验证了该框架的有效性,证明在理解任务中顶级VLMs已超越普通人表现但在专家级理解上仍有差距,在生成任务中应用IEI框架显著提升了VLMs的创造性输出质量。这不仅为评估人工创造力提供了理论基础,也为改进VLMs的创造性生成提供了实践指导。

链接: https://arxiv.org/abs/2504.13120
作者: Yongqian Peng,Yuxi Ma,Mengmeng Wang,Yuxuan Wang,Yizhou Wang,Chi Zhang,Yixin Zhu,Zilong Zheng
机构: Institute for Artificial Intelligence, Peking University (北京大学智能学院); Yuanpei College, Peking University (北京大学元培学院); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室, BIGAI); Center on Frontiers of Computing Studies, School of Computer Science, Peking University (前沿计算研究中心, 北京大学计算机科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page: this https URL The first two authors contribute equally

点击查看摘要

Abstract:The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence. Recent advances in Vision-Language Models (VLMs) like GPT-4V and DALLE-3 have sparked debate about whether their outputs reflect combinational creativity–defined by M. A. Boden (1998) as synthesizing novel ideas through combining existing concepts–or sophisticated pattern matching of training data. Drawing inspiration from cognitive science, we investigate the combinational creativity of VLMs from the lens of concept blending. We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels: identifying input spaces, extracting shared attributes, and deriving novel semantic implications. To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework. Through extensive experiments, we demonstrate that in comprehension tasks, best VLMs have surpassed average human performance while falling short of expert-level understanding; in generation tasks, incorporating our IEI framework into the generation pipeline significantly enhances the creative quality of VLMs outputs. Our findings establish both a theoretical foundation for evaluating artificial creativity and practical guidelines for improving creative generation in VLMs.
zh

[NLP-10] ackling Social Bias against the Poor: A Dataset and Taxonomy on Aporophobia NAACL2025

【速读】: 该论文试图解决贫困污名化(aporophobia)在社会媒体中难以被有效识别和追踪的问题,这一现象构成了制定、批准和实施减贫政策的主要障碍。论文的关键解决方案在于通过与非营利组织和政府机构合作,构建一个包含来自五个世界地区的英语推文语料库,并进行人工标注,以全面表征针对贫困人口的偏见和歧视相关的社交媒体话语。基于标注数据,论文提出了贫困污名化的态度和行为分类法,并开发了多个分类器,旨在克服自动检测贫困污名化的技术挑战。这些工作为大规模识别、追踪和缓解社交媒体上的贫困污名化观点奠定了基础。

链接: https://arxiv.org/abs/2504.13085
作者: Georgina Curto,Svetlana Kiritchenko,Muhammad Hammad Fahim Siddiqui,Isar Nejadgholi,Kathleen C. Fraser
机构: United Nations University Institute in Macau (澳门联合国大学研究院); National Research Council Canada (加拿大国家研究委员会); University of Ottawa (渥太华大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: In Findings of the Association for Computational Linguistics: NAACL 2025

点击查看摘要

Abstract:Eradicating poverty is the first goal in the United Nations Sustainable Development Goals. However, aporophobia – the societal bias against people living in poverty – constitutes a major obstacle to designing, approving and implementing poverty-mitigation policies. This work presents an initial step towards operationalizing the concept of aporophobia to identify and track harmful beliefs and discriminative actions against poor people on social media. In close collaboration with non-profits and governmental organizations, we conduct data collection and exploration. Then we manually annotate a corpus of English tweets from five world regions for the presence of (1) direct expressions of aporophobia, and (2) statements referring to or criticizing aporophobic views or actions of others, to comprehensively characterize the social media discourse related to bias and discrimination against the poor. Based on the annotated data, we devise a taxonomy of categories of aporophobic attitudes and actions expressed through speech on social media. Finally, we train several classifiers and identify the main challenges for automatic detection of aporophobia in social networks. This work paves the way towards identifying, tracking, and mitigating aporophobic views on social media at scale.
zh

[NLP-11] Retrieval-Augmented Generation with Conflicting Evidence

【速读】: 该论文旨在解决大型语言模型(LLM)代理在使用检索增强生成(RAG)技术时面临的多重挑战,这些挑战包括处理用户查询中的歧义、来自多个来源的潜在冲突信息,以及抑制来自噪声或不相关文档的不准确信息。传统方法通常孤立地研究这些问题,仅关注单一方面,如处理歧义或提高对噪声和错误信息的鲁棒性。本文提出了一种综合考虑多种因素的方法。

论文的关键解决方案包括:(i) 构建了一个名为RAMDocs的新数据集,用于模拟复杂的现实场景,其中包含用户查询的歧义、错误信息和噪声等冲突证据;(ii) 提出了一种多智能体方法MADAM-RAG,其中LLM代理通过多轮辩论评估答案的优劣,使聚合器能够整合针对消歧实体的响应,同时排除错误信息和噪声,从而联合处理来自不同来源的冲突。实验表明,MADAM-RAG在AmbigDocs任务中提升了最多11.40%,在FaithEval任务中提升了最多15.80%(绝对值),显著优于现有RAG基线。此外,RAMDocs对现有RAG基线构成了重大挑战,而MADAM-RAG在此基础上部分缓解了冲突因素的影响,但仍存在显著改进空间,尤其是在支持证据与错误信息不平衡程度增加的情况下。

链接: https://arxiv.org/abs/2504.13079
作者: Han Wang,Archiki Prasad,Elias Stengel-Eskin,Mohit Bansal
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Our data and code is available at: this https URL

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs – which requires presenting all valid answers for ambiguous queries – improving over strong RAG baselines by up to 11.40% and on FaithEval – which requires suppressing misinformation – where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match score). While MADAM-RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.
zh

[NLP-12] Accuracy is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models

【速读】: 本文旨在探索深度学习(Deep Learning, DL)模型分类机动车碰撞叙述(crash narratives)的准确性与领域专家一致性的关系。研究通过对比五种DL模型(包括BERT变体、Universal Sentence Encoder (USE) 和零样本分类器)以及四种大型语言模型(Large Language Models, LLMs:GPT-4、LLaMA 3、Qwen和Claude)在专家标注数据上的表现,发现一个反直觉的趋势:技术精度更高的模型往往与专家的一致性较低,而LLMs尽管精度略低,却表现出更高的专家一致性。关键解决方案在于引入Cohen’s Kappa量化一致性,利用主成分分析(Principal Component Analysis, PCA)和基于SHAP的可解释性技术来解析模型行为,揭示专家一致模型更依赖于上下文和时间线索而非位置特定关键词。研究强调,仅凭准确性不足以评估安全关键自然语言处理(NLP)应用中的模型,并建议将专家一致性作为评估框架中的互补指标,同时凸显LLMs作为可解释且可扩展工具在碰撞分析管道中的潜力。

链接: https://arxiv.org/abs/2504.13068
作者: Sudesh Ramesh Bhagat,Ibne Farabi Shihab,Anuj Sharma
机构: Department of Civil, Construction and Environmental Engineering, Iowa State University (爱荷华州立大学), Ames, IA, USA; Department of Computer Science, Iowa State University (爱荷华州立大学), Ames, IA, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores the relationship between deep learning (DL) model accuracy and expert agreement in the classification of crash narratives. We evaluate five DL models – including BERT variants, the Universal Sentence Encoder (USE), and a zero-shot classifier – against expert-labeled data and narrative text. The analysis is further extended to four large language models (LLMs): GPT-4, LLaMA 3, Qwen, and Claude. Our results reveal a counterintuitive trend: models with higher technical accuracy often exhibit lower agreement with domain experts, whereas LLMs demonstrate greater expert alignment despite relatively lower accuracy scores. To quantify and interpret model-expert agreement, we employ Cohen’s Kappa, Principal Component Analysis (PCA), and SHAP-based explainability techniques. Findings indicate that expert-aligned models tend to rely more on contextual and temporal language cues, rather than location-specific keywords. These results underscore that accuracy alone is insufficient for evaluating models in safety-critical NLP applications. We advocate for incorporating expert agreement as a complementary metric in model evaluation frameworks and highlight the promise of LLMs as interpretable, scalable tools for crash analysis pipelines.
zh

[NLP-13] RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins CVPR2025

【速读】: 本文针对机器人领域中双臂协调和复杂物体操作能力发展所面临的挑战,尝试解决高质量演示数据稀缺以及与现实世界对齐的评估基准不足的问题。论文的关键解决方案是提出了RoboTwin框架,这是一个基于生成式数字孪生(Generative Digital Twin)的理念构建的系统,利用三维生成基础模型(3D Generative Foundation Models)和大型语言模型(Large Language Models)生成多样化的专家数据集,并提供与现实世界对齐的评估平台。RoboTwin通过从单张二维图像创建多样化数字孪生体来构建逼真的交互场景,并引入了一种空间关系感知的代码生成框架,结合对象标注与大型语言模型分解任务、确定空间约束并生成精确的机器人运动代码。这一方案的核心在于不仅提供了包含模拟与真实世界数据的综合基准,还实现了仿真训练与实际性能之间的更好适配。实验验证表明,在开放源代码COBOT Magic Robot平台上,使用预训练于RoboTwin生成数据并通过少量真实样本微调的策略,能够显著提升单臂任务的成功率超过70%,双臂任务的成功率超过40%,相比仅依赖真实世界数据训练的模型具有明显优势。

链接: https://arxiv.org/abs/2504.13059
作者: Yao Mu,Tianxing Chen,Zanxin Chen,Shijia Peng,Zhiqian Lan,Zeyu Gao,Zhixuan Liang,Qiaojun Yu,Yude Zou,Mingkun Xu,Lunkai Lin,Zhiqiang Xie,Mingyu Ding,Ping Luo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: CVPR 2025 Highlight. 22 pages. Project page: this https URL

点击查看摘要

Abstract:In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.
zh

[NLP-14] Aspect-Based Summarization with Self-Aspect Retrieval Enhanced Generation

【速读】: 本文旨在解决基于方面(aspect-based)摘要生成中的两个主要挑战:传统方法的资源限制与有限泛化能力,以及现有大语言模型在无训练情况下进行该任务时过度依赖提示工程(prompt engineering)、面临令牌限制(token limits)和幻觉问题(hallucination),特别是在上下文学习(in-context learning)中的表现。为了解决这些问题,论文提出了一种新颖的框架——Self-Aspect Retrieval Enhanced Summary Generation。其关键在于不完全依赖于上下文学习,而是通过嵌入驱动的检索机制(embedding-driven retrieval mechanism)针对指定方面提取相关文本片段,从而避免冗余信息,缓解令牌限制问题。此外,该框架优化了令牌使用效率,并确保模型输出严格基于给定方面,从而有效提升性能并克服令牌限制的难题。

链接: https://arxiv.org/abs/2504.13054
作者: Yichao Feng,Shuai Zhao,Yueqiu Li,Luwei Xiao,Xiaobao Wu,Anh Tuan Luu
机构: College of Computing and Data Science, Nanyang Technological University (南洋理工大学); School of Humanities, Nanyang Technological University (南洋理工大学); School of Computer Science and Technology, East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aspect-based summarization aims to generate summaries tailored to specific aspects, addressing the resource constraints and limited generalizability of traditional summarization approaches. Recently, large language models have shown promise in this task without the need for training. However, they rely excessively on prompt engineering and face token limits and hallucination challenges, especially with in-context learning. To address these challenges, in this paper, we propose a novel framework for aspect-based summarization: Self-Aspect Retrieval Enhanced Summary Generation. Rather than relying solely on in-context learning, given an aspect, we employ an embedding-driven retrieval mechanism to identify its relevant text segments. This approach extracts the pertinent content while avoiding unnecessary details, thereby mitigating the challenge of token limits. Moreover, our framework optimizes token usage by deleting unrelated parts of the text and ensuring that the model generates output strictly based on the given aspect. With extensive experiments on benchmark datasets, we demonstrate that our framework not only achieves superior performance but also effectively mitigates the token limitation problem.
zh

[NLP-15] How Large Language Models Are Changing MOOC Essay Answers: A Comparison of Pre- and Post-LLM Responses

【速读】: 该论文旨在量化大型语言模型(Large Language Models, LLMs)的兴起对在线教育的影响,特别是探讨ChatGPT发布后对学生作文长度、风格以及内容主题变化的影响。论文通过分析一个涵盖多年的学生作文数据集,该数据集来自一门关于AI伦理的免费大学MOOC课程,其中包含ChatGPT发布前后提交的作文。研究发现,ChatGPT的发布与学生作文长度和风格的显著变化相关,并且观察到与AI和LLMs相关的关键词流行度的变化,但主题建模分析显示,作文讨论的核心主题并未发生根本性改变。论文的关键解决方案在于利用多时间点的数据集,结合定量分析方法,系统地评估LLMs对在线教育中学生写作行为的具体影响及其潜在机制。

链接: https://arxiv.org/abs/2504.13038
作者: Leo Leppänen,Lili Aunimo,Arto Hellas,Jukka K. Nurminen,Linda Mannila
机构: University of Helsinki (赫尔辛基大学); Haaga-Helia University of Applied Sciences (哈格-赫利娅应用科学大学); Aalto University (阿尔托大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:The release of ChatGPT in late 2022 caused a flurry of activity and concern in the academic and educational communities. Some see the tool’s ability to generate human-like text that passes at least cursory inspections for factual accuracy ``often enough’’ a golden age of information retrieval and computer-assisted learning. Some, on the other hand, worry the tool may lead to unprecedented levels of academic dishonesty and cheating. In this work, we quantify some of the effects of the emergence of Large Language Models (LLMs) on online education by analyzing a multi-year dataset of student essay responses from a free university-level MOOC on AI ethics. Our dataset includes essays submitted both before and after ChatGPT’s release. We find that the launch of ChatGPT coincided with significant changes in both the length and style of student essays, mirroring observations in other contexts such as academic publishing. We also observe – as expected based on related public discourse – changes in prevalence of key content words related to AI and LLMs, but not necessarily the general themes or topics discussed in the student essays as identified through (dynamic) topic modeling.
zh

[NLP-16] ChatEXAONEPath: An Expert-level Multimodal Large Language Model for Histopathology Using Whole Slide Images

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在病理学领域因缺乏对全面临床语境的理解而限制其应用的问题。现有基于切片级(patch-level)数据的多模态LLMs在处理公共数据集中的有限信息时,难以充分理解复杂的病理学临床场景。为解决这一问题,论文提出的关键方案是开发基于全切片图像(Whole Slide Images, WSIs)级别的多模态大型语言模型(Multimodal Large Language Models, MLLMs)。具体而言,作者引入了一个名为ChatEXAONEPath的专家级MLLM,利用来自The Cancer Genome Atlas (TCGA) 的10,094对WSIs及其对应的病理报告,构建了一种基于检索的数据生成管道,并设计了AI驱动的评估协议以全面理解多模态信息中的医学语境。实验结果表明,该模型能够诊断给定的病理图像,接受率为62.9%,并可综合理解多种癌症类型的全景WSIs及其临床语境,从而为临床医生提供辅助支持。

链接: https://arxiv.org/abs/2504.13023
作者: Sangwook Kim,Soonyoung Lee,Jongseong Jang
机构: LG AI Research (LG AI研究)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent studies have made significant progress in developing large language models (LLMs) in the medical domain, which can answer expert-level questions and demonstrate the potential to assist clinicians in real-world clinical scenarios. Studies have also witnessed the importance of integrating various modalities with the existing LLMs for a better understanding of complex clinical contexts, which are innately multi-faceted by nature. Although studies have demonstrated the ability of multimodal LLMs in histopathology to answer questions from given images, they lack in understanding of thorough clinical context due to the patch-level data with limited information from public datasets. Thus, developing WSI-level MLLMs is significant in terms of the scalability and applicability of MLLMs in histopathology. In this study, we introduce an expert-level MLLM for histopathology using WSIs, dubbed as ChatEXAONEPath. We present a retrieval-based data generation pipeline using 10,094 pairs of WSIs and histopathology reports from The Cancer Genome Atlas (TCGA). We also showcase an AI-based evaluation protocol for a comprehensive understanding of the medical context from given multimodal information and evaluate generated answers compared to the original histopathology reports. We demonstrate the ability of diagnosing the given histopathology images using ChatEXAONEPath with the acceptance rate of 62.9% from 1,134 pairs of WSIs and reports. Our proposed model can understand pan-cancer WSIs and clinical context from various cancer types. We argue that our proposed model has the potential to assist clinicians by comprehensively understanding complex morphology of WSIs for cancer diagnosis through the integration of multiple modalities.
zh

[NLP-17] SHA256 at SemEval-2025 Task 4: Selective Amnesia – Constrained Unlearning for Large Language Models via Knowledge Isolation SEMEVAL

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中经常无意中存储敏感信息的问题,特别是在公开部署模型时可能引发的风险。目前的机器去学习方法难以在不损害整体模型性能的情况下,有选择性地移除特定的数据关联。为应对这一挑战,论文提出了一种针对SemEval-2025 Task 4目标去学习任务的解决方案,采用两阶段方法结合因果中介分析与分层优化。研究通过系统性的因果追踪实验发现,对于OLMo架构(1B和7B参数规模),前几层Transformer层(第0至5层)在存储主体-属性关联方面起着关键作用。基于此洞察,论文开发了一种约束优化方法,在冻结高层的同时,对低层应用一种新颖的联合损失函数——通过输出标记交叉熵惩罚最大化遗忘集的损失,并通过自适应正则化最小化保留集的偏差。该方法在1B模型赛道中获得第二名的成绩,同时保持了基线MMLU准确性达88%,验证了因果引导分层优化作为高效精准去学习的一种有前景范式的有效性,为缓解AI系统中的数据隐私问题迈出了重要一步。

链接: https://arxiv.org/abs/2504.12996
作者: Saransh Agrawal,Kuan-Hao Huang
机构: Texas A&M University (德克萨斯农工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, In Proceedings of The 19th International Workshop on Semantic Evaluation (SemEval), 2025

点击查看摘要

Abstract:Large language models (LLMs) frequently memorize sensitive information during training, posing risks when deploying publicly accessible models. Current machine unlearning methods struggle to selectively remove specific data associations without degrading overall model capabilities. This paper presents our solution to SemEval-2025 Task 4 on targeted unlearning, which introduces a two-stage methodology that combines causal mediation analysis with layer-specific optimization. Through systematic causal tracing experiments on OLMo architectures (1B and 7B parameters), we identify the critical role of the first few transformer layers (layers 0-5) in storing subject-attribute associations within MLP modules. Building on this insight, we develop a constrained optimization approach that freezes upper layers while applying a novel joint loss function to lower layers-simultaneously maximizing forget set loss via output token cross-entropy penalties and minimizing retain set deviation through adaptive regularization. Our method achieves 2nd place in the 1B model track, demonstrating strong task performance while maintaining 88% of baseline MMLU accuracy. These results establish causal-informed layer optimization as a promising paradigm for efficient, precise unlearning in LLMs, offering a significant step forward in addressing data privacy concerns in AI systems.
zh

[NLP-18] Accommodate Knowledge Conflicts in Retrieval-augmented LLM s: Towards Reliable Response Generation in the Wild

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在信息检索系统中的知识冲突问题,特别是在响应生成(Response Generation, RG)任务中,由于错误信息、偏见或过时知识导致的内部记忆与外部检索信息之间的矛盾。这种知识冲突会降低响应的可靠性并引入决策不确定性。为应对这一挑战,论文的关键在于提出了一种名为Swin-VIB的新框架,它通过将变分信息瓶颈(Variational Information Bottleneck, VIB)模型集成到检索信息的自适应增强以及引导LLM偏好中,有效缓解了上述问题。实验结果验证了该方法的有效性,尤其在单选任务上的准确率提升了至少7.54%。

链接: https://arxiv.org/abs/2504.12982
作者: Jiatai Wang,Zhiwei Xu,Di Jin,Xuewen Yang,Tao Li
机构: College of Computer Science, Nankai University (南开大学), Tianjin (中国); Haihe Lab of ITAI (海河实验室信息技术应用创新中心), Tianjin (中国); Meta AI (Meta); InnoPeak Technology, Inc (英诺峰科技有限公司), Palo Alto (美国); College of Computer Science, Nankai University (南开大学), Tianjin (中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of large language models (LLMs) has significantly advanced information retrieval systems, particularly in response generation (RG). Unfortunately, LLMs often face knowledge conflicts between internal memory and retrievaled external information, arising from misinformation, biases, or outdated knowledge. These conflicts undermine response reliability and introduce uncertainty in decision-making. In this work, we analyze how LLMs navigate knowledge conflicts from an information-theoretic perspective and reveal that when conflicting and supplementary information exhibit significant differences, LLMs confidently resolve their preferences. However, when the distinction is ambiguous, LLMs experience heightened uncertainty. Based on this insight, we propose Swin-VIB, a novel framework that integrates a pipeline of variational information bottleneck models into adaptive augmentation of retrieved information and guiding LLM preference in response generation. Extensive experiments on single-choice, open-ended question-answering (QA), and retrieval augmented generation (RAG) validate our theoretical findings and demonstrate the efficacy of Swin-VIB. Notably, our method improves single-choice task accuracy by at least 7.54% over competitive baselines.
zh

[NLP-19] A Phenomenological Approach to Analyzing User Queries in IT Systems Using Heideggers Fundamental Ontology

【速读】: 该论文试图解决传统IT系统局限于基于范畴分析(categorical analysis)而难以揭示查询处理中更深层次本体论模式的问题,尤其是在解析复杂交互(如隐喻在IT语境中的使用)时容易陷入逻辑陷阱。论文的关键解决方案在于提出了一种基于海德格尔基础本体论(Martin Heidegger’s Fundamental Ontology)的新型研究分析IT系统,该系统通过构建两种模态上不同但描述上完备的语言——用于处理用户输入的范畴语言(language of beings/das Seiende)和用于内部分析的存在语言(existential language/Being),并通过现象学约简模块(phenomenological reduction module)实现两者之间的桥梁作用。这种设计使得系统能够识别递归与自引用结构,并以范畴形式提供可操作的洞察,从而超越现有系统的局限性。

链接: https://arxiv.org/abs/2504.12977
作者: Maksim Vishnevskiy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 12 pages, no figures

点击查看摘要

Abstract:This paper presents a novel research analytical IT system grounded in Martin Heidegger’s Fundamental Ontology, distinguishing between beings (das Seiende) and Being (das Sein). The system employs two modally distinct, descriptively complete languages: a categorical language of beings for processing user inputs and an existential language of Being for internal analysis. These languages are bridged via a phenomenological reduction module, enabling the system to analyze user queries (including questions, answers, and dialogues among IT specialists), identify recursive and self-referential structures, and provide actionable insights in categorical terms. Unlike contemporary systems limited to categorical analysis, this approach leverages Heidegger’s phenomenological existential analysis to uncover deeper ontological patterns in query processing, aiding in resolving logical traps in complex interactions, such as metaphor usage in IT contexts. The path to full realization involves formalizing the language of Being by a research team based on Heidegger’s Fundamental Ontology; given the existing completeness of the language of beings, this reduces the system’s computability to completeness, paving the way for a universal query analysis tool. The paper presents the system’s architecture, operational principles, technical implementation, use cases–including a case based on real IT specialist dialogues–comparative evaluation with existing tools, and its advantages and limitations.
zh

[NLP-20] Sparks of Science: Hypothesis Generation Using Structured Paper Data

【速读】: 该论文旨在解决科学假设生成(Scientific Hypothesis Generation, SHG)任务中现有基础模型难以同时产生新颖性和可行性的科学想法的问题。为解决此问题,论文提出了HypoGen数据集,这是首个以自然语言生成(Natural Language Generation, NLG)任务形式构建的包含约5500个结构化问题-假设对的数据集。HypoGen采用Bit-Flip-Spark框架组织数据,其中Bit代表传统假设,Spark表示关键见解或概念飞跃,Flip则是由此产生的反提案,并且独特地整合了显式的推理链(Chain-of-Reasoning)组件来反映从Bit到Flip的思维过程。论文的关键解决方案在于将假设生成建模为条件语言模型,通过在Bit-Flip-Spark及推理链上微调模型,并在推理阶段仅提供Bit信息,从而显著提升了生成假设的整体质量,包括其新颖性、可行性和整体评估得分。

链接: https://arxiv.org/abs/2504.12976
作者: Charles O’Neill,Tirthankar Ghosal,Roberta Răileanu,Mike Walmsley,Thang Bui,Kevin Schawinski,Ioana Ciucă
机构: University of Oxford (牛津大学); Oak Ridge National Laboratory (橡树岭国家实验室); University College London (伦敦大学学院); University of Toronto (多伦多大学); Australian National University (澳大利亚国立大学); Modulos AG (莫杜洛斯股份公司); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures. Comments welcome

点击查看摘要

Abstract:Generating novel and creative scientific hypotheses is a cornerstone in achieving Artificial General Intelligence. Large language and reasoning models have the potential to aid in the systematic creation, selection, and validation of scientifically informed hypotheses. However, current foundation models often struggle to produce scientific ideas that are both novel and feasible. One reason is the lack of a dedicated dataset that frames Scientific Hypothesis Generation (SHG) as a Natural Language Generation (NLG) task. In this paper, we introduce HypoGen, the first dataset of approximately 5500 structured problem-hypothesis pairs extracted from top-tier computer science conferences structured with a Bit-Flip-Spark schema, where the Bit is the conventional assumption, the Spark is the key insight or conceptual leap, and the Flip is the resulting counterproposal. HypoGen uniquely integrates an explicit Chain-of-Reasoning component that reflects the intellectual process from Bit to Flip. We demonstrate that framing hypothesis generation as conditional language modelling, with the model fine-tuned on Bit-Flip-Spark and the Chain-of-Reasoning (and where, at inference, we only provide the Bit), leads to improvements in the overall quality of the hypotheses. Our evaluation employs automated metrics and LLM judge rankings for overall quality assessment. We show that by fine-tuning on our HypoGen dataset we improve the novelty, feasibility, and overall quality of the generated hypotheses. The HypoGen dataset is publicly available at this http URL.
zh

[NLP-21] Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization

【速读】: 本文旨在解决现有长上下文语言模型在大规模多文档摘要任务中的性能瓶颈问题,特别是其在宣称的上下文窗口范围内表现不佳的现象。为应对这一挑战,论文提出了一种结合检索增强系统(Retrieval-Augmented System, RAS)与最新长上下文语言模型的混合方法。关键在于通过估计最优检索长度来优化RAS系统的性能,该长度依赖于检索器、摘要生成器以及数据集特性。具体而言,作者首先在数据集的随机子集上利用一组大型语言模型(LLMs)生成银标准参考(silver references),然后使用这些参考来确定特定RAS配置下的最佳上下文长度。实验结果显示,所提方法在不同模型类别和规模下均表现出色,并且优于其他长上下文基准如RULER和HELMET的长度估计。此外,分析表明此估计策略对于非常大的上下文长度的语言模型同样有效,并可推广至新的模型类别。

链接: https://arxiv.org/abs/2504.12972
作者: Adithya Pratapa,Teruko Mitamura
机构: Language Technologies Institute (语言技术研究所), Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in long-context reasoning abilities of language models led to interesting applications in large-scale multi-document summarization. However, prior work has shown that these long-context models are not effective at their claimed context windows. To this end, retrieval-augmented systems provide an efficient and effective alternative. However, their performance can be highly sensitive to the choice of retrieval context length. In this work, we present a hybrid method that combines retrieval-augmented systems with long-context windows supported by recent language models. Our method first estimates the optimal retrieval length as a function of the retriever, summarizer, and dataset. On a randomly sampled subset of the dataset, we use a panel of LLMs to generate a pool of silver references. We use these silver references to estimate the optimal context length for a given RAG system configuration. Our results on the multi-document summarization task showcase the effectiveness of our method across model classes and sizes. We compare against length estimates from strong long-context benchmarks such as RULER and HELMET. Our analysis also highlights the effectiveness of our estimation method for very long-context LMs and its generalization to new classes of LMs.
zh

[NLP-22] Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中通过迭代推理策略提升性能时面临的高计算成本和低效率问题。论文的关键解决方案是引入“无反馈重试”(“retrials without feedback”)机制,这是一种无需显式自我反思或口头反馈的简单但强大的方法,允许LLMs在识别错误答案后直接重新尝试解决问题。与传统的迭代精化方法相比,该方法简化了推理框架,同时证明了这种基于重试的简单方法通常优于更复杂的推理框架,挑战了复杂方法必然带来更好性能的普遍假设,为设计更高效且性能优异的推理系统提供了新思路。

链接: https://arxiv.org/abs/2504.12951
作者: Nearchos Potamitis,Akhil Arora
机构: Aarhus University (奥胡斯大学); Aarhus University (奥胡斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 16 figures, 1 table. arXiv admin note: text overlap with arXiv:2405.06691

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have catalyzed the development of general-purpose autonomous agents, demonstrating remarkable performance in complex reasoning tasks across various domains. This surge has spurred the evolution of a plethora of prompt-based reasoning frameworks. A recent focus has been on iterative reasoning strategies that refine outputs through self-evaluation and verbalized feedback. However, these strategies require additional computational complexity to enable models to recognize and correct their mistakes, leading to a significant increase in their cost. In this work, we introduce the concept of ``retrials without feedback’', an embarrassingly simple yet powerful mechanism for enhancing reasoning frameworks by allowing LLMs to retry problem-solving attempts upon identifying incorrect answers. Unlike conventional iterative refinement methods, our method does not require explicit self-reflection or verbalized feedback, simplifying the refinement process. Our findings indicate that simpler retrial-based approaches often outperform more sophisticated reasoning frameworks, suggesting that the benefits of complex methods may not always justify their computational costs. By challenging the prevailing assumption that more intricate reasoning strategies inherently lead to better performance, our work offers new insights into how simpler, more efficient approaches can achieve optimal results. So, are retrials all you need?
zh

[NLP-23] ConExion: Concept Extraction with Large Language Models

【速读】: 该论文试图解决从文档中提取所有相关概念而非仅限于重要信息的关键短语的问题。这一任务更具挑战性,因为目标是覆盖特定领域的全部概念。论文的关键解决方案在于利用预训练的大规模语言模型(LLMs)进行概念提取,并通过引入提示(prompts)实现无监督的概念抽取。实验结果表明,该方法在两个广泛使用的基准数据集上的F1分数优于现有最先进的技术,证明了LLMs在概念提取任务中的有效性。代码和数据集已公开可用。

链接: https://arxiv.org/abs/2504.12915
作者: Ebrahim Norouzi,Sven Hertling,Harald Sack
机构: FIZ Karlsruhe – Leibniz Institute for Information Infrastructure (FIZ 卡尔斯鲁厄 – 信息基础设施莱布尼茨研究所); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In this paper, an approach for concept extraction from documents using pre-trained large language models (LLMs) is presented. Compared with conventional methods that extract keyphrases summarizing the important information discussed in a document, our approach tackles a more challenging task of extracting all present concepts related to the specific domain, not just the important ones. Through comprehensive evaluations of two widely used benchmark datasets, we demonstrate that our method improves the F1 score compared to state-of-the-art techniques. Additionally, we explore the potential of using prompts within these models for unsupervised concept extraction. The extracted concepts are intended to support domain coverage evaluation of ontologies and facilitate ontology learning, highlighting the effectiveness of LLMs in concept extraction tasks. Our source code and datasets are publicly available at this https URL.
zh

[NLP-24] MAIN: Mutual Alignment Is Necessary for instruction tuning

【速读】: 该论文试图解决在大规模指令微调过程中因忽视指令与响应之间对齐而导致的数据质量问题。论文指出,高质量的指令-响应对并非单纯由各组件的个体质量决定,而是取决于它们之间的对齐程度。为解决此问题,论文提出了一种名为Mutual Alignment Framework (MAIN) 的方法,通过双向约束确保指令与响应之间的连贯性。关键在于引入这种对齐机制,使模型如LLaMA和Mistral在多个基准测试中表现出色,强调了指令-响应对齐在可扩展且高质量指令微调中的重要性。

链接: https://arxiv.org/abs/2504.12913
作者: Fanyi Yang,Jianfeng Liu,Xin Zhang,Haoyu Liu,Xixin Cao,Yuefeng Zhan,Hao Sun,Weiwei Deng,Feng Sun,Qi Zhang
机构: Peking University (北京大学); Microsoft Corporation (微软公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction tuning has enabled large language models (LLMs) to achieve remarkable performance, but its success heavily depends on the availability of large-scale, high-quality instruction-response pairs. However, current methods for scaling up data generation often overlook a crucial aspect: the alignment between instructions and responses. We hypothesize that high-quality instruction-response pairs are not defined by the individual quality of each component, but by the extent of their alignment with each other. To address this, we propose a Mutual Alignment Framework (MAIN) that ensures coherence between the instruction and response through mutual constraints. Experiments demonstrate that models such as LLaMA and Mistral, fine-tuned within this framework, outperform traditional methods across multiple benchmarks. This approach underscores the critical role of instruction-response alignment in enabling scalable and high-quality instruction tuning for LLMs.
zh

[NLP-25] Benchmarking Multi-National Value Alignment for Large Language Models

【速读】: 该论文旨在解决现有Large Language Models (LLMs) 价值评估方法未能充分反映国家价值观多样性的问题,同时指出当前基于人工设计问卷的谱系测试基准存在不易扩展的局限性。为应对这些挑战,论文提出了NaVAB(National Values Alignment Benchmark),这是一个综合性的基准,用于评估LLMs与五个主要国家(中国、美国、英国、法国和德国)价值观的一致性。NaVAB的关键创新在于其国家价值观提取流程,包括使用指令标记处理原始数据源的建模程序、筛选与价值观相关主题的过程以及结合冲突减少机制的数据生成过程。通过在多个国家的LLMs上进行广泛实验,NaVAB不仅能够识别出不一致的场景,还展示了可以通过与对齐技术结合来有效减轻价值观关切,从而实现LLMs价值观与目标国家价值观的对齐。

链接: https://arxiv.org/abs/2504.12911
作者: Chengyi Ju,Weijie Shi,Chengzhong Liu,Jiaming Ji,Jipeng Zhang,Ruiyuan Zhang,Jia Zhu,Jiajie Xu,Yaodong Yang,Sirui Han,Yike Guo
机构: The Hong Kong University of Science and Technology (香港科技大学); Peking University (北京大学); Zhejiang Normal University (浙江师范大学); Soochow University (苏州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Do Large Language Models (LLMs) hold positions that conflict with your country’s values? Occasionally they do! However, existing works primarily focus on ethical reviews, failing to capture the diversity of national values, which encompass broader policy, legal, and moral considerations. Furthermore, current benchmarks that rely on spectrum tests using manually designed questionnaires are not easily scalable. To address these limitations, we introduce NaVAB, a comprehensive benchmark to evaluate the alignment of LLMs with the values of five major nations: China, the United States, the United Kingdom, France, and Germany. NaVAB implements a national value extraction pipeline to efficiently construct value assessment datasets. Specifically, we propose a modeling procedure with instruction tagging to process raw data sources, a screening process to filter value-related topics and a generation process with a Conflict Reduction mechanism to filter non-conflicting this http URL conduct extensive experiments on various LLMs across countries, and the results provide insights into assisting in the identification of misaligned scenarios. Moreover, we demonstrate that NaVAB can be combined with alignment techniques to effectively reduce value concerns by aligning LLMs’ values with the target country. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.12911 [cs.CL] (or arXiv:2504.12911v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.12911 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-26] Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中可能利用数据集偏差而导致泛化能力较差的问题。当前基于先验知识的去偏方法和基于上下文学习的自动去偏方法因数据集偏差的多样性以及基于上下文学习的去偏抑制不足而效果有限。论文的关键解决方案在于提出了一种基于因果机制与信息论相结合的信息增益引导因果干预去偏(Information Gain-Guided Causal Intervention Debiasing, IGCIDB)框架。该框架首先通过信息增益引导的因果干预方法自动平衡指令微调数据集的分布,随后采用标准的监督微调过程在去偏数据集上训练LLMs,从而有效提升模型的泛化能力。

链接: https://arxiv.org/abs/2504.12898
作者: Zhouhao Sun,Xiao Ding,Li Du,Yunpeng Xu,Yixuan Ma,Yang Zhao,Bing Qin,Ting Liu
机构: Research Center for Social Computing and Information Retrieval (社会计算与信息检索研究中心), Harbin Institute of Technology (哈尔滨工业大学), China; Beijing Academy of Artificial Intelligence (北京人工智能研究院), Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (IGCIDB) framework. This framework first utilizes an information gain-guided causal intervention method to automatically and autonomously balance the distribution of instruction-tuning dataset. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that IGCIDB can effectively debias LLM to improve its generalizability across different tasks.
zh

[NLP-27] Are AI agents the new machine translation frontier? Challenges and opportunities of single- and multi-agent systems for multilingual digital communication

【速读】: 该论文旨在探索人工智能代理在机器翻译(Machine Translation, MT)中的应用潜力,特别是单代理与多代理系统在提升多语言数字通信中的作用。论文试图解决传统机器翻译在处理复杂场景时存在的局限性,如高精度需求、领域特定知识以及上下文意识不足的问题。解决方案的关键在于设计一种多代理系统,其中多个专业化的人工智能代理以结构化方式协作,分别负责翻译、准确性审查、流畅性审查及最终编辑等任务。通过法律机器翻译的试点研究验证了多代理工作流的可行性,并展示了其在领域适应性和上下文感知能力上的显著优势,从而实现优于传统机器翻译或单一代理系统的翻译质量。

链接: https://arxiv.org/abs/2504.12891
作者: Vicent Briva-Iglesias
机构: Dublin City University (都柏林城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The rapid evolution of artificial intelligence (AI) has introduced AI agents as a disruptive paradigm across various industries, yet their application in machine translation (MT) remains underexplored. This paper describes and analyses the potential of single- and multi-agent systems for MT, reflecting on how they could enhance multilingual digital communication. While single-agent systems are well-suited for simpler translation tasks, multi-agent systems, which involve multiple specialized AI agents collaborating in a structured manner, may offer a promising solution for complex scenarios requiring high accuracy, domain-specific knowledge, and contextual awareness. To demonstrate the feasibility of multi-agent workflows in MT, we are conducting a pilot study in legal MT. The study employs a multi-agent system involving four specialized AI agents for (i) translation, (ii) adequacy review, (iii) fluency review, and (iv) final editing. Our findings suggest that multi-agent systems may have the potential to significantly improve domain-adaptability and contextual awareness, with superior translation quality to traditional MT or single-agent systems. This paper also sets the stage for future research into multi-agent applications in MT, integration into professional translation workflows, and shares a demo of the system analyzed in the paper.
zh

[NLP-28] ViClaim: A Multilingual Multilabel Dataset for Automatic Claim Detection in Videos

【速读】: 该论文试图解决视频内容作为传播与误传信息媒介日益增长的影响下,现有误传信息检测工具在多语言和多主题场景中对口语文本分析不足的问题。论文的关键解决方案是引入ViClaim数据集,包含三种语言(英语、德语、西班牙语)和六个主题的1,798个标注视频字幕,并将句子划分为“值得事实核查”、“不值得事实核查”或“观点”三个类别。此外,开发了一个定制化标注工具以支持复杂的标注过程,并通过最先进的多语言语言模型验证了其有效性,展示了较强的交叉验证性能(宏F1高达0.896),同时揭示了对未见过主题泛化能力的挑战,特别是特定领域内的困难。这一工作强调了视频字幕中主张检测的复杂性,并为基于视频的沟通中误传信息检测提供了坚实基础。

链接: https://arxiv.org/abs/2504.12882
作者: Patrick Giedemann,Pius von Däniken,Jan Deriu,Alvaro Rodrigo,Anselmo Peñas,Mark Cieliebak
机构: Zurich University of Applied Sciences (苏黎世应用科技大学); UNED NLP & IR Group (UNED 自然语言处理与信息检索小组, 西班牙)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing influence of video content as a medium for communication and misinformation underscores the urgent need for effective tools to analyze claims in multilingual and multi-topic settings. Existing efforts in misinformation detection largely focus on written text, leaving a significant gap in addressing the complexity of spoken text in video transcripts. We introduce ViClaim, a dataset of 1,798 annotated video transcripts across three languages (English, German, Spanish) and six topics. Each sentence in the transcripts is labeled with three claim-related categories: fact-check-worthy, fact-non-check-worthy, or opinion. We developed a custom annotation tool to facilitate the highly complex annotation process. Experiments with state-of-the-art multilingual language models demonstrate strong performance in cross-validation (macro F1 up to 0.896) but reveal challenges in generalization to unseen topics, particularly for distinct domains. Our findings highlight the complexity of claim detection in video transcripts. ViClaim offers a robust foundation for advancing misinformation detection in video-based communication, addressing a critical gap in multimodal analysis.
zh

[NLP-29] Building Russian Benchmark for Evaluation of Information Retrieval Models

【速读】: 本文旨在解决俄语环境下信息检索(Information Retrieval, IR)模型的零样本评估问题,通过构建RusBEIR这一综合基准数据集来实现。RusBEIR包含来自不同领域的17个数据集,整合了改编、翻译以及新创建的数据集,从而支持词项模型(lexical models)与神经模型(neural models)的系统性对比研究。论文的关键在于强调在形态学丰富的语言中预处理对于词项模型的重要性,并验证BM25作为全文检索强基准的有效性。同时,针对神经模型(如mE5-large和BGE-M3),尽管其在大多数数据集上表现出色,但受限于输入长度限制,在长文档检索任务中面临挑战。因此,RusBEIR提供了一个统一且开源的研究框架,以推动俄语信息检索领域的进一步发展。

链接: https://arxiv.org/abs/2504.12879
作者: Grigory Kovalev,Mikhail Tikhomirov,Evgeny Kozhevnikov,Max Kornilov,Natalia Loukachevitch
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce RusBEIR, a comprehensive benchmark designed for zero-shot evaluation of information retrieval (IR) models in the Russian language. Comprising 17 datasets from various domains, it integrates adapted, translated, and newly created datasets, enabling systematic comparison of lexical and neural models. Our study highlights the importance of preprocessing for lexical models in morphologically rich languages and confirms BM25 as a strong baseline for full-document retrieval. Neural models, such as mE5-large and BGE-M3, demonstrate superior performance on most datasets, but face challenges with long-document retrieval due to input size constraints. RusBEIR offers a unified, open-source framework that promotes research in Russian-language information retrieval.
zh

[NLP-30] Can LLM s reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks

【速读】: 该论文试图解决现有多语言长上下文基准测试的局限性,这些基准主要基于“needle-in-a-haystack”测试,侧重于评估模型在无关文本中定位特定信息的能力,但忽略了模型在长上下文中进行推理的能力,并容易受到数据泄露和短路效应的影响。为了解决这些问题,论文提出了MLRBench,一个全新的合成基准,通过引入多跳推理、聚合和认识推理等任务,超越了表面级检索的限制。关键在于设计了一个平行、抗泄露且可扩展到任意上下文长度的多语言基准,同时揭示了高资源与低资源语言之间的显著差距,并强调即使在多语言设置下,大型语言模型(LLM)实际有效利用的上下文长度不足其声称长度的30%。尽管现有的检索增强生成方法能够部分缓解这一问题,但未能彻底解决长上下文挑战。

链接: https://arxiv.org/abs/2504.12845
作者: Amey Hengle,Prasoon Bajpai,Soham Dan,Tanmoy Chakraborty
机构: Indian Institute of Technology Delhi (印度理工学院德里), Microsoft
类目: Computation and Language (cs.CL)
备注: 33 Pages in Total - 23 (Main Manuscript) + 10 (Appendix)

点击查看摘要

Abstract:Existing multilingual long-context benchmarks, often based on the popular needle-in-a-haystack test, primarily evaluate a model’s ability to locate specific information buried within irrelevant texts. However, such a retrieval-centric approach is myopic and inherently limited, as successful recall alone does not indicate a model’s capacity to reason over extended contexts. Moreover, these benchmarks are susceptible to data leakage, short-circuiting, and risk making the evaluation a priori identifiable. To address these limitations, we introduce MLRBench, a new synthetic benchmark for multilingual long-context reasoning. Unlike existing benchmarks, MLRBench goes beyond surface-level retrieval by including tasks that assess multi-hop inference, aggregation, and epistemic reasoning. Spanning seven languages, MLRBench is designed to be parallel, resistant to leakage, and scalable to arbitrary context lengths. Our extensive experiments with an open-weight large language model (LLM) reveal a pronounced gap between high- and low-resource languages, particularly for tasks requiring the model to aggregate multiple facts or predict the absence of information. We also find that, in multilingual settings, LLMs effectively utilize less than 30% of their claimed context length. Although off-the-shelf Retrieval Augmented Generation helps alleviate this to a certain extent, it does not solve the long-context problem. We open-source MLRBench to enable future research in improved evaluation and training of multilingual LLMs.
zh

[NLP-31] SMARTe: Slot-based Method for Accountable Relational Triple extraction

【速读】: 该论文旨在解决现有关系三元组抽取(Relational Triple Extraction, RTE)方法内部机制不透明的问题,这些方法通常依赖复杂的预处理来诱导特定交互,可能导致模型表现与其理论基础不完全一致。为了解决这些问题,论文提出了SMARTe(Slot-based Method for Accountable Relational Triple extraction),其关键是通过槽位注意力机制引入内在可解释性,并将任务重新定义为集合预测问题。槽位注意力将相关信息聚合到不同的槽位中,确保所有预测都可以追溯到学习到的槽位表示以及对每个预测的关系三元组有贡献的tokens。这种方法在保持可解释性的同时,实现了与最先进的模型相当的性能,在NYT和WebNLG数据集上的评估表明,增加可解释性不会损害性能。此外,论文通过定性分析展示了SMARTe提供的解释,并使用注意力热图映射到相应的tokens。最后,作者讨论了研究发现并提出了未来的研究方向。

链接: https://arxiv.org/abs/2504.12816
作者: Xue Wen Tan,Stanley Kok
机构: Asian Institute of Digital Finance, National University of Singapore (新加坡国立大学数字金融研究所); School of Computing, National University of Singapore (新加坡国立大学计算机学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Relational Triple Extraction (RTE) is a fundamental task in Natural Language Processing (NLP). However, prior research has primarily focused on optimizing model performance, with limited efforts to understand the internal mechanisms driving these models. Many existing methods rely on complex preprocessing to induce specific interactions, often resulting in opaque systems that may not fully align with their theoretical foundations. To address these limitations, we propose SMARTe: a Slot-based Method for Accountable Relational Triple extraction. SMARTe introduces intrinsic interpretability through a slot attention mechanism and frames the task as a set prediction problem. Slot attention consolidates relevant information into distinct slots, ensuring all predictions can be explicitly traced to learned slot representations and the tokens contributing to each predicted relational triple. While emphasizing interpretability, SMARTe achieves performance comparable to state-of-the-art models. Evaluations on the NYT and WebNLG datasets demonstrate that adding interpretability does not compromise performance. Furthermore, we conducted qualitative assessments to showcase the explanations provided by SMARTe, using attention heatmaps that map to their respective tokens. We conclude with a discussion of our findings and propose directions for future research.
zh

[NLP-32] Assesing LLM s in Art Contexts: Critique Generation and Theory of Mind Evaluation

【速读】: 该论文试图解决的问题是如何评估大型语言模型(Large Language Models, LLMs)在艺术相关领域的表现,具体包括艺术批评写作和艺术情境中的心智状态推理(Theory of Mind, ToM)。论文通过构建一个结合Noel Carroll评价框架与多种艺术批评理论的系统,利用逐步提示的方法引导模型生成艺术批评,并设计超越标准错误信念测试的新ToM任务来评估模型在复杂社会嵌入推理中的能力。解决方案的关键在于精心设计的提示策略,例如在艺术批评生成中采用逐步提示过程,以及在ToM任务中引入包含解读、情感和道德张力的情境,从而揭示LLMs在特定指导下的行为是否更接近于真正的理解。

链接: https://arxiv.org/abs/2504.12805
作者: Takaya Arita,Wenxian Zheng,Reiji Suzuki,Fuminori Akiba
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 30 pages, 13 figures, 1 table

点击查看摘要

Abstract:This study explored how large language models (LLMs) perform in two areas related to art: writing critiques of artworks and reasoning about mental states (Theory of Mind, or ToM) in art-related situations. For the critique generation part, we built a system that combines Noel Carroll’s evaluative framework with a broad selection of art criticism theories. The model was prompted to first write a full-length critique and then shorter, more coherent versions using a step-by-step prompting process. These AI-generated critiques were then compared with those written by human experts in a Turing test-style evaluation. In many cases, human subjects had difficulty telling which was which, and the results suggest that LLMs can produce critiques that are not only plausible in style but also rich in interpretation, as long as they are carefully guided. In the second part, we introduced new simple ToM tasks based on situations involving interpretation, emotion, and moral tension, which can appear in the context of art. These go beyond standard false-belief tests and allow for more complex, socially embedded forms of reasoning. We tested 41 recent LLMs and found that their performance varied across tasks and models. In particular, tasks that involved affective or ambiguous situations tended to reveal clearer differences. Taken together, these results help clarify how LLMs respond to complex interpretative challenges, revealing both their cognitive limitations and potential. While our findings do not directly contradict the so-called Generative AI Paradox–the idea that LLMs can produce expert-like output without genuine understanding–they suggest that, depending on how LLMs are instructed, such as through carefully designed prompts, these models may begin to show behaviors that resemble understanding more closely than we might assume.
zh

[NLP-33] owards Lossless Token Pruning in Late-Interaction Retrieval Models SIGIR2025

【速读】: 该论文旨在解决基于晚交互神经信息检索(Neural Information Retrieval, Neural IR)模型如ColBERT在存储文档上下文表示时所需巨大内存空间的问题。现有方法通过启发式或统计技术剪枝文档中的标记,但无法保证剪枝的标记不会影响检索分数。论文的关键在于提出一种原则性的方法,定义如何剪枝标记而不影响文档与查询之间的检索分数。为此,作者引入了三种正则化损失函数,以实现高剪枝比率,并设计了两种剪枝策略。实验结果表明,所提方法能够在保持ColBERT性能的同时,仅使用30%的标记。

链接: https://arxiv.org/abs/2504.12778
作者: Yuxuan Zong,Benjamin Piwowarski
机构: Sorbonne Université, CNRS, ISIR(索邦大学, 法国国家科学研究中心, 国际智能机器人研究所); CNRS, Sorbonne Université, ISIR(法国国家科学研究中心, 索邦大学, 国际智能机器人研究所)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at SIGIR 2025 Full Paper Track

点击查看摘要

Abstract:Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn’t guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT’s performance while using only 30% of the tokens.
zh

[NLP-34] Enhancing the Geometric Problem-Solving Ability of Multimodal LLM s via Symbolic-Neural Integration

【速读】: 该论文旨在解决几何问题求解(Geometry Problem Solving, GPS)中缺乏精确逐步骤求解数据以及多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中易产生幻觉的问题。为了解决这些问题,论文提出了一种名为GeoGen的流水线方法,用于自动生成几何图示的逐步骤推理路径,并利用精确的符号推理生成大规模高质量的问题-答案对。解决方案的关键在于通过GeoGen生成的合成数据训练另一个名为GeoLogic的大语言模型(Large Language Model, LLM),GeoLogic作为自然语言与符号系统之间的桥梁,能够帮助验证MLLMs的输出结果,从而增强其逻辑推理能力并减少幻觉现象。实验结果显示,这种方法显著提升了MLLMs在几何推理任务基准测试中的性能。这一改进得益于结合了LLMs和符号系统的优点,提供了一个更可靠且可解释的几何问题求解方法。

链接: https://arxiv.org/abs/2504.12773
作者: Yicheng Pan,Zhenrong Zhang,Pengfei Hu,Jiefeng Ma,Jun Du,Jianshu Zhang,Quan Liu,Jianqing Gao,Feng Ma
机构: University of Science and Technology of China (中国科学技术大学); iFLYTEK Research (科大讯飞研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have achieved remarkable progress in general domains and demonstrated promise in multimodal mathematical reasoning. However, applying MLLMs to geometry problem solving (GPS) remains challenging due to lack of accurate step-by-step solution data and severe hallucinations during reasoning. In this paper, we propose GeoGen, a pipeline that can automatically generates step-wise reasoning paths for geometry diagrams. By leveraging the precise symbolic reasoning, \textbfGeoGen produces large-scale, high-quality question-answer pairs. To further enhance the logical reasoning ability of MLLMs, we train \textbfGeoLogic, a Large Language Model (LLM) using synthetic data generated by GeoGen. Serving as a bridge between natural language and symbolic systems, GeoLogic enables symbolic tools to help verifying MLLM outputs, making the reasoning process more rigorous and alleviating hallucinations. Experimental results show that our approach consistently improves the performance of MLLMs, achieving remarkable results on benchmarks for geometric reasoning tasks. This improvement stems from our integration of the strengths of LLMs and symbolic systems, which enables a more reliable and interpretable approach for the GPS task. Codes are available at this https URL.
zh

[NLP-35] Out of Sight Out of Mind Out of Sight Out of Mind: Measuring Bias in Language Models Against Overlooked Marginalized Groups in Regional Contexts

【速读】: 该论文试图解决语言模型(Language Models, LMs)中存在的针对少数群体的偏见和刻板印象问题,特别是那些被现有研究忽视的边缘化群体和低资源语言。论文的关键在于首次系统性地评估了来自埃及、其他21个阿拉伯国家、德国、英国和美国的270个边缘化群体在23种语言模型中的攻击性刻板印象偏见,并探讨了低资源语言和方言对偏见研究的影响。研究发现,使用埃及阿拉伯语方言相较于现代标准阿拉伯语显著提高了检测到的偏见水平,揭示了当前偏见度量方法的局限性。此外,研究强调了跨多个维度(如性别身份、性取向和种族)的交叉偏见问题。因此,论文的核心解决方案在于通过扩大研究范围至被忽视的边缘化群体和低资源语言,以推动更具包容性的语言模型开发。

链接: https://arxiv.org/abs/2504.12767
作者: Fatma Elsafoury,David Hartmann
机构: Fraunhofer-fokus Institute (弗劳恩霍夫 fokus 研究所)(柏林, 德国); Weizenbaum Institute (魏岑鲍姆研究所)(柏林, 德国); Techniche Universtäte Berlin (柏林工业大学)(柏林, 德国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We know that language models (LMs) form biases and stereotypes of minorities, leading to unfair treatments of members of these groups, thanks to research mainly in the US and the broader English-speaking world. As the negative behavior of these models has severe consequences for society and individuals, industry and academia are actively developing methods to reduce the bias in LMs. However, there are many under-represented groups and languages that have been overlooked so far. This includes marginalized groups that are specific to individual countries and regions in the English speaking and Western world, but crucially also almost all marginalized groups in the rest of the world. The UN estimates, that between 600 million to 1.2 billion people worldwide are members of marginalized groups and in need for special protection. If we want to develop inclusive LMs that work for everyone, we have to broaden our understanding to include overlooked marginalized groups and low-resource languages and dialects. In this work, we contribute to this effort with the first study investigating offensive stereotyping bias in 23 LMs for 270 marginalized groups from Egypt, the remaining 21 Arab countries, Germany, the UK, and the US. Additionally, we investigate the impact of low-resource languages and dialects on the study of bias in LMs, demonstrating the limitations of current bias metrics, as we measure significantly higher bias when using the Egyptian Arabic dialect versus Modern Standard Arabic. Our results show, LMs indeed show higher bias against many marginalized groups in comparison to dominant groups. However, this is not the case for Arabic LMs, where the bias is high against both marginalized and dominant groups in relation to religion and ethnicity. Our results also show higher intersectional bias against Non-binary, LGBTQIA+ and Black women. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.12767 [cs.CL] (or arXiv:2504.12767v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.12767 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-36] Chinese-Vicuna: A Chinese Instruction-following Llama-based Model

【速读】: 该论文旨在解决中文指令跟随能力不足的问题,特别是在低资源环境下的应用挑战。解决方案的关键在于通过Low-Rank Adaptation (LoRA) 技术对Meta的Llama架构进行微调,从而设计出一个资源高效且开放的中文语言模型Chinese-Vicuna。该模型结合了混合数据集(BELLE和Guanaco)以及4-bit量化技术(QLoRA),支持在消费级GPU(如RTX-2080Ti)上的低成本部署,并能在医疗和法律等特定领域实现适配。其模块化设计与开源生态系统进一步增强了模型的灵活性与可访问性,使其成为中文大型语言模型 (LLM) 应用的多功能基础平台。

链接: https://arxiv.org/abs/2504.12737
作者: Chenghao Fan,Zhenyi Lu,Jie Tian
机构: 未知
类目: Computation and Language (cs.CL)
备注: Chinese-Vicuna Technique Report

点击查看摘要

Abstract:Chinese-Vicuna is an open-source, resource-efficient language model designed to bridge the gap in Chinese instruction-following capabilities by fine-tuning Meta’s LLaMA architecture using Low-Rank Adaptation (LoRA). Targeting low-resource environments, it enables cost-effective deployment on consumer GPUs (e.g., RTX-2080Ti for 7B models) and supports domain-specific adaptation in fields like healthcare and law. By integrating hybrid datasets (BELLE and Guanaco) and 4-bit quantization (QLoRA), the model achieves competitive performance in tasks such as translation, code generation, and domain-specific Q\A. The project provides a comprehensive toolkit for model conversion, CPU inference, and multi-turn dialogue interfaces, emphasizing accessibility for researchers and developers. Evaluations indicate competitive performance across medical tasks, multi-turn dialogue coherence, and real-time legal updates. Chinese-Vicuna’s modular design, open-source ecosystem, and community-driven enhancements position it as a versatile foundation for Chinese LLM applications.
zh

[NLP-37] Pandora: A Code-Driven Large Language Model Agent for Unified Reasoning Across Diverse Structured Knowledge

【速读】: 该论文试图解决统一结构化知识推理(USKR)在处理自然语言问答(NLQs)时面临的挑战,特别是现有方法难以有效利用不同SKR任务之间的知识迁移能力,或与大规模语言模型(LLMs)预训练目标对齐的问题。为了解决这一问题,论文的关键创新在于提出了名为\textscPandora的新型USKR框架,通过利用Python的\textscPandas API构建统一的知识表示,以更好地与LLMs的预训练目标对齐。此外,\textscPandora利用LLM生成文本推理步骤和可执行的Python代码,并从包含多种SKR任务的训练示例记忆中获取演示,从而促进知识迁移。实验结果表明,该方法在三个SKR任务的四个基准数据集上优于现有的统一框架,并且在性能上能够与特定任务的方法相媲美。

链接: https://arxiv.org/abs/2504.12734
作者: Yongrui Chen,Junhao He,Linbo Fu,Shenyu Zhang,Rihui Jin,Xinbang Dai,Jiaqi Li,Dehai Min,Nan Hu,Yuxin Zhang,Guilin Qi,Yi Huang,Tongtong Wu
机构: Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (新一代人工智能技术及其交叉应用重点实验室) (Southeast University), Ministry of Education; China Mobile Research (中国移动研究院); Monash University (蒙纳士大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions (NLQs) by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods either rely on employing task-specific strategies or custom-defined representations, which struggle to leverage the knowledge transfer between different SKR tasks or align with the prior of LLMs, thereby limiting their performance. This paper proposes a novel USKR framework named \textscPandora, which takes advantage of \textscPython’s \textscPandas API to construct a unified knowledge representation for alignment with LLM pre-training. It employs an LLM to generate textual reasoning steps and executable Python code for each question. Demonstrations are drawn from a memory of training examples that cover various SKR tasks, facilitating knowledge transfer. Extensive experiments on four benchmarks involving three SKR tasks demonstrate that \textscPandora outperforms existing unified frameworks and competes effectively with task-specific methods.
zh

[NLP-38] KODIS: A Multicultural Dispute Resolution Dialogue Corpus

【速读】: 该论文试图构建一个包含跨文化纠纷对话数据的 corpus(语料库),以研究文化与冲突之间的关系。论文的关键在于设计了一个由专家定制的典型客户服务质量纠纷场景,用于激发强烈的情绪和冲突,并收集了来自超过75个国家的数千段对话数据。通过这一 corpus,论文提供了丰富的性格特征、过程及结果测量数据,旨在验证愤怒表达如何导致冲突升级螺旋,并揭示不同文化中的情感表达差异。解决方案的关键在于创建了一个包含多文化视角的数据收集框架和相应的理论驱动型实验设计。

链接: https://arxiv.org/abs/2504.12723
作者: James Hale,Sushrita Rakshit,Kushal Chawla,Jeanne M. Brett,Jonathan Gratch
机构: University of Southern California (南加州大学); University of Michigan (密歇根大学); Capital One (Capital One); Northwestern University (西北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present KODIS, a dyadic dispute resolution corpus containing thousands of dialogues from over 75 countries. Motivated by a theoretical model of culture and conflict, participants engage in a typical customer service dispute designed by experts to evoke strong emotions and conflict. The corpus contains a rich set of dispositional, process, and outcome measures. The initial analysis supports theories of how anger expressions lead to escalatory spirals and highlights cultural differences in emotional expression. We make this corpus and data collection framework available to the community.
zh

[NLP-39] Why and How LLM s Hallucinate: Connecting the Dots with Subsequence Associations

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成内容时频繁出现幻觉(hallucinations)的问题,即生成的内容偏离事实准确性或提供的上下文信息。这种幻觉现象因其复杂的成因而对诊断带来了挑战。论文的关键在于提出了一种子序列关联框架,用于系统性追踪和理解幻觉现象。其核心洞察是,当主导的幻觉关联超过忠实关联时,幻觉便会发生。为此,论文通过理论与实证分析证明了仅解码器变换器(decoder-only transformers)实际上可以作为子序列嵌入模型有效运作,其中线性层编码输入输出关联。基于此,作者提出了一种追踪算法,通过分析随机化输入上下文中的幻觉概率来识别因果子序列。实验结果表明,该方法在识别幻觉原因方面优于标准归因技术,并且与模型训练语料库中的证据保持一致。这项工作为幻觉现象提供了统一视角,并构建了一个稳健的追踪和分析框架。

链接: https://arxiv.org/abs/2504.12691
作者: Yiyou Sun,Yu Gai,Lijie Chen,Abhilasha Ravichander,Yejin Choi,Dawn Song
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) frequently generate hallucinations-content that deviates from factual accuracy or provided context-posing challenges for diagnosis due to the complex interplay of underlying causes. This paper introduces a subsequence association framework to systematically trace and understand hallucinations. Our key insight is that hallucinations arise when dominant hallucinatory associations outweigh faithful ones. Through theoretical and empirical analyses, we demonstrate that decoder-only transformers effectively function as subsequence embedding models, with linear layers encoding input-output associations. We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts. Experiments show our method outperforms standard attribution techniques in identifying hallucination causes and aligns with evidence from the model’s training corpus. This work provides a unified perspective on hallucinations and a robust framework for their tracing and analysis.
zh

[NLP-40] Data-efficient LLM Fine-tuning for Code Generation

【速读】: 该论文试图解决开源代码大语言模型(Code-based LLMs)在性能上与闭源模型存在的差距问题。为了解决这一问题,现有方法通常通过生成大量合成数据进行微调,但这种方法效率低下。论文的关键解决方案包括两个方面:一是提出了一种基于数据复杂度优先级的数据选择策略,通过采样子集确保其分布与原始数据集一致,从而有效选取高质量数据;二是优化了分词过程,采用“动态打包”技术减少填充标记(padding tokens),降低计算资源消耗。实验结果表明,该方法不仅提升了模型性能,还显著提高了训练效率。

链接: https://arxiv.org/abs/2504.12687
作者: Weijie Lv,Xuan Xia,Sheng-Jun Huang
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Shenzhen Institute of Artificial Intelligence and Robotics for Society (深圳人工智能与机器人研究院)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2408.02193

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically generate large amounts of synthetic data for fine-tuning, which often leads to inefficient training. In this work, we propose a data selection strategy in order to improve the effectiveness and efficiency of training for code-based LLMs. By prioritizing data complexity and ensuring that the sampled subset aligns with the distribution of the original dataset, our sampling strategy effectively selects high-quality data. Additionally, we optimize the tokenization process through a “dynamic pack” technique, which minimizes padding tokens and reduces computational resource consumption. Experimental results show that when training on 40% of the OSS-Instruct dataset, the DeepSeek-Coder-Base-6.7B model achieves an average performance of 66.9%, surpassing the 66.1% performance with the full dataset. Moreover, training time is reduced from 47 minutes to 34 minutes, and the peak GPU memory decreases from 61.47 GB to 42.72 GB during a single epoch. Similar improvements are observed with the CodeLlama-Python-7B model on the Evol-Instruct dataset. By optimizing both data selection and tokenization, our approach not only improves model performance but also improves training efficiency.
zh

[NLP-41] WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents

【速读】: 该论文试图解决规模化从网页中提取结构化数据的问题,现有方法在类似任务上的表现不佳。解决方案的关键在于提出BardeenAgent框架,它通过将Web代理的操作转换为可重复的程序,并利用HTML的结构性,构造通用的CSS选择器以捕获页面中的相关项目,从而实现高效的数据提取。这一方法显著提升了性能,在WebLists基准测试中达到了66%的召回率,大幅超越现有最先进的Web代理方法。

链接: https://arxiv.org/abs/2504.12682
作者: Arth Bohra,Manvel Saroyan,Danil Melkozerov,Vahe Karufanyan,Gabriel Maher,Pascal Weinberger,Artem Harutyunyan,Giovanni Campagna
机构: University of California Berkeley (加州大学伯克利分校); Bardeen, Inc. (巴丁公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most recent web agent research has focused on navigation and transaction tasks, with little emphasis on extracting structured data at scale. We present WebLists, a benchmark of 200 data-extraction tasks across four common business and enterprise use-cases. Each task requires an agent to navigate to a webpage, configure it appropriately, and extract complete datasets with well-defined schemas. We show that both LLMs with search capabilities and SOTA web agents struggle with these tasks, with a recall of 3% and 31%, respectively, despite higher performance on question-answering tasks. To address this challenge, we propose BardeenAgent, a novel framework that enables web agents to convert their execution into repeatable programs, and replay them at scale across pages with similar structure. BardeenAgent is also the first LLM agent to take advantage of the regular structure of HTML. In particular BardeenAgent constructs a generalizable CSS selector to capture all relevant items on the page, then fits the operations to extract the data. On the WebLists benchmark, BardeenAgent achieves 66% recall overall, more than doubling the performance of SOTA web agents, and reducing cost per output row by 3x. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2504.12682 [cs.AI] (or arXiv:2504.12682v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.12682 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-42] GRAIL: Gradient-Based Adaptive Unlearning for Privacy and Copyright in LLM s IJCNN2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中学习到敏感信息的问题,尤其是在涉及“被遗忘权”等原则时引发的社会与法律关注。传统的通过重新训练整个模型来移除不希望存在的知识既昂贵又不可行,而现有的单领域遗忘方法难以应对多领域场景,因为跨领域的知识相互交织会导致过度的知识移除或性能下降。论文的关键解决方案是提出了一种名为GRAIL(基于梯度的自适应遗忘)的新框架,它利用多个领域的梯度信息精确区分遗忘范围与保留范围,并采用参数级的自适应定位策略,在每个领域中选择性地移除目标知识同时保留关键参数,从而有效管理大规模预训练语言模型中的敏感信息。

链接: https://arxiv.org/abs/2504.12681
作者: Kun-Woo Kim,Ji-Hoon Park,Ju-Min Han,Seong-Whan Lee
机构: Korea University (高丽大学), Seoul, South Korea; Korea University (高丽大学), Seoul, South Korea
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by IJCNN 2025

点击查看摘要

Abstract:Large Language Models (LLMs) trained on extensive datasets often learn sensitive information, which raises significant social and legal concerns under principles such as the “Right to be forgotten.” Retraining entire models from scratch to remove undesired information is both costly and impractical. Furthermore, existing single-domain unlearning methods fail to address multi-domain scenarios, where knowledge is interwoven across domains such as privacy and copyright, creating overlapping representations that lead to excessive knowledge removal or degraded performance. To tackle these issues, we propose GRAIL (GRadient-based AdaptIve unLearning), a novel multi-domain unlearning framework. GRAIL leverages gradient information from multiple domains to precisely distinguish the unlearning scope from the retention scope, and applies an adaptive parameter-wise localization strategy to selectively remove targeted knowledge while preserving critical parameters for each domain. Experimental results on unlearning benchmarks show that GRAIL achieves unlearning success on par with the existing approaches, while also demonstrating up to 17% stronger knowledge retention success compared to the previous state-of-art method. Our findings establish a new paradigm for effectively managing and regulating sensitive information in large-scale pre-trained language models.
zh

[NLP-43] ACoRN: Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models

【速读】: 该论文试图解决在检索增强生成(Retrieval-Augmented Generation, RAG)中,利用抽取式压缩模型处理长上下文时,因检索到的文档包含无关或错误信息而导致的重要信息遗漏问题。论文指出,尽管这些文档具有较高的相关性分数,但它们可能由于注意力分散而在长上下文中导致关键信息被忽略。为了解决这一问题,论文提出了“抗噪生成式压缩方法(ACoRN)”,其关键在于通过两个创新的训练步骤提升压缩器的鲁棒性:首先,采用离线数据增强技术提高模型对两种不同类型的检索噪声的抵抗能力;其次,针对基于语言模型的压缩器无法充分利用多个检索文档信息且存在位置偏差的问题,进行微调以生成围绕支持正确答案的关键信息的摘要。实验结果表明,使用ACoRN训练的T5-large模型不仅提升了精确率(EM)和F1分数,还能保留作为直接证据的答案字符串,在包含大量误导性文档的数据集上表现尤为出色,从而在实际应用中具有重要价值。

链接: https://arxiv.org/abs/2504.12673
作者: Singon Kim,Gunho Jung,Seong-Whan Lee
机构: Department of Artificial Intelligence (人工智能系), Korea University (高丽大学), Seoul, Republic of Korea (韩国首尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However,retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language modelbased compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy-reducing documents, making it highly useful in real-world scenarios.
zh

[NLP-44] Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment

【速读】: 该论文旨在解决语言模型与人类偏好对齐过程中个性化实现的挑战,特别是在不产生过高计算成本的情况下。现有方法依赖于奖励信号和额外的标注数据,这限制了其可扩展性和对多样化人类价值观的适应性。为应对这些挑战,论文引入了一种名为Persona-judge的新颖判别范式,它能够在无需训练的情况下实现对未见偏好的个性化对齐。解决方案的关键在于利用模型的内在偏好判断能力,而非通过外部奖励反馈优化策略参数。具体而言,Persona-judge通过一个草案模型基于给定偏好生成候选标记,同时由另一个体现不同偏好的法官模型交叉验证预测标记是否被接受,从而实现个性化对齐。实验结果表明,该方法通过模型的内在偏好评估机制,提供了一个可扩展且计算高效的解决方案,为更灵活的定制化对齐奠定了基础。

链接: https://arxiv.org/abs/2504.12663
作者: Xiaotian Zhang,Ruizhe Chen,Yang Feng,Zuozhu Liu
机构: Zhejiang University; Angelalign Technology Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aligning language models with human preferences presents significant challenges, particularly in achieving personalization without incurring excessive computational costs. Existing methods rely on reward signals and additional annotated data, limiting their scalability and adaptability to diverse human values. To address these challenges, we introduce Persona-judge, a novel discriminative paradigm that enables training-free personalized alignment with unseen preferences. Instead of optimizing policy parameters through external reward feedback, Persona-judge leverages the intrinsic preference judgment capabilities of the model. Specifically, a draft model generates candidate tokens conditioned on a given preference, while a judge model, embodying another preference, cross-validates the predicted tokens whether to be accepted. Experimental results demonstrate that Persona-judge, using the inherent preference evaluation mechanisms of the model, offers a scalable and computationally efficient solution to personalized alignment, paving the way for more adaptive customized alignment.
zh

[NLP-45] VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning -Driven Prompt Optimization

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在多模态复杂性下潜在的安全风险问题,特别是由于视觉与语言模态融合而产生的难以通过传统方法防范的微妙威胁。论文的关键创新在于提出了一种基于多模态推理的提示重写方案(multimodal reasoning-driven prompt rewriting),以增强VLM的安全性。其核心解决方案是设计了一个名为VLMGuard-R1的主动防护框架,该框架通过推理引导的重写器动态解析文本-图像交互,从而优化用户输入,生成强化安全性的提示,且无需修改模型的核心参数。这一方案的关键在于构建了一个三阶段的推理管道,用于合成训练数据集,使重写器能够推断出细微的安全威胁,并提供针对性的响应措施,而非泛泛拒绝。实验结果表明,VLMGuard-R1在三个基准测试中的五个VLM上显著提升了平均安全性能,相较于四个基线方法表现出色,尤其在SIUO基准上的平均安全性能提升了43.59%。

链接: https://arxiv.org/abs/2504.12661
作者: Menglan Chen,Xianghe Pang,Jingjing Dong,WenHao Wang,Yaxin Du,Siheng Chen
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aligning Vision-Language Models (VLMs) with safety standards is essential to mitigate risks arising from their multimodal complexity, where integrating vision and language unveils subtle threats beyond the reach of conventional safeguards. Inspired by the insight that reasoning across modalities is key to preempting intricate vulnerabilities, we propose a novel direction for VLM safety: multimodal reasoning-driven prompt rewriting. To this end, we introduce VLMGuard-R1, a proactive framework that refines user inputs through a reasoning-guided rewriter, dynamically interpreting text-image interactions to deliver refined prompts that bolster safety across diverse VLM architectures without altering their core parameters. To achieve this, we devise a three-stage reasoning pipeline to synthesize a dataset that trains the rewriter to infer subtle threats, enabling tailored, actionable responses over generic refusals. Extensive experiments across three benchmarks with five VLMs reveal that VLMGuard-R1 outperforms four baselines. In particular, VLMGuard-R1 achieves a remarkable 43.59% increase in average safety across five models on the SIUO benchmark.
zh

[NLP-46] Scaling Instruction-Tuned LLM s to Million-Token Contexts via Hierarchical Synthetic Data Generation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在长上下文推理方面的挑战,这些问题源于计算复杂度随序列长度呈二次增长以及长上下文标注数据稀缺且昂贵。目前几乎没有开源工作系统性地分析长上下文数据,也缺乏可用的指令微调数据集,其上下文长度超过100K令牌。为填补这一空白,论文提出了一种新的后训练合成数据生成策略,以高效扩展LLMs的上下文窗口,同时保持其通用任务性能。该方法可扩展至任意长的上下文长度,不受现有真实世界数据长度的限制,从而有效应对原始长上下文数据的匮乏问题。关键在于通过逐步旋转位置嵌入(Rotary Position Embedding, RoPE)缩放训练策略,使模型在长达1M令牌的上下文中表现出色,并在RULER基准、InfiniteBench以及通用语言任务中保持稳健性能。

链接: https://arxiv.org/abs/2504.12637
作者: Linda He,Jue Wang,Maurice Weber,Shang Zhu,Ben Athiwaratkun,Ce Zhang
机构: Harvard University (哈佛大学); Together AI; University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with long-context reasoning, not only due to the quadratic scaling of computational complexity with sequence length but also because of the scarcity and expense of annotating long-context data. There has been barely any open-source work that systematically ablates long-context data, nor is there any openly available instruction tuning dataset with contexts surpassing 100K tokens. To bridge this gap, we introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of LLMs while preserving their general task performance. Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data, which effectively addresses the scarcity of raw long-context data. Through a step-by-step rotary position embedding (RoPE) scaling training strategy, we demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench and maintains robust performance on general language tasks.
zh

[NLP-47] owards Characterizing Subjectivity of Individuals through Modeling Value Conflicts and Trade-offs

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在个体层面主观性建模及道德判断推理方面未被充分研究的问题。论文聚焦于如何利用LLMs刻画社交媒体用户的主观性,并推断其道德判断。解决方案的关键在于提出了一种名为SOLAR(Subjective Ground with Value Abstraction)的框架,该框架通过分析用户生成文本中的价值冲突与权衡,更有效地表征个体的主观立场。此外,SOLAR不仅能提升整体推理性能,还能为个体的价值偏好提供可解释性的洞察,从而进一步阐释其决策判断。

链接: https://arxiv.org/abs/2504.12633
作者: Younghun Lee,Dan Goldwasser
机构: Department of Computer Science (计算机科学系), Purdue University (普渡大学)
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Large Language Models (LLMs) not only have solved complex reasoning problems but also exhibit remarkable performance in tasks that require subjective decision making. Existing studies suggest that LLM generations can be subjectively grounded to some extent, yet exploring whether LLMs can account for individual-level subjectivity has not been sufficiently studied. In this paper, we characterize subjectivity of individuals on social media and infer their moral judgments using LLMs. We propose a framework, SOLAR (Subjective Ground with Value Abstraction), that observes value conflicts and trade-offs in the user-generated texts to better represent subjective ground of individuals. Empirical results show that our framework improves overall inference results as well as performance on controversial situations. Additionally, we qualitatively show that SOLAR provides explanations about individuals’ value preferences, which can further account for their judgments.
zh

[NLP-48] GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning

【速读】: 该论文试图解决现有基准测试未能全面评估多模态大型语言模型(Multimodal Large Language Models, MLLMs)在几何问题求解(Geometry Problem-Solving, GPS)任务中人类-like几何推理机制的两个维度的问题。论文的关键解决方案是提出了GeoSense,这是一个首个综合性的双语基准测试集,通过几何原理的角度系统性评估MLLMs的几何推理能力。GeoSense包含一个涵盖平面与立体几何的五级层次化几何原理框架、一个包含1,789个问题的精细标注数据集以及一种创新的评估策略。实验结果表明,几何原理的识别与应用仍是领先模型的瓶颈,这凸显了GeoSense在指导MLLMs几何推理能力未来发展的潜力。

链接: https://arxiv.org/abs/2504.12597
作者: Liangyu Xu,Yingxiu Zhao,Jingyun Wang,Yingyao Wang,Bu Pi,Chen Wang,Mingliang Zhang,Jihao Gu,Xiang Li,Xiaoyong Zhu,Jun Song,Bo Zheng
机构: Alibaba Group(Beijing, China); Beihang University(Beijing, China)
类目: Computation and Language (cs.CL)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Geometry problem-solving (GPS), a challenging task requiring both visual comprehension and symbolic reasoning, effectively measures the reasoning capabilities of multimodal large language models (MLLMs). Humans exhibit strong reasoning ability in this task through accurate identification and adaptive application of geometric principles within visual contexts. However, existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in MLLMs, remaining a critical gap in assessing their ability to tackle GPS. To this end, we introduce GeoSense, the first comprehensive bilingual benchmark designed to systematically evaluate the geometric reasoning abilities of MLLMs through the lens of geometric principles. GeoSense features a five-level hierarchical framework of geometric principles spanning plane and solid geometry, an intricately annotated dataset of 1,789 problems, and an innovative evaluation strategy. Through extensive experiments on GeoSense with various open-source and closed-source MLLMs, we observe that Gemini-2.0-pro-flash performs best, achieving an overall score of 65.3 . Our in-depth analysis reveals that the identification and application of geometric principles remain a bottleneck for leading MLLMs, jointly hindering their reasoning abilities. These findings underscore GeoSense’s potential to guide future advancements in MLLMs’ geometric reasoning capabilities, paving the way for more robust and human-like reasoning in artificial intelligence.
zh

[NLP-49] Simplifying Graph Transformers

【速读】: 该论文旨在解决将传统的Transformer模型迁移到图学习领域时面临的挑战,特别是现有先进Graph Transformer模型因引入消息传递机制或复杂注意力机制而导致架构过于复杂的问题,这限制了Transformer在图数据上的高效训练与应用。论文的关键解决方案在于提出三种简单的修改:(1) 使用简化的L₂注意力机制来衡量标记之间的大小接近程度;(2) 引入自适应均方根归一化以保留标记的大小信息;(3) 在共享编码器中采用相对位置编码偏置。这些修改无需大幅改变Transformer的基本架构即可使其适用于图数据,并通过多种图数据集上的显著性能提升验证了所提方法的有效性,同时在表达能力基准测试中展示了其实现图同构的强大表达能力。

链接: https://arxiv.org/abs/2504.12588
作者: Liheng Ma,Soumyasundar Pal,Yingxue Zhang,Philip H.S. Torr,Mark Coates
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformers have attained outstanding performance across various modalities, employing scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers are designed with major architectural differences, either integrating message-passing or incorporating sophisticated attention mechanisms. These complexities prevent the easy adoption of Transformer training advances. We propose three simple modifications to the plain Transformer to render it applicable to graphs without introducing major architectural distortions. Specifically, we advocate for the use of (1) simplified L_2 attention to measure the magnitude closeness of tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a relative positional encoding bias with a shared encoder. Significant performance gains across a variety of graph datasets justify the effectiveness of our proposed modifications. Furthermore, empirical evaluation on the expressiveness benchmark reveals noteworthy realized expressiveness in the graph isomorphism.
zh

[NLP-50] Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models

【速读】: 该论文试图解决大型语言模型(LLMs)在确定性任务(如计数或缩写形成)中因依赖隐式先验分布而导致响应不适当的问题。论文的关键在于发现LLMs实际上已经计算出完成这些任务所需的正确信息,并通过特定干预措施使模型能够访问这些信息以提高性能。关键解决方案包括通过提示引导模型减少对先验知识的依赖,以及利用机制可解释性技术定位并调控模型中的先验影响。研究显示,针对主导先验的任务对相关神经网络层进行轻量微调可以显著提升性能,且微调后的错误不再与先验相关联。这表明通过调整LLMs对先验的依赖程度,可能有效提高其在某些场景下的表现,特别是在因令牌序列的先验概率导致幻觉的情况下。

链接: https://arxiv.org/abs/2504.12585
作者: Liyi Zhang,Veniamin Veselovsky,R. Thomas McCoy,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学); Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) sometimes fail to respond appropriately to deterministic tasks – such as counting or forming acronyms – because the implicit prior distribution they have learned over sequences of tokens influences their responses. In this work, we show that, in at least some cases, LLMs actually compute the information needed to perform these tasks correctly, and we identify some interventions that can allow them to access this information to improve their performance. First, we show that simply prompting the language model to not rely on its prior knowledge leads to dramatic improvements in prior-dominated tasks. We then use mechanistic interpretability techniques to localize the prior within the LLM and manipulate the extent to which that prior influences its responses. Specifically, we show that it is possible to identify layers of the underlying neural network that correlate with the prior probability of a response and that lightweight finetuning of these layers with basic prompts on prior-dominated tasks achieves high performance on held-out answers. These results suggest that the information required to produce a correct response is contained within the representations of the problems formed by the models. Furthermore, we show that this finetuning is significantly more effective for prior-dominated tasks, and that the error after finetuning is no longer correlated with the prior. Our results suggest that it may be possible to define effective methods for manipulating the extent to which LLMs rely upon their priors in solving problems, potentially increasing their performance in settings where LLMs hallucinate for reasons related to the prior probability of token sequences.
zh

[NLP-51] Provable Secure Steganography Based on Adaptive Dynamic Sampling

【速读】: 该论文试图解决现有可证明安全隐写术(Provably Secure Steganography, PSS)方法在黑盒场景下因需显式访问生成模型分布而导致实用性受限的问题。解决方案的关键在于提出了一种无需显式访问发送方和接收方模型分布的可证明安全隐写方案,通过引入动态采样策略,使生成模型能够在不影响正常生成过程的前提下,利用多种采样选择嵌入秘密信息,从而实现高效且容量相当的黑盒隐写能力,同时避免了模型生成输出中隐写效果的退化。

链接: https://arxiv.org/abs/2504.12579
作者: Kaiyi Pang
机构: Tsinghua University (清华大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The security of private communication is increasingly at risk due to widespread surveillance. Steganography, a technique for embedding secret messages within innocuous carriers, enables covert communication over monitored channels. Provably Secure Steganography (PSS) is state of the art for making stego carriers indistinguishable from normal ones by ensuring computational indistinguishability between stego and cover distributions. However, current PSS methods often require explicit access to the distribution of generative model for both sender and receiver, limiting their practicality in black box scenarios. In this paper, we propose a provably secure steganography scheme that does not require access to explicit model distributions for both sender and receiver. Our method incorporates a dynamic sampling strategy, enabling generative models to embed secret messages within multiple sampling choices without disrupting the normal generation process of the model. Extensive evaluations of three real world datasets and three LLMs demonstrate that our blackbox method is comparable with existing white-box steganography methods in terms of efficiency and capacity while eliminating the degradation of steganography in model generated outputs.
zh

[NLP-52] MetaSynth: Meta-Prompting-Driven Agent ic Scaffolds for Diverse Synthetic Data Generation

【速读】: 该论文旨在解决利用合成数据进行领域适应的问题,特别是现有合成数据多样性不足限制其下游应用的局限性。论文的关键解决方案是提出了一种名为MetaSynth的方法,通过元提示(meta-prompting)机制,让一个语言模型协调多个“专家”大语言模型(LLMs)协作生成数据,从而显著提升合成数据的多样性。实验表明,使用仅2500万tokens的MetaSynth生成的合成数据,即可有效将预训练良好的大语言模型(Mistral-7B-v0.3)适配到金融和生物医学两个专业领域,同时保持其在通用任务上的性能。此外,MetaSynth生成的数据在七项自动化指标下接近真实预训练语料库的多样性,并且在持续微调中展现出优于基础模型的性能提升,证明了其有效性。

链接: https://arxiv.org/abs/2504.12563
作者: Haris Riaz,Sourav Bhabesh,Vinayak Arannil,Miguel Ballesteros,Graham Horwood
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 17 figures. Preprint

点击查看摘要

Abstract:Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple “expert” LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and Biomedicine-without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth. Comments: 33 pages, 17 figures. Preprint Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.12563 [cs.CL] (or arXiv:2504.12563v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.12563 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-53] ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition

【速读】: 该论文试图解决传统大语言模型(Large Language Models, LLMs)评估方法中存在的过拟合、高成本和偏见等问题。为了解决这些问题,论文提出了一种名为ZeroSumEval的新颖竞争性评估协议,其关键是利用零和博弈(zero-sum games)构建动态基准测试(dynamic benchmarks),以避免评估饱和(resist saturation)。ZeroSumEval通过一系列多样化游戏(如安全挑战、经典游戏、知识测试和说服挑战等)来评估LLMs的战略推理、规划、知识应用和创造力等多种能力,并提供了一个标准化且可扩展的框架。实验结果显示,尽管某些前沿模型在常见任务上表现良好,但在需要创造性和挑战性的问题生成方面仍存在不足,同时在需要创造力的任务上普遍表现不佳。关键解决方案在于引入基于零和博弈的动态评估机制,从而更全面地衡量LLMs的能力。

链接: https://arxiv.org/abs/2504.12562
作者: Haidar Khan,Hisham A. Alyahya,Yazeed Alnumay,M Saiful Bari,Bülent Yener
机构: Meta; Saudi Data & AI Authority; Cohere; Rensselaer Polytechnic Institute
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating the capabilities of Large Language Models (LLMs) has traditionally relied on static benchmark datasets, human assessments, or model-based evaluations - methods that often suffer from overfitting, high costs, and biases. ZeroSumEval is a novel competition-based evaluation protocol that leverages zero-sum games to assess LLMs with dynamic benchmarks that resist saturation. ZeroSumEval encompasses a diverse suite of games, including security challenges (PyJail), classic games (Chess, Liar’s Dice, Poker), knowledge tests (MathQuiz), and persuasion challenges (Gandalf, Debate). These games are designed to evaluate a range of AI capabilities such as strategic reasoning, planning, knowledge application, and creativity. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework. To demonstrate this, we conduct extensive experiments with 7000 simulations across 7 games and 13 models. Our results show that while frontier models from the GPT and Claude families can play common games and answer questions, they struggle to play games that require creating novel and challenging questions. We also observe that models cannot reliably jailbreak each other and fail generally at tasks requiring creativity. We release our code at this https URL.
zh

[NLP-54] CDF-RAG : Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation

【速读】: 该论文旨在解决现有 Retrieval-Augmented Generation (RAG) 框架在知识密集型任务中因依赖语义相似性和相关性驱动的知识检索,而导致难以区分真实因果关系与虚假关联的问题。这种局限性使得生成的回答虽在事实基础上但未能建立明确的因果机制,从而可能产生不完整或误导性的见解。为了解决这一问题,论文提出了 Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation (CDF-RAG),其关键在于通过迭代优化查询、检索结构化的因果图谱,并支持跨互联知识源的多跳因果推理,同时验证输出是否符合因果路径,以确保逻辑一致且基于事实的生成结果。

链接: https://arxiv.org/abs/2504.12560
作者: Elahe Khatibi,Ziyu Wang,Amir M. Rahmani
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has significantly enhanced large language models (LLMs) in knowledge-intensive tasks by incorporating external knowledge retrieval. However, existing RAG frameworks primarily rely on semantic similarity and correlation-driven retrieval, limiting their ability to distinguish true causal relationships from spurious associations. This results in responses that may be factually grounded but fail to establish cause-and-effect mechanisms, leading to incomplete or misleading insights. To address this issue, we introduce Causal Dynamic Feedback for Adaptive Retrieval-Augmented Generation (CDF-RAG), a framework designed to improve causal consistency, factual accuracy, and explainability in generative reasoning. CDF-RAG iteratively refines queries, retrieves structured causal graphs, and enables multi-hop causal reasoning across interconnected knowledge sources. Additionally, it validates responses against causal pathways, ensuring logically coherent and factually grounded outputs. We evaluate CDF-RAG on four diverse datasets, demonstrating its ability to improve response accuracy and causal correctness over existing RAG-based methods. Our code is publicly available at this https URL elakhatibi/CDF-RAG.
zh

[NLP-55] Benchmarking LLM -based Relevance Judgment Methods

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化信息检索系统评估中的方法比较问题。尽管已有研究主要集中在通过不同提示策略复制分级的人类相关性判断,但对替代评估方法或全面对比研究的关注较少。为此,论文系统性地比较了多种基于LLM的相关性评估方法,包括二元相关性判断、分级相关性评估、基于成对偏好以及两种基于片段(nugget)的评估方法——文档无关和文档依赖型。关键在于不仅采用传统的基于Kendall相关系数的系统排名比较,还考察了LLM判断与从相关性等级推断出的人类偏好之间的对齐程度。论文通过多个数据集进行了广泛的实验,并公开了由开源模型(Llama3.2b)和商业模型(gpt-4o)生成的相关性判断,以重现这些方法并提供全面对比。所有代码、数据及资源均公开可用。

链接: https://arxiv.org/abs/2504.12558
作者: Negar Arabzadeh,Charles L. A. Clarke
机构: University of Waterloo (滑铁卢大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in both academic and industry settings to automate the evaluation of information seeking systems, particularly by generating graded relevance judgments. Previous work on LLM-based relevance assessment has primarily focused on replicating graded human relevance judgments through various prompting strategies. However, there has been limited exploration of alternative assessment methods or comprehensive comparative studies. In this paper, we systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods~–~document-agnostic and document-dependent. In addition to a traditional comparison based on system rankings using Kendall correlations, we also examine how well LLM judgments align with human preferences, as inferred from relevance grades. We conduct extensive experiments on datasets from three TREC Deep Learning tracks 2019, 2020 and 2021 as well as the ANTIQUE dataset, which focuses on non-factoid open-domain question answering. As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model. Our goal is to \textitreproduce various LLM-based relevance judgment methods to provide a comprehensive comparison. All code, data, and resources are publicly available in our GitHub Repository at this https URL.
zh

[NLP-56] ELAB: Extensive LLM Alignment Benchmark in Persian Language

【速读】: 该论文旨在解决现有大型语言模型(Large Language Models, LLMs)评估框架在适应波斯语(Persian)语言和文化背景方面的不足,特别是针对安全(safety)、公平性(fairness)和社会规范(social norms)等关键伦理维度的对齐问题。论文的关键解决方案在于提出了一套全面的评估框架,并构建了一个包含三种类型数据的基准:(i) 翻译数据,(ii) 合成生成的新数据,以及 (iii) 自然采集的新数据。通过翻译已有的数据集(如Anthropic Red Teaming数据、AdvBench、HarmBench和DecodingTrust)以及创建新的波斯语数据集(如ProhibiBench-fa、SafeBench-fa、FairBench-fa和SocialBench-fa),同时收集反映波斯文化规范的GuardBench-fa数据集,论文实现了对波斯LLMs在安全、公平性和社会规范方面的系统性评估。这种统一的评估框架为基于文化背景的模型对齐提供了新方法,并通过公开的排行榜展示了不同模型的表现。

链接: https://arxiv.org/abs/2504.12553
作者: Zahra Pourbahman,Fatemeh Rajabi,Mohammadhossein Sadeghi,Omid Ghahroodi,Somaye Bakhshaei,Arash Amini,Reza Kazemi,Mahdieh Soleymani Baghshah
机构: MCILAB (麦吉尔实验室), Sharif University of Technology ( Sharif 大学技术)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive evaluation framework for aligning Persian Large Language Models (LLMs) with critical ethical dimensions, including safety, fairness, and social norms. It addresses the gaps in existing LLM evaluation frameworks by adapting them to Persian linguistic and cultural contexts. This benchmark creates three types of Persian-language benchmarks: (i) translated data, (ii) new data generated synthetically, and (iii) new naturally collected data. We translate Anthropic Red Teaming data, AdvBench, HarmBench, and DecodingTrust into Persian. Furthermore, we create ProhibiBench-fa, SafeBench-fa, FairBench-fa, and SocialBench-fa as new datasets to address harmful and prohibited content in indigenous culture. Moreover, we collect extensive dataset as GuardBench-fa to consider Persian cultural norms. By combining these datasets, our work establishes a unified framework for evaluating Persian LLMs, offering a new approach to culturally grounded alignment evaluation. A systematic evaluation of Persian LLMs is performed across the three alignment aspects: safety (avoiding harmful content), fairness (mitigating biases), and social norms (adhering to culturally accepted behaviors). We present a publicly available leaderboard that benchmarks Persian LLMs with respect to safety, fairness, and social norms at: this https URL.
zh

[NLP-57] Memorization: A Close Look at Books

【速读】: 该论文旨在探究大型语言模型(Large Language Models, LLMs)在从文本片段中提取完整书籍内容方面的潜力与限制。论文通过使用Llama 3 70B系列模型以及“前缀提示”(prefix-prompting)技术,成功实现了以极高的相似度自回归重构《爱丽丝梦游仙境》整本书籍,仅基于初始的500个标记(tokens)。此外,研究还展示了部分其他书籍的成功提取率,但发现这些成果无法均匀适用于所有书籍。研究进一步表明,书籍的提取率与其流行程度相关,可能反映了训练数据中的重复性。同时,论文确认了Llama 3.1指令微调模型中缓解策略的失效现象,并指出这种失效源于变压器网络较低层中极少数权重的改变。论文的关键在于提出了一种框架,用于分析微调如何影响对齐LLMs中逐字记忆内容的检索,从而揭示当前反刍抑制策略的局限性。

链接: https://arxiv.org/abs/2504.12549
作者: Iris Ma,Ian Domingo,Alberto Krone-Martins,Pierre Baldi,Cristina V. Lopes
机构: School of Information and Computer Sciences (信息与计算机科学学院), University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To what extent can entire books be extracted from LLMs? Using the Llama 3 70B family of models, and the “prefix-prompting” extraction technique, we were able to auto-regressively reconstruct, with a very high level of similarity, one entire book (Alice’s Adventures in Wonderland) from just the first 500 tokens. We were also able to obtain high extraction rates on several other books, piece-wise. However, these successes do not extend uniformly to all books. We show that extraction rates of books correlate with book popularity and thus, likely duplication in the training data. We also confirm the undoing of mitigations in the instruction-tuned Llama 3.1, following recent work (Nasr et al., 2025). We further find that this undoing comes from changes to only a tiny fraction of weights concentrated primarily in the lower transformer blocks. Our results provide evidence of the limits of current regurgitation mitigation strategies and introduce a framework for studying how fine-tuning affects the retrieval of verbatim memorization in aligned LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.12549 [cs.CL] (or arXiv:2504.12549v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.12549 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-58] Knowledge Acquisition on Mass-shooting Events via LLM s for AI-Driven Justice

【速读】: 该论文旨在解决大规模枪击事件相关文本数据中关键信息提取效率低下且自动化程度不足的问题,以支持法律和调查工作。论文的关键解决方案在于构建了一个专门的数据集,并利用命名实体识别(NER)技术结合大语言模型(LLMs)的少量提示学习(few-shot prompting)方法,从多样化的数据源(如新闻文章、警方报告和社会媒体)中高效提取和组织关键信息,包括作案者、受害者、地点和犯罪工具等实体。实验结果表明,GPT-4o在大规模枪击事件NER任务中表现最优,而o1-mini在资源受限场景下提供了竞争力的替代方案。此外,增加提示样本数量可提升所有模型性能,尤其对GPT-4o和o1-mini效果更显著,体现了其在少量提示学习中的优越性。

链接: https://arxiv.org/abs/2504.12545
作者: Benign John Ihugba,Afsana Nasrin,Ling Wu,Lin Li,Lijun Qian,Xishuang Dong
机构: Prairie View A&M University (普雷里维尤农工大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mass-shooting events pose a significant challenge to public safety, generating large volumes of unstructured textual data that hinder effective investigations and the formulation of public policy. Despite the urgency, few prior studies have effectively automated the extraction of key information from these events to support legal and investigative efforts. This paper presented the first dataset designed for knowledge acquisition on mass-shooting events through the application of named entity recognition (NER) techniques. It focuses on identifying key entities such as offenders, victims, locations, and criminal instruments, that are vital for legal and investigative purposes. The NER process is powered by Large Language Models (LLMs) using few-shot prompting, facilitating the efficient extraction and organization of critical information from diverse sources, including news articles, police reports, and social media. Experimental results on real-world mass-shooting corpora demonstrate that GPT-4o is the most effective model for mass-shooting NER, achieving the highest Micro Precision, Micro Recall, and Micro F1-scores. Meanwhile, o1-mini delivers competitive performance, making it a resource-efficient alternative for less complex NER tasks. It is also observed that increasing the shot count enhances the performance of all models, but the gains are more substantial for GPT-4o and o1-mini, highlighting their superior adaptability to few-shot learning scenarios.
zh

[NLP-59] MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models

【速读】: 该论文旨在解决长上下文语言模型在推理阶段因高 GPU 内存需求而导致部署困难的问题。论文提出了一种名为 Memory-efficient Offloaded Mini-sequence Inference (MOM) 的方法,其关键是将关键层划分为更小的“mini-sequences”(小型序列),并通过与 KV 缓存卸载技术无缝集成,大幅减少内存占用。实验结果表明,MOM 平均将峰值内存使用降低了超过 50%,并在单个 A100 80GB GPU 上将最大上下文长度从 155k 扩展到 455k tokens,同时保持输出一致且不牺牲准确性。相比传统的分块预填充方法,MOM 实现了上下文长度扩展提升 35%,并显著减少了预填充阶段的内存消耗,从而彻底消除了长期以来推理阶段的主要内存瓶颈。这一突破性进展重新定义了研究方向,将未来的优化重点从预填充阶段转向解码阶段的残差 KV 缓存效率改进。

链接: https://arxiv.org/abs/2504.12526
作者: Junyang Zhang,Tianyi Zhu,Cheng Luo,Anima Anandkumar
机构: California Institute of Technology (加州理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to COLM

点击查看摘要

Abstract:Long-context language models exhibit impressive performance but remain challenging to deploy due to high GPU memory demands during inference. We propose Memory-efficient Offloaded Mini-sequence Inference (MOM), a method that partitions critical layers into smaller “mini-sequences” and integrates seamlessly with KV cache offloading. Experiments on various Llama, Qwen, and Mistral models demonstrate that MOM reduces peak memory usage by over 50% on average. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU, while keeping outputs identical and not compromising accuracy. MOM also maintains highly competitive throughput due to minimal computational overhead and efficient last-layer processing. Compared to traditional chunked prefill methods, MOM achieves a 35% greater context length extension. More importantly, our method drastically reduces prefill memory consumption, eliminating it as the longstanding dominant memory bottleneck during inference. This breakthrough fundamentally changes research priorities, redirecting future efforts from prefill-stage optimizations to improving decode-stage residual KV cache efficiency.
zh

[NLP-60] Memorization vs. Reasoning : Updating LLM s with New Knowledge

【速读】: 本文旨在解决大型语言模型(LLMs)在面对不断演化的现实世界信息时难以有效更新知识的问题。现有方法主要聚焦于实体替换,未能充分捕捉复杂的真实世界动态变化。为应对这一挑战,论文提出了Knowledge Update Playground (KUP),这是一个用于模拟基于证据语料库的真实知识更新的自动化管道,并构建了一个包含直接和间接探测的评估框架,以测试任何更新学习方法对于事实记忆与推理的能力。此外,论文还介绍了一种名为记忆条件训练(Memory Conditioned Training, MCT)的轻量级方法,在训练过程中将更新语料中的标记与自动生成的“记忆”标记相结合。此策略促使LLMs在推理阶段能够呈现并处理新习得的知识。实验结果表明,KUP基准极具挑战性,而MCT训练显著优于先前的持续预训练(Continued Pre-training, CPT)基线,在直接探测任务上的表现提升了高达25.4%。因此,本文的关键在于通过KUP提供全面的知识更新模拟环境,并利用MCT增强LLMs对新知识的记忆与推理能力。

链接: https://arxiv.org/abs/2504.12523
作者: Aochong Oliver Li,Tanya Goyal
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) encode vast amounts of pre-trained knowledge in their parameters, but updating them as real-world information evolves remains a challenge. Existing methodologies and benchmarks primarily target entity substitutions, failing to capture the full breadth of complex real-world dynamics. In this paper, we introduce Knowledge Update Playground (KUP), an automatic pipeline for simulating realistic knowledge updates reflected in an evidence corpora. KUP’s evaluation framework includes direct and indirect probes to both test memorization of updated facts and reasoning over them, for any update learning methods. Next, we present a lightweight method called memory conditioned training (MCT), which conditions tokens in the update corpus on self-generated “memory” tokens during training. Our strategy encourages LLMs to surface and reason over newly memorized knowledge at inference. Our results on two strong LLMs show that (1) KUP benchmark is highly challenging, with the best CPT models achieving 2% in indirect probing setting (reasoning) and (2) MCT training significantly outperforms prior continued pre-training (CPT) baselines, improving direct probing (memorization) results by up to 25.4% .
zh

[NLP-61] Evaluating the Diversity and Quality of LLM Generated Content ICLR2025

【速读】: 该论文试图解决的问题是:偏好调优(Preference-Tuning)技术(如基于人类偏好的强化学习方法RLHF,包括PPO和GRPO,以及替代方法DPO)在提升模型质量的同时导致输出多样性下降的问题,这一现象在需要多样化输出的应用场景中形成了矛盾。
解决方案的关键在于引入了一个衡量有效语义多样性(Effective Semantic Diversity)的新框架,该框架关注的是满足质量阈值的输出之间的多样性,而非传统度量中仅关注形式上的多样性。通过开放性任务实验发现,尽管偏好调优模型(尤其是基于强化学习训练的模型)在词汇和句法多样性上有所降低,但它们在满足高质量输出的前提下展现了更高的有效语义多样性,其原因并非增加了高质量输出内部的多样性,而是生成了更多整体高质量的输出。此外,研究揭示了形式多样性与内容多样性之间的区别,并进一步表明较小规模的模型在固定采样预算下更高效地生成独特内容,从而为模型规模与多样性的关系提供了洞见。

链接: https://arxiv.org/abs/2504.12522
作者: Alexander Shypula,Shuo Li,Botong Zhang,Vishakh Padmakumar,Kayo Yin,Osbert Bastani
机构: University of Pennsylvania (宾夕法尼亚大学); New York University (纽约大学); UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2025 Third Workshop on Deep Learning for Code

点击查看摘要

Abstract:Recent work suggests that preference-tuning techniques–including Reinforcement Learning from Human Preferences (RLHF) methods like PPO and GRPO, as well as alternatives like DPO–reduce diversity, creating a dilemma given that such models are widely deployed in applications requiring diverse outputs. To address this, we introduce a framework for measuring effective semantic diversity–diversity among outputs that meet quality thresholds–which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: although preference-tuned models–especially those trained via RL–exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models, not from increasing diversity among high-quality outputs, but from generating more high-quality outputs overall. We discover that preference tuning reduces syntactic diversity while preserving semantic diversity–revealing a distinction between diversity in form and diversity in content that traditional metrics often overlook. Our analysis further shows that smaller models are consistently more parameter-efficient at generating unique content within a fixed sampling budget, offering insights into the relationship between model scaling and diversity. These findings have important implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.
zh

[NLP-62] BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

【速读】: 该论文试图解决如何有效评估代理(agents)在浏览网络以获取信息方面的能力问题。传统基准测试往往难以全面衡量代理在复杂、隐蔽信息检索任务中的表现。为了解决这一问题,论文提出BrowseComp,这是一个包含1,266个问题的简单但具有挑战性的基准数据集,这些问题需要代理持续导航互联网以寻找难以发现且复杂的隐含信息。关键在于设计了一种既能体现难度又能保证可验证性的方式,即预测答案简短且易于与参考答案核对,从而聚焦于衡量代理在信息检索过程中展现的坚持性和创造性这一核心能力。

链接: https://arxiv.org/abs/2504.12516
作者: Jason Wei,Zhiqing Sun,Spencer Papay,Scott McKinney,Jeffrey Han,Isa Fulford,Hyung Won Chung,Alex Tachard Passos,William Fedus,Amelia Glaese
机构: OpenAI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at this https URL.
zh

[NLP-63] Beyond Text: Characterizing Domain Expert Needs in Document Research

【速读】: 该论文试图解决的问题是如何评估当前基于文本的自然语言处理(NLP)系统在模拟领域专家文档研究过程方面的有效性,并探索这些系统是否能够充分捕捉专家在实际工作中所依赖的复杂性和社会语境。论文的关键在于通过访谈十六位来自两个不同领域的专家,揭示其个性化、迭代性的文档研究流程及其对文档社会语境的依赖性,进而指出现有的NLP方法中那些将文档视为具有独立意义而非单纯文本容器的方法更能反映专家的优先级,尽管这些方法通常在其研究社区之外较难获得。论文呼吁NLP社区在开发实用工具时更加关注文档的作用,以实现工具的易用性、可定制性、迭代性和社会意识。

链接: https://arxiv.org/abs/2504.12495
作者: Sireesh Gururaja,Nupoor Gandhi,Jeremiah Milbauer,Emma Strubell
机构: Language Technologies Institute, Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Working with documents is a key part of almost any knowledge work, from contextualizing research in a literature review to reviewing legal precedent. Recently, as their capabilities have expanded, primarily text-based NLP systems have often been billed as able to assist or even automate this kind of work. But to what extent are these systems able to model these tasks as experts conceptualize and perform them now? In this study, we interview sixteen domain experts across two domains to understand their processes of document research, and compare it to the current state of NLP systems. We find that our participants processes are idiosyncratic, iterative, and rely extensively on the social context of a document in addition its content; existing approaches in NLP and adjacent fields that explicitly center the document as an object, rather than as merely a container for text, tend to better reflect our participants’ priorities, though they are often less accessible outside their research communities. We call on the NLP community to more carefully consider the role of the document in building useful tools that are accessible, personalizable, iterative, and socially aware.
zh

[NLP-64] Accelerating Clinical NLP at Scale with a Hybrid Framework with Reduced GPU Demands: A Case Study in Dementia Identification

【速读】: 该论文旨在解决临床自然语言处理(Clinical Natural Language Processing, Clinical NLP)在大规模临床文本分析中的高计算资源需求问题,从而限制其广泛应用的挑战。论文提出了一种混合型NLP框架,通过结合基于规则的过滤、支持向量机(Support Vector Machine, SVM)分类器以及基于BERT的模型,在保证准确性的同时显著提升效率。这种方案的关键在于整合多种技术手段,既发挥了传统规则方法和机器学习模型的优势,又利用了深度学习模型的强大表征能力,从而实现高效且精准的临床文本分析。

链接: https://arxiv.org/abs/2504.12494
作者: Jianlin Shi,Qiwei Gan,Elizabeth Hanchrow,Annie Bowles,John Stanley,Adam P. Bress,Jordana B. Cohen,Patrick R. Alba
机构: 未知
类目: Computation and Language (cs.CL)
备注: This manuscript has been submitted to AMIA 2025 annual symposium ( this https URL )

点击查看摘要

Abstract:Clinical natural language processing (NLP) is increasingly in demand in both clinical research and operational practice. However, most of the state-of-the-art solutions are transformers-based and require high computational resources, limiting their accessibility. We propose a hybrid NLP framework that integrates rule-based filtering, a Support Vector Machine (SVM) classifier, and a BERT-based model to improve efficiency while maintaining accuracy. We applied this framework in a dementia identification case study involving 4.9 million veterans with incident hypertension, analyzing 2.1 billion clinical notes. At the patient level, our method achieved a precision of 0.90, a recall of 0.84, and an F1-score of 0.87. Additionally, this NLP approach identified over three times as many dementia cases as structured data methods. All processing was completed in approximately two weeks using a single machine with dual A40 GPUs. This study demonstrates the feasibility of hybrid NLP solutions for large-scale clinical text analysis, making state-of-the-art methods more accessible to healthcare organizations with limited computational resources.
zh

[NLP-65] Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLM s?

【速读】: 该论文旨在解决在固定模型规模下,预训练过程中常用的评估指标(如困惑度 Perplexity)在预测下游任务微调性能方面的局限性问题,这一不足阻碍了有效的模型选择与开发。为填补这一研究空白,论文将选择最佳预训练检查点的任务重新定义为一个二元分类问题:预测两个预训练方式不同的大规模语言模型(LLMs)在经过有监督微调(SFT)后哪个表现更优。论文的关键解决方案在于提出了一组基于预训练的新型无监督和有监督代理指标(proxy metrics),这些指标成功将相对性能预测错误率降低了超过50%,从而有效弥补了传统指标的不足,并为设计针对不同下游任务优化的高效预训练方案奠定了基础。

链接: https://arxiv.org/abs/2504.12491
作者: Hansi Zeng,Kai Hui,Honglei Zhuang,Zhen Qin,Zhenrui Yue,Hamed Zamani,Dana Alon
机构: Google DeepMind; University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Google DeepMind
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While metrics available during pre-training, such as perplexity, correlate well with model performance at scaling-laws studies, their predictive capacities at a fixed model size remain unclear, hindering effective model selection and development. To address this gap, we formulate the task of selecting pre-training checkpoints to maximize downstream fine-tuning performance as a pairwise classification problem: predicting which of two LLMs, differing in their pre-training, will perform better after supervised fine-tuning (SFT). We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations, e.g., objectives or data, and evaluate them on diverse downstream tasks after SFT. We first conduct a study and demonstrate that the conventional perplexity is a misleading indicator. As such, we introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%. Despite the inherent complexity of this task, we demonstrate the practical utility of our proposed proxies in specific scenarios, paving the way for more efficient design of pre-training schemes optimized for various downstream tasks.
zh

[NLP-66] owards Conversational AI for Human-Machine Collaborative MLOps

【速读】: 该论文旨在解决复杂MLOps(机器学习运营)平台如Kubeflow在易用性方面的不足,通过构建基于大语言模型(Large Language Model, LLM)的对话式代理系统,降低使用门槛并提升人机协作效率。论文提出的关键解决方案是Swarm Agent架构,它通过集成多种专用代理(如KubeFlow Pipelines Agent用于ML工作流编排、MinIO Agent用于数据管理、Retrieval-Augmented Generation Agent用于领域知识集成),结合分层模块化设计与上下文感知处理能力,实现从发现、执行到监控ML管道,管理数据集及工件,以及访问相关文档等操作的自然语言交互。这种方案不仅提高了复杂系统的可访问性,还保持了向其他平台扩展的灵活性。

链接: https://arxiv.org/abs/2504.12477
作者: George Fatouros,Georgios Makridis,George Kousiouris,John Soldatos,Anargyros Tsadimas,Dimosthenis Kyriazis
机构: Innov-Acts Ltd (Innov-Acts Ltd); University of Piraeus (塞萨洛尼基大学); Harokopio University (哈罗科皮奥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:This paper presents a Large Language Model (LLM) based conversational agent system designed to enhance human-machine collaboration in Machine Learning Operations (MLOps). We introduce the Swarm Agent, an extensible architecture that integrates specialized agents to create and manage ML workflows through natural language interactions. The system leverages a hierarchical, modular design incorporating a KubeFlow Pipelines (KFP) Agent for ML pipeline orchestration, a MinIO Agent for data management, and a Retrieval-Augmented Generation (RAG) Agent for domain-specific knowledge integration. Through iterative reasoning loops and context-aware processing, the system enables users with varying technical backgrounds to discover, execute, and monitor ML pipelines; manage datasets and artifacts; and access relevant documentation, all via intuitive conversational interfaces. Our approach addresses the accessibility gap in complex MLOps platforms like Kubeflow, making advanced ML tools broadly accessible while maintaining the flexibility to extend to other platforms. The paper describes the architecture, implementation details, and demonstrates how this conversational MLOps assistant reduces complexity and lowers barriers to entry for users across diverse technical skill levels.
zh

[NLP-67] Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex

【速读】: 该论文旨在解决文本属性图(Text-attributed Graphs, TAGs)在表示学习中的独特挑战,即如何同时有效捕捉节点关联文本的语义丰富性与图的结构依赖性。传统图神经网络(Graph Neural Networks, GNNs)擅长处理拓扑信息但无法处理非结构化文本,而大型语言模型(Large Language Models, LLMs)虽擅长文本理解却通常缺乏对图结构的认知。为应对这一问题,论文提出了一种名为BiGTex(双向图文本)的新架构,其关键在于通过堆叠的图文本融合单元(Graph-Text Fusion Unit)将GNNs和LLMs紧密集成。每个单元允许文本和结构表示之间的互注意力机制,实现信息在文本影响结构与结构指导文本解释两个方向上的双向流动。此外,该架构采用参数高效微调方法(LoRA),在保持LLM冻结的同时适应任务特定信号。实验结果表明,BiGTex在节点分类任务中达到当前最优性能,并在链接预测任务中表现出良好的泛化能力。

链接: https://arxiv.org/abs/2504.12474
作者: Azadeh Beiranvand,Seyed Mehdi Vahidipour
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures

点击查看摘要

Abstract:Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model’s success.
zh

[NLP-68] SLURG: Investigating the Feasibility of Generating Synthetic Online Fallacious Discourse

【速读】: 本文旨在解决社交媒体上自动检测操纵行为中的谬误识别问题,特别是探索这些逻辑谬误在实际场景(如互联网论坛)中的表现形式。研究发现,在围绕乌克兰-俄罗斯冲突的讨论板中普遍存在错误信息或误导性意图,这缩小了任务的研究范围。尽管自动谬误检测近期受到关注,但现有数据集大多采用未经规范的谬误分类法,或局限于正式语言环境(如政治辩论或新闻报道)。然而,网络话语通常包含未被这些领域捕捉到的非标准化且多样的语言。为解决上述限制,论文提出了Shady Linguistic Utterance Replication-Generation (SLURG),通过利用大型语言模型(Large Language Models, LLMs),特别是DeepHermes-3-Mistral-24B,来生成合成的带有谬误的论坛风格评论。研究的关键在于验证LLMs能否复制真实数据的句法模式,并通过高质量的少量样本提示提升其模仿在线论坛词汇多样性的能力。

链接: https://arxiv.org/abs/2504.12466
作者: Cal Blanco,Gavin Dsouza,Hugo Lin,Chelsey Rush
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Computation and Language (cs.CL)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:In our paper we explore the definition, and extrapolation of fallacies as they pertain to the automatic detection of manipulation on social media. In particular we explore how these logical fallacies might appear in the real world i.e internet forums. We discovered a prevalence of misinformation / misguided intention in discussion boards specifically centered around the Ukrainian Russian Conflict which serves to narrow the domain of our task. Although automatic fallacy detection has gained attention recently, most datasets use unregulated fallacy taxonomies or are limited to formal linguistic domains like political debates or news reports. Online discourse, however, often features non-standardized and diverse language not captured in these domains. We present Shady Linguistic Utterance Replication-Generation (SLURG) to address these limitations, exploring the feasibility of generating synthetic fallacious forum-style comments using large language models (LLMs), specifically DeepHermes-3-Mistral-24B. Our findings indicate that LLMs can replicate the syntactic patterns of real data and that high-quality few-shot prompts enhance LLMs’ ability to mimic the vocabulary diversity of online forums.
zh

[NLP-69] On Linear Representations and Pretraining Data Frequency in Language Models ICLR2025

【速读】: 该论文试图解决语言模型(Language Models, LMs)表征中线性表示形成机制的问题。研究聚焦于预训练数据频率与模型线性表示之间的关系,特别是事实三元组(subject-relation-object)中主语-关系-宾语共现频率及上下文学习准确性如何影响线性表示的形成。关键在于发现当关系中的主语和宾语在预训练期间分别至少共现1000次和2000次时,通常会形成一致的线性表示(但非唯一情况)。此外,通过在完全训练的语言模型中训练回归模型来预测预训练阶段术语的出现频率,提出了一种估计封闭数据模型未知训练数据属性的新方法。这表明语言模型中线性表示的强度包含关于预训练语料库的信息,可能为控制和改进模型行为提供新途径。

链接: https://arxiv.org/abs/2504.12459
作者: Jack Merullo,Noah A. Smith,Sarah Wiegreffe,Yanai Elazar
机构: Brown University (布朗大学); Allen Institute for AI (Ai2)(艾伦人工智能研究所); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2025

点击查看摘要

Abstract:Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data’s effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded `linearly’ in the representations, but what factors cause these representations to form? We study the connection between pretraining data frequency and models’ linear representations of factual relations. We find evidence that the formation of linear representations is strongly connected to pretraining term frequencies; specifically for subject-relation-object fact triplets, both subject-object co-occurrence frequency and in-context learning accuracy for the relation are highly correlated with linear representations. This is the case across all phases of pretraining. In OLMo-7B and GPT-J, we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1k and 2k times, respectively, regardless of when these occurrences happen during pretraining. Finally, we train a regression model on measurements of linear representation quality in fully-trained LMs that can predict how often a term was seen in pretraining. Our model achieves low error even on inputs from a different model with a different pretraining dataset, providing a new method for estimating properties of the otherwise-unknown training data of closed-data models. We conclude that the strength of linear representations in LMs contains signal about the models’ pretraining corpora that may provide new avenues for controlling and improving model behavior: particularly, manipulating the models’ training data to meet specific frequency thresholds.
zh

[NLP-70] Position: The Most Expensive Part of an LLM should be its Training Data

【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)开发中训练数据生产成本被严重低估的问题。论文指出,尽管训练LLM需要巨大的计算资源和工程投入,但人类劳动背后的数据生产成本往往未被充分重视或补偿。论文的关键解决方案在于提出一种方法来量化训练数据生产的真实成本,并呼吁将训练数据生产者的报酬作为LLM生产中最重要且需优先考虑的部分。通过研究64个LLM的训练数据成本,论文发现即使采用保守的人工工资估算,这些数据集的生产成本也比模型训练成本高出10到1000倍,凸显了当前实践中的显著财务风险。论文进一步探讨了未来可能的研究方向,以推动更公平的数据生产补偿机制。

链接: https://arxiv.org/abs/2504.12427
作者: Nikhil Kandpal,Colin Raffel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Training a state-of-the-art Large Language Model (LLM) is an increasingly expensive endeavor due to growing computational, hardware, energy, and engineering demands. Yet, an often-overlooked (and seldom paid) expense is the human labor behind these models’ training data. Every LLM is built on an unfathomable amount of human effort: trillions of carefully written words sourced from books, academic papers, codebases, social media, and more. This position paper aims to assign a monetary value to this labor and argues that the most expensive part of producing an LLM should be the compensation provided to training data producers for their work. To support this position, we study 64 LLMs released between 2016 and 2024, estimating what it would cost to pay people to produce their training datasets from scratch. Even under highly conservative estimates of wage rates, the costs of these models’ training datasets are 10-1000 times larger than the costs to train the models themselves, representing a significant financial liability for LLM providers. In the face of the massive gap between the value of training data and the lack of compensation for its creation, we highlight and discuss research directions that could enable fairer practices in the future.
zh

[NLP-71] A Human-AI Comparative Analysis of Prompt Sensitivity in LLM -Based Relevance Judgment

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在信息检索(Information Retrieval, IR)任务中生成的相关性判断的鲁棒性和可靠性问题。论文的关键在于系统性地评估提示敏感性(prompt sensitivity)对LLM生成相关性判断的影响。为此,作者从15名人机专家和15个LLMs收集了针对二元、分级和成对三种任务的提示,通过比较LLM生成的标签与TREC官方人工标注的Cohen’s \kappa 和成对一致性度量,分析了提示变化对与人工标签一致性的影响,并进一步对比了人类生成与LLM生成的提示差异以及不同LLMs作为裁判的一致性。此外,还评估了这些提示与Bing和TREC 2024 RAG赛道使用的标准UMBRELA提示之间的差异。论文通过全面的实验设计和数据公开,为未来基于LLM的相关性评估研究提供了支持。

链接: https://arxiv.org/abs/2504.12408
作者: Negar Arabzadeh,Charles L. A . Clarke
机构: University of Waterloo(滑铁卢大学); Waterloo(滑铁卢); Ontario(安大略); Canada(加拿大)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to automate relevance judgments for information retrieval (IR) tasks, often demonstrating agreement with human labels that approaches inter-human agreement. To assess the robustness and reliability of LLM-based relevance judgments, we systematically investigate impact of prompt sensitivity on the task. We collected prompts for relevance assessment from 15 human experts and 15 LLMs across three tasks~ – ~binary, graded, and pairwise~ – ~yielding 90 prompts in total. After filtering out unusable prompts from three humans and three LLMs, we employed the remaining 72 prompts with three different LLMs as judges to label document/query pairs from two TREC Deep Learning Datasets (2020 and 2021). We compare LLM-generated labels with TREC official human labels using Cohen’s \kappa and pairwise agreement measures. In addition to investigating the impact of prompt variations on agreement with human labels, we compare human- and LLM-generated prompts and analyze differences among different LLMs as judges. We also compare human- and LLM-generated prompts with the standard UMBRELA prompt used for relevance assessment by Bing and TREC 2024 Retrieval Augmented Generation (RAG) Track. To support future research in LLM-based evaluation, we release all data and prompts at this https URL.
zh

[NLP-72] A Method for Handling Negative Similarities in Explainable Graph Spectral Clustering of Text Documents – Extended Version CCS

【速读】: 该论文致力于解决图谱聚类(Graph Spectral Clustering, GSC)在处理包含负相似度(negative similarities)的情况下的问题,这些负相似度源于与传统词项向量空间模型(Term Vector Space,如doc2vec、GloVe等)不同的文档嵌入方法。论文讨论了组合拉普拉斯矩阵(combinatorial Laplacians)和归一化拉普拉斯矩阵(normalized Laplacians)的解决方案,并通过实验分析了文献中提出的6种不同方法的优势与不足。研究的关键发现表明,GloVe嵌入常导致基于归一化拉普拉斯矩阵的GSC失败,而应用修正负相似度的方法可以同时提升基于组合拉普拉斯和归一化拉普拉斯的GSC的准确性,并扩展了解释方法在GloVe嵌入上的适用性。

链接: https://arxiv.org/abs/2504.12360
作者: Mieczysław A. Kłopotek,Sławomir T. Wierzchoń,Bartłomiej Starosta,Dariusz Czerski,Piotr Borkowski
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 1 figure, 17 pages, this is an extended version of a paper accepted for the 25th International Conference on Computational Science (ICCS), 7-9 July 2025

点击查看摘要

Abstract:This paper investigates the problem of Graph Spectral Clustering with negative similarities, resulting from document embeddings different from the traditional Term Vector Space (like doc2vec, GloVe, etc.). Solutions for combinatorial Laplacians and normalized Laplacians are discussed. An experimental investigation shows the advantages and disadvantages of 6 different solutions proposed in the literature and in this research. The research demonstrates that GloVe embeddings frequently cause failures of normalized Laplacian based GSC due to negative similarities. Furthermore, application of methods curing similarity negativity leads to accuracy improvement for both combinatorial and normalized Laplacian based GSC. It also leads to applicability for GloVe embeddings of explanation methods developed originally bythe authors for Term Vector Space embeddings.
zh

[NLP-73] Replicating ReLM Results: Validating Large Language Models with ReLM

【速读】: 该论文旨在解决通过形式语言方法评估和控制大型语言模型(Large Language Models, LLMs)在记忆能力、偏见及零样本性能方面行为的问题。当前评估这些行为的方法通常存在速度慢、精度低、成本高或引入新偏见等局限性,但鉴于这些行为在生产化LLMs中的重要性,此类评估方法仍不可或缺。论文的关键在于复现原始ReLM论文中的关键结果,并着重阐述其方法论及其在机器学习系统领域的应用价值,强调形式语言方法在提升评估效率与准确性方面的潜力。

链接: https://arxiv.org/abs/2504.12357
作者: Reece Adamson,Erin Song
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Validating Large Language Models with ReLM explores the application of formal languages to evaluate and control Large Language Models (LLMs) for memorization, bias, and zero-shot performance. Current approaches for evaluating these types behavior are often slow, imprecise, costly, or introduce biases of their own, but are necessary due to the importance of this behavior when productionizing LLMs. This project reproduces key results from the original ReLM paper and expounds on the approach and applications with an emphasis on the relevance to the field of systems for machine learning.
zh

[NLP-74] Leverag ing Large Language Models for Multi-Class and Multi-Label Detection of Drug Use and Overdose Symptoms on Social Media

【速读】: 该论文旨在解决药物过量这一全球性的健康问题,特别是由阿片类药物、止痛药和精神类药物滥用引发的问题。传统研究方法存在局限性,而社交媒体能够提供关于自我报告的物质使用和过量症状的实时洞察。论文的关键解决方案是提出了一种基于人工智能的语言模型(AI-driven NLP)框架,该框架经过标注的社交媒体数据训练,用于检测常用药物及其相关的过量症状。通过结合大型语言模型(LLMs)和人工标注者的混合标注策略,研究应用了传统的机器学习模型、神经网络以及先进的基于变换器的模型。最终,该框架在多类别分类任务中达到了98%的准确率,在多标签分类任务中达到97%,比基准模型高出多达8%。这些结果表明,AI在支持公共卫生监测和个人化干预策略方面具有巨大潜力。

链接: https://arxiv.org/abs/2504.12355
作者: Muhammad Ahmad,Muhammad Waqas,ldar Batyrshin,Grigori Sidorov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Drug overdose remains a critical global health issue, often driven by misuse of opioids, painkillers, and psychiatric medications. Traditional research methods face limitations, whereas social media offers real-time insights into self-reported substance use and overdose symptoms. This study proposes an AI-driven NLP framework trained on annotated social media data to detect commonly used drugs and associated overdose symptoms. Using a hybrid annotation strategy with LLMs and human annotators, we applied traditional ML models, neural networks, and advanced transformer-based models. Our framework achieved 98% accuracy in multi-class and 97% in multi-label classification, outperforming baseline models by up to 8%. These findings highlight the potential of AI for supporting public health surveillance and personalized intervention strategies.
zh

[NLP-75] A Large-Language Model Framework for Relative Timeline Extraction from PubMed Case Reports

【速读】: 该论文试图解决临床事件时间点捕捉不足的问题,以支持患者轨迹表征、过程追踪、预测及因果推理等分析任务。当前电子健康记录仅捕获少量关键数据元素,而临床报告缺乏结构化的事件时间信息。为此,论文提出了一种将病例报告转化为文本时间序列的方法,构建包含事件及其对应时间戳的配对数据。解决方案的关键在于利用大语言模型(LLM)实现对事件及其时间戳的标注,并通过评估其召回率(event recall: 0.80)和时间一致性(temporal concordance: 0.95),验证了该方法在事件识别与时间定位上的有效性,从而为利用PubMed开放获取资源进行时间分析提供了基准参考。

链接: https://arxiv.org/abs/2504.12350
作者: Jing Wang,Jeremy C Weiss
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Timing of clinical events is central to characterization of patient trajectories, enabling analyses such as process tracing, forecasting, and causal reasoning. However, structured electronic health records capture few data elements critical to these tasks, while clinical reports lack temporal localization of events in structured form. We present a system that transforms case reports into textual time series-structured pairs of textual events and timestamps. We contrast manual and large language model (LLM) annotations (n=320 and n=390 respectively) of ten randomly-sampled PubMed open-access (PMOA) case reports (N=152,974) and assess inter-LLM agreement (n=3,103; N=93). We find that the LLM models have moderate event recall(O1-preview: 0.80) but high temporal concordance among identified events (O1-preview: 0.95). By establishing the task, annotation, and assessment systems, and by demonstrating high concordance, this work may serve as a benchmark for leveraging the PMOA corpus for temporal analytics.
zh

[NLP-76] Mathematical Capabilities of Large Language Models in Finnish Matriculation Examination

【速读】: 该论文旨在评估大型语言模型(Large Language Models, LLMs)在数学推理方面的能力,并探索其在高 stakes 教育评估中的应用潜力。论文以芬兰高校入学考试为测试基准,通过对比不同阶段 LLMs 的表现来衡量其数学能力的进步。关键在于跟踪 LLMs 随时间演进过程中数学性能的显著提升,部分模型最终实现了接近或达到满分的成绩,与顶尖学生的水平相当,从而证明了这些模型在支持规模化教育评估方面的巨大潜力。

链接: https://arxiv.org/abs/2504.12347
作者: Mika Setälä,Pieta Sikström,Ville Heilala,Tommi Kärkkäinen
机构: University of Jyväskylä (耶尔伐斯拉大学); University of Jyväskylä (耶尔伐斯拉大学); University of Jyväskylä (耶尔伐斯拉大学); University of Jyväskylä (耶尔伐斯拉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown increasing promise in educational settings, yet their mathematical reasoning has been considered evolving. This study evaluates the mathematical capabilities of various LLMs using the Finnish matriculation examination, a high-stakes digital test for upper secondary education. Initial tests yielded moderate performance corresponding to mid-range grades, but later evaluations demonstrated substantial improvements as the language models evolved. Remarkably, some models achieved near-perfect or perfect scores, matching top student performance and qualifying for university admission. Our findings highlight the rapid advances in the mathematical proficiency of LLMs and illustrate their potential to also support educational assessments at scale.
zh

[NLP-77] Reimagining Urban Science: Scaling Causal Inference with Large Language Models

【速读】: 该论文旨在解决城市因果研究中因假设生成效率与偏倚、多模态数据复杂性的障碍以及因果实验方法论脆弱性所带来的挑战。论文提出的关键解决方案是引入一个基于大型语言模型(Large Language Models, LLMs)驱动的概念框架——AutoUrbanCI,它由四个模块化代理组成,分别负责假设生成、数据工程、实验设计与执行,以及结果解释与政策建议。这一框架的核心在于利用AI技术增强城市因果分析的能力,同时强调通过人机协作提高研究的严谨性、透明度,并确保公平性和问责制。

链接: https://arxiv.org/abs/2504.12345
作者: Yutong Xia,Ao Qu,Yunhan Zheng,Yihong Tang,Dingyi Zhuang,Yuxuan Liang,Cathy Wu,Roger Zimmermann,Jinhua Zhao
机构: MIT (麻省理工学院); NUS (新加坡国立大学); McGill University (麦吉尔大学); HKUST (香港科技大学 (广州))
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Urban causal research is essential for understanding the complex dynamics of cities and informing evidence-based policies. However, it is challenged by the inefficiency and bias of hypothesis generation, barriers to multimodal data complexity, and the methodological fragility of causal experimentation. Recent advances in large language models (LLMs) present an opportunity to rethink how urban causal analysis is conducted. This Perspective examines current urban causal research by analyzing taxonomies that categorize research topics, data sources, and methodological approaches to identify structural gaps. We then introduce an LLM-driven conceptual framework, AutoUrbanCI, composed of four distinct modular agents responsible for hypothesis generation, data engineering, experiment design and execution, and results interpretation with policy recommendations. We propose evaluation criteria for rigor and transparency and reflect on implications for human-AI collaboration, equity, and accountability. We call for a new research agenda that embraces AI-augmented workflows not as replacements for human expertise but as tools to broaden participation, improve reproducibility, and unlock more inclusive forms of urban causal reasoning.
zh

[NLP-78] Propaganda via AI? A Study on Semantic Backdoors in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的语义后门攻击问题,这类攻击通过嵌入概念层面的隐蔽触发器(如意识形态立场或文化引用)来系统性操纵模型输出,而传统防御方法未能有效检测此类语义后门,因其依赖于基于意义而非词法异常的线索。论文的关键在于提出了一种名为RAVEN(用于揭示语义后门的“响应异常警戒”)的黑盒检测框架,该框架结合语义熵与跨模型一致性分析,通过结构化主题视角提示探测多个模型,利用双向蕴含聚类采样响应,并标记异常一致输出;同时,跨模型比较能够从全局偏差中分离出特定于模型的异常,从而有效识别出此前未被发现的语义后门,为部署的语言模型的概念级审计提供了实证依据。代码和数据已开源。

链接: https://arxiv.org/abs/2504.12344
作者: Nay Myat Min,Long H. Pham,Yige Li,Jun Sun
机构: Singapore Management University (新加坡管理大学), Singapore
类目: Computation and Language (cs.CL)
备注: 18 pages, 1 figure

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable performance across myriad language tasks, yet they remain vulnerable to backdoor attacks, where adversaries implant hidden triggers that systematically manipulate model outputs. Traditional defenses focus on explicit token-level anomalies and therefore overlook semantic backdoors-covert triggers embedded at the conceptual level (e.g., ideological stances or cultural references) that rely on meaning-based cues rather than lexical oddities. We first show, in a controlled finetuning setting, that such semantic backdoors can be implanted with only a small poisoned corpus, establishing their practical feasibility. We then formalize the notion of semantic backdoors in LLMs and introduce a black-box detection framework, RAVEN (short for “Response Anomaly Vigilance for uncovering semantic backdoors”), which combines semantic entropy with cross-model consistency analysis. The framework probes multiple models with structured topic-perspective prompts, clusters the sampled responses via bidirectional entailment, and flags anomalously uniform outputs; cross-model comparison isolates model-specific anomalies from corpus-wide biases. Empirical evaluations across diverse LLM families (GPT-4o, Llama, DeepSeek, Mistral) uncover previously undetected semantic backdoors, providing the first proof-of-concept evidence of these hidden vulnerabilities and underscoring the urgent need for concept-level auditing of deployed language models. We open-source our code and data at this https URL.
zh

[NLP-79] Benchmarking Biopharmaceuticals Retrieval-Augmented Generation Evaluation

【速读】: 该论文旨在解决生物制药领域中缺乏专门用于评估检索增强型大语言模型(Retrieval-Augmented LLMs)的基准数据集的问题,并提出一种新的评估方法以弥补传统问答(QA)指标在开放域检索增强型问答场景中的不足。论文的关键解决方案是引入了首个针对生物制药领域的基准数据集——Biopharmaceuticals Retrieval-Augmented Generation Evaluation (BRAGE),支持英语、法语、德语和中文,并提出了基于引用的分类方法来评估模型的查询与参考理解能力(Query and Reference Understanding Capability, QRUC)。实验结果表明主流LLMs在生物制药领域的QRUC存在显著差距,需要进一步改进。

链接: https://arxiv.org/abs/2504.12342
作者: Hanmeng Zhong,Linqing Chen,Weilei Wang,Wentao Wu
机构: PatSnap Co., LTD.
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, the application of the retrieval-augmented Large Language Models (LLMs) in specific domains has gained significant attention, especially in biopharmaceuticals. However, in this context, there is no benchmark specifically designed for biopharmaceuticals to evaluate LLMs. In this paper, we introduce the Biopharmaceuticals Retrieval-Augmented Generation Evaluation (BRAGE) , the first benchmark tailored for evaluating LLMs’ Query and Reference Understanding Capability (QRUC) in the biopharmaceutical domain, available in English, French, German and Chinese. In addition, Traditional Question-Answering (QA) metrics like accuracy and exact match fall short in the open-ended retrieval-augmented QA scenarios. To address this, we propose a citation-based classification method to evaluate the QRUC of LLMs to understand the relationship between queries and references. We apply this method to evaluate the mainstream LLMs on BRAGE. Experimental results show that there is a significant gap in the biopharmaceutical QRUC of mainstream LLMs, and their QRUC needs to be improved.
zh

[NLP-80] Streamlining Biomedical Research with Specialized LLM s

【速读】: 本文旨在解决跨模态信息整合与高效人机交互的问题,特别是在生物医学和制药领域的研究效率提升及决策支持。解决方案的关键在于将最先进的领域专用大型语言模型与先进的信息检索技术相结合,通过构建一个能够实现多组件无缝协作的系统,利用强大的问答模型进行输出的交叉验证,从而生成包含相关数据、图像、表格等多种模态且精度更高的高质量响应。这种集成方法显著提高了对话生成的质量,并提供了实时、高保真的交互平台,使用户能够在广泛文献和数据的同时访问中受益。

链接: https://arxiv.org/abs/2504.12341
作者: Linqing Chen,Weilei Wang,Yubin Xia,Wentao Wu,Peng Xu,Zilong Bai,Jie Fang,Chaobo Xu,Ran Hu,Licong Xu,Haoran Hua,Jing Sun,Hanmeng Zhong,Jin Liu,Tian Qiu,Haowen Liu,Meng Hu,Xiuwen Li,Fei Gao,Yong Gu,Tao Shi,Chaochao Wang,Jianping Lu,Cheng Sun,Yixin Wang,Shengjie Yang,Yuancheng Li,Lu Jin,Lisha Zhang,Fu Bian,Zhongkai Ye,Lidong Pei,Changyang Tu
机构: PatSnap Co., LTD. (智慧芽公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel system that integrates state-of-the-art, domain-specific large language models with advanced information retrieval techniques to deliver comprehensive and context-aware responses. Our approach facilitates seamless interaction among diverse components, enabling cross-validation of outputs to produce accurate, high-quality responses enriched with relevant data, images, tables, and other modalities. We demonstrate the system’s capability to enhance response precision by leveraging a robust question-answering model, significantly improving the quality of dialogue generation. The system provides an accessible platform for real-time, high-fidelity interactions, allowing users to benefit from efficient human-computer interaction, precise retrieval, and simultaneous access to a wide range of literature and data. This dramatically improves the research efficiency of professionals in the biomedical and pharmaceutical domains and facilitates faster, more informed decision-making throughout the R\D process. Furthermore, the system proposed in this paper is available at this https URL.
zh

[NLP-81] GOAT-TTS: LLM -based Text-To-Speech Generation Optimized via A Dual-Branch Architecture

【速读】: 该论文旨在解决现有基于大型语言模型(Large Language Models, LLMs)的文本转语音(Text-to-Speech, TTS)系统中存在的三个核心挑战:1)由语音提示量化引起的不可逆声学特征损失;2)对精确对齐的语音-文本配对数据的高度依赖限制了实际应用;3)在优化语音标记生成过程中对LLM原生文本理解能力的灾难性遗忘。为了解决这些问题,论文提出了一种基于LLM的TTS生成方法GOAT-TTS,其关键创新在于引入了双分支架构:1)模态对齐分支通过结合语音编码器和投影器捕获连续的声学嵌入,实现韵律特征(语言、音色、情感)与语义文本表示之间的双向关联,而无需依赖转录数据;2)语音生成分支通过冻结底部k层并在顶部k层进行模块化微调来预测语音标记,从而保留基础的语言学知识。此外,多标记预测被引入以支持实时流式TTS合成。实验结果表明,GOAT-TTS的性能与最先进的TTS模型相当,并验证了合成方言语音数据的有效性。

链接: https://arxiv.org/abs/2504.12339
作者: Yaodong Song,Hongjie Chen,Jie Lian,Yuxin Zhang,Guangmin Xia,Zehan Li,Genliang Zhao,Jian Kang,Yongxiang Li,Jie Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信), Beijing
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have revolutionized text-to-speech (TTS) synthesis through discrete tokenization paradigms, current architectures exhibit fundamental tensions between three critical dimensions: 1) irreversible loss of acoustic characteristics caused by quantization of speech prompts; 2) stringent dependence on precisely aligned prompt speech-text pairs that limit real-world deployment; and 3) catastrophic forgetting of the LLM’s native text comprehension during optimization for speech token generation. To address these challenges, we propose an LLM-based text-to-speech Generation approach Optimized via a novel dual-branch ArchiTecture (GOAT-TTS). Our framework introduces two key innovations: (1) The modality-alignment branch combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency; (2) The speech-generation branch employs modular fine-tuning on top-k layers of an LLM for speech token prediction while freezing the bottom-k layers to preserve foundational linguistic knowledge. Moreover, multi-token prediction is introduced to support real-time streaming TTS synthesis. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models while validating the efficacy of synthesized dialect speech data.
zh

[NLP-82] Paging Dr. GPT : Extracting Information from Clinical Notes to Enhance Patient Predictions

【速读】: 该论文试图解决如何有效利用电子医疗记录中的未结构化临床笔记信息以提升患者个体层面死亡风险预测的准确性。解决方案的关键在于构建了一个透明的框架,将由GPT-4o-mini(ChatGPT)针对患者出院摘要回答简单临床问题所生成的答案作为输入特征,结合逻辑回归模型进行分析。研究发现,仅使用基于GPT的方法即可优于基于标准表格数据训练的模型,并且结合结构化与非结构化数据源能够进一步增强预测能力,使最高风险组的曲线下面积(AUC)平均提高5.1个百分点,阳性预测值提升29.9%。这表明整合大型语言模型(LLMs)在临床预测任务中有显著价值,并揭示了LLMs在其他领域未充分利用文本数据时的广泛潜力。

链接: https://arxiv.org/abs/2504.12338
作者: David Anderson,Michaela Anderson,Margret Bjarnadottir,Stephen Mahar,Shriyan Reyya
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Paper and Online Supplement combined into one PDF. 26 pages. 2 figures

点击查看摘要

Abstract:There is a long history of building predictive models in healthcare using tabular data from electronic medical records. However, these models fail to extract the information found in unstructured clinical notes, which document diagnosis, treatment, progress, medications, and care plans. In this study, we investigate how answers generated by GPT-4o-mini (ChatGPT) to simple clinical questions about patients, when given access to the patient’s discharge summary, can support patient-level mortality prediction. Using data from 14,011 first-time admissions to the Coronary Care or Cardiovascular Intensive Care Units in the MIMIC-IV Note dataset, we implement a transparent framework that uses GPT responses as input features in logistic regression models. Our findings demonstrate that GPT-based models alone can outperform models trained on standard tabular data, and that combining both sources of information yields even greater predictive power, increasing AUC by an average of 5.1 percentage points and increasing positive predictive value by 29.9 percent for the highest-risk decile. These results highlight the value of integrating large language models (LLMs) into clinical prediction tasks and underscore the broader potential for using LLMs in any domain where unstructured text data remains an underutilized resource.
zh

[NLP-83] “It Listens Better Than My Therapist”: Exploring Social Media Discourse on LLM s as Mental Health Tool ALT

【速读】: 该论文试图探索大型语言模型(Large Language Models, LLMs)作为非正式心理健康支持工具在用户中的使用情况、态度及体验。研究通过分析超过10,000条TikTok评论,采用自定义的分级编码方案和监督分类模型,识别用户经验、态度及常见主题。论文的关键在于量化用户对LLMs在心理健康领域应用的看法,并揭示其优势(如易用性、情感支持和感知治疗价值)与潜在担忧(如隐私问题、通用化回复及缺乏专业监管)。此外,研究强调需要进一步的临床和伦理审查,以确保AI在心理健康领域的安全和有效应用。

链接: https://arxiv.org/abs/2504.12337
作者: Anna-Carolina Haensch
机构: LMU Munich (慕尼黑大学); University of Maryland, College Park (马里兰大学帕克分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: This study does not endorse or encourage the use of AI tools as substitutes for professional mental health support. The findings are presented for research purposes only, and any interpretation should take into account the limitations and potential risks of relying on AI in mental health contexts

点击查看摘要

Abstract:The emergence of generative AI chatbots such as ChatGPT has prompted growing public and academic interest in their role as informal mental health support tools. While early rule-based systems have been around for several years, large language models (LLMs) offer new capabilities in conversational fluency, empathy simulation, and availability. This study explores how users engage with LLMs as mental health tools by analyzing over 10,000 TikTok comments from videos referencing LLMs as mental health tools. Using a self-developed tiered coding schema and supervised classification models, we identify user experiences, attitudes, and recurring themes. Results show that nearly 20% of comments reflect personal use, with these users expressing overwhelmingly positive attitudes. Commonly cited benefits include accessibility, emotional support, and perceived therapeutic value. However, concerns around privacy, generic responses, and the lack of professional oversight remain prominent. It is important to note that the user feedback does not indicate which therapeutic framework, if any, the LLM-generated output aligns with. While the findings underscore the growing relevance of AI in everyday practices, they also highlight the urgent need for clinical and ethical scrutiny in the use of AI for mental health support.
zh

[NLP-84] Youve Changed: Detecting Modification of Black-Box Large Language Models

【速读】: 该论文旨在解决通过API提供服务的大规模语言模型(Large Language Models, LLMs)行为变化难以被开发者检测的问题。解决方案的关键在于提出一种方法,通过比较生成文本的语言学(linguistic)和心理语言学(psycholinguistic)特征分布的变化来监控LLMs的行为。具体而言,该方法采用统计检验(statistical test)来判断来自两个文本样本的特征分布是否等价,从而帮助开发者识别LLMs的行为改变。此外,论文还探索了该方法在检测提示注入攻击(prompt injection attacks)中的应用。这种方法能够实现对LLMs的频繁行为监控,并避免了昂贵的基准评估(benchmark evaluations)。

链接: https://arxiv.org/abs/2504.12335
作者: Alden Dima,James Foulds,Shimei Pan,Philip Feldman
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县校区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are often provided as a service via an API, making it challenging for developers to detect changes in their behavior. We present an approach to monitor LLMs for changes by comparing the distributions of linguistic and psycholinguistic features of generated text. Our method uses a statistical test to determine whether the distributions of features from two samples of text are equivalent, allowing developers to identify when an LLM has changed. We demonstrate the effectiveness of our approach using five OpenAI completion models and Meta’s Llama 3 70B chat model. Our results show that simple text features coupled with a statistical test can distinguish between language models. We also explore the use of our approach to detect prompt injection attacks. Our work enables frequent LLM change monitoring and avoids computationally expensive benchmark evaluations.
zh

[NLP-85] QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model

【速读】: 该论文旨在解决大型语言模型(LLMs)在专业化生物医学任务中面临的挑战,特别是由于医疗推理的复杂性和临床数据的敏感性导致的性能下降问题。现有模型通常难以处理复杂的医学术语和需要精确临床洞见的任务,在资源受限的部署场景下量化后性能进一步降低。为了解决这些问题,论文提出了一种基于路径推理的框架——量化医学树状思维(Quantized Medical Tree of Thought, QM-ToT)。该框架的关键在于结合树状思维(Tree of Thought, ToT)推理方法将复杂的医学问题分解为可管理的小任务,并通过评估层优化子任务的解,从而显著提升INT4量化模型在MedQA-USMLE数据集上的性能。具体而言,LLaMA2-70b模型的准确率从34%提高到50%,LLaMA-3.1-8b模型的准确率从58.77%提高到69.49%。此外,论文还提出了一种基于ToT的效果数据蒸馏方法,在仅使用传统方法3.9%的数据量情况下实现了86.27%的改进。这项工作首次展示了ToT在增强复杂生物医学任务性能方面的潜力,为在资源有限的医疗环境中部署高性能量化LLMs奠定了重要基础。

链接: https://arxiv.org/abs/2504.12334
作者: Zongxian Yang,Jiayu Qian,Zhi-An Huang,Kay Chen Tan
机构: City University of Hong Kong (Dongguan)(香港城市大学东莞校区); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the this http URL work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.
zh

[NLP-86] Meta-Evaluating Local LLM s: Rethinking Performance Metrics for Serious Games

【速读】: 该论文旨在解决开放性问答评估在严肃游戏中面临的独特挑战,即正确性通常具有主观性的问题。论文研究了五种小规模大语言模型(LLMs)在评估玩家回答时的可靠性,这些模型应用于模拟能源社区决策的严肃游戏《En-join》中。解决方案的关键在于通过传统的二元分类指标(如准确率、真正例率和真负例率)系统性地比较不同评估场景下的模型表现,揭示各模型在敏感性、特异性和整体性能之间的权衡。研究发现表明,某些模型在识别正确回答方面表现出色,但其他模型可能产生误报或评估不一致,强调了部署LLMs作为评估工具时需要采用上下文感知的评估框架及谨慎选择模型的重要性。这项工作为AI驱动评估工具的信任worthiness提供了见解,并探讨了不同LLM架构处理主观评估任务的能力。

链接: https://arxiv.org/abs/2504.12333
作者: Andrés Isaza-Giraldo,Paulo Bala,Lucas Pereira
机构: ITI/LARSyS (ITI/LARSyS); Técnico Lisboa (Técnico Lisboa)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 2nd HEAL Workshop at CHI Conference on Human Factors in Computing Systems. April 26, 2025. Yokohama, Japan

点击查看摘要

Abstract:The evaluation of open-ended responses in serious games presents a unique challenge, as correctness is often subjective. Large Language Models (LLMs) are increasingly being explored as evaluators in such contexts, yet their accuracy and consistency remain uncertain, particularly for smaller models intended for local execution. This study investigates the reliability of five small-scale LLMs when assessing player responses in \textitEn-join, a game that simulates decision-making within energy communities. By leveraging traditional binary classification metrics (including accuracy, true positive rate, and true negative rate), we systematically compare these models across different evaluation scenarios. Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance. We demonstrate that while some models excel at identifying correct responses, others struggle with false positives or inconsistent evaluations. The findings highlight the need for context-aware evaluation frameworks and careful model selection when deploying LLMs as evaluators. This work contributes to the broader discourse on the trustworthiness of AI-driven assessment tools, offering insights into how different LLM architectures handle subjective evaluation tasks.
zh

[NLP-87] Can the capability of Large Language Models be described by human ability? A Meta Study

【速读】: 该论文试图解决的问题是评估大型语言模型(Large Language Models, LLMs)的能力在多大程度上能够近似人类能力,并明确其与人类认知能力之间的关系。论文的关键解决方案在于通过收集超过80个模型在37个评价基准上的性能数据,将这些基准按人类的6种主要能力和11种子能力进行分类,并对模型性能排名聚类,进而与基于人类能力方面分类的结果进行对比分析。这一方法揭示了LLMs在不同参数规模下的能力差异,以及某些能力在LLMs中表现出与人类不同的相关性特征。

链接: https://arxiv.org/abs/2504.12332
作者: Mingrui Zan,Yunquan Zhang,Boyang Zhang,Fangming Liu,Daning Cheng
机构: University of Chinese Academy of Sciences; Institute of Computing Technology (计算技术研究所); PengCheng Lab (鹏城实验室)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Users of Large Language Models (LLMs) often perceive these models as intelligent entities with human-like capabilities. However, the extent to which LLMs’ capabilities truly approximate human abilities remains a topic of debate. In this paper, to characterize the capabilities of LLMs in relation to human capabilities, we collected performance data from over 80 models across 37 evaluation benchmarks. The evaluation benchmarks are categorized into 6 primary abilities and 11 sub-abilities in human aspect. Then, we then clustered the performance rankings into several categories and compared these clustering results with classifications based on human ability aspects. Our findings lead to the following conclusions: 1. We have confirmed that certain capabilities of LLMs with fewer than 10 billion parameters can indeed be described using human ability metrics; 2. While some abilities are considered interrelated in humans, they appear nearly uncorrelated in LLMs; 3. The capabilities possessed by LLMs vary significantly with the parameter scale of the model.
zh

[NLP-88] Span-level Emotion-Cause-Category Triplet Extraction with Instruction Tuning LLM s and Data Augmentation

【速读】: 该论文旨在解决文本中跨度级情感-因果-类别三元组提取这一复杂挑战,重点克服现有方法在冗余信息检索及隐晦或模糊情感类别判断上的困难。解决方案的关键在于提出了一种基于指令微调和数据增强技术的细粒度方法,并引入了一个创新框架。该框架利用任务特定的三元组提取指令通过低秩适应微调大规模语言模型,同时开发了一种基于提示的数据增强策略以缓解数据稀缺问题。实验结果表明,该方法在跨度级情感-因果-类别三元组提取任务上显著优于现有基线方法,性能提升至少12.8%,展示了其有效性和鲁棒性。

链接: https://arxiv.org/abs/2504.12331
作者: Xiangju Li,Dong Yang,Xiaogang Zhu,Faliang Huang,Peng Zhang,Zhongying Zhao
机构: Shandong University of Science and Technology (山东科技大学); Nanchang Normal University (南昌师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Span-level emotion-cause-category triplet extraction represents a novel and complex challenge within emotion cause analysis. This task involves identifying emotion spans, cause spans, and their associated emotion categories within the text to form structured triplets. While prior research has predominantly concentrated on clause-level emotion-cause pair extraction and span-level emotion-cause detection, these methods often confront challenges originating from redundant information retrieval and difficulty in accurately determining emotion categories, particularly when emotions are expressed implicitly or ambiguously. To overcome these challenges, this study explores a fine-grained approach to span-level emotion-cause-category triplet extraction and introduces an innovative framework that leverages instruction tuning and data augmentation techniques based on large language models. The proposed method employs task-specific triplet extraction instructions and utilizes low-rank adaptation to fine-tune large language models, eliminating the necessity for intricate task-specific architectures. Furthermore, a prompt-based data augmentation strategy is developed to address data scarcity by guiding large language models in generating high-quality synthetic training data. Extensive experimental evaluations demonstrate that the proposed approach significantly outperforms existing baseline methods, achieving at least a 12.8% improvement in span-level emotion-cause-category triplet extraction metrics. The results demonstrate the method’s effectiveness and robustness, offering a promising avenue for advancing research in emotion cause analysis. The source code is available at this https URL.
zh

[NLP-89] HM-RAG : Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

【速读】: 该论文旨在解决传统单-agent Retrieval-Augmented Generation (RAG) 系统在处理复杂查询时,尤其是在需要跨异构数据生态系统进行协调推理的任务中所面临的根本性限制。论文提出了一种名为HM-RAG(Hierarchical Multi-agent Multimodal RAG)的新框架,通过引入协作智能来实现结构化、非结构化及基于图的数据之间的动态知识合成。解决方案的关键在于其三层架构中的专业化agent设计:分解Agent利用语义感知的查询重写与模式引导的上下文增强技术将复杂查询拆解为语境一致的子任务;多源检索Agent采用可插拔模块执行并行且特定模态的检索操作,适用于向量数据库、图形数据库以及Web数据库;决策Agent则通过一致性投票整合多源答案,并借助专家模型优化解决检索结果中的不一致问题。这种架构实现了文本、图关系及Web衍生证据的综合理解,使得HM-RAG在ScienceQA和CrisisMMD基准测试中分别提升了12.95%的答案准确性及3.56%的问题分类准确性,同时在零样本设置下确立了最新的性能标准。其模块化设计确保了新数据模态的无缝集成,并保持严格的数据治理,标志着在解决RAG系统中多模态推理和知识合成关键挑战方面的重大进步。

链接: https://arxiv.org/abs/2504.12330
作者: Pei Liu,Xin Liu,Ruoyu Yao,Junming Liu,Siyuan Meng,Ding Wang,Jun Ma
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at this https URL.
zh

[NLP-90] Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

【速读】: 该论文旨在解决现有大型推理模型在后训练增强推理性能时面临的两大挑战:一是昂贵的训练开销,二是低效且冗长的输出。为应对这些问题,论文提出了一种名为Speculative Thinking的无训练框架,其关键在于通过推理层面的协作,使大模型在推理过程中引导小模型完成反思性步骤(reflection steps)。具体而言,该方法利用了两个观察结果:一是结构分隔符(如"\n\n")后的特定提示词(如“wait”)可作为推理反思或延续的信号;二是大模型在控制反思行为方面表现更强,能够减少不必要的回溯并提升推理质量。通过将复杂的反思步骤战略性地分配给更强大的模型,该框架不仅显著提高了推理模型的准确性,还将输出长度缩短了15.7%,同时还能改善非推理模型的性能。

链接: https://arxiv.org/abs/2504.12329
作者: Wang Yang,Xiang Yue,Vipin Chaudhary,Xiaotian Han
机构: Case Western Reserve University (凯斯西储大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as “wait” frequently appear after structural delimiters like “\n\n”, serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model’s accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.
zh

[NLP-91] A Comprehensive Survey of Reward Models: Taxonomy Applications Challenges and Future

【速读】: 该论文旨在系统性地梳理 Reward Model (RM) 在提升大语言模型 (Large Language Models, LLM) 表现方面的研究进展,并全面介绍 RM 的相关工作,包括偏好收集 (preference collection)、奖励建模 (reward modeling) 和应用方法 (usage)。论文的关键在于深入分析 RM 的现有挑战与潜在研究方向,同时通过总结基准评估方法 (benchmarks) 为 RM 的实际应用提供指导。解决方案的关键在于构建能够有效表征人类偏好的奖励模型,从而为 LLM 提供行为引导信号,同时结合全面的理论分析与实践建议,推动 RM 在更多任务中的广泛应用。

链接: https://arxiv.org/abs/2504.12328
作者: Jialun Zhong,Wei Shen,Yanzeng Li,Songyang Gao,Hua Lu,Yicheng Chen,Yang Zhang,Wei Zhou,Jinjie Gu,Lei Zou
机构: Peking University (北京大学); Fudan University (复旦大学); Huazhong University of Science and Technology (华中科技大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward Model (RM) has demonstrated impressive potential for enhancing Large Language Models (LLM), as RM can serve as a proxy for human preferences, providing signals to guide LLMs’ behavior in various tasks. In this paper, we provide a comprehensive overview of relevant research, exploring RMs from the perspectives of preference collection, reward modeling, and usage. Next, we introduce the applications of RMs and discuss the benchmarks for evaluation. Furthermore, we conduct an in-depth analysis of the challenges existing in the field and dive into the potential research directions. This paper is dedicated to providing beginners with a comprehensive introduction to RMs and facilitating future studies. The resources are publicly available at github\footnotethis https URL.
zh

[NLP-92] Word Embeddings Track Social Group Changes Across 70 Years in China

【速读】: 该论文试图解决的问题是如何通过语言分析反映社会群体在革命性社会变革中的官方语言表征变化,特别是在非西方语境下。论文的关键解决方案在于采用历时词嵌入(diachronic word embeddings)技术,在多个时间分辨率上对1950年至2019年中国官方媒体的语言数据进行大规模计算分析,揭示经济地位、民族和性别等社会群体表征的演变模式及其与历史转型的关联,从而深化对正式话语如何通过语言编码社会结构的理解,并强调非西方视角在计算社会科学中的重要性。

链接: https://arxiv.org/abs/2504.12327
作者: Yuxi Ma,Yongqian Peng,Yixin Zhu
机构: Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院); Yuanpei College, Peking University (北京大学元培学院)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Language encodes societal beliefs about social groups through word patterns. While computational methods like word embeddings enable quantitative analysis of these patterns, studies have primarily examined gradual shifts in Western contexts. We present the first large-scale computational analysis of Chinese state-controlled media (1950-2019) to examine how revolutionary social transformations are reflected in official linguistic representations of social groups. Using diachronic word embeddings at multiple temporal resolutions, we find that Chinese representations differ significantly from Western counterparts, particularly regarding economic status, ethnicity, and gender. These representations show distinct evolutionary dynamics: while stereotypes of ethnicity, age, and body type remain remarkably stable across political upheavals, representations of gender and economic classes undergo dramatic shifts tracking historical transformations. This work advances our understanding of how officially sanctioned discourse encodes social structure through language while highlighting the importance of non-Western perspectives in computational social science.
zh

[NLP-93] Reconstructing Sepsis Trajectories from Clinical Case Reports using LLM s: the Textual Time Series Corpus for Sepsis

【速读】: 该论文旨在解决临床病例报告和出院摘要虽完整且准确,但最终确定时间(即患者诊疗后)滞后的问题,而互补的结构化数据流虽较早可用但存在不完整性。为在更完整且时间粒度更细的数据上训练模型与算法,研究提出了一种管道方法,利用大型语言模型(LLMs)对病例报告中的时间定位发现进行表型分析、提取及标注。解决方案的关键在于开发这一管道系统,通过应用O1-preview和Llama 3.3 70B Instruct等LLMs,成功生成包含2,139份病例报告的Sepsis-3开放获取文本时间序列语料库,并验证了其在恢复临床发现(事件匹配率)和时间排序(一致性)方面的高准确性,从而证明了LLMs在时间定位临床发现方面的潜力及改进方向,特别是在多模态集成方面。

链接: https://arxiv.org/abs/2504.12326
作者: Shahriar Noroozizadeh,Jeremy C. Weiss
机构: Machine Learning Department (机器学习系); School of Computer Science (计算机科学学院); Heinz College of Information Systems and Public Policy (海因茨信息系统与公共政策学院); Carnegie Mellon University (卡内基梅隆大学); National Library of Medicine (国家医学图书馆); National Institutes of Health (国立卫生研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary data structured streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the Pubmed-Open Access (PMOA) Subset. To validate our system, we apply it on PMOA and timeline annotations from I2B2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: O1-preview–0.755, Llama 3.3 70B Instruct–0.753) and strong temporal ordering (concordance: O1-preview–0.932, Llama 3.3 70B Instruct–0.932). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.
zh

[NLP-94] LLM Taxo: Leverag ing Large Language Models for Constructing Taxonomy of Factual Claims from Social Media

【速读】: 该论文试图解决在社交媒体平台上因海量内容扩张导致的在线话语分析与理解复杂化的问题。解决方案的关键在于提出LLMTaxo框架,通过利用大语言模型(Large Language Models, LLMs)从多层级粒度自动生成事实主张的分类体系(taxonomy),从而实现社交媒体中事实主张的自动化分类。这一方法帮助相关利益方更有效地导航社交媒体环境。论文通过在三个不同数据集上使用多种模型实施该框架,并设计专门的分类体系评估指标进行综合评估,验证了LLMTaxo的有效性及特定模型在不同数据集上的性能差异。

链接: https://arxiv.org/abs/2504.12325
作者: Haiqi Zhang,Zhengyuan Zhu,Zeyu Zhang,Chengkai Li
机构: University of Texas at Arlington (德克萨斯大学阿灵顿分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:With the vast expansion of content on social media platforms, analyzing and comprehending online discourse has become increasingly complex. This paper introduces LLMTaxo, a novel framework leveraging large language models for the automated construction of taxonomy of factual claims from social media by generating topics from multi-level granularities. This approach aids stakeholders in more effectively navigating the social media landscapes. We implement this framework with different models across three distinct datasets and introduce specially designed taxonomy evaluation metrics for a comprehensive assessment. With the evaluations from both human evaluators and GPT-4, the results indicate that LLMTaxo effectively categorizes factual claims from social media, and reveals that certain models perform better on specific datasets.
zh

[NLP-95] Cross-Document Cross-Lingual Natural Language Inference via RST-enhanced Graph Fusion and Interpretability Prediction

【速读】: 该论文致力于解决跨文档跨语言自然语言推理(Cross-Document Cross-Lingual NLI, CDCL-NLI)这一未被充分研究的问题。现有自然语言推理任务主要集中在单文档或单一语言场景下,而本文提出了一种新颖的方法,将传统的NLI能力扩展到多文档、多语言的情境中。论文的关键在于构建了一个包含1,110个实例、覆盖26种语言的高质量CDCL-NLI数据集,并提出了一个创新性的方法,该方法结合了基于RST增强的图融合与可解释性预测技术。具体而言,所提出的方法利用RST理论在RGAT模型上进行跨文档上下文建模,同时采用基于词汇链的结构感知语义对齐机制实现跨语言理解。此外,为了提高推理过程的可解释性,还开发了基于EDU级别的归因框架以生成提取式解释。通过广泛的实验验证,该方法在性能上显著优于传统的NLI模型(如DocNLI和R2F)以及大型语言模型(如Llama3和GPT-4o)。

链接: https://arxiv.org/abs/2504.12324
作者: Mengying Yuan,Wangzi Xuan,Fei Li
机构: Wuhan University (武汉大学); School of Cyber Science and Engineering (网络空间安全学院); Hubei (湖北); China (中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural Language Inference (NLI) is a fundamental task in both natural language processing and information retrieval. While NLI has developed many sub-directions such as sentence-level NLI, document-level NLI and cross-lingual NLI, Cross-Document Cross-Lingual NLI (CDCL-NLI) remains largely unexplored. In this paper, we propose a novel paradigm for CDCL-NLI that extends traditional NLI capabilities to multi-document, multilingual scenarios. To support this task, we construct a high-quality CDCL-NLI dataset including 1,110 instances and spanning 26 languages. To build a baseline for this task, we also propose an innovative method that integrates RST-enhanced graph fusion and interpretability prediction. Our method employs RST (Rhetorical Structure Theory) on RGAT (Relation-aware Graph Attention Network) for cross-document context modeling, coupled with a structure-aware semantic alignment mechanism based on lexical chains for cross-lingual understanding. For NLI interpretability, we develop an EDU-level attribution framework that generates extractive explanations. Extensive experiments demonstrate our approach’s superior performance, achieving significant improvements over both traditional NLI models such as DocNLI and R2F, as well as LLMs like Llama3 and GPT-4o. Our work sheds light on the study of NLI and will bring research interest on cross-document cross-lingual context understanding, semantic retrieval and interpretability inference. Our dataset and code are available at \hrefthis https URLCDCL-NLI-Link for peer review.
zh

[NLP-96] he Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation

【速读】: 该论文旨在探究 Retrieval-Augmented Generation (RAG) 框架对大规模语言模型 (Large Language Models, LLMs) 公平性的影响,并提出解决方案以缓解因引入 RAG 而可能加剧的小规模 LLMs 的不公平问题。研究发现,当模型规模小于 8B 时,RAG 的集成通常会加重小规模 LLMs 的不公平现象。为解决此问题,论文提出了两种方法:FairFT 和 FairFilter。其中,FairFT 通过使检索器与 LLM 在公平性方面对齐,使其能够检索促进更公平输出的文档;FairFilter 则引入了一种公平性过滤机制,在检索后去除有偏见的内容。最终,这两种方法在真实数据集上的验证表明,它们能够在保持性能的同时有效改善公平性。

链接: https://arxiv.org/abs/2504.12323
作者: Zheng Zhang,Ning Li,Qi Liu,Rui Li,Weibo Gao,Qingyang Mao,Zhenya Huang,Baosheng Yu,Dacheng Tao
机构: State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China (中国科学技术大学认知智能国家重点实验室), Hefei, Anhui 230027, China; Nanyang Technological University (南洋理工大学), Singapore
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant document from external knowledge sources. By referencing this external knowledge, RAG effectively reduces the generation of factually incorrect content and addresses hallucination issues within LLMs. Recently, there has been growing attention to improving the performance and efficiency of RAG systems from various perspectives. While these advancements have yielded significant results, the application of RAG in domains with considerable societal implications raises a critical question about fairness: What impact does the introduction of the RAG paradigm have on the fairness of LLMs? To address this question, we conduct extensive experiments by varying the LLMs, retrievers, and retrieval sources. Our experimental analysis reveals that the scale of the LLMs plays a significant role in influencing fairness outcomes within the RAG framework. When the model scale is smaller than 8B, the integration of retrieval mechanisms often exacerbates unfairness in small-scale LLMs (e.g., LLaMA3.2-1B, Mistral-7B, and LLaMA3-8B). To mitigate the fairness issues introduced by RAG for small-scale LLMs, we propose two approaches, FairFT and FairFilter. Specifically, in FairFT, we align the retriever with the LLM in terms of fairness, enabling it to retrieve documents that facilitate fairer model outputs. In FairFilter, we propose a fairness filtering mechanism to filter out biased content after retrieval. Finally, we validate our proposed approaches on real-world datasets, demonstrating their effectiveness in improving fairness while maintaining performance.
zh

[NLP-97] A Strategic Coordination Framework of Small LLM s Matches Large LLM s in Data Synthesis

【速读】: 该论文试图解决通过数据合成与蒸馏增强小规模语言模型(Small Language Models)过程中过度依赖大型语言模型(Large Language Models, LLMs)的问题。尽管大型语言模型在数据合成方面表现强大,但其高昂的计算成本、环境效率低下以及潜在的偏见限制了实际应用。相比之下,小型语言模型更具可及性和可持续性,但它们单独的能力通常难以生成高质量、多样化且可靠的数据。论文的关键解决方案在于提出了一种多小型语言模型协作框架GRA(Grouped Role Assignment),通过模拟同行评议(peer review)的过程,将数据合成任务分解为生成(Generator)、评审(Reviewer)和裁定(Adjudicator)三个专业化角色,从而实现迭代优化和质量控制,最终达到与基于单一大型语言模型的蒸馏方法相当甚至更高的数据质量。实验表明,GRA生成的数据在多个基准测试中达到了或超过了单个大型语言模型(如Qwen-2.5-72B-Instruct)的水平,挑战了单一大型模型对于高质量数据合成的必要性,提倡通过小型模型的战略性协作来实现这一目标。

链接: https://arxiv.org/abs/2504.12322
作者: Xin Gao,Qizhi Pei,Zinan Tang,Yu Li,Honglin Lin,Jiang Wu,Conghui He,Lijun Wu
机构: Shanghai AI Laboratory (上海人工智能实验室); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LLMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LLMs involved framework, GRA, that aggregates specialized roles across small LLMs to iterative refinement and quality control typically achieved by a single large LLM. In this collaborative framework, multiple small LLMs assume distinct roles-Generator, Reviewer, and Adjudicator-to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LLMs can achieve data-level parity with large LLM-based distillation. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LLM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents. Our datasets, models, and code are publicly available at this https URL.
zh

[NLP-98] AttentionDefense: Leverag ing System Prompt Attention for Explainable Defense Against Novel Jailbreaks

【速读】: 该论文试图解决语言模型在面对恶意输入导致的越狱行为(jailbreaks)时缺乏有效解释性防御的问题。当前防御策略主要依赖于将输入分类为对抗样本或阻止有害输出生成,但难以揭示越狱行为恶意本质的原因,从而导致多种封闭式方法的出现。论文的关键解决方案在于提出了一种名为AttentionDefense的新颖、可解释且成本更低的防御方法,其核心是利用小语言模型(Small Language Models, SLMs)的系统提示注意力机制来表征对抗性提示。研究显示,注意力机制是理解语言模型如何响应未被文本嵌入语义捕获的恶意输入的重要组成部分。实验结果表明,基于SLM的AttentionDefense在检测越狱行为方面表现出与基于文本嵌入的分类器相当甚至更优的性能,并且在针对新生成的越狱变体数据集上的表现优于现有方法。此外,AttentionDefense具有小语言模型的计算需求,却能达到大语言模型检测器的性能水平,适合实际应用。

链接: https://arxiv.org/abs/2504.12321
作者: Charlotte Siska,Anush Sankaran
机构: Security AI Research (安全人工智能研究); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the past few years, Language Models (LMs) have shown par-human capabilities in several domains. Despite their practical applications and exceeding user consumption, they are susceptible to jailbreaks when malicious input exploits the LM’s weaknesses, causing it to deviate from its intended behavior. Current defensive strategies either classify the input prompt as adversarial or prevent LMs from generating harmful outputs. However, it is challenging to explain the reason behind the malicious nature of the jailbreak, which results in a wide variety of closed-box approaches. In this research, we propose and demonstrate that system-prompt attention from Small Language Models (SLMs) can be used to characterize adversarial prompts, providing a novel, explainable, and cheaper defense approach called AttentionDefense. Our research suggests that the attention mechanism is an integral component in understanding and explaining how LMs respond to malicious input that is not captured in the semantic meaning of text embeddings. The proposed AttentionDefense is evaluated against existing jailbreak benchmark datasets. Ablation studies show that SLM-based AttentionDefense has equivalent or better jailbreak detection performance compared to text embedding-based classifiers and GPT-4 zero-shot this http URL further validate the efficacy of the proposed approach, we generate a dataset of novel jailbreak variants of the existing benchmark dataset using a closed-loop LLM-based multi-agent system. We demonstrate that the proposed AttentionDefense approach performs robustly on this novel jailbreak dataset while existing approaches suffer in performance. Additionally, for practical purposes AttentionDefense is an ideal solution as it has the computation requirements of a small LM but the performance of a LLM detector.
zh

[NLP-99] Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在创意任务中的性能是否随时间提升以及其创意输出一致性的问题。论文的关键解决方案在于通过两个经过验证的创造力评估任务——发散联想任务(Divergent Association Task, DAT)和替代用途任务(Alternative Uses Task, AUT),对包括GPT-4、Claude、Llama、Grok、Mistral和DeepSeek在内的14种广泛使用的LLMs进行系统性评估。研究发现,尽管大多数模型在AUT任务中平均表现优于人类平均水平,但仅有极小比例(0.28%)的LLM生成响应达到人类创造力基准的前10%,且同一模型在不同时间或相同提示下表现出显著的输出变异性。这表明,LLMs的创意能力评估需要更精细的框架,并强调了模型选择、提示设计及重复评估的重要性。

链接: https://arxiv.org/abs/2504.12320
作者: Jennifer Haase,Paul H. P. Hanel,Sebastian Pokutta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 19 pages + Appendix, 13 figure

点击查看摘要

Abstract:Following the widespread adoption of ChatGPT in early 2023, numerous studies reported that large language models (LLMs) can match or even surpass human performance in creative tasks. However, it remains unclear whether LLMs have become more creative over time, and how consistent their creative output is. In this study, we evaluated 14 widely used LLMs – including GPT-4, Claude, Llama, Grok, Mistral, and DeepSeek – across two validated creativity assessments: the Divergent Association Task (DAT) and the Alternative Uses Task (AUT). Contrary to expectations, we found no evidence of increased creative performance over the past 18-24 months, with GPT-4 performing worse than in previous studies. For the more widely used AUT, all models performed on average better than the average human, with GPT-4o and o3-mini performing best. However, only 0.28% of LLM-generated responses reached the top 10% of human creativity benchmarks. Beyond inter-model differences, we document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original. This variability has important implications for both creativity research and practical applications. Ignoring such variability risks misjudging the creative potential of LLMs, either inflating or underestimating their capabilities. The choice of prompts affected LLMs differently. Our findings underscore the need for more nuanced evaluation frameworks and highlight the importance of model selection, prompt design, and repeated assessment when using Generative AI (GenAI) tools in creative contexts.
zh

[NLP-100] Specialized text classification: an approach to classifying Open Banking transactions

【速读】: 该论文旨在解决银行交易描述 enrichment 领域中基于特定领域文本语料库的定制化自然语言处理应用未被充分探索的问题,特别是在法语银行数据背景下训练语言模型所面临的挑战。论文聚焦于通过构建一个基于语言的开放银行交易分类系统,提升对客户行为的理解,并以此为基础预防欺诈、降低风险以及提供更精准的服务。解决方案的关键在于结合语言特定技术和领域专业知识,针对法语银行数据集开发专用的文本预处理、建模及评估方法,从而显著提高性能与效率,超越通用方法的表现。

链接: https://arxiv.org/abs/2504.12319
作者: Duc Tuyen TA,Wajdi Ben Saad,Ji Young Oh
机构: Data Science Team - Oney Bank (Oney银行) - France
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:With the introduction of the PSD2 regulation in the EU which established the Open Banking framework, a new window of opportunities has opened for banks and fintechs to explore and enrich Bank transaction descriptions with the aim of building a better understanding of customer behavior, while using this understanding to prevent fraud, reduce risks and offer more competitive and tailored services. And although the usage of natural language processing models and techniques has seen an incredible progress in various applications and domains over the past few years, custom applications based on domain-specific text corpus remain unaddressed especially in the banking sector. In this paper, we introduce a language-based Open Banking transaction classification system with a focus on the french market and french language text. The system encompasses data collection, labeling, preprocessing, modeling, and evaluation stages. Unlike previous studies that focus on general classification approaches, this system is specifically tailored to address the challenges posed by training a language model with a specialized text corpus (Banking data in the French context). By incorporating language-specific techniques and domain knowledge, the proposed system demonstrates enhanced performance and efficiency compared to generic approaches. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Finance (q-fin.CP) Cite as: arXiv:2504.12319 [cs.IR] (or arXiv:2504.12319v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.12319 Focus to learn more arXiv-issued DOI via DataCite Journalreference: 2023 IEEE 18th International Conference on Computer Science and Information Technologies (CSIT) Related DOI: https://doi.org/10.1109/CSIT61576.2023.10324203 Focus to learn more DOI(s) linking to related resources
zh

[NLP-101] ChatGPT as Linguistic Equalizer? Quantifying LLM -Driven Lexical Shifts in Academic Writing

【速读】: 本文旨在探讨ChatGPT是否能够缓解非母语英语使用者(Non-Native English Speakers, NNES)在学术写作中面临的语言障碍,并促进全球学术领域的公平性。研究通过分析OpenAlex数据库中2020年至2024年间280万篇论文的摘要,量化词汇复杂度的变化,采用基于差异中的差异(Difference-in-Differences, DID)设计的方法来识别因果效应。研究的关键在于利用Measure of Textual Lexical Diversity (MTLD) 指标衡量词汇复杂性,并控制文章层面的因素、作者撰写模式以及期刊规范,从而证明ChatGPT显著提升了NNES撰写的摘要中的词汇复杂度,尤其在预印本、技术和生物相关领域以及较低级别期刊中效果最为明显。这一发现为ChatGPT减少语言差异、推动全球学术公平提供了因果证据。

链接: https://arxiv.org/abs/2504.12317
作者: Dingkang Lin,Naixuan Zhao,Dan Tian,Jiang Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:The advent of ChatGPT has profoundly reshaped scientific research practices, particularly in academic writing, where non-native English-speakers (NNES) historically face linguistic barriers. This study investigates whether ChatGPT mitigates these barriers and fosters equity by analyzing lexical complexity shifts across 2.8 million articles from OpenAlex (2020-2024). Using the Measure of Textual Lexical Diversity (MTLD) to quantify vocabulary sophistication and a difference-in-differences (DID) design to identify causal effects, we demonstrate that ChatGPT significantly enhances lexical complexity in NNES-authored abstracts, even after controlling for article-level controls, authorship patterns, and venue norms. Notably, the impact is most pronounced in preprint papers, technology- and biology-related fields and lower-tier journals. These findings provide causal evidence that ChatGPT reduces linguistic disparities and promotes equity in global academia.
zh

[NLP-102] Data Metabolism: An Efficient Data Design Schema For Vision Language Model ICLR2025

【速读】: 该论文旨在解决视觉语言模型(Visual Language Models, VLMs)训练过程中数据相关挑战的问题。论文提出了一种以数据为中心的框架,通过引入“数据代谢”(Data Metabolism)的概念,优化从标准模型架构出发的数据筛选(data curation)与迭代过程,形成一个闭环系统以持续提升模型性能。关键在于将数据处理流程系统化,并构建用户特定的数据飞轮(data flywheel),从而实现更高效的小型化且性能卓越的VLM训练。作为验证,所提出的Capybara-VL模型在多种多模态任务上表现出色,尽管其规模较小,却超越了多个开源模型,并接近甚至达到某些专有模型的性能水平。这表明该框架能够显著提高数据利用效率,推动更小、更高效的VLM的发展。

链接: https://arxiv.org/abs/2504.12316
作者: Jingyuan Zhang,Hongzhi Zhang,Zhou Haonan,Chenxi Sun,Xingguang ji,Jiakang Wang,Fanheng Kong,Yahui Liu,Qi Wang,Fuzheng Zhang
机构: Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: To be presented at ICLR 2025, First Workshop on Open Science for Foundation Models

点击查看摘要

Abstract:Data curation plays a crucial role in training powerful Visual Language Models (VLMs). In this work, we introduce the concept of Data Metabolism and present our data-centric framework to build VLMs throughout the development lifecycle. Starting from a standard model architecture, we discuss and provide insights into two crucial development steps: data curation and iteration, forming a closed-loop system that continuously improves model performance. We show a detailed codebook on how to process existing massive datasets and build user-specific data flywheel. As a demonstration, we release a VLM, named Capybara-VL, which excels in typical multimodal tasks (e.g. , visual question answering, scientific reasoning, and text-rich tasks). Despite its relatively compact size, Capybara-VL surpasses several open-source models that are up to 10 times larger in size. Moreover, it achieves results that are on par with those of several leading proprietary models, demonstrating its remarkable competitiveness. These results highlight the power of our data-centric framework and the potential of training smaller and more efficient VLMs.
zh

[NLP-103] Capybara-OMNI: An Efficient Paradigm for Building Omni-Modal Language Models

【速读】: 该论文旨在解决多模态大型语言模型(Multimodal Large Language Models, MLLMs)构建过程中计算复杂度高、耗时长的问题。为应对这一挑战,论文提出的关键解决方案是开发Capybara-OMNI模型,该模型通过轻量级且高效的设计支持文本、图像、视频和音频等多种模态的理解能力。论文详细介绍了模型框架设计、数据构建及训练方法,以逐步构建具有竞争力性能的多模态模型,并提供了专门的基准测试来验证不同模态下的理解能力。此外,为了提升模型的多模态指令遵循与对话能力,论文进一步探讨了如何基于已有理解模型训练聊天版本,使其更符合用户习惯,特别是在实时人机交互任务中的应用。最终,论文公开了Capybara-OMNI及其聊天版本,包含模型权重、部分训练数据以及推理代码,供研究者在GitHub上使用。

链接: https://arxiv.org/abs/2504.12315
作者: Xingguang Ji,Jiakang Wang,Hongzhi Zhang,Jingyuan Zhang,Haonan Zhou,Chenxi Sun,Yahui Liu,Qi Wang,Fuzheng Zhang
机构: Kuaishou Technology (快手科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the development of Multimodal Large Language Models (MLLMs), numerous outstanding accomplishments have emerged within the open-source community. Due to the complexity of creating and training multimodal data pairs, it is still a computational and time-consuming process to build powerful MLLMs. In this work, we introduce Capybara-OMNI, an MLLM that trains in a lightweight and efficient manner and supports understanding text, image, video, and audio modalities. We present in detail the framework design, the data construction, and the training recipe, to develop an MLLM step-by-step to obtain competitive performance. We also provide exclusive benchmarks utilized in our experiments to show how to properly verify understanding capabilities across different modalities. Results show that by following our guidance, we can efficiently build an MLLM that achieves competitive performance among models of the same scale on various multimodal benchmarks. Additionally, to enhance the multimodal instruction following and conversational capabilities of the model, we further discuss how to train the chat version upon an MLLM understanding model, which is more in line with user habits for tasks like real-time interaction with humans. We publicly disclose the Capybara-OMNI model, along with its chat-based version. The disclosure includes both the model weights, a portion of the training data, and the inference codes, which are made available on GitHub.
zh

[NLP-104] How to Detect and Defeat Molecular Mirag e: A Metric-Driven Benchmark for Hallucination in LLM -based Molecular Comprehension

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在科学领域,尤其是分子理解和分析任务中因幻觉(hallucination)问题导致的药物设计和应用错误。论文的关键在于首先分析了LLMs在分子理解任务中产生幻觉的来源,特别是PubChem数据集中观察到的知识捷径现象。为高效评估分子理解任务中的幻觉程度,论文引入了一种名为\textbf{Mol-Hallu}的新颖自由形式评估指标,该指标基于生成文本与实际分子属性之间的科学蕴含关系量化幻觉的程度。此外,论文提出了幻觉减少后处理阶段(Hallucination Reduction Post-processing, HRPP),以缓解分子幻觉问题。实验表明,HRPP在仅解码器和编码器-解码器分子LLMs上均有效。这些研究结果为减轻幻觉并提高科学应用中LLMs的可靠性提供了重要见解。

链接: https://arxiv.org/abs/2504.12314
作者: Hao Li,Liuzhenghao Lv,He Cao,Zijing Liu,Zhiyuan Yan,Yu Wang,Yonghong Tian,Yu Li,Li Yuan
机构: Shenzhen Graduate School, Peking University (北京大学深圳研究生院); International Digital Economy Academy (IDEA); Pengcheng Laboratory, Shenzhen, China (鹏城实验室, 深圳, 中国); Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:Large language models are increasingly used in scientific domains, especially for molecular understanding and analysis. However, existing models are affected by hallucination issues, resulting in errors in drug design and utilization. In this paper, we first analyze the sources of hallucination in LLMs for molecular comprehension tasks, specifically the knowledge shortcut phenomenon observed in the PubChem dataset. To evaluate hallucination in molecular comprehension tasks with computational efficiency, we introduce \textbfMol-Hallu, a novel free-form evaluation metric that quantifies the degree of hallucination based on the scientific entailment relationship between generated text and actual molecular properties. Utilizing the Mol-Hallu metric, we reassess and analyze the extent of hallucination in various LLMs performing molecular comprehension tasks. Furthermore, the Hallucination Reduction Post-processing stage~(HRPP) is proposed to alleviate molecular hallucinations, Experiments show the effectiveness of HRPP on decoder-only and encoder-decoder molecular LLMs. Our findings provide critical insights into mitigating hallucination and improving the reliability of LLMs in scientific applications.
zh

[NLP-105] Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models

【速读】: 该论文旨在解决如何理解人格特质对会话推荐系统(Conversational Recommender Systems, CRSs)结果的影响这一关键问题。随着大型语言模型(Large Language Models, LLMs)的应用增强了CRSs的自然性和动态交互能力,但人格特质对用户行为及推荐效果的具体影响仍缺乏深入理解。为此,论文提出了一种基于LLM的人格感知用户模拟框架(Personality-aware Conversational Recommender System, PerCRS)。其核心解决方案在于通过引入可定制的人格特质与偏好用户代理以及具备说服能力的系统代理,实现对CRS中真实交互过程的有效模拟,并结合多方面评估确保系统的鲁棒性。实验结果表明,最先进的LLMs能够生成符合指定人格特质的多样化用户响应,从而促使CRS动态调整推荐策略,为理解人格特质对会话推荐结果的影响提供了实证依据。

链接: https://arxiv.org/abs/2504.12313
作者: Xiaoyan Zhao,Yang Deng,Wenjie Wang,Hongzhan lin,Hong Cheng,Rui Zhang,See-Kiong Ng,Tat-Seng Chua
机构: Chinese University of Hong Kong (香港中文大学); National University of Singapore (新加坡国立大学); Singapore Management University (新加坡管理大学); Hong Kong Baptist University (香港浸会大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Conversational Recommender Systems (CRSs) engage users in multi-turn interactions to deliver personalized recommendations. The emergence of large language models (LLMs) further enhances these systems by enabling more natural and dynamic user interactions. However, a key challenge remains in understanding how personality traits shape conversational recommendation outcomes. Psychological evidence highlights the influence of personality traits on user interaction behaviors. To address this, we introduce an LLM-based personality-aware user simulation for CRSs (PerCRS). The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs. We incorporate multi-aspect evaluation to ensure robustness and conduct extensive analysis from both user and system perspectives. Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits, thereby prompting CRSs to dynamically adjust their recommendation strategies. Our experimental analysis offers empirical insights into the impact of personality traits on the outcomes of conversational recommender systems.
zh

[NLP-106] Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

【速读】: 该论文旨在解决现有大型语言模型(LLMs)逻辑推理能力评估数据集过于简单、不自然或上下文受限的问题,并填补人工数据收集与标注的局限性,如谬误类型分布不均和劳动密集型标注。为应对这些挑战,论文提出了两个关键解决方案:首先,构建了一个名为SmartyPat-Bench的新基准数据集,该数据集源自真实的高质量Reddit帖子,包含微妙的逻辑谬误,并提供更详细的标注;其次,开发了一个名为SmartyPat的自动化框架,利用基于逻辑编程的预言器生成系统化的逻辑谬误陈述,并通过LLMs将其转化为流畅的自然语言句子,确保精确的谬误表达。实验表明,SmartyPat生成的谬误在细微度和质量上可媲美人工生成的内容,并显著优于基线方法。这一研究揭示了LLMs在逻辑推理能力方面的深入洞见,同时指出了过度推理步骤对谬误检测的影响及结构化推理对谬误分类性能的提升作用。

链接: https://arxiv.org/abs/2504.12312
作者: Zihao Xu,Junchen Ding,Yiling Lou,Kun Zhang,Dong Gong,Yuekang Li
机构: University of New South Wales(Australia); Fudan University(复旦大学, China); Carnegie Mellon University(卡内基梅隆大学, USA)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.
zh

[NLP-107] Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

【速读】: 该论文旨在解决通过简单聚合多个预训练提示(Prompts)来适应下游任务时,由于提示之间的相互干扰导致表征崩溃的问题,从而削弱多源提示的综合潜力。论文的关键解决方案是提出了一种名为HGPrompt的自适应框架,用于多源提示迁移。该框架通过联合优化两个目标——迁移能力和稳定性,学习最优的提示集成权重。具体而言,论文引入了一个基于信息论的度量方法来评估提示诱导特征在目标任务上的迁移能力,并提出了新颖的梯度对齐正则化(Gradient Alignment Regularization),以缓解提示间的梯度冲突,实现来自多个来源的稳定且一致的知识迁移,同时抑制干扰。实验结果表明,HGPrompt在大规模VTAB基准测试中达到了最先进的性能,验证了其在多源提示迁移中的有效性。

链接: https://arxiv.org/abs/2504.12311
作者: Enming Zhang,Liwen Cao,Yanru Wu,Zijie Zhao,Guan Wang,Yang Li
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院, 清华大学); Southeast University (东南大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompt tuning has emerged as a lightweight adaptation strategy for adapting foundation models to downstream tasks, particularly in resource-constrained systems. As pre-trained prompts have become valuable intellectual assets, combining multiple source prompts offers a promising approach to enhance generalization to new tasks by leveraging complementary knowledge from diverse sources. However, naive aggregation of these prompts often leads to representation collapse due to mutual interference, undermining their collective potential. To address these challenges, we propose HGPrompt, an adaptive framework for multi-source prompt transfer that learns optimal ensemble weights by jointly optimizing dual objectives: transferability and stability. Specifically, we first introduce an information-theoretic metric to evaluate the transferability of prompt-induced features on the target task, capturing the intrinsic alignment between the feature representations. Additionally, we propose a novel Gradient Alignment Regularization to mitigate gradient conflicts among prompts, enabling stable and coherent knowledge transfer from multiple sources while suppressing interference. Extensive experiments on the large-scale VTAB benchmark demonstrate that HGPrompt achieves state-of-the-art performance, validating its effectiveness in multi-source prompt transfer.
zh

[NLP-108] Large Language Model-Based Knowledge Graph System Construction for Sustainable Development Goals: An AI-Based Speculative Design Perspective

【速读】: 该论文旨在解决全球可持续发展目标(Sustainable Development Goals, SDGs)进展缓慢的问题,通过创新策略加速实现SDGs的目标。论文的关键在于开发了一种基于人工智能的知识图谱系统,利用官方SDG文本、Elsevier关键词数据集以及TED演讲 transcripts,结合AI推测性设计、大语言模型和检索增强生成技术,分析SDG之间的相互联系,发现潜在的新目标,并以可视化方式呈现。这种方法不仅揭示了现有目标间的关联性(如Goal 10与Goal 16的强关联及Goal 6的低关注度),还通过模拟对话生成新的中心节点,促进发散性思维和目标清晰度,最终提出了六个潜在新目标,聚焦于公平性、韧性及技术驱动的包容性。这一推测性AI框架为政策制定者提供了全新视角,并为未来多模态和跨系统的SDG应用奠定了基础。

链接: https://arxiv.org/abs/2504.12309
作者: Yi-De Lin,Guan-Ze Liao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:From 2000 to 2015, the UN’s Millennium Development Goals guided global priorities. The subsequent Sustainable Development Goals (SDGs) adopted a more dynamic approach, with annual indicator updates. As 2030 nears and progress lags, innovative acceleration strategies are critical. This study develops an AI-powered knowledge graph system to analyze SDG interconnections, discover potential new goals, and visualize them online. Using official SDG texts, Elsevier’s keyword dataset, and 1,127 TED Talk transcripts (2020-2023), a pilot on 269 talks from 2023 applies AI-speculative design, large language models, and retrieval-augmented generation. Key findings include: (1) Heatmap analysis reveals strong associations between Goal 10 and Goal 16, and minimal coverage of Goal 6. (2) In the knowledge graph, simulated dialogue over time reveals new central nodes, showing how richer data supports divergent thinking and goal clarity. (3) Six potential new goals are proposed, centered on equity, resilience, and technology-driven inclusion. This speculative-AI framework offers fresh insights for policymakers and lays groundwork for future multimodal and cross-system SDG applications.
zh

[NLP-109] Unmasking the Reality of PII Masking Models: Performance Gaps and the Call for Accountability

【速读】: 该论文旨在解决隐私屏蔽(Privacy Masking)领域中基于命名实体识别(NER)方法在处理个人可识别信息(PII)匿名化与去匿名化时所面临的局限性问题。这些局限性包括内容敏感性、表达变体以及格式或语法变异等。为应对这些问题,论文的关键解决方案是构建了一个包含17,000个独特半合成句子的数据集,涵盖来自印度、英国和美国等多个司法管辖区的16种类型PII,并设计了五种不同维度的NER检测特征来生成包含PII的句子。此外,还考虑了对抗性上下文。通过这种方式,论文评估了现有模型(如Piiranha和Starprii)在PII屏蔽方面的性能,并揭示了由于模型广泛使用可能导致的隐私泄露风险,最终强调了改进模型性能评估标准及完善模型卡片中上下文披露的重要性。

链接: https://arxiv.org/abs/2504.12308
作者: Devansh Singh,Sundaraparipurnan Narayanan
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Privacy Masking is a critical concept under data privacy involving anonymization and de-anonymization of personally identifiable information (PII). Privacy masking techniques rely on Named Entity Recognition (NER) approaches under NLP support in identifying and classifying named entities in each text. NER approaches, however, have several limitations including (a) content sensitivity including ambiguous, polysemic, context dependent or domain specific content, (b) phrasing variabilities including nicknames and alias, informal expressions, alternative representations, emerging expressions, evolving naming conventions and © formats or syntax variations, typos, misspellings. However, there are a couple of PII datasets that have been widely used by researchers and the open-source community to train models on PII detection or masking. These datasets have been used to train models including Piiranha and Starpii, which have been downloaded over 300k and 580k times on HuggingFace. We examine the quality of the PII masking by these models given the limitations of the datasets and of the NER approaches. We curate a dataset of 17K unique, semi-synthetic sentences containing 16 types of PII by compiling information from across multiple jurisdictions including India, U.K and U.S. We generate sentences (using language models) containing these PII at five different NER detection feature dimensions - (1) Basic Entity Recognition, (2) Contextual Entity Disambiguation, (3) NER in Noisy Real-World Data, (4) Evolving Novel Entities Detection and (5) Cross-Lingual or multi-lingual NER) and 1 in adversarial context. We present the results and exhibit the privacy exposure caused by such model use (considering the extent of lifetime downloads of these models). We conclude by highlighting the gaps in measuring performance of the models and the need for contextual disclosure in model cards for such models.
zh

[NLP-110] RIDIS: A Comprehensive Medieval and Early Modern Corpus for HTR and NER

【速读】: 该论文试图解决多语言、跨时代的中世纪及早期现代手稿数字化处理中的 Handwritten Text Recognition (HTR) 和 Named Entity Recognition (NER) 挑战。论文的关键在于构建一个名为 TRIDIS 的开源语料库,通过整合多个开放许可的遗产数据集并提供详细的元数据描述,同时提出了一种基于联合嵌入空间异常检测的领域外测试分割策略,并利用 TrOCR 和 MiniCPM2.5 进行初步基线实验,比较随机划分与异常样本驱动的测试划分方法,以促进 HTR 和 NER 技术在中世纪及早期现代文本遗产中的联合鲁棒研究。

链接: https://arxiv.org/abs/2503.22714
作者: Sergio Torres Aguilar
机构: University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注: 6 pages, 3 figures, 2 tables

点击查看摘要

Abstract:This paper introduces TRIDIS (Tria Digita Scribunt), an open-source corpus of medieval and early modern manuscripts. TRIDIS aggregates multiple legacy collections (all published under open licenses) and incorporates large metadata descriptions. While prior publications referenced some portions of this corpus, here we provide a unified overview with a stronger focus on its constitution. We describe (i) the narrative, chronological, and editorial background of each major sub-corpus, (ii) its semi-diplomatic transcription rules (expansion, normalization, punctuation), (iii) a strategy for challenging out-of-domain test splits driven by outlier detection in a joint embedding space, and (iv) preliminary baseline experiments using TrOCR and MiniCPM2.5 comparing random and outlier-based test partitions. Overall, TRIDIS is designed to stimulate joint robust Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) research across medieval and early modern textual heritage.
zh

[NLP-111] EmoVoice: LLM -based Emotional Text-To-Speech Model with Freestyle Text Prompting

【速读】: 该论文旨在解决Text-to-Speech (TTS) 模型在控制生成语音情感表达方面面临的挑战。现有TTS模型难以实现对情感的精细且自由形式的自然语言控制。为此,论文提出了一种名为EmoVoice的新颖情感可控TTS模型,其关键创新点包括:1) 利用大型语言模型 (Large Language Models, LLMs) 实现情感的细粒度自然语言控制,受链式思维 (Chain-of-Thought, CoT) 和模态思维 (Modality-of-Thought, CoM) 技术启发;2) 引入音素增强变体设计,使模型能够并行输出音素和音频标记,以提升内容一致性。此外,论文构建了一个包含40小时英语数据的高质量情感数据集EmoVoice-DB,用于支持模型训练与评估。这些方法使得EmoVoice在英语和中文情感语音任务上均取得了最先进的性能。

链接: https://arxiv.org/abs/2504.12867
作者: Guanrou Yang,Chen Yang,Qian Chen,Ziyang Ma,Wenxi Chen,Wen Wang,Tianrui Wang,Yifan Yang,Zhikang Niu,Wenrui Liu,Fan Yu,Zhihao Du,Zhifu Gao,ShiLiang Zhang,Xie Chen
机构: Shanghai Jiao Tong University (上海交通大学); Tongyi Speech Lab (通义语音实验室), Hangzhou (杭州); Tianjin University (天津大学), Tianjin (天津); Zhejiang University (浙江大学), Hangzhou (杭州)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and modality-of-thought (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Demo samples are available at this https URL. Dataset, code, and checkpoints will be released.
zh

计算机视觉

[CV-0] Perception Encoder: The best visual embeddings are not at the output of the network

【速读】:该论文试图解决的问题是如何通过一种统一的视觉-语言对比学习方法,利用简单的图像和视频理解任务,生成适用于多种下游任务(如分类、检索、问答及密集预测等)的强大且通用的嵌入表示。传统方法通常依赖于针对特定任务设计的不同预训练目标,而本文发现,通过扩展精心调整的图像预训练策略并结合鲁棒的视频数据增强技术,仅依靠对比视觉-语言学习即可实现这一目标。解决方案的关键在于引入了Perception Encoder (PE),并通过两种对齐方法——多模态语言建模的语言对齐以及密集预测的空间对齐,将隐藏在模型中间层中的有用信息提取出来,从而实现对多种任务的支持。

链接: https://arxiv.org/abs/2504.13181
作者: Daniel Bolya,Po-Yao Huang,Peize Sun,Jang Hyun Cho,Andrea Madotto,Chen Wei,Tengyu Ma,Jiale Zhi,Jathushan Rajasegaran,Hanoona Rasheed,Junke Wang,Marco Monteiro,Hu Xu,Shiyu Dong,Nikhila Ravi,Daniel Li,Piotr Dollár,Christoph Feichtenhofer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Initial Submission

点击查看摘要

Abstract:We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video QA; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.
zh

[CV-1] PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

【速读】:该论文旨在解决视觉语言模型研究领域中因许多高性能模型闭源而导致的科学透明度不足的问题。尽管已有通过黑盒模型蒸馏标注训练数据的方法实现了较强基准性能,但缺乏对教师模型及其数据来源的了解限制了科学进展的可测量性。论文的关键在于提出了一种完全开源且可重现的框架,用于图像和视频理解的透明研究。其解决方案的核心包括分析标准训练管道以避免依赖专有模型的蒸馏,并探索大规模合成数据来识别关键的数据缺口,特别是在详细的视频理解方面。为填补这些数据缺口,论文发布了包含280万个细粒度视频问答对以及时空定位视频描述的人类标注数据集。此外,还引入了PLM-VideoBench评估套件,专注于评估关于视频“何事”、“何处”、“何时”和“如何”的复杂理解任务。论文通过提供数据、训练方法、代码和模型确保了工作的完全可重现性。

链接: https://arxiv.org/abs/2504.13180
作者: Jang Hyun Cho,Andrea Madotto,Effrosyni Mavroudi,Triantafyllos Afouras,Tushar Nagarajan,Muhammad Maaz,Yale Song,Tengyu Ma,Shuming Hu,Suyog Jain,Miguel Martin,Huiyu Wang,Hanoona Rasheed,Peize Sun,Po-Yao Huang,Daniel Bolya,Nikhila Ravi,Shashank Jain,Tammy Stark,Shane Moon,Babak Damavandi,Vivian Lee,Andrew Westbury,Salman Khan,Philipp Krähenbühl,Piotr Dollár,Lorenzo Torresani,Kristen Grauman,Christoph Feichtenhofer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical report

点击查看摘要

Abstract:Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about “what”, “where”, “when”, and “how” of a video. We make our work fully reproducible by providing data, training recipes, code models.
zh

[CV-2] ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation ICRA2025

【速读】:该论文旨在解决物体6D姿态估计在机器人操作任务中的关键挑战,特别是在结合视觉和触觉(visuotactile)信息时因触觉数据有限而导致的泛化难题。论文提出了一种名为ViTa-Zero的零样本触觉姿态估计算法框架。其关键创新在于利用视觉模型作为主干,并基于从触觉和本体感觉观察中推导出的物理约束进行可行性检查和测试时优化。具体而言,论文将夹爪与物体的交互建模为弹簧-质量系统,其中触觉传感器诱导吸引力,本体感觉产生排斥力。通过在真实机器人上的实验验证,该方法展示了其在代表性视觉主干和操作场景(如抓取、物体拾取和双手传递)中的有效性。与仅依赖视觉模型的方法相比,ViTa-Zero在追踪物体内部姿态时克服了一些严重的失效模式,实验结果显示其ADD-S的AUC提高了55%,ADD提高了60%,并且位置误差降低了80%。

链接: https://arxiv.org/abs/2504.13179
作者: Hongyu Li,James Akl,Srinath Sridhar,Tye Brady,Taskin Padir
机构: Amazon Fulfillment Technologies & Robotics (亚马逊 fulfillment 技术与机器人); Brown University (布朗大学); Northeastern University (东北大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2025

点击查看摘要

Abstract:Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.
zh

[CV-3] Single-Shot Shape and Reflectance with Spatial Polarization Multiplexing

【速读】:该论文试图解决从单张偏振图像中重建物体形状与反射特性的难题,并将其应用于动态表面恢复。传统单次投影结构光虽可实现形状重建,但难以解耦表面反射特性,主要因为入射光角度采样不足以及投影图案与表面纹理的纠缠。论文的关键解决方案在于提出了一种空间偏振复用(Spatial Polarization Multiplexing, SPM)方法:通过设计一种极化模式的空间复用图案,在量化AoLP值的基础上实现鲁棒且唯一的形状重建;同时,利用受限的de Bruijn序列实现局部区域内不同偏振光的投影,分离镜面反射与漫反射以估计BRDF。此外,该方法保持了自然表面外观,使偏振图案肉眼不可见,适用于精确外观建模及人机交互。实验验证表明,该方法可以从单次偏振成像中恢复形状、Mueller矩阵和BRDF,并展示其在动态表面中的应用潜力。

链接: https://arxiv.org/abs/2504.13177
作者: Tomoki Ichikawa,Ryo Kawahara,Ko Nishino
机构: Graduate School of Informatics, Kyoto University (京都大学信息学研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose spatial polarization multiplexing (SPM) for reconstructing object shape and reflectance from a single polarimetric image and demonstrate its application to dynamic surface recovery. Although single-pattern structured light enables single-shot shape reconstruction, the reflectance is challenging to recover due to the lack of angular sampling of incident light and the entanglement of the projected pattern and the surface color texture. We design a spatially multiplexed pattern of polarization that can be robustly and uniquely decoded for shape reconstruction by quantizing the AoLP values. At the same time, our spatial-multiplexing enables single-shot ellipsometry of linear polarization by projecting differently polarized light within a local region, which separates the specular and diffuse reflections for BRDF estimation. We achieve this spatial polarization multiplexing with a constrained de Bruijn sequence. Unlike single-pattern structured light with intensity and color, our polarization pattern is invisible to the naked eye and retains the natural surface appearance which is essential for accurate appearance modeling and also interaction with people. We experimentally validate our method on real data. The results show that our method can recover the shape, the Mueller matrix, and the BRDF from a single-shot polarimetric image. We also demonstrate the application of our method to dynamic surfaces.
zh

[CV-4] IMAGGarment-1: Fine-Grained Garment Generation for Controllable Fashion Design

【速读】:该论文试图解决个性化时尚设计和数字服装应用中多条件可控高保真服装生成的问题。现有方法通常局限于单一条件输入,而IMAGGarment-1通过引入两阶段训练策略解决了多条件可控性挑战:第一阶段利用混合注意力模块和颜色适配器联合编码服装轮廓与颜色以建模全局外观;第二阶段通过自适应外观感知模块注入用户定义的Logo和空间约束以增强局部细节。关键在于分离全局外观与局部细节建模的同时,实现端到端统一且可控的生成过程,并为此构建了包含超过18万组样本的GarmentBench数据集。

链接: https://arxiv.org/abs/2504.13176
作者: Fei Shen,Jian Yu,Cong Wang,Xin Jiang,Xiaoyu Du,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents IMAGGarment-1, a fine-grained garment generation (FGG) framework that enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement. Unlike existing methods that are limited to single-condition inputs, IMAGGarment-1 addresses the challenges of multi-conditional controllability in personalized fashion design and digital apparel applications. Specifically, IMAGGarment-1 employs a two-stage training strategy to separately model global appearance and local details, while enabling unified and controllable generation through end-to-end inference. In the first stage, we propose a global appearance model that jointly encodes silhouette and color using a mixed attention module and a color adapter. In the second stage, we present a local enhancement model with an adaptive appearance-aware module to inject user-defined logos and spatial constraints, enabling accurate placement and visual consistency. To support this task, we release GarmentBench, a large-scale dataset comprising over 180K garment samples paired with multi-level design conditions, including sketches, color references, logo placements, and textual prompts. Extensive experiments demonstrate that our method outperforms existing baselines, achieving superior structural stability, color fidelity, and local controllability performance. The code and model are available at this https URL.
zh

[CV-5] Generate but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在视觉理解任务中常见的视觉幻觉(visual hallucinations)问题,即模型可能生成关于不存在的对象、动作或概念的描述,这在安全关键应用中构成显著风险。现有缓解幻觉的方法主要分为两类:生成调整(generation adjustment),通过修改解码行为使文本与视觉输入对齐;以及事后验证(post-hoc verification),利用外部模型评估并修正输出。然而,生成调整方法通常依赖启发式规则且缺乏修正机制,而事后验证则复杂度高,常需多个模型且倾向于拒绝输出而非优化。

论文的关键解决方案是提出REVERSE框架,它将具有幻觉意识的训练与实时自验证相结合。通过利用包含超过130万半合成样本的新幻觉验证数据集,以及一种新颖的推理时回顾重采样技术,该方法使VLMs能够在生成过程中检测幻觉,并动态修正这些幻觉。实验结果显示,REVERSE在CHAIR-MSCOCO和HaloQuest上的幻觉减少效果达到当前最佳水平,分别比现有最佳方法高出12%和28%。相关数据集、模型及代码已公开。

链接: https://arxiv.org/abs/2504.13169
作者: Tsung-Han Wu,Heekyung Lee,Jiaxin Ge,Joseph E. Gonzalez,Trevor Darrell,David M. Chan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Project Page: this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel at visual understanding but often suffer from visual hallucinations, where they generate descriptions of nonexistent objects, actions, or concepts, posing significant risks in safety-critical applications. Existing hallucination mitigation methods typically follow one of two paradigms: generation adjustment, which modifies decoding behavior to align text with visual inputs, and post-hoc verification, where external models assess and correct outputs. While effective, generation adjustment methods often rely on heuristics and lack correction mechanisms, while post-hoc verification is complicated, typically requiring multiple models and tending to reject outputs rather than refine them. In this work, we introduce REVERSE, a unified framework that integrates hallucination-aware training with on-the-fly self-verification. By leveraging a new hallucination-verification dataset containing over 1.3M semi-synthetic samples, along with a novel inference-time retrospective resampling technique, our approach enables VLMs to both detect hallucinations during generation and dynamically revise those hallucinations. Our evaluations show that REVERSE achieves state-of-the-art hallucination reduction, outperforming the best existing methods by up to 12% on CHAIR-MSCOCO and 28% on HaloQuest. Our dataset, model, and code are available at: this https URL.
zh

[CV-6] ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos CVPR2025

【速读】:该论文旨在解决从单目野外视频中创建逼真的场景和人类重建的问题,这是构建以人为中心的3D世界感知的关键。传统方法需要预先校准相机和人体姿态,并且训练时间长达数天。为了解决这些问题,论文提出了一种新颖的统一框架,能够在在线模式下同时执行相机跟踪、人体姿态估计以及人景重建。关键解决方案包括:利用3D高斯点 splatting(Gaussian Splatting)高效学习人类和场景的高斯基元;设计基于重建的相机跟踪和人体姿态估计模块以实现对姿态与外观的有效解耦;引入考虑遮挡的人体轮廓渲染和单目几何先验以准确学习人体与场景的空间相关性;并通过人体变形模块忠实还原细节并增强对分布外姿态的泛化能力。实验结果表明,该方法在相机跟踪、人体姿态估计、新视角合成和运行时间方面表现优于或至少达到现有方法的水平。

链接: https://arxiv.org/abs/2504.13167
作者: Zetong Zhang,Manuel kaufmann,Lixin Xue,Jie Song,Martin R. Oswald
机构: ETH Zürich (苏黎世联邦理工学院); HKUST(GZ) (香港科技大学(广州)); HKUST (香港科技大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in camera tracking, human pose estimation, novel view synthesis and runtime. Our project page is at this https URL.
zh

[CV-7] Personalized Text-to-Image Generation with Auto-Regressive Models

【速读】:该论文试图解决个性化图像合成(Personalized Image Synthesis)在文本到图像生成领域中的问题,特别是探索自回归模型(Auto-regressive Models)在这一任务中的潜力。尽管扩散模型(Diffusion Models)在该领域占据主导地位,但自回归模型因其统一的文本与图像建模架构尚未得到充分研究。论文的关键解决方案在于提出了一种两阶段训练策略:首先优化文本嵌入(text embeddings),然后微调Transformer层(fine-tuning of transformer layers)。实验结果表明,该方法在主体保真度(subject fidelity)和提示遵循能力(prompt following)方面可与领先的基于扩散的方法相媲美,从而验证了自回归模型在个性化图像生成中的有效性,并为未来研究提供了新方向。

链接: https://arxiv.org/abs/2504.13162
作者: Kaiyue Sun,Xian Liu,Yao Teng,Xihui Liu
机构: The University of Hong Kong (香港大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Personalized image synthesis has emerged as a pivotal application in text-to-image generation, enabling the creation of images featuring specific subjects in diverse contexts. While diffusion models have dominated this domain, auto-regressive models, with their unified architecture for text and image modeling, remain underexplored for personalized image generation. This paper investigates the potential of optimizing auto-regressive models for personalized image synthesis, leveraging their inherent multimodal capabilities to perform this task. We propose a two-stage training strategy that combines optimization of text embeddings and fine-tuning of transformer layers. Our experiments on the auto-regressive model demonstrate that this method achieves comparable subject fidelity and prompt following to the leading diffusion-based personalization methods. The results highlight the effectiveness of auto-regressive models in personalized image generation, offering a new direction for future research in this area.
zh

[CV-8] Digital Twin Generation from Visual Data: A Survey

【速读】:该论文旨在探索从视频生成数字孪生体(Digital Twins)的最新进展,其核心问题是解决在机器人应用、媒体内容创作以及设计与施工等领域中,如何高效且精准地构建数字孪生体。论文通过分析包括3D高斯点渲染(3D Gaussian Splatting)、生成式填补(generative in-painting)、语义分割(semantic segmentation)以及基础模型(foundation models)等方法,阐明了这些技术的优势与局限性。解决方案的关键在于综合运用上述技术,并克服诸如遮挡(occlusions)、光照变化(lighting variations)及可扩展性(scalability)等挑战,以实现更广泛的实际应用。

链接: https://arxiv.org/abs/2504.13159
作者: Andrew Melnik,Benjamin Alt,Giang Nguyen,Artur Wilkowski,Maciej Stefańczyk,Qirui Wu,Sinan Harms,Helge Rhodin,Manolis Savva,Michael Beetz
机构: University of Bremen (德国); Warsaw University of Technology (华沙工业大学); Bielefeld University (比勒费尔德大学); Simon Fraser University (西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This survey explores recent developments in generating digital twins from videos. Such digital twins can be used for robotics application, media content creation, or design and construction works. We analyze various approaches, including 3D Gaussian Splatting, generative in-painting, semantic segmentation, and foundation models highlighting their advantages and limitations. Additionally, we discuss challenges such as occlusions, lighting variations, and scalability, as well as potential future research directions. This survey aims to provide a comprehensive overview of state-of-the-art methodologies and their implications for real-world applications. Awesome list: this https URL
zh

[CV-9] AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis CVPR2025

【速读】:该论文试图解决从混合地面和航拍视角捕获的图像进行几何重建的任务,特别是当前基于学习的方法在处理地面-航拍图像对之间的极端视点变化时表现不佳的问题。论文假设缺乏高质量且配准良好的地面-航拍数据集用于训练是导致这一问题的关键原因。为克服此挑战,论文提出了一种可扩展的框架,结合来自城市级三维网格(如Google Earth)的伪合成渲染与真实地面的众包图像(如MegaDepth)。伪合成数据模拟了广泛的航拍视点,而真实的众包图像则在网格渲染细节不足的地面图像上提高了视觉保真度,有效弥合了真实图像与伪合成渲染之间的领域差距。通过使用这种混合数据集微调几种最先进的算法,在实际零样本地面-航拍任务中取得了显著改进。

链接: https://arxiv.org/abs/2504.13157
作者: Khiem Vuong,Anurag Ghosh,Deva Ramanan,Srinivasa Narasimhan,Shubham Tulsiani
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Appearing in CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.
zh

[CV-10] raining-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs

【速读】:该论文旨在解决自然语言与三维几何结合中的开放词汇理解问题,并克服现有方法在多视图一致性及效率方面的局限。具体而言,大多数现有方法依赖于每视图二维语义特征图的迭代优化,这不仅效率低下,还会导致跨视图三维语义不一致。为了解决这些问题,论文提出了一种无需训练的框架,通过直接从高斯基元构建超点图来实现。该超点图将场景划分为空间紧凑且语义一致的区域,形成跨视图一致的三维实体,为开放词汇理解提供了结构化基础。关键创新在于基于图结构设计了一种高效的重投影策略,将二维语义特征提升到超点上,避免了代价高昂的多视图迭代训练。这一方法确保了强三维语义一致性,并支持分层理解,从而能够在统一语义场中实现粗粒度和细粒度的开放词汇感知。实验结果表明,该方法在开放词汇分割任务上达到了最先进的性能,并且语义场重建速度比现有方法快30倍以上。

链接: https://arxiv.org/abs/2504.13153
作者: Shaohui Dai,Yansong Qu,Zheyan Li,Xinyang Li,Shengchuan Zhang,Liujuan Cao
机构: Key Laboratory of Multimedia Trusted Perception and Efficient Computing (多媒体可信感知与高效计算重点实验室), Ministry of Education of China (中华人民共和国教育部), Xiamen University (厦门大学), China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Bridging natural language and 3D geometry is a crucial step toward flexible, language-driven scene understanding. While recent advances in 3D Gaussian Splatting (3DGS) have enabled fast and high-quality scene reconstruction, research has also explored incorporating open-vocabulary understanding into 3DGS. However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. The superpoint graph partitions the scene into spatially compact and semantically coherent regions, forming view-consistent 3D entities and providing a structured foundation for open-vocabulary understanding. Based on the graph structure, we design an efficient reprojection strategy that lifts 2D semantic features onto the superpoints, avoiding costly multi-view iterative training. The resulting representation ensures strong 3D semantic coherence and naturally supports hierarchical understanding, enabling both coarse- and fine-grained open-vocabulary perception within a unified semantic field. Extensive experiments demonstrate that our method achieves state-of-the-art open-vocabulary segmentation performance, with semantic field reconstruction completed over 30\times faster. Our code will be available at this https URL.
zh

[CV-11] St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World

【速读】:该论文试图解决动态视频中3D重建与点跟踪两个任务分离的问题,提出了一种统一的解决方案。关键在于通过预测两帧图像在同一世界坐标系下的适配点图(pointmaps),同时捕捉静态和动态场景几何结构并保持3D对应关系,从而实现基于RGB输入的同步3D重建与跟踪。这种方法通过参考帧将连续帧的预测链接起来,自然计算长距离对应关系,有效融合了3D重建与跟踪。此外,不同于依赖4D真实标签监督的传统方法,该方案采用基于重投影损失的新适应机制,建立了新的世界坐标系下重建与跟踪基准,展示了数据驱动框架的有效性和高效性。

链接: https://arxiv.org/abs/2504.13152
作者: Haiwen Feng,Junyi Zhang,Qianqian Wang,Yufei Ye,Pengcheng Yu,Michael J. Black,Trevor Darrell,Angjoo Kanazawa
机构: UC Berkeley (加州大学伯克利分校); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision, we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework. Our code, model, and benchmark will be released.
zh

[CV-12] Readable Twins of Unreadable Models

【速读】:该论文旨在解决如何使不可解释的深度学习模型(Deep Learning Model, DLM)具备可解释性的问题。论文提出的关键解决方案是通过创建可读的“数字孪生”(Readable Twin),即以不精确信息流模型(Imprecise Information Flow Model, IIFM)的形式,来表征原本不可解释的深度学习模型的行为与决策过程。论文详细阐述了从深度学习模型到不精确信息流模型转换的完整流程,并通过MNIST数据集中手写数字图像识别的深度学习分类模型示例验证了所提方法的有效性。这一方法的核心在于利用IIFM提供一种易于理解的方式,揭示深度学习模型内部复杂的决策机制。

链接: https://arxiv.org/abs/2504.13150
作者: Krzysztof Pancerz,Piotr Kulicki,Michał Kalisz,Andrzej Burda,Maciej Stanisławski,Jaromir Sarzyński
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Based on the abstract accepted for ISFS 2025

点击查看摘要

Abstract:Creating responsible artificial intelligence (AI) systems is an important issue in contemporary research and development of works on AI. One of the characteristics of responsible AI systems is their explainability. In the paper, we are interested in explainable deep learning (XDL) systems. On the basis of the creation of digital twins of physical objects, we introduce the idea of creating readable twins (in the form of imprecise information flow models) for unreadable deep learning models. The complete procedure for switching from the deep learning model (DLM) to the imprecise information flow model (IIFM) is presented. The proposed approach is illustrated with an example of a deep learning classification model for image recognition of handwritten digits from the MNIST data set.
zh

[CV-13] textttComplex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark

【速读】:该论文旨在解决跨指令复杂度评估指令驱动图像编辑模型的问题。为实现这一目标,论文提出了一种名为\texttt{Complex-Edit}的综合基准,通过利用GPT-4o大规模自动收集多样化的编辑指令,并采用结构化的“编辑链”(Chain-of-Edit)流程生成复杂指令,从而系统性评估模型性能。论文的关键创新在于引入了一套全面的评价指标与基于视觉语言模型(VLM)的自动化评估管道,支持大规模性能评估。此外,研究揭示了开源模型与闭源模型在处理复杂指令时的显著性能差距,强调了保留输入图像关键元素及整体美学质量的重要性,同时验证了分解复杂指令为原子步骤会导致性能下降的现象,并提出了Best-of-N选择策略以提升编辑效果。最终,论文观察到合成数据训练带来的“合成诅咒”现象,即随着指令复杂度增加,模型生成的编辑结果愈发显得不自然。解决方案的关键在于构建一个能够覆盖不同复杂度指令的基准体系,并结合多维度评价机制深入分析现有模型的优劣及局限性。

链接: https://arxiv.org/abs/2504.13143
作者: Siwei Yang,Mude Hui,Bingchen Zhao,Yuyin Zhou,Nataniel Ruiz,Cihang Xie
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); University of Edinburgh (爱丁堡大学); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL , Dataset: this https URL

点击查看摘要

Abstract:We introduce \textttComplex-Edit , a comprehensive benchmark designed to systematically evaluate instruction-based image editing models across instructions of varying complexity. To develop this benchmark, we harness GPT-4o to automatically collect a diverse set of editing instructions at scale. Our approach follows a well-structured Chain-of-Edit'' pipeline: we first generate individual atomic editing tasks independently and then integrate them to form cohesive, complex instructions. Additionally, we introduce a suite of metrics to assess various aspects of editing performance, along with a VLM-based auto-evaluation pipeline that supports large-scale assessments. Our benchmark yields several notable insights: 1) Open-source models significantly underperform relative to proprietary, closed-source models, with the performance gap widening as instruction complexity increases; 2) Increased instructional complexity primarily impairs the models' ability to retain key elements from the input images and to preserve the overall aesthetic quality; 3) Decomposing a complex instruction into a sequence of atomic steps, executed in a step-by-step manner, substantially degrades performance across multiple metrics; 4) A straightforward Best-of-N selection strategy improves results for both direct editing and the step-by-step sequential approach; and 5) We observe a curse of synthetic data’': when synthetic data is involved in model training, the edited images from such models tend to appear increasingly synthetic as the complexity of the editing instructions rises – a phenomenon that intriguingly also manifests in the latest GPT-4o outputs.
zh

[CV-14] PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition CVPR

【速读】:该论文致力于解决现有深度学习模型在人体动作识别(Human Action Recognition, HAR)领域中因黑箱特性导致的决策过程不透明问题,尤其是在需要透明性和可解释性的实际应用中。现有的视频可解释性人工智能(XAI)方法主要依赖于特征归因或静态文本概念,但这些方法难以捕捉动作理解所需的运动动态和时间依赖关系。论文的关键创新在于提出Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR),这是一种新颖的概念瓶颈框架,通过引入人体姿态序列作为感知运动的结构化概念,以实现视频动作识别的可解释性。与基于像素级特征或静态文本描述的方法不同,PCBEAR利用人体骨骼姿态,专注于身体运动本身,提供稳健且易于理解的动作动态解释。其关键解决方案是定义两种基于姿态的概念:单帧的空间配置静态姿态概念和多帧的动态姿态概念,并通过聚类视频姿态序列自动发现有意义的概念,无需人工标注。这种方法不仅实现了高分类性能,还提供了对模型推理过程的人类可理解洞察,支持测试时的干预以调试和改进模型行为。

链接: https://arxiv.org/abs/2504.13140
作者: Jongseo Lee,Wooil Lee,Gyeong-Moon Park,Seong Tae Kim,Jinwoo Choi
机构: Kyung Hee University (庆熙大学), Republic of Korea; Korea University (高丽大学), Republic of Korea
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by CVPRW 2025

点击查看摘要

Abstract:Human action recognition (HAR) has achieved impressive results with deep learning models, but their decision-making process remains opaque due to their black-box nature. Ensuring interpretability is crucial, especially for real-world applications requiring transparency and accountability. Existing video XAI methods primarily rely on feature attribution or static textual concepts, both of which struggle to capture motion dynamics and temporal dependencies essential for action understanding. To address these challenges, we propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose sequences as motion-aware, structured concepts for video action recognition. Unlike methods based on pixel-level features or static textual descriptions, PCBEAR leverages human skeleton poses, which focus solely on body movements, providing robust and interpretable explanations of motion dynamics. We define two types of pose-based concepts: static pose concepts for spatial configurations at individual frames, and dynamic pose concepts for motion patterns across multiple frames. To construct these concepts, PCBEAR applies clustering to video pose sequences, allowing for automatic discovery of meaningful concepts without manual annotation. We validate PCBEAR on KTH, Penn-Action, and HAA500, showing that it achieves high classification performance while offering interpretable, motion-driven explanations. Our method provides both strong predictive performance and human-understandable insights into the model’s reasoning process, enabling test-time interventions for debugging and improving model behavior.
zh

[CV-15] Science-T2I: Addressing Scientific Illusions in Image Synthesis CVPR2025

【速读】:该论文旨在解决生成式人工智能(Generative AI)在科学知识一致性与真实性方面表现不足的问题。为实现这一目标,论文提出的关键解决方案包括:首先构建了一个名为Science-T2I的专家标注对抗性数据集,包含2万组对抗图像对及9千条提示语,覆盖广泛的科学知识类别;其次,开发了一种端到端奖励模型SciScore,通过增强预训练CLIP模型的科学理解和视觉能力,优化生成图像的科学评估;最后,基于SciScore提出了一种两阶段微调框架,结合监督微调和掩码在线微调,有效将科学知识融入现有生成模型。实验结果表明,该框架显著提升了生成内容的科学真实性的评估标准,SciScore的表现接近人类水平,并在性能上实现了超过50%的提升。

链接: https://arxiv.org/abs/2504.13129
作者: Jialuo Li,Wenhao Chai,Xingyu Fu,Haiyang Xu,Saining Xie
机构: New York University (纽约大学); University of Washington (华盛顿大学); University of Pennsylvania (宾夕法尼亚大学); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025. Code, docs, weight, benchmark and training data are all avaliable at this https URL

点击查看摘要

Abstract:We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. Additionally, based on SciScore, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% on SciScore.
zh

[CV-16] Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

【速读】:该论文旨在解决视觉-语言模型预训练中高质量图像-文本配对数据稀缺且饱和的问题。现有方法严重依赖于精心标注的真实数据,而随着模型和数据规模的快速增长,这种高成本的数据获取方式已成为进一步发展的瓶颈。为应对这一挑战,论文提出了一种可扩展的生成式解决方案,即利用大规模低幻觉合成描述(synthetic captions)作为替代或补充数据源。关键在于开发了一种新颖的生成管道,通过持续性的人类反馈优化(Continuous DPO方法),显著降低了生成内容的幻觉现象,并确保其具备高质量、低幻觉及知识丰富等特性。实验结果表明,该方法不仅在预训练阶段表现出色,还能有效提升视觉-语言任务以及文本到图像生成任务的表现。

链接: https://arxiv.org/abs/2504.13123
作者: Xinsong Zhang,Yarong Zeng,Xinting Huang,Hu Hu,Runquan Xie,Han Hu,Zhanhui Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents three key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.2% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 35 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to alt-text pairs and other previous work. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark. 3) We will release Hunyuan-Recap100M, a low-hallucination and knowledge-intensive synthetic caption dataset.
zh

[CV-17] VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

【速读】:该论文旨在解决大型视频模型(Large Video Models, LVMs)在视频理解任务中与人类直觉不一致以及视频幻觉问题。为应对这些挑战,论文引入了VistaDPO,这是一种新颖的视频分层时空直接偏好优化框架。其关键是通过三个层级提升文本与视频之间的偏好对齐:i) 实例层级,确保整体视频内容与响应一致;ii) 时序层级,使视频的时间语义与事件描述相匹配;iii) 感知层级,将空间对象与语言标记对齐。此外,由于缺乏细粒度视频-语言偏好对齐的数据集,研究者构建了VistaDPO-7k数据集,包含7.2K个带标注的问题-答案对及选择/拒绝响应,同时提供时空定位信息如时间戳、关键帧和边界框。实验结果表明,VistaDPO显著提升了现有LVMs的性能,在视频幻觉、问答及字幕生成等任务中有效缓解了视频-语言错位和幻觉现象。

链接: https://arxiv.org/abs/2504.13122
作者: Haojian Huang,Haodong Chen,Shengqiong Wu,Meng Luo,Jinlan Fu,Xinya Du,Hanwang Zhang,Hao Fei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code and Data: this https URL

点击查看摘要

Abstract:Large Video Models (LVMs) built upon Large Language Models (LLMs) have shown promise in video understanding but often suffer from misalignment with human intuition and video hallucination issues. To address these challenges, we introduce VistaDPO, a novel framework for Video Hierarchical Spatial-Temporal Direct Preference Optimization. VistaDPO enhances text-video preference alignment across three hierarchical levels: i) Instance Level, aligning overall video content with responses; ii) Temporal Level, aligning video temporal semantics with event descriptions; and iii) Perceptive Level, aligning spatial objects with language tokens. Given the lack of datasets for fine-grained video-language preference alignment, we construct VistaDPO-7k, a dataset of 7.2K QA pairs annotated with chosen and rejected responses, along with spatial-temporal grounding information such as timestamps, keyframes, and bounding boxes. Extensive experiments on benchmarks such as Video Hallucination, Video QA, and Captioning performance tasks demonstrate that VistaDPO significantly improves the performance of existing LVMs, effectively mitigating video-language misalignment and hallucination. The code and data are available at this https URL.
zh

[CV-18] UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

【速读】:该论文旨在解决现有扩散模型(Diffusion Models)的反转与编辑方法在流匹配模型(Flow Matching Models)中效果不佳或无法适用的问题。流匹配模型具有直线且不交叉的轨迹特性,这为基于扩散的方法带来了挑战,但也提供了创新解决方案的可能性。论文的关键在于提出了一种基于预测器-校正器框架的解决方案,包括Uni-Inv和Uni-Edit两个核心部分。Uni-Inv是一种针对精确重构设计的有效反转方法;在此基础上,通过将延迟注入的概念扩展到流模型中,提出了Uni-Edit,这是一种区域感知且鲁棒的图像编辑方法。该方案无需调参、与模型无关、高效且有效,能够在保证非编辑区域强保留的同时实现多样化的编辑操作。实验结果验证了Uni-Inv和Uni-Edit的优越性和通用性。

链接: https://arxiv.org/abs/2504.13109
作者: Guanlong Jiao,Biqing Huang,Kuan-Chieh Wang,Renjie Liao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Flow matching models have emerged as a strong alternative to diffusion models, but existing inversion and editing methods designed for diffusion are often ineffective or inapplicable to them. The straight-line, non-crossing trajectories of flow models pose challenges for diffusion-based approaches but also open avenues for novel solutions. In this paper, we introduce a predictor-corrector-based framework for inversion and editing in flow models. First, we propose Uni-Inv, an effective inversion method designed for accurate reconstruction. Building on this, we extend the concept of delayed injection to flow models and introduce Uni-Edit, a region-aware, robust image editing approach. Our methodology is tuning-free, model-agnostic, efficient, and effective, enabling diverse edits while ensuring strong preservation of edit-irrelevant regions. Extensive experiments across various generative models demonstrate the superiority and generalizability of Uni-Inv and Uni-Edit, even under low-cost settings. Project page: this https URL
zh

[CV-19] RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

【速读】:该论文旨在解决在复杂果园环境中检测绿色水果时面临的标签模糊、遮挡和背景融合等挑战。论文通过构建包含单类(绿色水果)和多类(被遮挡与未被遮挡的绿色水果)标注的自定义数据集,评估了RF-DETR和YOLOv12两种目标检测模型的性能。RF-DETR的关键优势在于其利用DINOv2主干网络和可变形注意力机制,擅长全局上下文建模,能够有效识别部分遮挡或模糊的绿色水果;而YOLOv12则通过基于CNN的注意力机制优化局部特征提取,适合计算高效且边缘部署的应用场景。论文的核心解决方案在于验证了RF-DETR在复杂空间场景中的优越性,特别是在单类检测中达到0.9464的mAP50,以及在多类检测中以0.8298的mAP@50表现优异,表明其在处理动态视觉数据时具有快速收敛的优势,尤其适用于需要高精度检测的精准农业应用。

链接: https://arxiv.org/abs/2504.13099
作者: Ranjan Sapkota,Rahul Harsha Cheppally,Ajay Sharda,Manoj Karkee
机构: Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR’s swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR’s effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs
zh

[CV-20] EventVAD: Training-Free Event-Aware Video Anomaly Detection

【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)中的两个主要挑战:一是有监督方法依赖大量领域内训练数据且泛化能力有限;二是无监督或训练-free方法利用大型语言模型(Large Language Models, LLMs)的世界知识虽可检测异常,但在细粒度视觉过渡和多样化事件定位方面表现不足。为解决这些问题,论文提出了一种名为EventVAD的事件感知视频异常检测框架。其关键是结合动态图架构与多模态LLMs,通过时间-事件推理机制捕捉事件感知的视频特征,并采用自适应噪声过滤及信号比率阈值处理来检测事件边界,同时利用分层提示策略指导LLMs进行推理以做出最终决策。这一方案显著提升了在UCF-Crime和XD-Violence数据集上的性能,在无需训练的情况下达到当前最优水平(SOTA)。

链接: https://arxiv.org/abs/2504.13092
作者: Yihua Shao,Haojin He,Sijie Li,Siyu Chen,Xinwei Long,Fanhu Zeng,Yuxuan Fan,Muyang Zhang,Ziyang Yan,Ao Ma,Xiaochen Wang,Hao Tang,Yan Wang,Shuyan Li
机构: Peking University (北京大学); Tsinghua University (清华大学); The University of Sheffield (谢菲尔德大学); University of Science and Technology Beijing (北京科技大学); Institute of Computing Technology, Chinese Academy of Science (中国科学院计算技术研究所); Queen’s University Belfast (贝尔法斯特女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.
zh

[CV-21] Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off

【速读】:该论文旨在解决虚拟试衣中的虚拟脱衣(Virtual Try-Off, VTOFF)问题,即从穿着衣物的个体图像中提取标准化的衣物图像。论文的关键创新在于提出了TryOffDiff模型,这是一种基于扩散(diffusion)的方法,构建于潜在扩散框架之上,并结合SigLIP图像条件技术,能够有效捕捉衣物的纹理、形状和图案等属性。通过引入类别特定嵌入(class-specific embeddings),TryOffDiff实现了多衣物类型的虚拟脱衣功能,这是首个此类方法。此外,当与虚拟试衣(Virtual Try-On, VTON)模型结合时,TryOffDiff还能减少人物之间属性传递(如肤色)的不良影响,从而提升人物到人物虚拟试衣(p2p-VTON)的效果。

链接: https://arxiv.org/abs/2504.13078
作者: Riza Velioglu,Petra Bevandic,Robin Chan,Barbara Hammer
机构: Machine Learning Group, CITEC, Bielefeld University (比勒费尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer vision is transforming fashion through Virtual Try-On (VTON) and Virtual Try-Off (VTOFF). VTON generates images of a person in a specified garment using a target photo and a standardized garment image, while a more challenging variant, Person-to-Person Virtual Try-On (p2p-VTON), uses a photo of another person wearing the garment. VTOFF, on the other hand, extracts standardized garment images from clothed individuals. We introduce TryOffDiff, a diffusion-based VTOFF model. Built on a latent diffusion framework with SigLIP image conditioning, it effectively captures garment properties like texture, shape, and patterns. TryOffDiff achieves state-of-the-art results on VITON-HD and strong performance on DressCode dataset, covering upper-body, lower-body, and dresses. Enhanced with class-specific embeddings, it pioneers multi-garment VTOFF, the first of its kind. When paired with VTON models, it improves p2p-VTON by minimizing unwanted attribute transfer, such as skin color. Code is available at: this https URL
zh

[CV-22] Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data

【速读】:该论文旨在减少对大规模标注数据集的依赖,同时提升模型在多样化计算机视觉任务中的鲁棒性和适应性,特别是在源无关领域自适应(Source-Free Domain Adaptation, SFDA)和行人再识别(Person Re-Identification, ReID)任务中。论文提出了一种新颖的双区域增强(Dual-Region Augmentation)方法,其关键是针对前景对象施加随机噪声扰动,并对背景块进行空间重排,通过结构化变换增加训练数据的多样性,从而提升模型的泛化能力和鲁棒性。实验结果表明,该方法在PACS数据集上的SFDA任务及Market-1501与DukeMTMC-reID数据集上的ReID任务中均优于现有方法,验证了其在跨域泛化和减少人工标注数据依赖方面的有效性。

链接: https://arxiv.org/abs/2504.13077
作者: Prasanna Reddy Pulakurthi,Majid Rabbani,Celso M. de Melo,Sohail A. Dianat,Raghuveer M. Rao
机构: Rochester Institute of Technology (罗切斯特理工学院); DEVCOM U.S. Army Research Laboratory (DEVCOM 美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 2 figures, Accepted to SPIE DSC 2025 Conference: Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications III

点击查看摘要

Abstract:This paper introduces a novel dual-region augmentation approach designed to reduce reliance on large-scale labeled datasets while improving model robustness and adaptability across diverse computer vision tasks, including source-free domain adaptation (SFDA) and person re-identification (ReID). Our method performs targeted data transformations by applying random noise perturbations to foreground objects and spatially shuffling background patches. This effectively increases the diversity of the training data, improving model robustness and generalization. Evaluations on the PACS dataset for SFDA demonstrate that our augmentation strategy consistently outperforms existing methods, achieving significant accuracy improvements in both single-target and multi-target adaptation settings. By augmenting training data through structured transformations, our method enables model generalization across domains, providing a scalable solution for reducing reliance on manually annotated datasets. Furthermore, experiments on Market-1501 and DukeMTMC-reID datasets validate the effectiveness of our approach for person ReID, surpassing traditional augmentation techniques.
zh

[CV-23] SkyReels-V2: Infinite-length Film Generative Model

【速读】:该论文旨在解决视频生成领域中长期存在的几个关键挑战:在保证时间视觉质量的同时难以保持自然的运动动态,在有限的5到10秒视频时长内优先追求高分辨率导致的局限性,以及由于通用多模态大语言模型(MLLM)无法理解电影语法(如镜头构图、演员表情和摄像机运动)而导致的片段感知生成不足。这些限制阻碍了真实感强的长视频合成及专业电影风格生成。为了解决这些问题,论文提出了SkyReels-V2,这是一种无限长度的电影生成模型,其核心在于整合多模态大语言模型、多阶段预训练、强化学习以及扩散强制框架。关键方案包括设计结合多模态LLM通用描述与子专家模型细节镜头语言的综合视频结构表示,并通过人类注释训练统一的视频标注器SkyCaptioner-V1;其次,建立逐步分辨率预训练以提升基础视频生成能力,并通过初始概念平衡监督微调、针对动态伪影的运动特定强化学习训练、基于非递减噪声调度的扩散强制框架实现长视频合成,最终通过高质量的监督微调优化视觉保真度。所有代码和模型均可公开获取。

链接: https://arxiv.org/abs/2504.13074
作者: Guibin Chen,Dixuan Lin,Jiangping Yang,Chunze Lin,Juncheng Zhu,Mingyuan Fan,Hao Zhang,Sheng Chen,Zheng Chen,Chengchen Ma,Weiming Xiong,Wei Wang,Nuo Pang,Kang Kang,Zhiheng Xu,Yuzhe Jin,Yupeng Liang,Yubing Song,Peng Zhao,Boyuan Xu,Di Qiu,Debang Li,Zhengcong Fei,Yang Li,Yahui Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages,10 figures

点击查看摘要

Abstract:Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs’ inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at this https URL.
zh

[CV-24] HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation

【速读】:本文旨在解决场景级三维生成领域中现有方法存在的两个主要问题:一是物体类别受限,二是缺乏足够的编辑灵活性以满足交互应用的需求。为了解决这些问题,论文提出了HiScene这一新颖的分层框架,其核心在于将场景视为等距视图下的分层“对象”,其中房间被看作一个复杂的对象,可以进一步分解为可操作的项目。这种方法的关键在于通过视频扩散(Video-Diffusion)技术实现对分解实例的完整性和空间对齐处理,同时引入形状先验注入以确保场景内的空间一致性。实验结果表明,该方法能够生成更自然的物体排列和完整的物体实例,适用于交互式应用,并保持物理上的合理性以及与用户输入的一致性。

链接: https://arxiv.org/abs/2504.13072
作者: Wenqi Dong,Bangbang Yang,Zesong Yang,Yuan Li,Tao Hu,Hujun Bao,Yuewen Ma,Zhaopeng Cui
机构: Zhejiang University (浙江大学); ByteDance (字节跳动)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Scene-level 3D generation represents a critical frontier in multimedia and computer graphics, yet existing approaches either suffer from limited object categories or lack editing flexibility for interactive applications. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical “objects” under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.
zh

[CV-25] EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance CVPR2025

【速读】:该论文旨在解决超声心动图探头引导系统在生成标准切面图像时,因依赖经验丰富的超声技师而导致效率受限的问题。论文提出的关键解决方案是开发一种名为EchoWorld的运动感知世界建模框架,该框架通过编码解剖学知识和运动诱导的视觉动态,同时有效利用历史视觉-运动序列来提升探头引导的精度。其核心在于采用了一种受世界建模原则启发的预训练策略,并在微调阶段引入运动感知注意力机制,以整合历史视觉-运动数据,实现精确且自适应的探头引导。

链接: https://arxiv.org/abs/2504.13065
作者: Yang Yue,Yulin Wang,Haojun Jiang,Pan Liu,Shiji Song,Gao Huang
机构: Tsinghua University (清华大学); PLA General Hospital (中国人民解放军总医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single-frame and sequential evaluation protocols. Code is available at this https URL.
zh

[CV-26] ArtistAuditor: Auditing Artist Style Pirate in Text-to-Image Generation Models

【速读】:该论文旨在解决基于文本到图像生成模型中艺术品版权侵权的问题,特别是当艺术品或模型已在线发布且无法修改或重新训练时的传统策略失效的情况。论文提出了一种名为ArtistAuditor的新方法,用于检测文本到图像生成模型是否通过特定艺术家的作品进行了微调。解决方案的关键在于ArtistAuditor采用风格提取器获取多粒度的风格表示,并将艺术品视为艺术家风格的采样,然后通过查询经过训练的判别器来获得审计决策。实验结果表明,ArtistAuditor在六种模型与数据集组合上的AUC值高达0.937,验证了其有效性。

链接: https://arxiv.org/abs/2504.13061
作者: Linkang Du,Zheng Zhu,Min Chen,Zhou Su,Shouling Ji,Peng Cheng,Jiming Chen,Zhikun Zhang
机构: Xi’an Jiaotong University(XJTU)(西安交通大学); Zhejiang University(ZJU)(浙江大学); The Chinese University of Hong Kong(香港中文大学); Vrije Universiteit Amsterdam(VU)(阿姆斯特丹自由大学); Hangzhou Dianzi University(HDNU)(杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: To appear in the ACM Web Conference 2025, Sydney, Australia

点击查看摘要

Abstract:Text-to-image models based on diffusion processes, such as DALL-E, Stable Diffusion, and Midjourney, are capable of transforming texts into detailed images and have widespread applications in art and design. As such, amateur users can easily imitate professional-level paintings by collecting an artist’s work and fine-tuning the model, leading to concerns about artworks’ copyright infringement. To tackle these issues, previous studies either add visually imperceptible perturbation to the artwork to change its underlying styles (perturbation-based methods) or embed post-training detectable watermarks in the artwork (watermark-based methods). However, when the artwork or the model has been published online, i.e., modification to the original artwork or model retraining is not feasible, these strategies might not be viable. To this end, we propose a novel method for data-use auditing in the text-to-image generation model. The general idea of ArtistAuditor is to identify if a suspicious model has been finetuned using the artworks of specific artists by analyzing the features related to the style. Concretely, ArtistAuditor employs a style extractor to obtain the multi-granularity style representations and treats artworks as samplings of an artist’s style. Then, ArtistAuditor queries a trained discriminator to gain the auditing decisions. The experimental results on six combinations of models and datasets show that ArtistAuditor can achieve high AUC values ( 0.937). By studying ArtistAuditor’s transferability and core modules, we provide valuable insights into the practical implementation. Finally, we demonstrate the effectiveness of ArtistAuditor in real-world cases by an online platform Scenario. ArtistAuditor is open-sourced at this https URL. Comments: To appear in the ACM Web Conference 2025, Sydney, Australia Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2504.13061 [cs.CV] (or arXiv:2504.13061v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.13061 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-27] Imaging for All-Day Wearable Smart Glasses

【速读】:该论文旨在解决智能眼镜在全天候佩戴场景下,因体积、重量、时尚性和社交接受度等限制所导致的成像系统设计挑战。同时,考虑到用户在动态环境中使用智能眼镜进行日常活动的需求,论文分析了智能眼镜成像领域的基本限制,并探讨了这些限制对图像质量和相机模块尺寸的影响,与智能手机等类似设备进行了对比。论文的关键解决方案是一种新颖的分布式成像方法,通过将传统单片式相机设计替换为分布式的模块化设计,显著减小了单个相机模块的尺寸,从而满足智能眼镜对紧凑结构的需求。论文进一步通过合成数据及原型设备拍摄的实际图像验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.13060
作者: Michael Goesele,Daniel Andersen,Yujia Chen,Simon Green,Eddy Ilg,Chao Li,Johnson Liu,Grace Kuo,Logan Wan,Richard Newcombe
机构: Meta Reality Labs Research; University of Technology Nuremberg; Meta Reality Labs Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years smart glasses technology has rapidly advanced, opening up entirely new areas for mobile computing. We expect future smart glasses will need to be all-day wearable, adopting a small form factor to meet the requirements of volume, weight, fashionability and social acceptability, which puts significant constraints on the space of possible solutions. Additional challenges arise due to the fact that smart glasses are worn in arbitrary environments while their wearer moves and performs everyday activities. In this paper, we systematically analyze the space of imaging from smart glasses and derive several fundamental limits that govern this imaging domain. We discuss the impact of these limits on achievable image quality and camera module size – comparing in particular to related devices such as mobile phones. We then propose a novel distributed imaging approach that allows to minimize the size of the individual camera modules when compared to a standard monolithic camera design. Finally, we demonstrate the properties of this novel approach in a series of experiments using synthetic data as well as images captured with two different prototype implementations.
zh

[CV-28] NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在强化学习(Reinforcement Learning, RL)中的两个主要挑战:一是如何更有效地扩展测试时的计算探索能力;二是如何克服视觉感知的不完美性以提升后续推理过程的准确性。为了解决这些问题,论文提出了一种名为NoisyRollout的简单而有效的RL方法。其关键是通过混合来自清晰图像和适度失真图像的轨迹来引入视觉感知和推理模式的目标多样性,并结合视觉导向的归纳偏置增强模型的探索能力。此外,NoisyRollout采用噪声退火调度,在训练过程中逐渐减少失真强度,从而早期利用噪声信号的优势,同时确保后期训练的稳定性和可扩展性。

链接: https://arxiv.org/abs/2504.13055
作者: Xiangyan Liu,Jinjie Ni,Zijian Wu,Chao Du,Longxu Dou,Haonan Wang,Tianyu Pang,Michael Qizhe Shieh
机构: National University of Singapore; Sea AI Lab, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-language models (VLMs). However, enhancing policy exploration to more effectively scale test-time compute remains underexplored in VLMs. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. To this end, we propose NoisyRollout, a simple yet effective RL approach that mixes trajectories from both clean and moderately distorted images to introduce targeted diversity in visual perception and the resulting reasoning patterns. Without additional training cost, NoisyRollout enhances the exploration capabilities of VLMs by incorporating a vision-oriented inductive bias. Furthermore, NoisyRollout employs a noise annealing schedule that gradually reduces distortion strength over training, ensuring benefit from noisy signals early while maintaining training stability and scalability in later stages. With just 2.1K training samples, NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models on 5 out-of-domain benchmarks spanning both reasoning and perception tasks, while preserving comparable or even better in-domain performance.
zh

[CV-29] Expert Kernel Generation Network Driven by Contextual Mapping for Hyperspectral Image Classification

【速读】:该论文旨在解决高光谱图像分类中的几个关键挑战:高维数据、地物分布稀疏以及光谱冗余,这些问题通常会导致模型过拟合及泛化能力受限。为更高效地适应地物分布并提取图像特征,同时避免引入过多参数且不遗漏冗余信息,论文提出了基于改进3D-DenseNet的EKGNet模型。其核心在于结合上下文感知映射网络与动态核生成模块,通过上下文感知映射模块将高光谱输入的全局上下文信息转化为组合基础卷积核的指令,而动态核由K组基础卷积组成,类似于K种专注于不同维度基本模式的专家。这种紧密耦合系统使得模型能够根据输入灵活调整关注点,而非局限于单一静态卷积核的固定感受野,从而提升模型表达能力而不增加网络深度或宽度。这一方法在IN、UP和KSC数据集上的表现优于主流高光谱图像分类方法。

链接: https://arxiv.org/abs/2504.13045
作者: Guandong Li,Mengxia Ye
机构: aiFLYTEK (科大讯飞); Aegon THTF (安信信托)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2503.23472

点击查看摘要

Abstract:Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more efficiently adapt to ground object distributions while extracting image features without introducing excessive parameters and skipping redundant information, this paper proposes EKGNet based on an improved 3D-DenseNet model, consisting of a context-aware mapping network and a dynamic kernel generation module. The context-aware mapping module translates global contextual information of hyperspectral inputs into instructions for combining base convolutional kernels, while the dynamic kernels are composed of K groups of base convolutions, analogous to K different types of experts specializing in fundamental patterns across various dimensions. The mapping module and dynamic kernel generation mechanism form a tightly coupled system - the former generates meaningful combination weights based on inputs, while the latter constructs an adaptive expert convolution system using these weights. This dynamic approach enables the model to focus more flexibly on key spatial structures when processing different regions, rather than relying on the fixed receptive field of a single static convolutional kernel. EKGNet enhances model representation capability through a 3D dynamic expert convolution system without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.
zh

[CV-30] Event-Enhanced Blurry Video Super-Resolution AAAI2025

【速读】:该论文旨在解决模糊视频超分辨率(Blurry Video Super-Resolution, BVSR)问题,目标是从低分辨率且模糊的输入视频生成高分辨率的清晰视频。当前BVSR方法在恢复高分辨率下的细节时通常表现不佳,主要由于运动信息不足导致去卷积过程中的伪影以及低分辨率帧中高频细节的缺失。为应对这些挑战,论文提出了一种新颖的事件增强网络Ev-DeblurVSR,并将事件信号引入BVSR任务中。其关键解决方案在于设计了一个互惠特征去模糊模块(reciprocal feature deblurring module),用于有效融合帧与事件的信息以去除特征图的模糊性,同时利用帧内的事件运动信息来去模糊帧特征,并通过帧的全局场景上下文来增强事件特征。此外,为了提升时间一致性,论文还提出了一个混合可变形对齐模块(hybrid deformable alignment module),充分利用帧间事件与光流的互补运动信息,以改善可变形对齐过程中的运动估计。大量实验表明,该方法在合成数据集和真实数据集上均达到了新的性能高度,尤其在真实数据上比最新的BVSR基线FMA-Net提高了+2.59 dB的精度,同时推理速度提升了7.28倍。

链接: https://arxiv.org/abs/2504.13042
作者: Dachun Kai,Yueyi Zhang,Jin Wang,Zeyu Xiao,Zhiwei Xiong,Xiaoyan Sun
机构: Beijing Institute of Technology(北京理工大学); Chinese Academy of Sciences(中国科学院); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2025. Project page: this https URL

点击查看摘要

Abstract:In this paper, we tackle the task of blurry video super-resolution (BVSR), aiming to generate high-resolution (HR) videos from low-resolution (LR) and blurry inputs. Current BVSR methods often fail to restore sharp details at high resolutions, resulting in noticeable artifacts and jitter due to insufficient motion information for deconvolution and the lack of high-frequency details in LR frames. To address these challenges, we introduce event signals into BVSR and propose a novel event-enhanced network, Ev-DeblurVSR. To effectively fuse information from frames and events for feature deblurring, we introduce a reciprocal feature deblurring module that leverages motion information from intra-frame events to deblur frame features while reciprocally using global scene context from the frames to enhance event features. Furthermore, to enhance temporal consistency, we propose a hybrid deformable alignment module that fully exploits the complementary motion information from inter-frame events and optical flow to improve motion estimation in the deformable alignment process. Extensive evaluations demonstrate that Ev-DeblurVSR establishes a new state-of-the-art performance on both synthetic and real-world datasets. Notably, on real data, our method is +2.59 dB more accurate and 7.28 \times faster than the recent best BVSR baseline FMA-Net. Code: this https URL.
zh

[CV-31] Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

【速读】:该论文旨在解决部分相关视频检索(Partially Relevant Video Retrieval, PRVR)中搜索准确性与效率难以兼顾的问题。在PRVR任务中,通过引入不同时间尺度下的多样化上下文表示可以提升检索精度,但同时会显著增加计算和内存开销。为了解决这一矛盾,论文提出了一种原型PRVR框架,其关键是将视频内的多样化上下文编码为固定数量的原型,并通过增强文本关联性和视频理解能力的策略优化这些原型,同时引入正交目标确保原型能够涵盖多样化的语义内容。为了使原型既能被文本查询有效搜索又能准确编码视频上下文,论文设计了跨模态重建和单模态重建任务,其中跨模态重建任务实现原型与文本特征的空间对齐,而单模态重建任务保证视频上下文在编码过程中的完整性。此外,还采用视频混合技术提供弱监督以进一步优化原型与文本表示之间的对齐关系。实验结果表明,该方法在TVR、ActivityNet-Captions和QVHighlights数据集上的有效性得到了验证,且未牺牲检索效率。

链接: https://arxiv.org/abs/2504.13035
作者: WonJun Moon,Cheol-Ho Cho,Woojin Jun,Minho Shim,Taeoh Kim,Inwoong Lee,Dongyoon Wee,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学); NAVER Cloud (NAVERクラウド); Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.
zh

[CV-32] RD3: Texture Transfer Residual Denoising Dual Diffusion Model for Remote Sensing Image Super-Resolution

【速读】:该论文致力于解决遥感图像超分辨率(RSISR)中的三个关键挑战:(1) 从空间异质性遥感场景中提取多尺度特征的难度;(2) 先验信息有限导致重建中的语义不一致;(3) 几何精度与视觉质量之间的权衡失衡。为应对这些挑战,论文提出了一种名为Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) 的方法,其核心创新点包括:首先,采用并行异构卷积核的多尺度特征聚合块(MFAB)用于多尺度特征提取;其次,引入稀疏纹理传递引导模块(STTG),通过参考相似场景的高分辨率纹理先验进行特征迁移;第三,构建残差去噪双扩散模型(RDDM)框架,结合确定性重建的残差扩散和多样化生成的噪声扩散。实验结果表明,TTRD3 在多个源数据集上的表现优于现有最先进方法,LPIPS 提升 1.43%,FID 改进 3.67%。

链接: https://arxiv.org/abs/2504.13026
作者: Yide Liu,Haijiang Sun,Xiaowen Zhang,Qiaoyuan Liu,Zhouchang Chen,Chongzhuo Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote Sensing Image Super-Resolution (RSISR) reconstructs high-resolution (HR) remote sensing images from low-resolution inputs to support fine-grained ground object interpretation. Existing methods face three key challenges: (1) Difficulty in extracting multi-scale features from spatially heterogeneous RS scenes, (2) Limited prior information causing semantic inconsistency in reconstructions, and (3) Trade-off imbalance between geometric accuracy and visual quality. To address these issues, we propose the Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) with three innovations: First, a Multi-scale Feature Aggregation Block (MFAB) employing parallel heterogeneous convolutional kernels for multi-scale feature extraction. Second, a Sparse Texture Transfer Guidance (STTG) module that transfers HR texture priors from reference images of similar scenes. Third, a Residual Denoising Dual Diffusion Model (RDDM) framework combining residual diffusion for deterministic reconstruction and noise diffusion for diverse generation. Experiments on multi-source RS datasets demonstrate TTRD3’s superiority over state-of-the-art methods, achieving 1.43% LPIPS improvement and 3.67% FID enhancement compared to best-performing baselines. Code/model: this https URL.
zh

[CV-33] Riemannian Patch Assignment Gradient Flows

【速读】:该论文致力于解决基于图的度量数据标注问题,提出了一种通过图上标签与标签分配动态交互来优化初始局部标注的方法。关键在于利用一组竞争性标记块字典以及由补丁分配变量调控的机制,通过黎曼上升流的几何数值积分实现补丁分配的最大一致性,该过程以拉格朗日作用泛函的临界点为指导。实验展示了所提方法的特性,包括标签分配的不确定性量化。

链接: https://arxiv.org/abs/2504.13024
作者: Daniel Gonzalez-Alvarado,Fabio Schlindwein,Jonas Cassel,Laura Steingruber,Stefania Petra,Christoph Schnörr
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces patch assignment flows for metric data labeling on graphs. Labelings are determined by regularizing initial local labelings through the dynamic interaction of both labels and label assignments across the graph, entirely encoded by a dictionary of competing labeled patches and mediated by patch assignment variables. Maximal consistency of patch assignments is achieved by geometric numerical integration of a Riemannian ascent flow, as critical point of a Lagrangian action functional. Experiments illustrate properties of the approach, including uncertainty quantification of label assignments.
zh

[CV-34] CompGS: Compressed Gaussian Splatting for Static and Dynamic Scene Representation

【速读】:该论文旨在解决高保真三维场景建模中由于冗余导致的数据体积庞大的问题,特别是针对静态和动态场景,提出了一种高效的压缩方法以适应现有互联网基础设施下的传输需求。论文的关键在于提出了一种名为Compressed Gaussian Splatting (CompGS++)的新框架,通过利用紧凑的高斯基元实现准确的三维建模,并显著减小数据规模。其解决方案的核心包括:1)开发了一种综合预测范式来消除基元间的冗余,具体涉及空间和时间上的基元预测模块,前者主要减少空间冗余,后者则针对动态场景处理时间冗余;2)设计了一个速率约束优化模块,同时最小化重建误差与码率消耗,进一步消除基元内部的参数冗余,提升场景表示的整体紧凑性。这些创新使得CompGS++在多个基准数据集上的表现显著优于现有方法,实现了卓越的压缩性能与精确的场景建模能力。

链接: https://arxiv.org/abs/2504.13022
作者: Xiangrui Liu,Xinju Wu,Shiqi Wang,Zhu Li,Sam Kwong
机构: Department of Computer Science, City University of Hong Kong (香港城市大学计算机科学系); Department of Computer Science and Electrical Engineering, University of Missouri–Kansas City (密苏里大学堪萨斯城分校计算机科学与电气工程系); Department of Computing and Decision Science, Lingnan University (岭南大学计算与决策科学系)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to a journal

点击查看摘要

Abstract:Gaussian splatting demonstrates proficiency for 3D scene modeling but suffers from substantial data volume due to inherent primitive redundancy. To enable future photorealistic 3D immersive visual communication applications, significant compression is essential for transmission over the existing Internet infrastructure. Hence, we propose Compressed Gaussian Splatting (CompGS++), a novel framework that leverages compact Gaussian primitives to achieve accurate 3D modeling with substantial size reduction for both static and dynamic scenes. Our design is based on the principle of eliminating redundancy both between and within primitives. Specifically, we develop a comprehensive prediction paradigm to address inter-primitive redundancy through spatial and temporal primitive prediction modules. The spatial primitive prediction module establishes predictive relationships for scene primitives and enables most primitives to be encoded as compact residuals, substantially reducing the spatial redundancy. We further devise a temporal primitive prediction module to handle dynamic scenes, which exploits primitive correlations across timestamps to effectively reduce temporal redundancy. Moreover, we devise a rate-constrained optimization module that jointly minimizes reconstruction error and rate consumption. This module effectively eliminates parameter redundancy within primitives and enhances the overall compactness of scene representations. Comprehensive evaluations across multiple benchmark datasets demonstrate that CompGS++ significantly outperforms existing methods, achieving superior compression performance while preserving accurate scene modeling. Our implementation will be made publicly available on GitHub to facilitate further research.
zh

[CV-35] Pose and Facial Expression Transfer by using StyleGAN

【速读】:该论文旨在解决跨人脸图像的姿势(pose)和表情(expression)迁移问题。解决方案的关键在于设计了一个包含两个编码器和一个映射网络的架构,该架构能够将源人脸图像的姿势和表情特征提取并映射到StyleGAN2的潜在空间(latent space),并通过生成网络最终输出融合目标身份特征的结果图像。此外,该方法通过自监督学习从多个人的视频序列中进行训练,无需人工标注,实现了随机身份合成的同时保持对姿势和表情的可控性,并达到了接近实时的性能。

链接: https://arxiv.org/abs/2504.13021
作者: Petr Jahoda,Jan Cech
机构: Faculty of Electrical Engineering, Czech Technical University in Prague (捷克布拉格捷克技术大学电气工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CVWW 2024. Presented in Terme Olimia, Slovenia

点击查看摘要

Abstract:We propose a method to transfer pose and expression between face images. Given a source and target face portrait, the model produces an output image in which the pose and expression of the source face image are transferred onto the target identity. The architecture consists of two encoders and a mapping network that projects the two inputs into the latent space of StyleGAN2, which finally generates the output. The training is self-supervised from video sequences of many individuals. Manual labeling is not required. Our model enables the synthesis of random identities with controllable pose and expression. Close-to-real-time performance is achieved.
zh

[CV-36] Hierarchical Feature Learning for Medical Point Clouds via State Space Model

【速读】:该论文旨在解决医疗点云理解在疾病诊断和治疗中的应用潜力未被充分挖掘的问题。现有研究主要集中在通用形状分析的点云建模上,而针对医疗点云的研究较少。论文提出了一种基于状态空间模型(State Space Model, SSM)的分层特征学习框架,用于医学点云的理解。解决方案的关键在于通过最远点采样(Farthest Point Sampling, FPS)将输入点云下采样为多层级,并在每一层级利用k近邻(k-Nearest Neighbor, KNN)查询聚合多尺度结构信息。此外,引入坐标顺序扫描和内外扫描策略以高效处理不规则点云的序列化,并通过标准SSM块和分组SSM块逐步计算点特征,从而捕捉局部模式和长程依赖关系。实验结果表明,该方法在解剖分类、补全和分割任务中均表现出色。

链接: https://arxiv.org/abs/2504.13015
作者: Guoqing Zhang,Jingyun Yang,Yang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Deep learning-based point cloud modeling has been widely investigated as an indispensable component of general shape analysis. Recently, transformer and state space model (SSM) have shown promising capacities in point cloud learning. However, limited research has been conducted on medical point clouds, which have great potential in disease diagnosis and treatment. This paper presents an SSM-based hierarchical feature learning framework for medical point cloud understanding. Specifically, we down-sample input into multiple levels through the farthest point sampling. At each level, we perform a series of k-nearest neighbor (KNN) queries to aggregate multi-scale structural information. To assist SSM in processing point clouds, we introduce coordinate-order and inside-out scanning strategies for efficient serialization of irregular points. Point features are calculated progressively from short neighbor sequences and long point sequences through vanilla and group Point SSM blocks, to capture both local patterns and long-range dependencies. To evaluate the proposed method, we build a large-scale medical point cloud dataset named MedPointS for anatomy classification, completion, and segmentation. Extensive experiments conducted on MedPointS demonstrate that our method achieves superior performance across all tasks. The dataset is available at this https URL. Code is merged to a public medical imaging platform: this https URL.
zh

[CV-37] GSAC: Leverag ing Gaussian Splatting for Photorealistic Avatar Creation with Unity Integration

【速读】:该论文旨在解决现有数字人(Photorealistic Avatars)创建技术面临的高成本、长周期以及在虚拟应用中实用性有限的问题。特别是针对手动方法(如MetaHuman)需要大量时间和专业知识,以及基于NeRF的自动化方法存在效率低、面部表情细节不足且无法满足实时渲染需求等挑战,提出了创新性的解决方案。论文的关键在于引入了一种端到端的3D高斯点绘(3D Gaussian Splatting, 3DGS)数字人创建流水线,通过利用单目视频输入实现可扩展且高效的逼真数字人生成,并直接兼容Unity游戏引擎。这一方案的核心创新包括一种结合自定义预处理的新型高斯点绘技术,能够支持野外单目视频捕捉、精细面部表情重建及嵌入完全绑定的数字人模型,同时开发了集成于Unity的高斯点绘数字人编辑器,提供友好的VR/AR应用开发环境。实验结果验证了该方法在标准化训练数据和Unity中数字人灵活性方面的有效性,展示了其可扩展性和实际应用价值。

链接: https://arxiv.org/abs/2504.12999
作者: Rendong Zhang,Alexandra Watkins,Nilanjan Sarkar
机构: Dept. of Computer Science Vanderbilt University (范德比尔特大学); Dept. of Mechanical Engineering Vanderbilt University (范德比尔特大学); Dept. of Mechanical Engineering Vanderbilt University (范德比尔特大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photorealistic avatars have become essential for immersive applications in virtual reality (VR) and augmented reality (AR), enabling lifelike interactions in areas such as training simulations, telemedicine, and virtual collaboration. These avatars bridge the gap between the physical and digital worlds, improving the user experience through realistic human representation. However, existing avatar creation techniques face significant challenges, including high costs, long creation times, and limited utility in virtual applications. Manual methods, such as MetaHuman, require extensive time and expertise, while automatic approaches, such as NeRF-based pipelines often lack efficiency, detailed facial expression fidelity, and are unable to be rendered at a speed sufficent for real-time applications. By involving several cutting-edge modern techniques, we introduce an end-to-end 3D Gaussian Splatting (3DGS) avatar creation pipeline that leverages monocular video input to create a scalable and efficient photorealistic avatar directly compatible with the Unity game engine. Our pipeline incorporates a novel Gaussian splatting technique with customized preprocessing that enables the user of “in the wild” monocular video capture, detailed facial expression reconstruction and embedding within a fully rigged avatar model. Additionally, we present a Unity-integrated Gaussian Splatting Avatar Editor, offering a user-friendly environment for VR/AR application development. Experimental results validate the effectiveness of our preprocessing pipeline in standardizing custom data for 3DGS training and demonstrate the versatility of Gaussian avatars in Unity, highlighting the scalability and practicality of our approach.
zh

[CV-38] All-in-One Transferring Image Compression from Human Perception to Multi-Machine Perception

【速读】:该论文致力于解决将基于人类感知的 Learned Image Compression (LIC) 模型高效迁移到机器感知的问题,现有方法通常以单任务方式适应下游任务,存在效率低下、任务间交互缺乏以及生成多个任务特定码流等局限性。论文的关键解决方案在于提出了一种不对称适配器框架,支持在单一模型内进行多任务适配。该方法通过引入共享适配器学习通用语义特征,并使用任务特定适配器保留任务级差异,在仅增加轻量级插件模块且冻结基础编解码器的情况下,实现了在多个任务上的优异性能,同时保持了压缩效率。

链接: https://arxiv.org/abs/2504.12997
作者: Jiancheng Zhao,Xiang Ji,Zhuoxiao Li,Zunian Wan,Weihang Ran,Mingze Ma,Muyao Niu,Yifan Zhan,Cheng-Ching Tseng,Yinqiang Zheng
机构: The University of Tokyo(Tokyo University) Tokyo Japan; Peking University(Peking University) Beijing China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Efficiently transferring Learned Image Compression (LIC) model from human perception to machine perception is an emerging challenge in vision-centric representation learning. Existing approaches typically adapt LIC to downstream tasks in a single-task manner, which is inefficient, lacks task interaction, and results in multiple task-specific bitstreams. To address these limitations, we propose an asymmetric adaptor framework that supports multi-task adaptation within a single model. Our method introduces a shared adaptor to learn general semantic features and task-specific adaptors to preserve task-level distinctions. With only lightweight plug-in modules and a frozen base codec, our method achieves strong performance across multiple tasks while maintaining compression efficiency. Experiments on the PASCAL-Context benchmark demonstrate that our method outperforms both Fully Fine-Tuned and other Parameter Efficient Fine-Tuned (PEFT) baselines, and validating the effectiveness of multi-vision transferring.
zh

[CV-39] Enhancing Cocoa Pod Disease Classification via Transfer Learning and Ensemble Methods: Toward Robust Predictive Modeling

【速读】:该论文旨在解决基于图像的可可果病害分类问题,特别是在处理类别不平衡和环境变化(如光照、角度及病害严重程度)时的分类性能不足。为应对这些挑战,论文提出了一种结合迁移学习与集成学习策略(Bagging、Boosting和Stacking)的解决方案。关键在于利用预训练卷积神经网络(如VGG16、ResNet50等)作为基学习器,并通过微调实现迁移学习,同时采用Bagging方法显著提升了分类准确性,在测试集上达到了100%的分类精度。这一结果表明,将迁移学习与集成技术相结合能够有效提高模型的泛化能力和可靠性,为精准农业和作物病害自动化管理提供了可行路径。

链接: https://arxiv.org/abs/2504.12992
作者: Devina Anduyan,Nyza Cabillo,Navy Gultiano,Mark Phil Pacot
机构: Caraga State University (卡加延州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study presents an ensemble-based approach for cocoa pod disease classification by integrating transfer learning with three ensemble learning strategies: Bagging, Boosting, and Stacking. Pre-trained convolutional neural networks, including VGG16, VGG19, ResNet50, ResNet101, InceptionV3, and Xception, were fine-tuned and employed as base learners to detect three disease categories: Black Pod Rot, Pod Borer, and Healthy. A balanced dataset of 6,000 cocoa pod images was curated and augmented to ensure robustness against variations in lighting, orientation, and disease severity. The performance of each ensemble method was evaluated using accuracy, precision, recall, and F1-score. Experimental results show that Bagging consistently achieved superior classification performance with a test accuracy of 100%, outperforming Boosting (97%) and Stacking (92%). The findings confirm that combining transfer learning with ensemble techniques improves model generalization and reliability, making it a promising direction for precision agriculture and automated crop disease management.
zh

[CV-40] MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly Detection

【速读】:该论文旨在解决工业异常检测领域中因实际缺陷图像稀缺且难以预测而导致的数据获取难题。传统基于简单剪切粘贴或修补的合成策略忽视了缺陷的物理本质,从而产生低质量且不一致的人工异常,限制了模型在真实复杂场景中的泛化能力。为应对这一挑战,论文提出了一种创新的合成与优化框架,其关键是通过数学物理模型引导生成高质量的合成异常,并采用粗到细(Coarse-to-Fine)的两阶段增强方法结合双层优化策略与合成质量评估器(SQE)来提升生成样本的真实性和鲁棒性。具体而言,第一阶段(npcF)利用偏微分方程(PDE)一致性确保全局结构连贯性,第二阶段(npcF++)则借助小波变换和边界协同模块进一步细化局部细节,同时引入SQE驱动的加权机制以突出高质量样本的重要性。实验结果表明,所提出的MaPhC2F与BiSQAD方法在MVTec AD、VisA和BTAD三个基准数据集上的图像级(AUROC)及像素级(AUROC)性能均达到当前最优(SOTA),验证了该方案的有效性。

链接: https://arxiv.org/abs/2504.12970
作者: Long Qian,Bingke Zhu,Yingying Chen,Ming Tang,Jinqiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly detection is a crucial task in computer vision, yet collecting real-world defect images is inherently difficult due to the rarity and unpredictability of anomalies. Consequently, researchers have turned to synthetic methods for training data augmentation. However, existing synthetic strategies (e.g., naive cut-and-paste or inpainting) overlook the underlying physical causes of defects, leading to inconsistent, low-fidelity anomalies that hamper model generalization to real-world complexities. In this thesis, we introduced a novel pipeline that generates synthetic anomalies through Math-Physics model guidance, refines them via a Coarse-to-Fine approach and employs a bi-level optimization strategy with a Synthesis Quality Estimator(SQE). By incorporating physical modeling of cracks, corrosion, and deformation, our method produces realistic defect masks, which are subsequently enhanced in two phases. The first stage (npcF) enforces a PDE-based consistency to achieve a globally coherent anomaly structure, while the second stage (npcF++) further improves local fidelity using wavelet transforms and boundary synergy blocks. Additionally, we leverage SQE-driven weighting, ensuring that high-quality synthetic samples receive greater emphasis during training. To validate our approach, we conducted comprehensive experiments on three widely adopted industrial anomaly detection benchmarks: MVTec AD, VisA, and BTAD. Across these datasets, the proposed pipeline achieves state-of-the-art (SOTA) results in both image-AUROC and pixel-AUROC, confirming the effectiveness of our MaPhC2F and BiSQAD.
zh

[CV-41] Vision and Language Integration for Domain Generalization

【速读】:该论文旨在解决领域泛化(Domain Generalization)中的关键挑战,即如何在存在领域差距(domain gaps)的情况下,从源领域(source domains)学习到一个域不变(domain-invariant)的特征空间,以确保模型在未知目标领域(target domains)上具备鲁棒的泛化能力。论文指出,这一问题的核心难点在于难以找到可靠的通用图像特征空间,主要原因是图像缺乏合适的基元单位。为了解决这一问题,论文提出了一种结合语言空间和视觉空间的方法——VLCA(Vision-Language Cross-Alignment),通过语义空间作为桥梁连接多个图像领域。

解决方案的关键在于利用语言的语义完备性(semantic completeness)和图像的直观性(intuitiveness)。具体而言,在语言空间中,通过单词向量距离捕获类别间语义关系的表示;在视觉空间中,通过低秩近似探索同一类样本特征的共同模式。最终,通过文本与图像的多模态空间对齐语言表示和视觉表示。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.12966
作者: Yanmei Wang,Xiyao Liu,Fupeng Chu,Zhi Han
机构: Sia(沈阳自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Domain generalization aims at training on source domains to uncover a domain-invariant feature space, allowing the model to perform robust generalization ability on unknown target domains. However, due to domain gaps, it is hard to find reliable common image feature space, and the reason for that is the lack of suitable basic units for images. Different from image in vision space, language has comprehensive expression elements that can effectively convey semantics. Inspired by the semantic completeness of language and intuitiveness of image, we propose VLCA, which combine language space and vision space, and connect the multiple image domains by using semantic space as the bridge domain. Specifically, in language space, by taking advantage of the completeness of language basic units, we tend to capture the semantic representation of the relations between categories through word vector distance. Then, in vision space, by taking advantage of the intuitiveness of image features, the common pattern of sample features with the same class is explored through low-rank approximation. In the end, the language representation is aligned with the vision representation through the multimodal space of text and image. Experiments demonstrate the effectiveness of the proposed method.
zh

[CV-42] Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction CVPR2025

【速读】:本文针对基于视觉的3D语义占用预测(VisionOcc)框架中未充分探索的时间融合问题提出了解决方案。论文的关键在于引入了一种名为GDFusion的时间融合方法,该方法系统性地分析了VisionOcc管道中的时间线索,并发现了三个此前被忽视但至关重要的时间线索:场景级一致性、运动校准和几何互补性。这些线索在VisionOcc的不同模块中做出了独特的贡献。为有效融合异构表示中的时间信号,论文提出了通过重新解释标准循环神经网络(RNNs)公式来实现的一种新颖融合策略。这一策略利用特征上的梯度下降统一整合多样化的时间信息,将所提出的这些时间线索无缝嵌入网络结构中。实验结果表明,GDFusion在nuScenes数据集上的表现显著优于现有基准模型,在Occ3D基准测试中提升了1.4%-4.8%的mIoU,并减少了27%-72%的内存消耗。

链接: https://arxiv.org/abs/2504.12959
作者: Dubing Chen,Huan Zheng,Jin Fang,Xingping Dong,Xianfei Li,Wenlong Liao,Tao He,Pai Peng,Jianbing Shen
机构: SKL-IOTSC, CIS, University of Macau(国家重点实验室(澳门大学)); Wuhan University (武汉大学); COWAROBOT Co. Ltd. (酷哇机器人有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). GDFusion opens up the underexplored aspects of temporal fusion within the VisionOcc framework, focusing on both temporal cues and fusion strategies. It systematically examines the entire VisionOcc pipeline, identifying three fundamental yet previously overlooked temporal cues: scene-level consistency, motion calibration, and geometric complementation. These cues capture diverse facets of temporal evolution and make distinct contributions across various modules in the VisionOcc framework. To effectively fuse temporal signals across heterogeneous representations, we propose a novel fusion strategy by reinterpreting the formulation of vanilla RNNs. This reinterpretation leverages gradient descent on features to unify the integration of diverse temporal information, seamlessly embedding the proposed temporal cues into the network. Extensive experiments on nuScenes demonstrate that GDFusion significantly outperforms established baselines. Notably, on Occ3D benchmark, it achieves 1.4%-4.8% mIoU improvements and reduces memory consumption by 27%-72%.
zh

[CV-43] Disentangling Polysemantic Channels in Convolutional Neural Networks CVPR2025

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Network, CNN)中多义性通道(polysemantic channel)难以被解析的问题。CNN中的多义性通道编码了多个不同的概念,这使得其内部机制难以被解释。为了解决这一挑战,论文提出了一种算法,通过将多义性通道分解为多个单一概念响应的独立通道来增强模型的可解释性。方案的关键在于利用同一通道内不同概念在前一层表现出的不同激活模式,通过对这些多义特征进行解耦,重构网络权重,从而提高CNN的可解释性,并进一步优化特征可视化等解释性技术。

链接: https://arxiv.org/abs/2504.12939
作者: Robin Hesse,Jonas Fischer,Simone Schaub-Meyer,Stefan Roth
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2025 Workshop on Mechanistic Interpretability for Vision (MIV). Code: this https URL

点击查看摘要

Abstract:Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.
zh

[CV-44] Efficient Masked Image Compression with Position-Indexed Self-Attention

【速读】:该论文旨在解决现有基于语义结构图像压缩方法中存在的冗余计算问题。传统方法在编码后才对位流进行语义结构化处理,导致即使不重要的区域信息最终不会被传输,但仍需参与整个编码过程,从而造成不必要的计算开销。此外,即便通过语义掩码将图像中不重要的区域置零,这些区域在后续计算中依然被视为图像的一部分而参与运算。为了解决这些问题,论文提出了一种基于位置索引自注意力机制的图像压缩方法,其关键是仅对被掩码遮挡后的可见部分进行编码和解码,从而显著降低计算成本。

链接: https://arxiv.org/abs/2504.12923
作者: Chengjie Dai,Tiantian Song,Hui Tang,Fangdong Chen,Bowei Yang,Guanghua Song
机构: Zhejiang University (浙江大学); University College London (伦敦大学学院); Hikvision Research Institute (海康威视研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, image compression for high-level vision tasks has attracted considerable attention from researchers. Given that object information in images plays a far more crucial role in downstream tasks than background information, some studies have proposed semantically structuring the bitstream to selectively transmit and reconstruct only the information required by these tasks. However, such methods structure the bitstream after encoding, meaning that the coding process still relies on the entire image, even though much of the encoded information will not be transmitted. This leads to redundant computations. Traditional image compression methods require a two-dimensional image as input, and even if the unimportant regions of the image are set to zero by applying a semantic mask, these regions still participate in subsequent computations as part of the image. To address such limitations, we propose an image compression method based on a position-indexed self-attention mechanism that encodes and decodes only the visible parts of the masked image. Compared to existing semantic-structured compression methods, our approach can significantly reduce computational costs.
zh

[CV-45] Real-time High-fidelity Gaussian Human Avatars with Position-based Interpolation of Spatially Distributed MLPs CVPR2025

【速读】:该论文旨在解决现有高保真高斯人体 avatar 重建方法中存在的两个主要问题:一是单个多层感知机(MLP)难以捕捉与姿态相关的外观细节;二是虽然某些基于复杂神经网络的方法可以重建高保真外观,但渲染性能下降至非实时。为了解决这些问题,论文提出了一种新颖的高斯人体 avatar 表征方法,能够在保持高保真姿态相关外观细节的同时实现实时渲染。

方案的关键在于引入空间分布的 MLPs,这些 MLPs 明确定位在人体的不同位置上。每个高斯分布的参数通过其附近 MLPs 的输出进行距离加权插值得到。为了避免插值过程中不希望出现的平滑属性变化,论文定义了一组高斯偏移基,并通过线性组合表示相对于中性属性的高斯属性偏移量。此外,MLPs 被训练输出一组与这些基相对应的系数,使得尽管高斯系数平滑变化,但基的学习不受约束。这种平滑变化的系数与自由学习的基相结合,仍能够产生显著不同的高斯属性偏移,从而具备学习高频空间信号的能力。同时,通过控制点约束高斯分布在表面层而非体内部随机分布,进一步提升了人体 avatar 在新姿态下的泛化能力。实验结果表明,与最先进的方法相比,本文方法在新型视角和姿态下实现了更高的渲染速度以及更精细的外观细节质量。

链接: https://arxiv.org/abs/2504.12909
作者: Youyi Zhan,Tianjia Shao,Yin Yang,Kun Zhou
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学); University of Utah (犹他大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Many works have succeeded in reconstructing Gaussian human avatars from multi-view videos. However, they either struggle to capture pose-dependent appearance details with a single MLP, or rely on a computationally intensive neural network to reconstruct high-fidelity appearance but with rendering performance degraded to non-real-time. We propose a novel Gaussian human avatar representation that can reconstruct high-fidelity pose-dependence appearance with details and meanwhile can be rendered in real time. Our Gaussian avatar is empowered by spatially distributed MLPs which are explicitly located on different positions on human body. The parameters stored in each Gaussian are obtained by interpolating from the outputs of its nearby MLPs based on their distances. To avoid undesired smooth Gaussian property changing during interpolation, for each Gaussian we define a set of Gaussian offset basis, and a linear combination of basis represents the Gaussian property offsets relative to the neutral properties. Then we propose to let the MLPs output a set of coefficients corresponding to the basis. In this way, although Gaussian coefficients are derived from interpolation and change smoothly, the Gaussian offset basis is learned freely without constraints. The smoothly varying coefficients combined with freely learned basis can still produce distinctly different Gaussian property offsets, allowing the ability to learn high-frequency spatial signals. We further use control points to constrain the Gaussians distributed on a surface layer rather than allowing them to be irregularly distributed inside the body, to help the human avatar generalize better when animated under novel poses. Compared to the state-of-the-art method, our method achieves better appearance quality with finer details while the rendering speed is significantly faster under novel views and novel poses.
zh

[CV-46] accel: Scaling Up Vision-based Tactile Robotics via High-performance GPU Simulation

【速读】:该论文旨在解决虚拟弯曲触觉传感器(VBTS)在机器人应用中因复杂物理特性和视觉信号处理需求所带来的挑战,特别是缺乏高效且准确的仿真工具限制了触觉机器人研究的规模与范围。论文的关键解决方案是提出Taccel,这是一个集成了IPCS(隐式表面偏微分方程求解器)和ABD(有限元分析方法)的高性能仿真平台,能够以高达实时速度18倍的加速率,在成千上万个并行环境中实现机器人、触觉传感器及物体的精确建模与模拟。与以往亚实时运行且并行化程度有限的仿真器相比,Taccel不仅提供了精准的物理模拟和逼真的触觉信号,还通过友好的用户API支持灵活的机器人-传感器配置,从而实现了从仿真到实际应用的有效迁移,为触觉机器人研究与开发的大规模推进提供了强大工具。

链接: https://arxiv.org/abs/2504.12908
作者: Yuyang Li,Wenxin Du,Chang Yu,Puhao Li,Zihang Zhao,Tengyu Liu,Chenfanfu Jiang,Yixin Zhu,Siyuan Huang
机构: Institute for AI, Peking University (北京大学人工智能研究院); State Key Lab of General AI, Beijing Institute for General AI (北京通用人工智能研究院国家实验室); AIVC Laboratory, University of California, Los Angeles (加州大学洛杉矶分校AIVC实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Tactile sensing is crucial for achieving human-level robotic capabilities in manipulation tasks. VBTSs have emerged as a promising solution, offering high spatial resolution and cost-effectiveness by sensing contact through camera-captured deformation patterns of elastic gel pads. However, these sensors’ complex physical characteristics and visual signal processing requirements present unique challenges for robotic applications. The lack of efficient and accurate simulation tools for VBTS has significantly limited the scale and scope of tactile robotics research. Here we present Taccel, a high-performance simulation platform that integrates IPC and ABD to model robots, tactile sensors, and objects with both accuracy and unprecedented speed, achieving an 18-fold acceleration over real-time across thousands of parallel environments. Unlike previous simulators that operate at sub-real-time speeds with limited parallelization, Taccel provides precise physics simulation and realistic tactile signals while supporting flexible robot-sensor configurations through user-friendly APIs. Through extensive validation in object recognition, robotic grasping, and articulated object manipulation, we demonstrate precise simulation and successful sim-to-real transfer. These capabilities position Taccel as a powerful tool for scaling up tactile robotics research and development. By enabling large-scale simulation and experimentation with tactile sensing, Taccel accelerates the development of more capable robotic systems, potentially transforming how robots interact with and understand their physical environment.
zh

[CV-47] Second-order Optimization of Gaussian Splats with Importance Sampling

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在新型视图合成中的高渲染质量和快速推理时间优势,但因主要依赖一阶优化器(如Adam)而导致训练时间过长的问题。论文的关键解决方案在于提出了一种基于Levenberg-Marquardt (LM) 和共轭梯度 (Conjugate Gradient, CG) 的新颖二阶优化策略,并针对Gaussian Splatting进行了专门设计。其核心洞察是3DGS 中的雅可比矩阵具有显著稀疏性,因为每个高斯分布仅影响有限数量的像素。论文通过提出一种无矩阵且GPU并行化的LM优化方法来利用这种稀疏性。此外,论文还提出了相机视角、损失函数以及法向方程的采样策略,大幅降低了计算复杂度。同时,引入了一种有效的启发式方法以确定学习率,避免了线搜索方法的高昂计算成本,从而提高了二阶近似的收敛速度。最终,该方法相较于标准LM加速了3倍,在高斯计数较低时比Adam快约6倍,同时在中等计数下仍保持竞争力。

链接: https://arxiv.org/abs/2504.12905
作者: Hamza Pehlivan,Andrea Boscolo Camiletto,Lin Geng Foo,Marc Habermann,Christian Theobalt
机构: Max Planck Institute for Informatics (马克斯·普朗克计算机科学研究所), Saarland Informatics Campus (萨尔州计算机科学校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is widely used for novel view synthesis due to its high rendering quality and fast inference time. However, 3DGS predominantly relies on first-order optimizers such as Adam, which leads to long training times. To address this limitation, we propose a novel second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG), which we specifically tailor towards Gaussian Splatting. Our key insight is that the Jacobian in 3DGS exhibits significant sparsity since each Gaussian affects only a limited number of pixels. We exploit this sparsity by proposing a matrix-free and GPU-parallelized LM optimization. To further improve its efficiency, we propose sampling strategies for both the camera views and loss function and, consequently, the normal equation, significantly reducing the computational complexity. In addition, we increase the convergence rate of the second-order approximation by introducing an effective heuristic to determine the learning rate that avoids the expensive computation cost of line search methods. As a result, our method achieves a 3\times speedup over standard LM and outperforms Adam by ~6\times when the Gaussian count is low while remaining competitive for moderate counts. Project Page: this https URL
zh

[CV-48] ree-NeRV: A Tree-Structured Neural Representation for Efficient Non-Uniform Video Encoding

【速读】:该论文旨在解决现有基于 Implicit Neural Representations for Videos (NeRV) 的方法未能充分利用时间冗余的问题。这些方法依赖于时间轴上的均匀采样,导致率失真(Rate-Distortion, RD)性能次优。为了解决这一局限性,论文提出了一种名为 Tree-NeRV 的新颖树结构特征表示方案,用于高效且自适应的视频编码。其关键在于利用二叉搜索树(Binary Search Tree, BST)组织特征表示,并通过优化驱动的非均匀采样策略动态分配更高的采样密度至时间变化较大的区域,从而实现更高效的压缩和更好的重建质量。

链接: https://arxiv.org/abs/2504.12899
作者: Jiancheng Zhao,Yifan Zhan,Qingtian Zhu,Mingze Ma,Muyao Niu,Zunian Wan,Xiang Ji,Yinqiang Zheng
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 14 figures

点击查看摘要

Abstract:Implicit Neural Representations for Videos (NeRV) have emerged as a powerful paradigm for video representation, enabling direct mappings from frame indices to video frames. However, existing NeRV-based methods do not fully exploit temporal redundancy, as they rely on uniform sampling along the temporal axis, leading to suboptimal rate-distortion (RD) performance. To address this limitation, we propose Tree-NeRV, a novel tree-structured feature representation for efficient and adaptive video encoding. Unlike conventional approaches, Tree-NeRV organizes feature representations within a Binary Search Tree (BST), enabling non-uniform sampling along the temporal axis. Additionally, we introduce an optimization-driven sampling strategy, dynamically allocating higher sampling density to regions with greater temporal variation. Extensive experiments demonstrate that Tree-NeRV achieves superior compression efficiency and reconstruction quality, outperforming prior uniform sampling-based methods. Code will be released.
zh

[CV-49] SC3EF: A Joint Self-Correlation and Cross-Correspondence Estimation Framework for Visible and Thermal Image Registration

【速读】:该论文致力于解决跨模态(可见光与热红外)图像精确配准(RGB-T image registration)的问题,这一问题因其显著的模态差异而极具挑战性。论文的关键创新在于提出了一种联合自相关与交叉对应估计框架(SC3EF),通过结合局部代表性特征和全局上下文线索,有效生成跨模态对应的映射。具体而言,作者设计了一个基于卷积与Transformer的管道,用于提取局部代表性特征并编码单模态内的全局关联,从而实现未对齐的可见光与热红外图像之间的跨模态对应关系估计。在融合局部与全局对应结果后,进一步采用分层光流估计算法解码器逐步优化密集对应图,以提升配准精度。实验验证表明,该方法在多个代表性RGB-T数据集上超越当前最先进的方法,并展现出良好的泛化能力,适用于大视差、严重遮挡、恶劣天气等复杂场景及其它跨模态数据集(如RGB-N和RGB-D)。

链接: https://arxiv.org/abs/2504.12869
作者: Xi Tong,Xing Luo,Jiangxin Yang,Yanpeng Cao
机构: State Key Laboratory of Fluid Power and Mechatronic Systems and Laboratory of Advanced Manufacturing Technology of Zhejiang Province, School of Mechanical Engineering, Zhejiang University (流体动力与机电系统国家重点实验室及浙江省先进制造技术重点实验室, 浙江大学机械工程学院); Section of Visual Computing and Creative Media, School of Performance, Visualization, and Fine Arts, Texas A&M University (视觉计算与创意媒体系, 表演、可视化艺术与美术学院, 德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multispectral imaging plays a critical role in a range of intelligent transportation applications, including advanced driver assistance systems (ADAS), traffic monitoring, and night vision. However, accurate visible and thermal (RGB-T) image registration poses a significant challenge due to the considerable modality differences. In this paper, we present a novel joint Self-Correlation and Cross-Correspondence Estimation Framework (SC3EF), leveraging both local representative features and global contextual cues to effectively generate RGB-T correspondences. For this purpose, we design a convolution-transformer-based pipeline to extract local representative features and encode global correlations of intra-modality for inter-modality correspondence estimation between unaligned visible and thermal images. After merging the local and global correspondence estimation results, we further employ a hierarchical optical flow estimation decoder to progressively refine the estimated dense correspondence maps. Extensive experiments demonstrate the effectiveness of our proposed method, outperforming the current state-of-the-art (SOTA) methods on representative RGB-T datasets. Furthermore, it also shows competitive generalization capabilities across challenging scenarios, including large parallax, severe occlusions, adverse weather, and other cross-modal datasets (e.g., RGB-N and RGB-D).
zh

[CV-50] Computer-Aided Design of Personalized Occlusal Positioning Splints Using Multimodal 3D Data

【速读】:该论文旨在解决个性化定制牙科矫治器(如咬合夹板)在治疗口颌系统功能障碍中的设计与精度评估问题。解决方案的关键在于提出了一种基于变换矩阵(Transformation Matrix)的计算机辅助设计方法,该矩阵能够表示下颌位置的治疗性改变,并通过虚拟患者模型实现精准建模,该模型整合了口内扫描(intraoral scans)、锥形束计算机断层扫描(CBCT)、三维面部扫描以及石膏模型数字化的数据。此外,论文引入了一种创新的虚拟压印机制来解决表面冲突,确保矫治器准确再现治疗位点的咬合条件。同时,通过临床工具和口内设备获取变换矩阵,并利用轮廓偏差和表面偏差分析验证设计与打印矫治器的精度。此方法实现了可重复且个性化的矫治器制造,并为诊断、多模态图像配准及咬合差异量化开辟了新途径。

链接: https://arxiv.org/abs/2504.12868
作者: Agnieszka Anna Tomaka,Leszek Luchowski,Michał Tarnawski,Dariusz Pojda
机构: Institute of Theoretical and Applied Informatics, Polish Academy of Sciences (波兰科学院理论与应用信息学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contemporary digital technology has a pivotal role in the design of customized medical appliances, including occlusal splints used in the treatment of stomatognathic system dysfunctions. We present an approach to computer-aided design and precision assessment of positioning occlusal splints, bridging clinical concepts with current digital dental practice. In our model, a 3D splint is generated based on a transformation matrix that represents the therapeutic change in mandibular position, defined by a specialist using a virtual patient model reconstructed from intraoral scans, CBCT, 3D facial scans and plaster model digitisation. The paper introduces a novel method for generating splints that accurately reproduce occlusal conditions in the therapeutic position, including a mechanism for resolving surface conflicts through virtual embossing. We demonstrate how transformation matrices can be acquired through clinical tools and intraoral devices, and evaluate the accuracy of the designed and printed splints using profile and surface deviation analysis. The proposed method enables reproducible, patient-specific splint fabrication and opens new possibilities in diagnostics, multimodal image registration and quantification of occlusal discrepancies.
zh

[CV-51] 3D-PNAS: 3D Industrial Surface Anomaly Synthesis with Perlin Noise

【速读】:该论文旨在解决工业异常检测中三维(3D)数据利用不足的问题,特别是由于真实缺陷样本稀缺导致预训练视觉基础模型难以有效应用的挑战。与二维(2D)异常生成技术的显著进展相比,三维异常生成仍处于初步阶段,限制了3D数据在工业表面质量检测中的潜力。论文的关键解决方案是提出了一种基于Perlin噪声和曲面参数化的新型简单方法——3D-PNAS。该方法通过将点云投影到二维平面、从Perlin噪声场采样多尺度噪声值,并沿法线方向扰动点云,生成逼真的三维表面异常。其关键在于通过调节噪声尺度、扰动强度和八度数等参数,实现对生成异常的精细控制,从而能够创建多样化的缺陷模式,同时确保生成的异常在几何上合理且适用于不同物体类型的特定表面特性。

链接: https://arxiv.org/abs/2504.12856
作者: Yifeng Cheng,Juan Du
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Guangzhou, China (中国广州)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large pretrained vision foundation models have shown significant potential in various vision tasks. However, for industrial anomaly detection, the scarcity of real defect samples poses a critical challenge in leveraging these models. While 2D anomaly generation has significantly advanced with established generative models, the adoption of 3D sensors in industrial manufacturing has made leveraging 3D data for surface quality inspection an emerging trend. In contrast to 2D techniques, 3D anomaly generation remains largely unexplored, limiting the potential of 3D data in industrial quality inspection. To address this gap, we propose a novel yet simple 3D anomaly generation method, 3D-PNAS, based on Perlin noise and surface parameterization. Our method generates realistic 3D surface anomalies by projecting the point cloud onto a 2D plane, sampling multi-scale noise values from a Perlin noise field, and perturbing the point cloud along its normal direction. Through comprehensive visualization experiments, we demonstrate how key parameters - including noise scale, perturbation strength, and octaves, provide fine-grained control over the generated anomalies, enabling the creation of diverse defect patterns from pronounced deformations to subtle surface variations. Additionally, our cross-category experiments show that the method produces consistent yet geometrically plausible anomalies across different object types, adapting to their specific surface characteristics. We also provide a comprehensive codebase and visualization toolkit to facilitate future research.
zh

[CV-52] High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion

【速读】:该论文旨在解决现有基于生成对抗网络(GAN)反转方法在图像修复中的两个主要问题:一是忽略输入与输出未遮掩区域应保持一致这一硬约束,导致GAN反转与图像修复之间存在差距,影响性能;二是仅利用单一模态信息,忽视其他辅助线索以进一步提升效果。为了解决这些问题,论文提出了一种新的GAN反转方法——MMInvertFill。其关键是引入多模态引导编码器(包含预调制模块)以及具有FW+潜在空间的GAN生成器,通过门控掩码感知注意力模块增强多尺度结构,并结合预调制将这些结构编码为风格向量,同时利用FW+潜在空间缩小GAN反转与图像修复之间的差距,最终实现更高质量的图像修复。

链接: https://arxiv.org/abs/2504.12844
作者: Libo Zhang,Yongsheng Yu,Jiali Yao,Heng Fan
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCV. arXiv admin note: text overlap with arXiv:2208.11850

点击查看摘要

Abstract:Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with FW+ latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the FW+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively.
zh

[CV-53] ALT: A Python Package for Lightweight Feature Representation in Time Series Classification

【速读】:该论文旨在解决时间序列分类(Time Series Classification, TSC)任务中高效性和准确性的问题。解决方案的关键在于提出了一种基于自适应定律变换(Adaptive Law-based Transformation, ALT)算法的新方法。ALT 算法通过使用可变长度的时间窗口对原始时间序列数据进行转换,将其映射到一个线性可分的特征空间,从而有效捕捉具有不同时间尺度模式的数据特性。与基于线性定律变换(Linear Law-based Transformation, LLT)的传统方法相比,ALT 的自适应特性显著提升了分类性能,同时保持了较低的计算开销,实现了在物理及相关领域中的广泛适用性与卓越性能。

链接: https://arxiv.org/abs/2504.12841
作者: Balázs P. Halmos,Balázs Hajós,Vince Á. Molnár,Marcell T. Kurbucz,Antal Jakovác
机构: Faculty of Engineering and Natural Sciences, Tampere University (坦佩雷大学); Department of Computational Sciences, Wigner Research Centre for Physics (Wigner 物理研究中心); Faculty of Science, Eötvös Loránd University (厄特沃什 Loránd 大学); Institute for Global Prosperity, The Bartlett, University College London (伦敦大学学院贝特莱特学院); Department of Statistics, Corvinus University of Budapest (科维努斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Mathematical Software (cs.MS); Machine Learning (stat.ML)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:We introduce ALT, an open-source Python package created for efficient and accurate time series classification (TSC). The package implements the adaptive law-based transformation (ALT) algorithm, which transforms raw time series data into a linearly separable feature space using variable-length shifted time windows. This adaptive approach enhances its predecessor, the linear law-based transformation (LLT), by effectively capturing patterns of varying temporal scales. The software is implemented for scalability, interpretability, and ease of use, achieving state-of-the-art performance with minimal computational overhead. Extensive benchmarking on real-world datasets demonstrates the utility of ALT for diverse TSC tasks in physics and related domains.
zh

[CV-54] Image-Editing Specialists: An RLAIF Approach for Diffusion Models

【速读】:该论文致力于解决在基于指令的图像编辑任务中,结构化信息保留与语义对齐的关键挑战。传统方法通常依赖大量人工标注或大规模数据集,而该研究提出了一种新颖的在线强化学习框架,通过与人类偏好对齐的方式训练专用的指令驱动图像编辑扩散模型,无需依赖广泛的标注或构建大型数据集。其解决方案的核心在于引入一种机制,使模型能够在保持无关区域高保真的同时,实现复杂场景中的精确且结构一致的修改,并通过视觉提示捕捉细微的编辑需求,从而简化用户操作,仅需少量参考图像即可完成高度特定的编辑任务。实验表明,该方法在仅经过10个训练步骤后便能在复杂场景中执行精细编辑,并成功扩展至机器人领域,通过针对性的仿真到现实图像编辑提升模拟环境的真实感及其作为真实世界代理的实用性。

链接: https://arxiv.org/abs/2504.12833
作者: Elior Benarous,Yilun Du,Heng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a novel approach to training specialized instruction-based image-editing diffusion models, addressing key challenges in structural preservation with input images and semantic alignment with user prompts. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations or curating a large dataset. Our method significantly improves the realism and alignment with instructions in two ways. First, the proposed models achieve precise and structurally coherent modifications in complex scenes while maintaining high fidelity in instruction-irrelevant areas. Second, they capture fine nuances in the desired edit by leveraging a visual prompt, enabling detailed control over visual edits without lengthy textual prompts. This approach simplifies users’ efforts to achieve highly specific edits, requiring only 5 reference images depicting a certain concept for training. Experimental results demonstrate that our models can perform intricate edits in complex scenes, after just 10 training steps. Finally, we showcase the versatility of our method by applying it to robotics, where enhancing the visual realism of simulated environments through targeted sim-to-real image edits improves their utility as proxies for real-world settings.
zh

[CV-55] UncAD: Towards Safe End-to-end Autonomous Driving via Online Map Uncertainty

【速读】:该论文旨在解决现有端到端自动驾驶方法中因感知模块对在线地图采用确定性建模而导致的安全隐患问题。这些方法可能引入错误的感知信息,从而影响规划的安全性。为了解决这一问题,论文提出了一种名为UncAD的新范式。其关键是首先在感知模块中估计在线地图的不确定性,并利用该不确定性引导运动预测和规划模块生成多模态轨迹。最终,通过引入基于在线地图不确定性的不确定性-碰撞感知规划选择策略,进一步评估和选择最优轨迹,以实现更安全的自动驾驶。实验结果表明,将UncAD集成到多种最先进的端到端方法中,仅增加1.9%的参数量即可将碰撞率降低多达26%,可行驶区域冲突率降低多达42%。

链接: https://arxiv.org/abs/2504.12826
作者: Pengxuan Yang,Yupeng Zheng,Qichao Zhang,Kefei Zhu,Zebin Xing,Qiao Lin,Yun-Fu Liu,Zhiguo Su,Dongbin Zhao
机构: Key Laboratory of Safety Intelligent Mining in Non-coal Open-pit Mines, National Mine safety Administration (国家矿山安全管理局), Guangdong Guangzhou, China; The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院), Beijing, China; EACON, Fujian, China
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving aims to produce planning trajectories from raw sensors directly. Currently, most approaches integrate perception, prediction, and planning modules into a fully differentiable network, promising great scalability. However, these methods typically rely on deterministic modeling of online maps in the perception module for guiding or constraining vehicle planning, which may incorporate erroneous perception information and further compromise planning safety. To address this issue, we delve into the importance of online map uncertainty for enhancing autonomous driving safety and propose a novel paradigm named UncAD. Specifically, UncAD first estimates the uncertainty of the online map in the perception module. It then leverages the uncertainty to guide motion prediction and planning modules to produce multi-modal trajectories. Finally, to achieve safer autonomous driving, UncAD proposes an uncertainty-collision-aware planning selection strategy according to the online map uncertainty to evaluate and select the best trajectory. In this study, we incorporate UncAD into various state-of-the-art (SOTA) end-to-end methods. Experiments on the nuScenes dataset show that integrating UncAD, with only a 1.9% increase in parameters, can reduce collision rates by up to 26% and drivable area conflict rate by up to 42%. Codes, pre-trained models, and demo videos can be accessed at this https URL.
zh

[CV-56] woSquared: 4D Generation from 2D Image Pairs

【速读】:该论文旨在解决4D动态物体生成这一开放性挑战,即从仅有的两张表示动作起始和结束的2D RGB图像出发,生成符合物理规律的4D序列。论文的关键解决方案在于将复杂的4D生成问题分解为两个步骤:首先利用基于高质量3D资产训练的现有生成模型构建一个图像到3D模块;其次引入受物理启发的变形模块来预测中间的运动过程。这种方法无需模板或特定对象类别的先验知识,能够直接处理真实场景中的输入图像,从而实现纹理一致且几何一致的4D序列生成。

链接: https://arxiv.org/abs/2504.12825
作者: Lu Sang,Zehranaz Canfes,Dongliang Cao,Riccardo Marin,Florian Bernard,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); Munich Center of Machine Learning (慕尼黑机器学习中心); University of Bonn (波恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the astonishing progress in generative AI, 4D dynamic object generation remains an open challenge. With limited high-quality training data and heavy computing requirements, the combination of hallucinating unseen geometry together with unseen movement poses great challenges to generative models. In this work, we propose TwoSquared as a method to obtain a 4D physically plausible sequence starting from only two 2D RGB images corresponding to the beginning and end of the action. Instead of directly solving the 4D generation problem, TwoSquared decomposes the problem into two steps: 1) an image-to-3D module generation based on the existing generative model trained on high-quality 3D assets, and 2) a physically inspired deformation module to predict intermediate movements. To this end, our method does not require templates or object-class-specific prior knowledge and can take in-the-wild images as input. In our experiments, we demonstrate that TwoSquared is capable of producing texture-consistent and geometry-consistent 4D sequences only given 2D images.
zh

[CV-57] Explainable Scene Understanding with Qualitative Representations and Graph Neural Networks

【速读】:该论文旨在解决自动驾驶场景理解中的关键挑战,即如何有效识别交通场景中的相关对象,并同时考虑整个场景的空间-时间关系。传统方法基于浅层机器学习模型,仅能分析单一关系链,而忽视了更广泛的场景上下文信息。论文的关键创新在于提出了一种新的图神经网络(Graph Neural Network, GNN)架构,能够处理完整的图结构以识别交通场景中的相关对象。该方案通过结合定性表示与深度学习方法,在nuScenes数据集上实现了优于基线方法的性能,特别是在应对类别不平衡问题以及捕捉全局空间-时间关联方面表现出色。

链接: https://arxiv.org/abs/2504.12817
作者: Nassim Belmecheri,Arnaud Gotlieb,Nadjib Lazaar,Helge Spieker
机构: Simula Research Laboratory (Simula 研究实验室); Université Paris-Saclay (巴黎萨克雷大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Workshop “Advancing Automated Driving in Highly Interactive Scenarios through Behavior Prediction, Trustworthy AI, and Remote Operations” @ 36th IEEE Intelligent Vehicles Symposium (IV)

点击查看摘要

Abstract:This paper investigates the integration of graph neural networks (GNNs) with Qualitative Explainable Graphs (QXGs) for scene understanding in automated driving. Scene understanding is the basis for any further reactive or proactive decision-making. Scene understanding and related reasoning is inherently an explanation task: why is another traffic participant doing something, what or who caused their actions? While previous work demonstrated QXGs’ effectiveness using shallow machine learning models, these approaches were limited to analysing single relation chains between object pairs, disregarding the broader scene context. We propose a novel GNN architecture that processes entire graph structures to identify relevant objects in traffic scenes. We evaluate our method on the nuScenes dataset enriched with DriveLM’s human-annotated relevance labels. Experimental results show that our GNN-based approach achieves superior performance compared to baseline methods. The model effectively handles the inherent class imbalance in relevant object identification tasks while considering the complete spatial-temporal relationships between all objects in the scene. Our work demonstrates the potential of combining qualitative representations with deep learning approaches for explainable scene understanding in autonomous driving systems.
zh

[CV-58] AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在三维重建中面临的锯齿效应(aliasing)、投影伪影(projection artifacts)以及视图不一致性(view inconsistencies)等问题。这些问题主要源于将光点(splats)简化为二维实体的处理方式。论文的关键解决方案在于在整个3DGS流程中引入全三维高斯分布的评估方法。具体而言,通过设计一种自适应三维平滑滤波器来缓解锯齿效应,提出一种稳定的视空间边界方法以消除高斯分布超出视锥体时产生的弹跳伪影(popping artifacts),并推广基于瓦片的剔除技术至三维屏幕空间平面,从而加速渲染并降低分层光栅化(hierarchical rasterization)的排序成本。实验结果表明,该方法在分布内数据集上达到了最先进的质量,并显著优于其他方法在分布外视图中的表现,同时有效消除了锯齿、失真及弹跳伪影,实现了实时无伪影的渲染。

链接: https://arxiv.org/abs/2504.12811
作者: Michael Steiner,Thomas Köhler,Lukas Radl,Felix Windisch,Dieter Schmalstieg,Markus Steinberger
机构: Graz University of Technology (格拉茨工业大学); University of Stuttgart (斯图加特大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although 3D Gaussian Splatting (3DGS) has revolutionized 3D reconstruction, it still faces challenges such as aliasing, projection artifacts, and view inconsistencies, primarily due to the simplification of treating splats as 2D entities. We argue that incorporating full 3D evaluation of Gaussians throughout the 3DGS pipeline can effectively address these issues while preserving rasterization efficiency. Specifically, we introduce an adaptive 3D smoothing filter to mitigate aliasing and present a stable view-space bounding method that eliminates popping artifacts when Gaussians extend beyond the view frustum. Furthermore, we promote tile-based culling to 3D with screen-space planes, accelerating rendering and reducing sorting costs for hierarchical rasterization. Our method achieves state-of-the-art quality on in-distribution evaluation sets and significantly outperforms other approaches for out-of-distribution views. Our qualitative evaluations further demonstrate the effective removal of aliasing, distortions, and popping artifacts, ensuring real-time, artifact-free rendering.
zh

[CV-59] Saliency-Aware Diffusion Reconstruction for Effective Invisible Watermark Removal

【速读】:该论文旨在解决现有数字水印嵌入技术鲁棒性不足的问题,提出了一种名为Saliency-Aware Diffusion Reconstruction (SADRE) 的新型框架,用于网页环境中的水印消除。解决方案的关键在于结合自适应噪声注入、区域特定扰动以及基于扩散的重建方法,通过由显著性掩码引导的潜 representation 中的目标噪声注入来破坏嵌入的水印,同时利用反向扩散过程确保高保真图像恢复,并根据水印强度动态调整噪声水平,从而在保证图像核心特征的同时实现水印的有效移除,且具备理论稳定性保证与跨场景的鲁棒性。

链接: https://arxiv.org/abs/2504.12809
作者: Inzamamul Alam,Md Tanvir Islam,Simon S. Woo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted at The Web Conference 2025

点击查看摘要

Abstract:As digital content becomes increasingly ubiquitous, the need for robust watermark removal techniques has grown due to the inadequacy of existing embedding techniques, which lack robustness. This paper introduces a novel Saliency-Aware Diffusion Reconstruction (SADRE) framework for watermark elimination on the web, combining adaptive noise injection, region-specific perturbations, and advanced diffusion-based reconstruction. SADRE disrupts embedded watermarks by injecting targeted noise into latent representations guided by saliency masks although preserving essential image features. A reverse diffusion process ensures high-fidelity image restoration, leveraging adaptive noise levels determined by watermark strength. Our framework is theoretically grounded with stability guarantees and achieves robust watermark removal across diverse scenarios. Empirical evaluations on state-of-the-art (SOTA) watermarking techniques demonstrate SADRE’s superiority in balancing watermark disruption and image quality. SADRE sets a new benchmark for watermark elimination, offering a flexible and reliable solution for real-world web content. Code is available on~\hrefthis https URL\textbfthis https URL.
zh

[CV-60] Hybrid Dense-UNet201 Optimization for Pap Smear Image Segmentation Using Spider Monkey Optimization

【速读】:该论文旨在解决宫颈癌诊断中巴氏涂片(Pap smear)图像分割面临的挑战,特别是传统分割模型在处理复杂细胞结构和图像变化时存在的困难。论文的关键解决方案在于提出了一种混合Dense-UNet201优化方法,该方法将预训练的DenseNet201作为U-Net架构的编码器,并利用改进的蜘蛛猴优化(Spider Monkey Optimization, SMO)算法对其进行优化。特别地,SMO被修改以处理类别型和离散参数,从而提升了模型性能。实验结果表明,该方法在SIPaKMeD数据集上的分割准确率达到96.16%,IoU为91.63%,Dice系数为95.63%,显著优于U-Net、Res-UNet50和Efficient-UNetB0等基准模型。

链接: https://arxiv.org/abs/2504.12807
作者: Ach Khozaimi,Isnani Darti,Syaiful Anam,Wuryansari Muharini Kusumawinahyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pap smear image segmentation is crucial for cervical cancer diagnosis. However, traditional segmentation models often struggle with complex cellular structures and variations in pap smear images. This study proposes a hybrid Dense-UNet201 optimization approach that integrates a pretrained DenseNet201 as the encoder for the U-Net architecture and optimizes it using the spider monkey optimization (SMO) algorithm. The Dense-UNet201 model excelled at feature extraction. The SMO was modified to handle categorical and discrete parameters. The SIPaKMeD dataset was used in this study and evaluated using key performance metrics, including loss, accuracy, Intersection over Union (IoU), and Dice coefficient. The experimental results showed that Dense-UNet201 outperformed U-Net, Res-UNet50, and Efficient-UNetB0. SMO Dense-UNet201 achieved a segmentation accuracy of 96.16%, an IoU of 91.63%, and a Dice coefficient score of 95.63%. These findings underscore the effectiveness of image preprocessing, pretrained models, and metaheuristic optimization in improving medical image analysis and provide new insights into cervical cell segmentation methods.
zh

[CV-61] Sign-In to the Lottery: Reparameterizing Sparse Training From Scratch

【速读】:该论文试图解决从零训练稀疏神经网络(PaI)与稠密到稀疏训练之间的性能差距问题。这一差距构成了高效深度学习的主要障碍。论文基于彩票假设(Lottery Ticket Hypothesis),指出PaI的关键在于找到特定于任务的问题初始化参数,而确定正确的参数符号是实现这一点的充分条件。然而,这一条件在PaI中难以实现。为了解决此问题,论文提出了一种名为Sign-In的方法,该方法采用动态重参数化技术,能够证明性地诱导符号翻转,这些符号翻转补充了稠密到稀疏训练所能完成的符号变化,使Sign-In成为一种正交方法。实验和理论结果表明,Sign-In可以提升PaI的性能,但也揭示了缩小PaI与稠密到稀疏训练之间性能差距的主要开放挑战。

链接: https://arxiv.org/abs/2504.12801
作者: Advait Gadhikar,Tom Jacobs,Chao Zhou,Rebekka Burkholz
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 9 figures

点击查看摘要

Abstract:The performance gap between training sparse neural networks from scratch (PaI) and dense-to-sparse training presents a major roadblock for efficient deep learning. According to the Lottery Ticket Hypothesis, PaI hinges on finding a problem specific parameter initialization. As we show, to this end, determining correct parameter signs is sufficient. Yet, they remain elusive to PaI. To address this issue, we propose Sign-In, which employs a dynamic reparameterization that provably induces sign flips. Such sign flips are complementary to the ones that dense-to-sparse training can accomplish, rendering Sign-In as an orthogonal method. While our experiments and theory suggest performance improvements of PaI, they also carve out the main open challenge to close the gap between PaI and dense-to-sparse training.
zh

[CV-62] CAGE-GS: High-fidelity Cage Based 3D Gaussian Splatting Deformation

【速读】:该论文旨在解决如何在保持原始3D高斯点云(3D Gaussian Splatting, 3DGS)细节的同时,实现用户友好型的场景变形以创建新场景的问题。论文的关键在于提出了一种基于笼子(cage-based)的3DGS变形方法CAGE-GS,通过从目标形状学习一个变形笼子来指导源场景的几何变换,从而实现与用户定义的目标形状无缝对齐。此外,为了克服由于协方差参数复杂性导致的纹理外观保真度挑战,论文采用基于雅可比矩阵(Jacobian matrix)的策略更新每个高斯分布的协方差参数,确保变形后的纹理一致性。这种方法不仅灵活适应多种目标形状表示形式,还通过广泛的实验验证,在效率和变形质量上显著优于现有技术。

链接: https://arxiv.org/abs/2504.12800
作者: Yifei Tong,Runze Tian,Xiao Han,Dingyao Liu,Fenggen Yu,Yan Zhang
机构: Nanjing University (南京大学); Simon Fraser University (西蒙弗雷泽大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As 3D Gaussian Splatting (3DGS) gains popularity as a 3D representation of real scenes, enabling user-friendly deformation to create novel scenes while preserving fine details from the original 3DGS has attracted significant research attention. We introduce CAGE-GS, a cage-based 3DGS deformation method that seamlessly aligns a source 3DGS scene with a user-defined target shape. Our approach learns a deformation cage from the target, which guides the geometric transformation of the source scene. While the cages effectively control structural alignment, preserving the textural appearance of 3DGS remains challenging due to the complexity of covariance parameters. To address this, we employ a Jacobian matrix-based strategy to update the covariance parameters of each Gaussian, ensuring texture fidelity post-deformation. Our method is highly flexible, accommodating various target shape representations, including texts, images, point clouds, meshes and 3DGS models. Extensive experiments and ablation studies on both public datasets and newly proposed scenes demonstrate that our method significantly outperforms existing techniques in both efficiency and deformation quality.
zh

[CV-63] SGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors

【速读】:该论文致力于解决透明表面三维重建中的“透明-深度困境”(transparency-depth dilemma),即现有方法如3D Gaussian Splatting (3DGS) 在追求照片级真实感渲染时通过标准α-混合(\alpha-blending)会牺牲几何精度,导致透明材料的深度估计误差显著增加的问题。为了解决这一挑战,论文提出了一种名为Transparent Surface Gaussian Splatting (TSGS) 的新框架,其关键在于将几何学习与外观优化分离。在几何学习阶段,TSGS 利用抑制镜面反射的输入来精确表示表面;在视觉细化阶段,通过各向异性镜面建模提高视觉保真度的同时保持已建立的不透明度以确保几何准确性。此外,为了增强深度推理能力,TSGS 引入了一种基于滑动窗口的第一表面深度提取方法,通过分析α权重来确定最可能的表面位置并计算鲁棒加权平均深度。实验结果表明,TSGS 在TransLab数据集上的表现显著优于当前领先方法,实现了更高的几何重建精度和更真实的透明物体渲染效果。

链接: https://arxiv.org/abs/2504.12799
作者: Mingwei Li,Pu Pang,Hehe Fan,Hua Huang,Yi Yang
机构: Zhejiang University (浙江大学), Hangzhou, Zhejiang, China; Zhongguancun Academy (中关村书院), Beijing, China; Xi’an Jiaotong University (西安交通大学), Xi’an, Shaanxi, China; Beijing Normal University (北京师范大学), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Reconstructing transparent surfaces is essential for tasks such as robotic manipulation in labs, yet it poses a significant challenge for 3D reconstruction techniques like 3D Gaussian Splatting (3DGS). These methods often encounter a transparency-depth dilemma, where the pursuit of photorealistic rendering through standard \alpha -blending undermines geometric precision, resulting in considerable depth estimation errors for transparent materials. To address this issue, we introduce Transparent Surface Gaussian Splatting (TSGS), a new framework that separates geometry learning from appearance refinement. In the geometry learning stage, TSGS focuses on geometry by using specular-suppressed inputs to accurately represent surfaces. In the second stage, TSGS improves visual fidelity through anisotropic specular modeling, crucially maintaining the established opacity to ensure geometric accuracy. To enhance depth inference, TSGS employs a first-surface depth extraction method. This technique uses a sliding window over \alpha -blending weights to pinpoint the most likely surface location and calculates a robust weighted average depth. To evaluate the transparent surface reconstruction task under realistic conditions, we collect a TransLab dataset that includes complex transparent laboratory glassware. Extensive experiments on TransLab show that TSGS achieves accurate geometric reconstruction and realistic rendering of transparent objects simultaneously within the efficient 3DGS framework. Specifically, TSGS significantly surpasses current leading methods, achieving a 37.3% reduction in chamfer distance and an 8.0% improvement in F1 score compared to the top baseline. The code and dataset will be released at this https URL.
zh

[CV-64] EarthGPT -X: Enabling MLLM s to Flexibly and Comprehensively Understand Multi-Source Remote Sensing Imagery

【速读】:该论文旨在解决自然多模态大语言模型(MLLMs)在遥感(RS)领域应用中的两个主要挑战:一是如何有效适应遥感影像中丰富的地理空间信息,与自然图像显著不同的特性使得直接迁移自然空间模型面临困难;二是当前遥感MLLMs在解释层次和交互方式上的过于狭窄限制了其在实际场景中的适用性。为应对这些挑战,论文提出了一种名为EarthGPT-X的空间MLLM,该模型能够全面理解多源遥感影像(如光学、合成孔径雷达SAR和红外),提供从全局到局部的多层次洞察,并具备灵活的多粒度交互能力。此外,EarthGPT-X通过视觉提示框架统一了指代和定位两种关键空间任务。

解决方案的关键在于以下几个策略:首先,开发了一种多模态内容集成方法,以增强图像、视觉提示与文本指令之间的相互作用;其次,提出了一个跨域单阶段融合训练策略,利用大型语言模型(LLM)作为多源多任务学习的统一接口;再者,通过引入像素感知模块,实现了指代和定位任务在单一框架内的无缝整合。实验结果验证了EarthGPT-X在多粒度任务中的优越性能及其在多模态交互中的灵活性,展示了MLLM在遥感领域的显著进步。

链接: https://arxiv.org/abs/2504.12795
作者: Wei Zhang,Miaoxin Cai,Yaqian Ning,Tong Zhang,Yin Zhuang,He Chen,Jun Li,Xuerui Mao
机构: Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology (北京理工大学多学科科学研究院), Beijing 100081, China; School of Mechatronical Engineering, Beijing Institute of Technology (北京理工大学机电工程学院), Beijing 100081, China; Yangtze Delta Region Academy of Beijing Institute of Technology (北京理工大学长三角研究院), Jiaxing 314003, China; National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, Beijing Institute of Technology (北京理工大学空间智能信息技术国家重点实验室), Beijing 100081, China; School of Optics and Photonics, Beijing Institute of Technology (北京理工大学光学与光电学院), Beijing 100081, China; School of Computer Science and Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences (中国地质大学计算机科学与智能地理信息处理湖北省重点实验室), Wuhan, 430078, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in the visual-language area have developed natural multi-modal large language models (MLLMs) for spatial reasoning through visual prompting. However, due to remote sensing (RS) imagery containing abundant geospatial information that differs from natural images, it is challenging to effectively adapt natural spatial models to the RS domain. Moreover, current RS MLLMs are limited in overly narrow interpretation levels and interaction manner, hindering their applicability in real-world scenarios. To address those challenges, a spatial MLLM named EarthGPT-X is proposed, enabling a comprehensive understanding of multi-source RS imagery, such as optical, synthetic aperture radar (SAR), and infrared. EarthGPT-X offers zoom-in and zoom-out insight, and possesses flexible multi-grained interactive abilities. Moreover, EarthGPT-X unifies two types of critical spatial tasks (i.e., referring and grounding) into a visual prompting framework. To achieve these versatile capabilities, several key strategies are developed. The first is the multi-modal content integration method, which enhances the interplay between images, visual prompts, and text instructions. Subsequently, a cross-domain one-stage fusion training strategy is proposed, utilizing the large language model (LLM) as a unified interface for multi-source multi-task learning. Furthermore, by incorporating a pixel perception module, the referring and grounding tasks are seamlessly unified within a single framework. In addition, the experiments conducted demonstrate the superiority of the proposed EarthGPT-X in multi-grained tasks and its impressive flexibility in multi-modal interaction, revealing significant advancements of MLLM in the RS field.
zh

[CV-65] ARAP-GS: Drag -driven As-Rigid-As-Possible 3D Gaussian Splatting Editing with Diffusion Prior

【速读】:该论文旨在解决基于拖拽操作(drag-driven)编辑3D高斯点云表示(3D Gaussian Splatting, 3DGS)时面临的挑战,特别是如何在保持形状一致性(shape coherence)和视觉连续性(visual continuity)的同时实现复杂的几何变形。目前,针对3DGS的拖拽驱动编辑方法鲜有研究,而这一领域的难点在于如何有效地对3D高斯分布进行灵活且直观的变形。

论文的关键解决方案是引入了一种名为ARAP-GS的框架,该框架基于As-Rigid-As-Possible (ARAP) 变形技术,首次直接将ARAP变形应用于3D高斯分布,从而实现了灵活的拖拽驱动几何变换。此外,为了在变形后保持场景外观的质量,作者在迭代优化过程中结合了先进的图像超分辨率扩散先验(diffusion prior),以提升视觉效果并确保多视角一致性。实验结果表明,该方法在多种3D场景中表现出色,显著优于现有方法,并且具有高效性,单个RTX 3090 GPU即可在10到20分钟内完成场景编辑。

链接: https://arxiv.org/abs/2504.12788
作者: Xiao Han,Runze Tian,Yifei Tong,Fenggen Yu,Dingyao Liu,Yan Zhang
机构: Nanjing University (南京大学); Simon Fraser University (西蒙弗雷泽大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Drag-driven editing has become popular among designers for its ability to modify complex geometric structures through simple and intuitive manipulation, allowing users to adjust and reshape content with minimal technical skill. This drag operation has been incorporated into numerous methods to facilitate the editing of 2D images and 3D meshes in design. However, few studies have explored drag-driven editing for the widely-used 3D Gaussian Splatting (3DGS) representation, as deforming 3DGS while preserving shape coherence and visual continuity remains challenging. In this paper, we introduce ARAP-GS, a drag-driven 3DGS editing framework based on As-Rigid-As-Possible (ARAP) deformation. Unlike previous 3DGS editing methods, we are the first to apply ARAP deformation directly to 3D Gaussians, enabling flexible, drag-driven geometric transformations. To preserve scene appearance after deformation, we incorporate an advanced diffusion prior for image super-resolution within our iterative optimization process. This approach enhances visual quality while maintaining multi-view consistency in the edited results. Experiments show that ARAP-GS outperforms current methods across diverse 3D scenes, demonstrating its effectiveness and superiority for drag-driven 3DGS editing. Additionally, our method is highly efficient, requiring only 10 to 20 minutes to edit a scene on a single RTX 3090 GPU.
zh

[CV-66] Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts

【速读】:该论文旨在解决文本到图像模型在部署过程中生成有害或不适当内容的问题,提出了一种名为ANT(Automatically guiding deNoising Trajectories)的微调框架。现有基于微调的概念擦除方法存在显著局限性,锚点无关方法可能破坏采样轨迹导致视觉伪影,而锚点相关方法依赖于启发式的锚点概念选择。ANT的关键创新在于,在去噪的中期到晚期阶段反向条件化引导的方向,使内容修改更加精确且不会牺牲早期结构完整性,从而引入一种轨迹感知的目标函数,该函数无需依赖启发式的锚点概念选择即可保持早期阶段得分函数场的完整性,并引导样本趋向自然图像流形。此外,对于单一概念擦除,通过增强的权重显著性图精确定位对不想要的概念贡献最大的关键参数;对于多概念擦除,其目标函数提供了一种通用的即插即用解决方案,显著提升了性能。实验结果表明,ANT在单概念和多概念擦除任务中均达到了最先进的水平,提供了高质量且安全的输出,同时保持了生成保真度。

链接: https://arxiv.org/abs/2504.12782
作者: Leyang Li,Shilin Lu,Yan Ren,Adams Wai-Kin Kong
机构: Nanyang Technological University (南洋理工大学), Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Ensuring the ethical deployment of text-to-image models requires effective techniques to prevent the generation of harmful or inappropriate content. While concept erasure methods offer a promising solution, existing finetuning-based approaches suffer from notable limitations. Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. ANT is built on a key insight: reversing the condition direction of classifier-free guidance during mid-to-late denoising stages enables precise content modification without sacrificing early-stage structural integrity. This inspires a trajectory-aware objective that preserves the integrity of the early-stage score function field, which steers samples toward the natural image manifold, without relying on heuristic anchor concept selection. For single-concept erasure, we propose an augmentation-enhanced weight saliency map to precisely identify the critical parameters that most significantly contribute to the unwanted concept, enabling more thorough and efficient erasure. For multi-concept erasure, our objective function offers a versatile plug-and-play solution that significantly boosts performance. Extensive experiments demonstrate that ANT achieves state-of-the-art results in both single and multi-concept erasure, delivering high-quality, safe outputs without compromising the generative fidelity. Code is available at this https URL
zh

[CV-67] Stronger Steadier Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation

【速读】:本文旨在解决现有Domain Generalized Semantic Segmentation (DGSS) 方法忽视视觉线索易变而底层几何结构稳定的特性的问题,提出通过整合深度信息提升Vision Foundation Models (VFMs) 的几何一致性及泛化性能。关键在于提出DepthForge框架,通过在冻结的DINOv2或EVA02模型中引入深度提示以及在VFMs每一层加入可学习的深度感知标记,持续解耦域不变的视觉与空间信息,增强模型的深度感知能力。此外,开发了一种深度精炼解码器以自适应优化多层VFM特征及深度感知标记。实验结果表明,该方法在多种DGSS设置和未见目标数据集上显著优于其他方法,在极端条件下(如夜晚和雪天)表现尤为突出。

链接: https://arxiv.org/abs/2504.12753
作者: Siyu Chen,Ting Han,Changshe Zhang,Xin Luo,Meiliu Wu,Guorong Cai,Jinhe Su
机构: Jimei University (集美大学); Sun Yat-sen University (中山大学); Xidian University (西安电子科技大学); University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at this https URL.
zh

[CV-68] LAD-Reason er: Tiny Multimodal Models are Good Reason ers for Logical Anomaly Detection

【速读】:该论文致力于解决工业异常检测中深层次逻辑异常分析的需求,即识别并解释对象间、数量关系及空间配置中的意外关联。现有方法通常依赖大规模外部推理模块或复杂的管道设计,这限制了实际部署与可解释性。为克服这些局限,论文引入了新的任务——推理逻辑异常检测(Reasoning Logical Anomaly Detection, RLAD),并通过结合逻辑推理扩展了传统异常检测。关键解决方案在于提出了一种名为LAD-Reasoner的新框架,这是一个基于Qwen2.5-VL 3B的小型定制多模态语言模型。该方法采用两阶段训练范式:首先通过有监督微调(Supervised Fine-Tuning, SFT)实现细粒度视觉理解;随后利用组相对策略优化(Group Relative Policy Optimization, GRPO)优化逻辑异常检测并确保输出的连贯性和人类可读性。奖励信号来源于检测准确率和输出结构质量,无需构建思维链(Chain of Thought, CoT)推理数据。实验表明,尽管规模较小,LAD-Reasoner在MVTec LOCO AD数据集上的精度和F1得分与更大规模的Qwen2.5-VL-72B相当,并且在生成简洁且可解释的理由方面表现更优。这种统一设计减少了对大型模型和复杂管道的依赖,同时提供了透明且可解释的逻辑异常检测洞察力。代码和数据将公开发布。

链接: https://arxiv.org/abs/2504.12749
作者: Weijia Li,Guanglei Chu,Jiong Chen,Guo-Sen Xie,Caifeng Shan,Fang Zhao
机构: Nanjing University (南京大学); China Mobile (Suzhou) Software Technology Co., Ltd (中国移动(苏州)软件技术有限公司); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in industrial anomaly detection have highlighted the need for deeper logical anomaly analysis, where unexpected relationships among objects, counts, and spatial configurations must be identified and explained. Existing approaches often rely on large-scale external reasoning modules or elaborate pipeline designs, hindering practical deployment and interpretability. To address these limitations, we introduce a new task, Reasoning Logical Anomaly Detection (RLAD), which extends traditional anomaly detection by incorporating logical reasoning. We propose a new framework, LAD-Reasoner, a customized tiny multimodal language model built on Qwen2.5-VL 3B. Our approach leverages a two-stage training paradigm that first employs Supervised Fine-Tuning (SFT) for fine-grained visual understanding, followed by Group Relative Policy Optimization (GRPO) to refine logical anomaly detection and enforce coherent, human-readable reasoning. Crucially, reward signals are derived from both the detection accuracy and the structural quality of the outputs, obviating the need for building chain of thought (CoT) reasoning data. Experiments on the MVTec LOCO AD dataset show that LAD-Reasoner, though significantly smaller, matches the performance of Qwen2.5-VL-72B in accuracy and F1 score, and further excels in producing concise and interpretable rationales. This unified design reduces reliance on large models and complex pipelines, while offering transparent and interpretable insights into logical anomaly detection. Code and data will be released.
zh

[CV-69] Privacy Protection Against Personalized Text-to-Image Synthesis via Cross-image Consistency Constraints

【速读】:该论文旨在解决由扩散模型和个人化技术进步引发的隐私泄露问题,特别是通过少量公开图像重建个人肖像的能力可能导致的严重隐私威胁。论文提出,现有的反个人化方法主要关注单张图像的处理,忽略了个人化过程本质上涉及多张图像的特性,未能充分利用图像间的关系来增强隐私保护效果。为此,论文倡导从群体视角出发,提出了一种名为Cross-image Anti-Personalization (CAP) 的新框架,其关键在于通过在扰动后的图像间施加样式一致性约束,提升对抗个人化的鲁棒性,并进一步设计了一种动态比率调整策略,以自适应地平衡攻击迭代过程中一致性损失的影响。实验结果表明,CAP 在 CelebHQ 和 VGGFace2 数据集上的表现显著优于现有方法。

链接: https://arxiv.org/abs/2504.12747
作者: Guanyu Wang,Kailong Wang,Yihao Huang,Mingyi Zhou,Zhang Qing cnwatcher,Geguang Pu,Li Li
机构: Beihang University (北京航空航天大学); Huazhong University of Science and Technology (华中科技大学); Nanyang Technological University (南洋理工大学); ByteDance (字节跳动); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of diffusion models and personalization techniques has made it possible to recreate individual portraits from just a few publicly available images. While such capabilities empower various creative applications, they also introduce serious privacy concerns, as adversaries can exploit them to generate highly realistic impersonations. To counter these threats, anti-personalization methods have been proposed, which add adversarial perturbations to published images to disrupt the training of personalization models. However, existing approaches largely overlook the intrinsic multi-image nature of personalization and instead adopt a naive strategy of applying perturbations independently, as commonly done in single-image settings. This neglects the opportunity to leverage inter-image relationships for stronger privacy protection. Therefore, we advocate for a group-level perspective on privacy protection against personalization. Specifically, we introduce Cross-image Anti-Personalization (CAP), a novel framework that enhances resistance to personalization by enforcing style consistency across perturbed images. Furthermore, we develop a dynamic ratio adjustment strategy that adaptively balances the impact of the consistency loss throughout the attack iterations. Extensive experiments on the classical CelebHQ and VGGFace2 benchmarks show that CAP substantially improves existing methods.
zh

[CV-70] Mask Image Watermarking

【速读】:该论文旨在解决图像水印领域的两个关键问题:全局水印嵌入与提取以及局部水印嵌入与提取,同时确保高鲁棒性和视觉质量。论文提出的解决方案——MaskMark框架的关键在于引入了一种简单的掩码机制(masking mechanism)。对于MaskMark-D变体,在解码阶段通过掩码引导解码器专注于选定区域以实现局部水印提取,并结合定位模块减少无关内容干扰;而对于MaskMark-ED变体,则进一步将掩码机制融入编码阶段,指导编码器在指定区域嵌入水印以增强局部鲁棒性。这种设计不仅实现了高性能的全局和局部水印提取及定位,还显著降低了计算成本,同时保持了高质量的水印图像输出。

链接: https://arxiv.org/abs/2504.12739
作者: Runyi Hu,Jie Zhang,Shiqian Zhao,Nils Lukas,Jiwei Li,Qing Guo,Han Qiu,Tianwei Zhang
机构: Nanyang Technological University (南洋理工大学); CFAR and IHPC, A*STAR, Singapore (新加坡科技研究局计算成像研究中心和高性能计算研究所); MBZUAI (穆罕默德·本·扎耶德人工智能大学); Zhejiang University (浙江大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 18 figures

点击查看摘要

Abstract:We present MaskMark, a simple, efficient and flexible framework for image watermarking. MaskMark has two variants: MaskMark-D, which supports global watermark embedding, watermark localization, and local watermark extraction for applications such as tamper detection, and MaskMark-ED, which focuses on local watermark embedding and extraction with enhanced robustness in small regions, enabling localized image protection. Built upon the classical Encoder- Distortion-Decoder training paradigm, MaskMark-D introduces a simple masking mechanism during the decoding stage to support both global and local watermark extraction. A mask is applied to the watermarked image before extraction, allowing the decoder to focus on selected regions and learn local extraction. A localization module is also integrated into the decoder to identify watermark regions during inference, reducing interference from irrelevant content and improving accuracy. MaskMark-ED extends this design by incorporating the mask into the encoding stage as well, guiding the encoder to embed the watermark in designated local regions for enhanced robustness. Comprehensive experiments show that MaskMark achieves state-of-the-art performance in global watermark extraction, local watermark extraction, watermark localization, and multi-watermark embedding. It outperforms all existing baselines, including the recent leading model WAM for local watermarking, while preserving high visual quality of the watermarked images. MaskMark is also flexible, by adjusting the distortion layer, it can adapt to different robustness requirements with just a few steps of fine-tuning. Moreover, our approach is efficient and easy to optimize, requiring only 20 hours on a single A6000 GPU with just 1/15 the computational cost of WAM.
zh

[CV-71] Post-pre-training for Modality Alignment in Vision-Language Foundation Models CVPR2025

【速读】:该论文旨在解决跨模态特征空间中的模态间隙(modality gap)问题,即图像和文本特征聚类之间的差距,这限制了下游任务的性能。尽管已有工作尝试通过修改预训练或微调来缓解这一问题,但这些方法往往面临高昂的训练成本或零样本性能下降的挑战。论文提出的解决方案是CLIP-Refine,一种介于预训练和微调之间的后处理方法。其关键在于引入了两种技术:随机特征对齐(Random Feature Alignment, RaFA)和混合对比蒸馏(Hybrid Contrastive-Distillation, HyCD)。RaFA通过最小化图像和文本特征到从先验分布中采样的随机参考向量的距离,使特征遵循共享的先验分布;HyCD则利用结合真实图像-文本对标签和预训练CLIP模型输出的混合软标签更新模型,从而在保持已有知识的同时学习新知识以对齐特征。实验结果表明,CLIP-Refine成功减轻了模态间隙并提升了零样本性能。

链接: https://arxiv.org/abs/2504.12717
作者: Shin’ya Yamaguchi,Dewei Feng,Sekitoshi Kanai,Kazuki Adachi,Daiki Chijiwa
机构: NTT(日本电信电话公司); Kyoto University(京都大学); MIT(麻省理工学院); YNU(横滨国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025; Code: this https URL

点击查看摘要

Abstract:Contrastive language image pre-training (CLIP) is an essential component of building modern vision-language foundation models. While CLIP demonstrates remarkable zero-shot performance on downstream tasks, the multi-modal feature spaces still suffer from a modality gap, which is a gap between image and text feature clusters and limits downstream task performance. Although existing works attempt to address the modality gap by modifying pre-training or fine-tuning, they struggle with heavy training costs with large datasets or degradations of zero-shot performance. This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning. CLIP-Refine aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations. To this end, we introduce two techniques: random feature alignment (RaFA) and hybrid contrastive-distillation (HyCD). RaFA aligns the image and text features to follow a shared prior distribution by minimizing the distance to random reference vectors sampled from the prior. HyCD updates the model with hybrid soft labels generated by combining ground-truth image-text pair labels and outputs from the pre-trained CLIP model. This contributes to achieving both maintaining the past knowledge and learning new knowledge to align features. Our extensive experiments with multiple classification and retrieval tasks show that CLIP-Refine succeeds in mitigating the modality gap and improving the zero-shot performance.
zh

[CV-72] NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results CVPR

【速读】:该论文旨在解决在不同光照和对焦条件下去除雨水这一任务中的挑战,其核心问题是开发一种能够有效处理多种雨滴退化类型(包括白天雨滴聚焦、白天背景聚焦、夜晚雨滴聚焦和夜晚背景聚焦)的新基准。解决方案的关键在于利用收集的真实世界Raindrop Clarity数据集来训练和评估各种创新算法,并通过竞赛的形式激励参与者提交高质量的成果。最终,32支团队提交的有效方案在测试集上达到了当前最先进的性能水平。

链接: https://arxiv.org/abs/2504.12711
作者: Xin Li,Yeying Jin,Xin Jin,Zongwei Wu,Bingchen Li,Yufei Wang,Wenhan Yang,Yu Li,Zhibo Chen,Bihan Wen,Robby T. Tan,Radu Timofte,Qiyu Rong,Hongyuan Jing,Mengmeng Zhang,Jinglong Li,Xiangyu Lu,Yi Ren,Yuting Liu,Meng Zhang,Xiang Chen,Qiyuan Guan,Jiangxin Dong,Jinshan Pan,Conglin Gou,Qirui Yang,Fangpu Zhang,Yunlong Lin,Sixiang Chen,Guoxi Huang,Ruirui Lin,Yan Zhang,Jingyu Yang,Huanjing Yue,Jiyuan Chen,Qiaosi Yi,Hongjun Wang,Chenxi Xie,Shuai Li,Yuhui Wu,Kaiyi Ma,Jiakui Hu,Juncheng Li,Liwen Pan,Guangwei Gao,Wenjie Li,Zhenyu Jin,Heng Guo,Zhanyu Ma,Yubo Wang,Jinghua Wang,Wangzhi Xing,Anjusree Karnavar,Diqi Chen,Mohammad Aminul Islam,Hao Yang,Ruikun Zhang,Liyuan Pan,Qianhao Luo,XinCao,Han Zhou,Yan Min,Wei Dong,Jun Chen,Taoyi Wu,Weijia Dou,Yu Wang,Shengjie Zhao,Yongcheng Huang,Xingyu Han,Anyan Huang,Hongtao Wu,Hong Wang,Yefeng Zheng,Abhijeet Kumar,Aman Kumar,Marcos V. Conde,Paula Garrido,Daniel Feijoo,Juan C. Benito,Guanglu Dong,Xin Lin,Siyuan Liu,Tianheng Zheng,Jiayu Zhong,Shouyi Wang,Xiangtai Li,Lanqing Guo,Lu Qi,Chao Ren,Shuaibo Wang,Shilong Zhang,Wanyu Zhou,Yunze Wu,Qinzhong Tan,Jieyuan Pei,Zhuoxuan Li,Jiayu Wang,Haoyu Bian,Haoran Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Challenge Report of CVPR NTIRE 2025; 26 pages; Methods from 32 teams

点击查看摘要

Abstract:This paper reviews the NTIRE 2025 Challenge on Day and Night Raindrop Removal for Dual-Focused Images. This challenge received a wide range of impressive solutions, which are developed and evaluated using our collected real-world Raindrop Clarity dataset. Unlike existing deraining datasets, our Raindrop Clarity dataset is more diverse and challenging in degradation types and contents, which includes day raindrop-focused, day background-focused, night raindrop-focused, and night background-focused degradations. This dataset is divided into three subsets for competition: 14,139 images for training, 240 images for validation, and 731 images for testing. The primary objective of this challenge is to establish a new and powerful benchmark for the task of removing raindrops under varying lighting and focus conditions. There are a total of 361 participants in the competition, and 32 teams submitting valid solutions and fact sheets for the final testing phase. These submissions achieved state-of-the-art (SOTA) performance on the Raindrop Clarity dataset. The project can be found at this https URL.
zh

[CV-73] Self-Supervised Pre-training with Combined Datasets for 3D Perception in Autonomous Driving

【速读】:该论文旨在解决如何通过大规模无标注数据提升自动驾驶领域中3D感知模型性能的问题。解决方案的关键在于提出了一种自监督预训练框架,该框架能够从海量无标注的异构数据集中学习有效的3D表征,并结合基于提示适配器的领域适应策略以减少数据集偏差。这种方案显著提升了下游任务(如3D目标检测、BEV分割、3D目标跟踪和占用预测)中的模型表现,并且随着训练数据量的增加表现出稳定的性能增长,展示了持续改进3D感知模型在自动驾驶应用中的潜力。

链接: https://arxiv.org/abs/2504.12709
作者: Shumin Wang,Zhuoran Yang,Lidian Wang,Zhipeng Tang,Heng Li,Lehan Pan,Sha Zhang,Jie Peng,Jianmin Ji,Yanyong Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The significant achievements of pre-trained models leveraging large volumes of data in the field of NLP and 2D vision inspire us to explore the potential of extensive data pre-training for 3D perception in autonomous driving. Toward this goal, this paper proposes to utilize massive unlabeled data from heterogeneous datasets to pre-train 3D perception models. We introduce a self-supervised pre-training framework that learns effective 3D representations from scratch on unlabeled data, combined with a prompt adapter based domain adaptation strategy to reduce dataset bias. The approach significantly improves model performance on downstream tasks such as 3D object detection, BEV segmentation, 3D object tracking, and occupancy prediction, and shows steady performance increase as the training data volume scales up, demonstrating the potential of continually benefit 3D perception models for autonomous driving. We will release the source code to inspire further investigations in the community.
zh

[CV-74] SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding

【速读】:该论文旨在解决现有图像编辑方法在空间推理、精确区域分割以及保持语义一致性方面的显著挑战,尤其是在复杂场景中的应用局限性。论文提出的解决方案之关键是SmartFreeEdit,这是一种将多模态大型语言模型(MLLM)与超图增强修复架构相结合的端到端框架,支持仅通过自然语言指令进行精确且无掩码的图像编辑。其关键技术突破包括:(1) 引入区域感知标记及掩码嵌入范式以提升复杂场景的空间理解能力;(2) 设计基于自然语言指令优化编辑掩码生成的推理分割管道;(3) 增强的超图修复模块,确保复杂编辑过程中结构完整性和语义一致性的同时克服局部生成方法的局限性。实验结果表明,SmartFreeEdit在多个评估指标上超越当前最先进的方法,并有效解决了局部信息聚焦的问题,提升了全局一致性。

链接: https://arxiv.org/abs/2504.12704
作者: Qianqian Sun,Jixiang Luo,Dell Zhang,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. However, conventional methods still face significant challenges, particularly in spatial reasoning, precise region segmentation, and maintaining semantic consistency, especially in complex scenes. To overcome these challenges, we introduce SmartFreeEdit, a novel end-to-end framework that integrates a multimodal large language model (MLLM) with a hypergraph-enhanced inpainting architecture, enabling precise, mask-free image editing guided exclusively by natural language instructions. The key innovations of SmartFreeEdit include:(1)the introduction of region aware tokens and a mask embedding paradigm that enhance the spatial understanding of complex scenes;(2) a reasoning segmentation pipeline designed to optimize the generation of editing masks based on natural language instructions;and (3) a hypergraph-augmented inpainting module that ensures the preservation of both structural integrity and semantic coherence during complex edits, overcoming the limitations of local-based image generation. Extensive experiments on the Reason-Edit benchmark demonstrate that SmartFreeEdit surpasses current state-of-the-art methods across multiple evaluation metrics, including segmentation accuracy, instruction adherence, and visual quality preservation, while addressing the issue of local information focus and improving global consistency in the edited image. Our project will be available at this https URL.
zh

[CV-75] Unsupervised Cross-Domain 3D Human Pose Estimation via Pseudo-Label-Guided Global Transforms

【速读】:该论文旨在解决跨场景推断中现有3D人体姿态估计方法性能下降的问题,主要由于源域与目标域之间的域偏移(domain shift),尤其是摄像机视角、位置、姿态和体型等特征差异的影响。论文指出,摄像机视角和位置显著影响人体姿态的全局位置,从而导致域差距。为了解决这一问题,论文提出了一种新颖的框架,其关键是通过显式地在源域和目标域的摄像机坐标系中进行全局变换来对齐姿态位置。具体而言,首先利用伪标签生成模块从目标数据集的2D姿态生成伪3D姿态;然后,借助以人体为中心的坐标系统作为新型桥梁机制,通过全局变换模块实现不同域之间姿态位置方向的一致性对齐;此外,引入姿态增强模块以应对人体姿态和体型的变化,进一步提升模型的泛化能力。该过程是迭代的,允许逐步改进的伪标签持续优化领域适应的指导。所提方法在Human3.6M、MPI-INF-3DHP和3DPW等多个跨数据集基准上进行了评估,表现出超越现有最先进方法甚至优于目标域训练模型的性能。

链接: https://arxiv.org/abs/2504.12699
作者: Jingjing Liu,Zhiyong Wang,Xinyu Fan,Amirhossein Dadashzadeh,Honghai Liu,Majid Mirmehdi
机构: School of Computer Science, University of Bristol (英国布里斯托尔大学计算机科学学院); State Key Laboratory of Robotics and Systems, Harbin Institute of Technology Shenzhen (哈尔滨工业大学深圳机器人与系统国家重点实验室); School of Aerospace Engineering, Xiamen University (厦门大学航空航天工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures, including appendix. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Existing 3D human pose estimation methods often suffer in performance, when applied to cross-scenario inference, due to domain shifts in characteristics such as camera viewpoint, position, posture, and body size. Among these factors, camera viewpoints and locations have been shown to contribute significantly to the domain gap by influencing the global positions of human poses. To address this, we propose a novel framework that explicitly conducts global transformations between pose positions in the camera coordinate systems of source and target domains. We start with a Pseudo-Label Generation Module that is applied to the 2D poses of the target dataset to generate pseudo-3D poses. Then, a Global Transformation Module leverages a human-centered coordinate system as a novel bridging mechanism to seamlessly align the positional orientations of poses across disparate domains, ensuring consistent spatial referencing. To further enhance generalization, a Pose Augmentor is incorporated to address variations in human posture and body size. This process is iterative, allowing refined pseudo-labels to progressively improve guidance for domain adaptation. Our method is evaluated on various cross-dataset benchmarks, including Human3.6M, MPI-INF-3DHP, and 3DPW. The proposed method outperforms state-of-the-art approaches and even outperforms the target-trained model.
zh

[CV-76] Collaborative Perception Datasets for Autonomous Driving: A Review

【速读】:该论文试图解决协作感知领域因缺乏系统性总结与对比分析而导致的资源利用效率低下及模型评估标准化困难的问题。论文的关键解决方案在于从多维度对现有的协作感知数据集进行全面回顾与比较,包括基于合作范式的分类、数据来源与场景的考察、传感器模态以及支持任务的分析,并通过详细的跨维度对比明确各数据集的特点与适用范围。此外,论文还提出了数据集扩展性、多样性、领域自适应、标准化、隐私保护以及大型语言模型整合等关键挑战与未来发展方向,同时提供了一个持续更新的在线资源库以支持相关研究的开展。

链接: https://arxiv.org/abs/2504.12696
作者: Naibang Wang,Deyong Shang,Yan Gong,Xiaoxi Hu,Ziying Song,Lei Yang,Yuhan Huang,Xiaoyu Wang,Jianli Lu
机构: School of Mechanical and Electrical Engineering, China University of Mining and Technology (Beijing)(中国矿业大学(北京)); State Key Laboratory of Robotics and System, Harbin Institute of Technology (哈尔滨工业大学机器人国家重点实验室); State Key Laboratory of Intelligent Green Vehicle and Mobility, Tsinghua University (清华大学智能绿色车辆与移动性国家重点实验室); Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, School of Computer Science and Technology, Beijing Jiaotong University (北京交通大学计算机科学与技术学院交通数据挖掘与具身智能北京市重点实验室); School of Mechanical and Aerospace Engineering, Nanyang Technological University (南洋理工大学机械与航空航天工程学院); School of Mechatronics Engineering, Harbin Institute of Technology (哈尔滨工业大学机电工程学院); Department of Electronic & Electrical Engineering, University of Bath (英国巴斯大学电子电气工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18pages, 7figures, journal

点击查看摘要

Abstract:Collaborative perception has attracted growing interest from academia and industry due to its potential to enhance perception accuracy, safety, and robustness in autonomous driving through multi-agent information fusion. With the advancement of Vehicle-to-Everything (V2X) communication, numerous collaborative perception datasets have emerged, varying in cooperation paradigms, sensor configurations, data sources, and application scenarios. However, the absence of systematic summarization and comparative analysis hinders effective resource utilization and standardization of model evaluation. As the first comprehensive review focused on collaborative perception datasets, this work reviews and compares existing resources from a multi-dimensional perspective. We categorize datasets based on cooperation paradigms, examine their data sources and scenarios, and analyze sensor modalities and supported tasks. A detailed comparative analysis is conducted across multiple dimensions. We also outline key challenges and future directions, including dataset scalability, diversity, domain adaptation, standardization, privacy, and the integration of large language models. To support ongoing research, we provide a continuously updated online repository of collaborative perception datasets and related literature: this https URL.
zh

[CV-77] HSS-IAD: A Heterogeneous Same-Sort Industrial Anomaly Detection Dataset ICME2025

【速读】:该论文旨在解决现有工业异常检测(Industrial Anomaly Detection, IAD)数据集在多类别无监督异常检测(Multi-class Unsupervised Anomaly Detection, MUAD)方法实际应用中的局限性问题。具体而言,当前IAD数据集存在类别分布不符合实际工厂生产环境、未能涵盖多种结构或外观以及缺陷特征不真实等问题,从而质疑了MUAD方法在实际工业场景中的有效性。为了解决这些问题,论文的关键方案是引入了一个名为Heterogeneous Same-Sort Industrial Anomaly Detection (HSS-IAD)的新数据集。该数据集包含8,580张金属类工业部件图像,并提供精确的异常标注,这些部件具有结构和外观上的变化,且其细微缺陷高度模拟真实材料特性。此外,还提供了前景图像以支持合成异常生成。通过在多类别和类别分离设置下评估主流IAD方法,验证了该数据集能够弥合现有数据集与真实工厂条件之间的差距。

链接: https://arxiv.org/abs/2504.12689
作者: Qishan Wang,Shuyong Gao,Junjie Hu,Jiawen Yu,Xuan Tong,You Li,Wenqiang Zhang
机构: Academy for Engineering and Technology, Fudan University (复旦大学工程与技术学院); School of Computer Science, Fudan University (复旦大学计算机科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE ICME 2025

点击查看摘要

Abstract:Multi-class Unsupervised Anomaly Detection algorithms (MUAD) are receiving increasing attention due to their relatively low deployment costs and improved training efficiency. However, the real-world effectiveness of MUAD methods is questioned due to limitations in current Industrial Anomaly Detection (IAD) datasets. These datasets contain numerous classes that are unlikely to be produced by the same factory and fail to cover multiple structures or appearances. Additionally, the defects do not reflect real-world characteristics. Therefore, we introduce the Heterogeneous Same-Sort Industrial Anomaly Detection (HSS-IAD) dataset, which contains 8,580 images of metallic-like industrial parts and precise anomaly annotations. These parts exhibit variations in structure and appearance, with subtle defects that closely resemble the base materials. We also provide foreground images for synthetic anomaly generation. Finally, we evaluate popular IAD methods on this dataset under multi-class and class-separated settings, demonstrating its potential to bridge the gap between existing datasets and real factory conditions. The dataset is available at this https URL.
zh

[CV-78] SOPHY: Generating Simulation-Ready Objects with Physical Materials

【速读】:该论文试图解决的问题是如何生成具备物理感知能力的三维形状,并同时考虑形状、纹理和材料属性的联合建模以支持模拟和交互式动态环境。为实现这一目标,论文提出了一种名为SOPHY的生成式模型 (Generative Model),其关键在于引入了一个包含详细物理材质属性标注的三维物体数据集以及高效的材质标注流水线,从而实现了对与物理驱动动态相关的形状、纹理和材料属性的联合合成。此外,通过联合建模形状和材质属性,显著提升了生成形状的真实感和保真度,改善了生成几何评估指标的表现。

链接: https://arxiv.org/abs/2504.12684
作者: Junyi Cao,Evangelos Kalogerakis
机构: Technical University of Crete (克里特理工大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present SOPHY, a generative model for 3D physics-aware shape synthesis. Unlike existing 3D generative models that focus solely on static geometry or 4D models that produce physics-agnostic animations, our approach jointly synthesizes shape, texture, and material properties related to physics-grounded dynamics, making the generated objects ready for simulations and interactive, dynamic environments. To train our model, we introduce a dataset of 3D objects annotated with detailed physical material attributes, along with an annotation pipeline for efficient material annotation. Our method enables applications such as text-driven generation of interactive, physics-aware 3D objects and single-image reconstruction of physically plausible shapes. Furthermore, our experiments demonstrate that jointly modeling shape and material properties enhances the realism and fidelity of generated shapes, improving performance on generative geometry evaluation metrics.
zh

[CV-79] Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning

【速读】:该论文试图解决的问题是如何让预训练模型(尤其是大型视觉语言模型)获得从序列化视觉观察中感知和进行高级空间推理的能力。目前,这一过程的具体机制尚不明确。论文提出的解决方案关键在于引入了一个名为Embodied-R的合作框架,该框架结合了大规模视觉语言模型用于感知,以及小规模语言模型用于推理,并通过强化学习(RL)利用一种考虑思考与答案逻辑一致性的新型奖励系统,实现了在有限计算资源下的慢速思维能力。这种设计使得仅基于5k个具身视频样本的训练,Embodied-R能够在分布内和分布外的具身空间推理任务上达到与最先进的多模态推理模型相当的表现。

链接: https://arxiv.org/abs/2504.12680
作者: Baining Zhao,Ziyou Wang,Jianjie Fang,Chen Gao,Fanhang Man,Jinqiang Cui,Xin Wang,Xinlei Chen,Yong Li,Wenwu Zhu
机构: Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training.
zh

[CV-80] ongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

【速读】:该论文旨在解决构建通用图形用户界面(GUI)代理所面临的缺乏跨操作系统和应用程序充足轨迹数据的问题,这一挑战主要源于手工标注的高成本。论文的关键解决方案是提出TongUI框架,通过从丰富的多模态网络教程中学习来构建通用GUI代理。具体而言,研究者爬取并处理在线GUI教程(如视频和文章)以生成GUI代理轨迹数据,从而创建包含143K轨迹数据的GUI-Net数据集,覆盖五种操作系统和超过200个应用程序。通过在GUI-Net上微调Qwen2.5-VL-3B/7B模型,开发出TongUI代理,在常用的定位和导航基准测试中表现出显著性能提升,多项基准测试中优于基线代理约10%,验证了GUI-Net数据集的有效性及TongUI框架的重要性。

链接: https://arxiv.org/abs/2504.12679
作者: Bofei Zhang,Zirui Shang,Zhi Gao,Wang Zhang,Rui Xie,Xiaojian Ma,Tao Yuan,Xinxiao Wu,Song-Chun Zhu,Qing Li
机构: State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室, BIGAI); Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology (北京理工大学计算机科学与技术学院智能信息技术北京市重点实验室); School of Intelligence Science and Technology, Peking University (北京大学智能科学与技术学院); Shanghai Jiao Tong University (上海交通大学); Department of Automation, Tsinghua University (清华大学自动化系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. Concretely, we crawl and process online GUI tutorials (such as videos and articles) into GUI agent trajectory data, through which we produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents about 10% on multiple benchmarks, showing the effectiveness of the GUI-Net dataset and underscoring the significance of our TongUI framework. We will fully open-source the code, the GUI-Net dataset, and the trained models soon.
zh

[CV-81] Accurate Tracking of Arabidopsis Root Cortex Cell Nuclei in 3D Time-Lapse Microscopy Images Based on Genetic Algorithm

【速读】:该论文旨在解决在密集排列细胞条件下,传统细胞追踪软件(如TrackMate)在植物根尖活体成像数据中追踪细胞核准确性不足的问题。解决方案的关键在于提出了一种基于遗传算法(Genetic Algorithm, GA)的精确追踪方法,该方法利用拟南芥根部细胞模式及其体积间的空间关系,采用粗到细的分步策略:首先进行细胞核的简单线条级追踪,随后基于已知的细胞文件线性排列及其核间空间关系实现复杂的核追踪。这种方法显著提升了追踪精度,并通过少量人工校正即可实现对拟南芥根尖细胞核的准确追踪。

链接: https://arxiv.org/abs/2504.12676
作者: Yu Song,Tatsuaki Goh,Yinhao Li,Jiahua Dong,Shunsuke Miyashima,Yutaro Iwamoto,Yohei Kondo,Keiji Nakajima,Yen-wei Chen
机构: College of Information Scinece and Engineering, Ritsumeikan University (信息科学与工程学院,立命馆大学); Graduate School of Science and Technology, Nara Institute of Science and Technology (科学技术研究生院,奈良先端科学技术大学院大学); Bioresource Engineering Laboratory, Ishikawa Prefectural University (生物资源工程实验室,石川县立大学); Department of Engineering Informatics, Osaka Electro-Communication University (工程信息学系,大阪电子信息大学); Exploratory Research Center on Life and Living Systems, National Institutes of Natural Sciences (生命与生活系统探索研究中心,国立自然科学研究机构); College of Computer Science and Technology, Zhejiang University (计算机科学与技术学院,浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Arabidopsis is a widely used model plant to gain basic knowledge on plant physiology and development. Live imaging is an important technique to visualize and quantify elemental processes in plant development. To uncover novel theories underlying plant growth and cell division, accurate cell tracking on live imaging is of utmost importance. The commonly used cell tracking software, TrackMate, adopts tracking-by-detection fashion, which applies Laplacian of Gaussian (LoG) for blob detection, and Linear Assignment Problem (LAP) tracker for tracking. However, they do not perform sufficiently when cells are densely arranged. To alleviate the problems mentioned above, we propose an accurate tracking method based on Genetic algorithm (GA) using knowledge of Arabidopsis root cellular patterns and spatial relationship among volumes. Our method can be described as a coarse-to-fine method, in which we first conducted relatively easy line-level tracking of cell nuclei, then performed complicated nuclear tracking based on known linear arrangement of cell files and their spatial relationship between nuclei. Our method has been evaluated on a long-time live imaging dataset of Arabidopsis root tips, and with minor manual rectification, it accurately tracks nuclei. To the best of our knowledge, this research represents the first successful attempt to address a long-standing problem in the field of time-lapse microscopy in the root meristem by proposing an accurate tracking method for Arabidopsis root nuclei.
zh

[CV-82] wo Tasks One Goal: Uniting Motion and Planning for Excellent End To End Autonomous Driving Performance

【速读】:该论文致力于解决传统端到端自动驾驶方法中规划(planning)与运动控制(motion control)任务解耦所带来的局限性,即未能充分利用运动任务中遇到的分布外(out-of-distribution)数据对规划任务的潜在收益。为应对这一挑战,论文提出了一种名为TTOG的新型两阶段轨迹生成框架作为解决方案。其关键在于第一阶段生成多样化轨迹候选集,第二阶段利用车辆状态信息优化这些候选轨迹,同时通过自车训练的状态估计器解决周围车辆状态不可观测的问题,并引入等变上下文共享场景适配器(ECSA)增强场景表示在不同智能体间的泛化能力。实验结果验证了TTOG在规划与运动任务中的卓越性能。

链接: https://arxiv.org/abs/2504.12667
作者: Lin Liu,Ziying Song,Hongyu Pan,Lei Yang,Caiyan Jia
机构: Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University (北京交通大学); Horizon Robotics; School of Information Technology and Electrical Engineering, The University of Queensland (澳大利亚昆士兰大学); School of Vehicle and Mobility, Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving has made impressive progress in recent years. Former end-to-end autonomous driving approaches often decouple planning and motion tasks, treating them as separate modules. This separation overlooks the potential benefits that planning can gain from learning out-of-distribution data encountered in motion tasks. However, unifying these tasks poses significant challenges, such as constructing shared contextual representations and handling the unobservability of other vehicles’ states. To address these challenges, we propose TTOG, a novel two-stage trajectory generation framework. In the first stage, a diverse set of trajectory candidates is generated, while the second stage focuses on refining these candidates through vehicle state information. To mitigate the issue of unavailable surrounding vehicle states, TTOG employs a self-vehicle data-trained state estimator, subsequently extended to other vehicles. Furthermore, we introduce ECSA (equivariant context-sharing scene adapter) to enhance the generalization of scene representations across different agents. Experimental results demonstrate that TTOG achieves state-of-the-art performance across both planning and motion tasks. Notably, on the challenging open-loop nuScenes dataset, TTOG reduces the L2 distance by 36.06%. Furthermore, on the closed-loop Bench2Drive dataset, our approach achieves a 22% improvement in the driving score (DS), significantly outperforming existing baselines.
zh

[CV-83] AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification

【速读】:该论文旨在解决如何在保持较高分类准确率的同时显著降低卷积神经网络(CNN)的计算复杂度和参数量的问题。解决方案的关键在于设计了一种名为AdaptoVision的新颖CNN架构,通过引入增强的残差单元(enhanced residual units)、深度可分离卷积(depth-wise separable convolutions)以及分层跳跃连接(hierarchical skip connections),实现了在不依赖预训练权重的情况下,在多个基准数据集和医学图像数据集上达到竞争性性能的目标。这种架构上的创新有效提升了特征提取效率,并增强了模型的泛化能力,使其特别适用于实时性和资源受限环境下的部署需求。

链接: https://arxiv.org/abs/2504.12652
作者: Md. Sanaullah Chowdhury Lameya Sabrin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces AdaptoVision, a novel convolutional neural network (CNN) architecture designed to efficiently balance computational complexity and classification accuracy. By leveraging enhanced residual units, depth-wise separable convolutions, and hierarchical skip connections, AdaptoVision significantly reduces parameter count and computational requirements while preserving competitive performance across various benchmark and medical image datasets. Extensive experimentation demonstrates that AdaptoVision achieves state-of-the-art on BreakHis dataset and comparable accuracy levels, notably 95.3% on CIFAR-10 and 85.77% on CIFAR-100, without relying on any pretrained weights. The model’s streamlined architecture and strategic simplifications promote effective feature extraction and robust generalization, making it particularly suitable for deployment in real-time and resource-constrained environments.
zh

[CV-84] Quantum Computing Supported Adversarial Attack-Resilient Autonomous Vehicle Perception Module for Traffic Sign Classification

【速读】:该论文旨在解决深度学习(Deep Learning, DL)驱动的图像分类模型在自动驾驶车辆(Autonomous Vehicle, AV)感知模块中对抗性攻击(Adversarial Attacks)导致误分类的问题。对抗性攻击可能使深度学习模型输出错误预测,例如自动驾驶感知模块对交通标志的错误分类,从而带来严重后果。为应对这一挑战,论文提出的关键解决方案是构建混合经典-量子深度学习(Hybrid Classical-Quantum Deep Learning, HCQ-DL)模型,并将其与传统经典深度学习(Classical Deep Learning, C-DL)模型进行对比评估。通过使用迁移学习模型(如AlexNet和VGG-16)作为特征提取器,并在量子系统中测试超过1000个量子电路,针对三种典型的非定向对抗性攻击方法(投影梯度下降法PGD、快速梯度符号攻击FGSA和梯度攻击GA),验证HCQ-DL模型的鲁棒性。实验结果表明,在无攻击场景下,HCQ-DL模型保持了高于95%的准确率;在对抗攻击(特别是PGD攻击)下,其性能显著优于C-DL模型,尤其是在AlexNet基线上的PGD攻击中达到了85%的准确率,而C-DL模型的准确率低于21%。因此,关键在于结合经典迁移学习与量子计算技术以提升对抗性环境下的交通标志分类准确性。

链接: https://arxiv.org/abs/2504.12644
作者: Reek Majumder,Mashrur Chowdhury,Sakib Mahmud Khan,Zadid Khan,Fahim Ahmad,Frank Ngeni,Gurcan Comert,Judith Mwakalonge,Dimitra Michalaka
机构: Clemson University (克莱姆森大学); The MITRE Corporation (MITRE公司); Walmart (沃尔玛); University of South Carolina (南卡罗来纳大学); North Carolina A&T State University (北卡罗来纳农工州立大学); The Citadel ( Citadel学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Deep learning (DL)-based image classification models are essential for autonomous vehicle (AV) perception modules since incorrect categorization might have severe repercussions. Adversarial attacks are widely studied cyberattacks that can lead DL models to predict inaccurate output, such as incorrectly classified traffic signs by the perception module of an autonomous vehicle. In this study, we create and compare hybrid classical-quantum deep learning (HCQ-DL) models with classical deep learning (C-DL) models to demonstrate robustness against adversarial attacks for perception modules. Before feeding them into the quantum system, we used transfer learning models, alexnet and vgg-16, as feature extractors. We tested over 1000 quantum circuits in our HCQ-DL models for projected gradient descent (PGD), fast gradient sign attack (FGSA), and gradient attack (GA), which are three well-known untargeted adversarial approaches. We evaluated the performance of all models during adversarial attacks and no-attack scenarios. Our HCQ-DL models maintain accuracy above 95% during a no-attack scenario and above 91% for GA and FGSA attacks, which is higher than C-DL models. During the PGD attack, our alexnet-based HCQ-DL model maintained an accuracy of 85% compared to C-DL models that achieved accuracies below 21%. Our results highlight that the HCQ-DL models provide improved accuracy for traffic sign classification under adversarial settings compared to their classical counterparts.
zh

[CV-85] RoPETR: Improving Temporal Camera-Only 3D Detection by Integrating Enhanced Rotary Position Embedding

【速读】:该论文旨在解决StreamPETR框架在NuScenes数据集上的速度估计(velocity estimation)性能瓶颈问题,这是影响其整体NuScenes检测分数(NDS)的关键因素之一。尽管StreamPETR在3D边界框检测方面表现出色,但其速度估计能力不足显著制约了性能提升。为了解决这一问题,论文提出了一种定制化的 positional embedding 策略,以增强其时间建模能力(temporal modeling capabilities)。实验结果表明,采用ViT-L主干网络的改进方法在NuScenes测试集上实现了70.86%的NDS,达到了当时的最先进水平,并为仅使用摄像头的3D目标检测设定了新的基准。

链接: https://arxiv.org/abs/2504.12643
作者: Hang Ji,Tao Ni,Xufeng Huang,Tao Luo,Xin Zhan,Junbo Chen
机构: Udeer.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This technical report introduces a targeted improvement to the StreamPETR framework, specifically aimed at enhancing velocity estimation, a critical factor influencing the overall NuScenes Detection Score. While StreamPETR exhibits strong 3D bounding box detection performance as reflected by its high mean Average Precision our analysis identified velocity estimation as a substantial bottleneck when evaluated on the NuScenes dataset. To overcome this limitation, we propose a customized positional embedding strategy tailored to enhance temporal modeling capabilities. Experimental evaluations conducted on the NuScenes test set demonstrate that our improved approach achieves a state-of-the-art NDS of 70.86% using the ViT-L backbone, setting a new benchmark for camera-only 3D object detection.
zh

[CV-86] Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

【速读】:该论文旨在解决长视频生成中因帧数过多导致的计算瓶颈问题,并改善现有视频扩散模型的视觉质量。解决方案的关键在于提出了一种名为FramePack的神经网络结构,它通过将输入帧压缩,使Transformer的上下文长度固定,从而无论视频长度如何都能以固定的计算量处理大量帧,显著提高了训练时的批量大小(接近图像扩散的规模)。此外,论文还引入了一种反向漂移采样方法,通过逆时间顺序生成帧并在早期建立端点来避免曝光偏差,进一步优化了生成效果。最后,FramePack能够微调现有的视频扩散模型,利用更平衡的扩散调度器减少极端流场偏移步长,从而提升视觉质量。

链接: https://arxiv.org/abs/2504.12626
作者: Lvmin Zhang,Maneesh Agrawala
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.
zh

[CV-87] SAM-Based Building Change Detection with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping

【速读】:该论文致力于解决建筑物变化检测中的挑战,特别是在城市发展中、灾害评估以及军事侦察场景下,由于域差距(domain gap)问题,现有基础模型如Segment Anything Model (SAM) 在建筑物变化检测任务上的局限性。此外,现有基于适配器的微调方法在处理建筑分布不平衡时表现不佳,难以准确检测细微变化并提取精确边缘;同时,双时相图像配准中的偏移估计容易受到背景噪声干扰,进一步影响检测精度与边缘识别效果。为应对这些挑战,论文提出了一种新的SAM-Based Network,名为FAEWNet(Distribution-Aware Fourier Adaptation and Edge-Constrained Warping Network)。其关键是通过设计一个分布感知的傅里叶聚合适配器(Distribution-Aware Fourier Aggregated Adapter),引导SAM关注遥感场景中的特定地面目标,并有效缓解域差距问题,同时兼顾变化建筑的分布特性;此外,引入一种新颖的流模块(flow module),优化建筑边缘提取并增强对变化建筑的感知能力,从而显著提高检测精度与鲁棒性。

链接: https://arxiv.org/abs/2504.12619
作者: Yun-Cheng Li,Sen Lei,Yi-Tao Zhao,Heng-Chao Li,Jun Li,Antonio Plaza
机构: School of Information Science and Technology, Southwest Jiaotong University (西南交通大学), Chengdu 611756, China.; School of Computer Science, China University of Geosciences (中国地质大学), Wuhan 430074, China.; Department of Technology of Computers and Communications, University of Extremadura (埃斯特雷马杜拉大学), Caceres 10003, Spain.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building change detection remains challenging for urban development, disaster assessment, and military reconnaissance. While foundation models like Segment Anything Model (SAM) show strong segmentation capabilities, SAM is limited in the task of building change detection due to domain gap issues. Existing adapter-based fine-tuning approaches face challenges with imbalanced building distribution, resulting in poor detection of subtle changes and inaccurate edge extraction. Additionally, bi-temporal misalignment in change detection, typically addressed by optical flow, remains vulnerable to background noises. This affects the detection of building changes and compromises both detection accuracy and edge recognition. To tackle these challenges, we propose a new SAM-Based Network with Distribution-Aware Fourier Adaptation and Edge-Constrained Warping (FAEWNet) for building change detection. FAEWNet utilizes the SAM encoder to extract rich visual features from remote sensing images. To guide SAM in focusing on specific ground objects in remote sensing scenes, we propose a Distribution-Aware Fourier Aggregated Adapter to aggregate task-oriented changed information. This adapter not only effectively addresses the domain gap issue, but also pays attention to the distribution of changed buildings. Furthermore, to mitigate noise interference and misalignment in height offset estimation, we design a novel flow module that refines building edge extraction and enhances the perception of changed buildings. Our state-of-the-art results on the LEVIR-CD, S2Looking and WHU-CD datasets highlight the effectiveness of FAEWNet. The code is available at this https URL.
zh

[CV-88] Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation

【速读】:本文旨在解决鲁棒场景图生成(Robust Scene Graph Generation)的问题,核心挑战在于处理因图像域偏移导致的清洁图像与被破坏图像之间的差异。现有场景图生成(Scene Graph Generation, SGG)方法在面对被破坏图像时,由于视觉特征受损(如干扰或遮挡),其性能显著下降。为应对这一挑战,论文提出了一种名为Robo-SGG的新方法,其关键在于利用布局信息(layout information)来增强SGG方法在被破坏图像上的鲁棒性。具体而言,通过布局导向归一化与恢复(Layout-Oriented Normalization and Restitution)结合实例归一化(Instance Normalization, IN),过滤掉域特定特征并恢复不变的结构特征(即物体间的空间和语义关系)。同时,引入布局嵌入编码器(Layout-Embedded Encoder, LEE)以增强对象和谓词的鲁棒位置及语义特征。该模块设计为即插即用组件,可轻松集成到任何基线SGG模型中。实验结果表明,将最先进的方法与Robo-SGG结合后,在VG-C数据集的PredCls、SGCls和SGDet任务上分别实现了5.6%、8.0%和6.5%的相对提升,并在被破坏场景图生成基准测试中达到新的技术水平。

链接: https://arxiv.org/abs/2504.12606
作者: Changsheng Lv,Mengshi Qi,Zijian Fu,Huadong Ma
机构: State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications(Beijing University of Posts and Telecommunications)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce a novel method named Robo-SGG, i.e., Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation. Compared to the existing SGG setting, the robust scene graph generation aims to perform inference on a diverse range of corrupted images, with the core challenge being the domain shift between the clean and corrupted images. Existing SGG methods suffer from degraded performance due to compromised visual features e.g., corruption interference or occlusions. To obtain robust visual features, we exploit the layout information, which is domain-invariant, to enhance the efficacy of existing SGG methods on corrupted images. Specifically, we employ Instance Normalization(IN) to filter out the domain-specific feature and recover the unchangeable structural features, i.e., the positional and semantic relationships among objects by the proposed Layout-Oriented Restitution. Additionally, we propose a Layout-Embedded Encoder (LEE) that augments the existing object and predicate encoders within the SGG framework, enriching the robust positional and semantic features of objects and predicates. Note that our proposed Robo-SGG module is designed as a plug-and-play component, which can be easily integrated into any baseline SGG model. Extensive experiments demonstrate that by integrating the state-of-the-art method into our proposed Robo-SGG, we achieve relative improvements of 5.6%, 8.0%, and 6.5% in mR@50 for PredCls, SGCls, and SGDet tasks on the VG-C dataset, respectively, and achieve new state-of-the-art performance in corruption scene graph generation benchmark (VG-C and GQA-C). We will release our source code and model.
zh

[CV-89] AdaQual-Diff: Diffusion-Based Image Restoration via Adaptive Quality Prompting

【速读】:该论文旨在解决复杂真实世界退化图像修复难题,传统方法因依赖间接线索难以适应特定退化混合类型及严重程度。论文的关键解决方案在于引入AdaQual-Diff框架,这是一种基于扩散模型的方法,直接将感知质量评估整合到生成式修复过程中。其核心技术通过自适应质量提示机制建立DeQAScore区域质量评分与最优引导复杂度之间的数学关系,根据测量的退化严重程度动态调整提示结构:低质量区域接收计算密集且结构复杂的精确修复指令,而高质量区域仅接受聚焦于保持而非干预的最小化提示。此方法的核心在于依据退化严重程度动态分配计算资源,形成空间变化的引导场,以数学精度指导扩散过程,从而实现对修复强度的精细控制,无需额外参数或推理迭代。实验结果表明,AdaQual-Diff在多种合成及真实数据集上实现了视觉效果更优的修复。

链接: https://arxiv.org/abs/2504.12605
作者: Xin Su,Chen Wu,Yu Zhang,Chen Lyu,Zhuoran Zheng
机构: Fuzhou University (福州大学); University of Science and Technology of China (中国科学技术大学); University of the Chinese Academy of Sciences (中国科学院大学); Shandong Normal University (山东师范大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Restoring images afflicted by complex real-world degradations remains challenging, as conventional methods often fail to adapt to the unique mixture and severity of artifacts present. This stems from a reliance on indirect cues which poorly capture the true perceptual quality deficit. To address this fundamental limitation, we introduce AdaQual-Diff, a diffusion-based framework that integrates perceptual quality assessment directly into the generative restoration process. Our approach establishes a mathematical relationship between regional quality scores from DeQAScore and optimal guidance complexity, implemented through an Adaptive Quality Prompting mechanism. This mechanism systematically modulates prompt structure according to measured degradation severity: regions with lower perceptual quality receive computationally intensive, structurally complex prompts with precise restoration directives, while higher quality regions receive minimal prompts focused on preservation rather than intervention. The technical core of our method lies in the dynamic allocation of computational resources proportional to degradation severity, creating a spatially-varying guidance field that directs the diffusion process with mathematical precision. By combining this quality-guided approach with content-specific conditioning, our framework achieves fine-grained control over regional restoration intensity without requiring additional parameters or inference iterations. Experimental results demonstrate that AdaQual-Diff achieves visually superior restorations across diverse synthetic and real-world datasets.
zh

[CV-90] 3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation

【速读】:该论文旨在解决半监督3D指代表达分割(3D-RES)中的两个关键挑战:高质量伪标签利用效率低下以及低质量伪标签中有用信息的浪费。现有半监督学习方法通常依赖高置信度阈值过滤伪标签,导致潜在有价值的伪标签被丢弃,限制了模型从大量未标注数据中充分受益的能力。为了解决这些问题,论文提出了首个针对3D-RES的半监督学习框架3DResT,并设计了两项创新方法:基于教师-学生一致性的采样(TSCS)和驱动质量的动态加权(QDW)。其中,TSCS通过选择高质量伪标签并将其整合到标注数据集中增强标注监督信号;而QDW则通过动态分配较低权重保留低质量伪标签,从而有效提取其有用信息而非直接丢弃。实验结果表明,在仅使用1%标注数据的情况下,3DResT相比全监督方法在mIoU指标上提升了8.34点。

链接: https://arxiv.org/abs/2504.12599
作者: Wenxin Chen,Mengxue Qu,Weitai Kang,Yan Yan,Yao Zhao,Yunchao Wei
机构: BeijingKey Laboratory of Advanced Information Science and Network and Institute of Information Science, Beijing Jiaotong University (北京交通大学先进信息技术实验室和信息系统研究所); University of Illinois Chicago (芝加哥大学伊利诺伊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Referring Expression Segmentation (3D-RES) typically requires extensive instance-level annotations, which are time-consuming and costly. Semi-supervised learning (SSL) mitigates this by using limited labeled data alongside abundant unlabeled data, improving performance while reducing annotation costs. SSL uses a teacher-student paradigm where teacher generates high-confidence-filtered pseudo-labels to guide student. However, in the context of 3D-RES, where each label corresponds to a single mask and labeled data is scarce, existing SSL methods treat high-quality pseudo-labels merely as auxiliary supervision, which limits the model’s learning potential. The reliance on high-confidence thresholds for filtering often results in potentially valuable pseudo-labels being discarded, restricting the model’s ability to leverage the abundant unlabeled data. Therefore, we identify two critical challenges in semi-supervised 3D-RES, namely, inefficient utilization of high-quality pseudo-labels and wastage of useful information from low-quality pseudo-labels. In this paper, we introduce the first semi-supervised learning framework for 3D-RES, presenting a robust baseline method named 3DResT. To address these challenges, we propose two novel designs called Teacher-Student Consistency-Based Sampling (TSCS) and Quality-Driven Dynamic Weighting (QDW). TSCS aids in the selection of high-quality pseudo-labels, integrating them into the labeled dataset to strengthen the labeled supervision signals. QDW preserves low-quality pseudo-labels by dynamically assigning them lower weights, allowing for the effective extraction of useful information rather than discarding them. Extensive experiments conducted on the widely used benchmark demonstrate the effectiveness of our method. Notably, with only 1% labeled data, 3DResT achieves an mIoU improvement of 8.34 points compared to the fully supervised method.
zh

[CV-91] CM3AE: A Unified RGB Frame and Event-Voxel/-Frame Pre-training Framework

【速读】:该论文旨在解决事件相机(Event Camera)数据在预训练中与RGB帧难以建立强连接的问题,限制了其在多模态融合场景中的应用。为了解决这一问题,论文提出了CM3AE预训练框架,用于RGB-事件感知任务。该框架的关键在于设计了一个多模态融合重建模块,通过融合多模态特征重构原始图像,显式增强模型聚合跨模态互补信息的能力;同时采用多模态对比学习策略,在共享潜空间中对齐跨模态特征表示,有效提升模型的多模态理解和全局依赖捕捉能力。这些创新点共同构成了CM3AE框架的核心解决方案。

链接: https://arxiv.org/abs/2504.12576
作者: Wentao Wu,Xiao Wang,Chenglong Li,Bo Jiang,Jin Tang,Bin Luo,Qi Liu
机构: Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University (安徽大学); Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University (安徽大学); School of Artificial Intelligence, Anhui University (安徽大学); School of Computer Science and Technology, Anhui University (安徽大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Event cameras have attracted increasing attention in recent years due to their advantages in high dynamic range, high temporal resolution, low power consumption, and low latency. Some researchers have begun exploring pre-training directly on event data. Nevertheless, these efforts often fail to establish strong connections with RGB frames, limiting their applicability in multi-modal fusion scenarios. To address these issues, we propose a novel CM3AE pre-training framework for the RGB-Event perception. This framework accepts multi-modalities/views of data as input, including RGB images, event images, and event voxels, providing robust support for both event-based and RGB-event fusion based downstream tasks. Specifically, we design a multi-modal fusion reconstruction module that reconstructs the original image from fused multi-modal features, explicitly enhancing the model’s ability to aggregate cross-modal complementary information. Additionally, we employ a multi-modal contrastive learning strategy to align cross-modal feature representations in a shared latent space, which effectively enhances the model’s capability for multi-modal understanding and capturing global dependencies. We construct a large-scale dataset containing 2,535,759 RGB-Event data pairs for the pre-training. Extensive experiments on five downstream tasks fully demonstrated the effectiveness of CM3AE. Source code and pre-trained models will be released on this https URL.
zh

[CV-92] Prompt-Driven and Training-Free Forgetting Approach and Dataset for Large Language Models

【速读】:该论文旨在解决扩散模型在图像生成领域中隐私合规性去学习(privacy-compliant unlearning)的需求增加背景下,实现选择性去学习(selective unlearning)的挑战。由于扩散模型具有高维特性和复杂的特征表示,现有方法难以在移除敏感信息的同时保持非敏感区域的一致性。为了解决这一问题,论文的关键创新在于提出了一种基于提示词的分层编辑和无训练的局部特征去除的自动数据集创建框架,并构建了ForgetMe数据集以及引入了Entangled评估指标。其中,Entangled指标通过评估目标区域与背景区域之间的相似性和一致性来量化去学习效果,支持配对和非配对图像数据,从而实现无监督评估。ForgetMe数据集涵盖了多种真实和合成场景,包括CUB-200-2011(鸟类)、Stanford-Dogs、ImageNet及合成猫数据集。通过在Stable Diffusion上应用LoRA微调实现对该数据集的选择性去学习,并验证了ForgetMe数据集和Entangled指标的有效性,将其确立为选择性去学习的基准。这项工作提供了一个可扩展且适应性强的解决方案,以推动隐私保护型生成式AI的发展。

链接: https://arxiv.org/abs/2504.12574
作者: Zhenyu Yu,Mohd Yamani Inda Idris,Pei Wang
机构: Universiti Malaya (马来亚大学); Kunming University of Science and Technology (昆明理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning. However, due to the high-dimensional nature and complex feature representations of diffusion models, achieving selective unlearning remains challenging, as existing methods struggle to remove sensitive information while preserving the consistency of non-sensitive regions. To address this, we propose an Automatic Dataset Creation Framework based on prompt-based layered editing and training-free local feature removal, constructing the ForgetMe dataset and introducing the Entangled evaluation metric. The Entangled metric quantifies unlearning effectiveness by assessing the similarity and consistency between the target and background regions and supports both paired (Entangled-D) and unpaired (Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe dataset encompasses a diverse set of real and synthetic scenarios, including CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on this dataset and validate the effectiveness of both the ForgetMe dataset and the Entangled metric, establishing them as benchmarks for selective unlearning. Our work provides a scalable and adaptable solution for advancing privacy-preserving generative AI.
zh

[CV-93] Parsimonious Dataset Construction for Laparoscopic Cholecystectomy Structure Segmentation

【速读】:该论文旨在解决医学领域数据标注成本高昂的问题,阻碍了相关深度学习应用的发展。为构建高质量且经济的腹腔镜胆囊切除手术视频数据集用于语义分割,论文引入主动学习方法优化手术视频帧的选择过程。解决方案的关键在于利用主动学习使深度神经网络(Deep Neural Networks, DNNs)的学习流程包含数据集构建工作流,通过从新收集的数据中选择最具信息量的样本进行标注,并将这些数据纳入训练集以逐步提升模型性能与泛化能力。研究评估了不同的数据信息量度量方法,发现基于深度特征距离的选择方法在该任务中表现最优。实验表明,采用主动学习选择一半数据即可使DNNs在关键解剖结构和手术器械上的平均交并比(mean Intersection over Union, mIoU)达到0.4349,与使用完整数据集训练的模型(mIoU=0.4374)几乎相同。

链接: https://arxiv.org/abs/2504.12573
作者: Yuning Zhou,Henry Badgery,Matthew Read,James Bailey,Catherine Davey
机构: Department of Biomedical Engineering, The University of Melbourne (墨尔本大学); School of Computing and Information Systems, The University of Melbourne (墨尔本大学); Department of HPB/UGI Surgery, St Vincent’s Hospital (圣文森特医院); Department of Surgery, The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: IEEE EMBS ISC Australia 2022

点击查看摘要

Abstract:Labeling has always been expensive in the medical context, which has hindered related deep learning application. Our work introduces active learning in surgical video frame selection to construct a high-quality, affordable Laparoscopic Cholecystectomy dataset for semantic segmentation. Active learning allows the Deep Neural Networks (DNNs) learning pipeline to include the dataset construction workflow, which means DNNs trained by existing dataset will identify the most informative data from the newly collected data. At the same time, DNNs’ performance and generalization ability improve over time when the newly selected and annotated data are included in the training data. We assessed different data informativeness measurements and found the deep features distances select the most informative data in this task. Our experiments show that with half of the data selected by active learning, the DNNs achieve almost the same performance with 0.4349 mean Intersection over Union (mIoU) compared to the same DNNs trained on the full dataset (0.4374 mIoU) on the critical anatomies and surgical instruments.
zh

[CV-94] Contour Field based Elliptical Shape Prior for the Segment Anything Model

【速读】:该论文旨在解决现有基于深度学习的图像分割方法(如Segment Anything Model, SAM)难以高效生成具有椭圆形状特征的分割结果的问题。为解决此问题,论文提出了一种将椭圆形状先验信息整合到SAM图像分割技术中的新方法,其关键是利用变分方法构建参数化的椭圆轮廓场,并通过双算法实现图像特征与椭圆先验及空间正则化先验的有效融合,从而显著提升分割精度。具体而言,论文通过将SAM分解为四个数学子问题,设计了一种新的网络结构,以确保SAM的分割输出由椭圆区域组成。实验结果表明,该方法在特定图像数据集上的表现优于原始SAM模型。

链接: https://arxiv.org/abs/2504.12556
作者: Xinyu Zhao,Jun Liu,Faqiang Wang,Li Cui,Yuping Duan
机构: Laboratory of Mathematics and Complex Systems (Ministry of Education of China)(数学与复杂系统重点实验室(教育部)), School of Mathematical Sciences (数学科学学院), Beijing Normal University (北京师范大学), Beijing, 100875, China(中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The elliptical shape prior information plays a vital role in improving the accuracy of image segmentation for specific tasks in medical and natural images. Existing deep learning-based segmentation methods, including the Segment Anything Model (SAM), often struggle to produce segmentation results with elliptical shapes efficiently. This paper proposes a new approach to integrate the prior of elliptical shapes into the deep learning-based SAM image segmentation techniques using variational methods. The proposed method establishes a parameterized elliptical contour field, which constrains the segmentation results to align with predefined elliptical contours. Utilizing the dual algorithm, the model seamlessly integrates image features with elliptical priors and spatial regularization priors, thereby greatly enhancing segmentation accuracy. By decomposing SAM into four mathematical sub-problems, we integrate the variational ellipse prior to design a new SAM network structure, ensuring that the segmentation output of SAM consists of elliptical regions. Experimental results on some specific image datasets demonstrate an improvement over the original SAM.
zh

[CV-95] Privacy-Preserving Operating Room Workflow Analysis using Digital Twins

【速读】:该论文旨在解决手术室(Operating Room, OR)优化过程中因隐私顾虑限制计算机视觉技术在手术视频中自动识别围手术期事件的应用问题。解决方案的关键在于提出了一种两阶段的隐私保护手术室视频分析与事件检测流程:第一阶段利用视觉基础模型进行深度估计和语义分割,从常规RGB视频生成去标识化的数字孪生体(Digital Twin, DT);第二阶段采用SafeOR模型,通过融合双流方法处理分割掩码和深度图实现手术室事件检测。这种方法能够在保护隐私的同时,达到甚至超越基于原始RGB视频模型的事件检测性能,并有助于跨机构共享去标识化数据以及增强模型的泛化能力。

链接: https://arxiv.org/abs/2504.12552
作者: Alejandra Perez,Han Zhang,Yu-Chun Ku,Lalithkumar Seenivasan,Roger Soberanis,Jose L. Porras,Richard Day,Jeff Jopling,Peter Najjar,Mathias Unberath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Purpose: The operating room (OR) is a complex environment where optimizing workflows is critical to reduce costs and improve patient outcomes. The use of computer vision approaches for the automatic recognition of perioperative events enables identification of bottlenecks for OR optimization. However, privacy concerns limit the use of computer vision for automated event detection from OR videos, which makes privacy-preserving approaches needed for OR workflow analysis. Methods: We propose a two-stage pipeline for privacy-preserving OR video analysis and event detection. In the first stage, we leverage vision foundation models for depth estimation and semantic segmentation to generate de-identified Digital Twins (DT) of the OR from conventional RGB videos. In the second stage, we employ the SafeOR model, a fused two-stream approach that processes segmentation masks and depth maps for OR event detection. We evaluate this method on an internal dataset of 38 simulated surgical trials with five event classes. Results: Our results indicate that this DT-based approach to the OR event detection model achieves performance on par and sometimes even better than raw RGB video-based models on detecting OR events. Conclusion: DTs enable privacy-preserving OR workflow analysis, facilitating the sharing of de-identified data across institutions and they can potentially enhance model generalizability by mitigating domain-specific appearance differences.
zh

[CV-96] Post-Hurricane Debris Segmentation Using Fine-Tuned Foundational Vision Models

【速读】:该论文旨在解决飓风灾害后跨区域通用性较差的废墟分割(Debris Segmentation)问题。由于不同地区的环境与成像条件变化导致废墟的视觉特征差异显著,加之训练数据匮乏,现有的废墟分割方法难以实现广泛适用性。为应对这些挑战,论文的关键解决方案是通过微调预训练的基础视觉模型(foundation vision models),利用一个相对较小但高质量的数据集实现了鲁棒性能。具体而言,研究引入了一个包含约1,200张手动标注的飓风(Ian、Ida、Ike)航拍RGB图像的开源数据集,并通过多标注者标签聚合以及视觉提示工程(visual prompt engineering)减少人为偏差、提升数据质量。最终提出的微调模型fCLIPSeg在完全未见过的飓风Ida数据上达到了0.70的Dice评分,且在无废墟区域几乎无误报,成为首个仅依赖标准RGB图像即可实现跨事件通用的废墟分割模型,适用于快速大规模灾后影响评估与恢复规划。

链接: https://arxiv.org/abs/2504.12542
作者: Kooshan Amini,Yuhao Liu,Jamie Ellen Padgett,Guha Balakrishnan,Ashok Veeraraghavan
机构: Department of Civil and Environmental Engineering, Rice University (土木与环境工程系, 莱斯大学); Department of Electrical and Computer Engineering, Rice University (电气与计算机工程系, 莱斯大学); Ken Kennedy Institute, Rice University (肯·肯尼迪研究所, 莱斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Timely and accurate detection of hurricane debris is critical for effective disaster response and community resilience. While post-disaster aerial imagery is readily available, robust debris segmentation solutions applicable across multiple disaster regions remain limited. Developing a generalized solution is challenging due to varying environmental and imaging conditions that alter debris’ visual signatures across different regions, further compounded by the scarcity of training data. This study addresses these challenges by fine-tuning pre-trained foundational vision models, achieving robust performance with a relatively small, high-quality dataset. Specifically, this work introduces an open-source dataset comprising approximately 1,200 manually annotated aerial RGB images from Hurricanes Ian, Ida, and Ike. To mitigate human biases and enhance data quality, labels from multiple annotators are strategically aggregated and visual prompt engineering is employed. The resulting fine-tuned model, named fCLIPSeg, achieves a Dice score of 0.70 on data from Hurricane Ida – a disaster event entirely excluded during training – with virtually no false positives in debris-free areas. This work presents the first event-agnostic debris segmentation model requiring only standard RGB imagery during deployment, making it well-suited for rapid, large-scale post-disaster impact assessments and recovery planning.
zh

[CV-97] UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control

【速读】:该论文致力于解决长时域控制下生成自然且物理逼真的角色运动的挑战,特别是在面对多样化引导信号时。传统方法结合了基于扩散的高级运动规划器与低级物理控制器,但存在领域差距问题,导致运动质量下降,并需要针对具体任务进行微调。为了解决这些问题,论文提出了一种名为UniPhys的基于扩散的行为克隆框架,其关键在于将运动规划与控制统一到单一模型中,通过多模态输入(如文本、轨迹和目标)实现灵活且具有表现力的角色运动。此外,UniPhys采用扩散强迫范式进行训练,以处理长序列中的累积预测误差及物理模拟器引入的不一致,从而能够稳健地生成长时域内的物理逼真运动。这种设计使得UniPhys在无需针对特定任务微调的情况下,能够泛化至多种控制信号,包括未见过的信号。实验表明,UniPhys在运动自然性、泛化能力和鲁棒性方面优于现有方法。

链接: https://arxiv.org/abs/2504.12540
作者: Yan Wu,Korrawe Karunratanakul,Zhengyi Luo,Siyu Tang
机构: ETH Zurich (苏黎世联邦理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Generating natural and physically plausible character motion remains challenging, particularly for long-horizon control with diverse guidance signals. While prior work combines high-level diffusion-based motion planners with low-level physics controllers, these systems suffer from domain gaps that degrade motion quality and require task-specific fine-tuning. To tackle this problem, we introduce UniPhys, a diffusion-based behavior cloning framework that unifies motion planning and control into a single model. UniPhys enables flexible, expressive character motion conditioned on multi-modal inputs such as text, trajectories, and goals. To address accumulated prediction errors over long sequences, UniPhys is trained with the Diffusion Forcing paradigm, learning to denoise noisy motion histories and handle discrepancies introduced by the physics simulator. This design allows UniPhys to robustly generate physically plausible, long-horizon motions. Through guided sampling, UniPhys generalizes to a wide range of control signals, including unseen ones, without requiring task-specific fine-tuning. Experiments show that UniPhys outperforms prior methods in motion naturalness, generalization, and robustness across diverse control tasks.
zh

[CV-98] Decision-based AI Visual Navigation for Cardiac Ultrasounds

【速读】:本文旨在解决通过超声心动图(Echocardiography)诊断心脏疾病时,依赖专家操作和高端设备导致其在医院外难以普及的问题。论文提出了一种基于人工智能的导航系统,其关键是开发了一种决策模型,用于识别心脏的下腔静脉(Inferior Vena Cava, IVC)。该模型通过离线训练的心脏超声视频数据,利用二分类方法判断给定视频中是否存在IVC,并结合一种新颖的定位算法,实时标注IVC的空间位置。这一方案不仅在高质量医院超声视频上表现出色,还能在低成本Butterfly iQ手持设备的低质量视频上实现零样本性能,从而推动超声诊断走出医院环境。

链接: https://arxiv.org/abs/2504.12535
作者: Andy Dimnaku,Dominic Yurk,Zhiyuan Gao,Arun Padmanabhan,Mandar Aras,Yaser Abu-Mostafa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ultrasound imaging of the heart (echocardiography) is widely used to diagnose cardiac diseases. However, obtaining an echocardiogram requires an expert sonographer and a high-quality ultrasound imaging device, which are generally only available in hospitals. Recently, AI-based navigation models and algorithms have been used to aid novice sonographers in acquiring the standardized cardiac views necessary to visualize potential disease pathologies. These navigation systems typically rely on directional guidance to predict the necessary rotation of the ultrasound probe. This paper demonstrates a novel AI navigation system that builds on a decision model for identifying the inferior vena cava (IVC) of the heart. The decision model is trained offline using cardiac ultrasound videos and employs binary classification to determine whether the IVC is present in a given ultrasound video. The underlying model integrates a novel localization algorithm that leverages the learned feature representations to annotate the spatial location of the IVC in real-time. Our model demonstrates strong localization performance on traditional high-quality hospital ultrasound videos, as well as impressive zero-shot performance on lower-quality ultrasound videos from a more affordable Butterfly iQ handheld ultrasound machine. This capability facilitates the expansion of ultrasound diagnostics beyond hospital settings. Currently, the guidance system is undergoing clinical trials and is available on the Butterfly iQ app.
zh

[CV-99] Event Quality Score (EQS): Assessing the Realism of Simulated Event Camera Streams via Distances in Latent Space CVPR

【速读】:该论文旨在解决事件相机(Event Camera)在深度学习驱动的计算机视觉应用中因高质量标注数据稀缺而导致的广泛应用受限问题。为缓解这一挑战,研究者提出了利用模拟器生成合成事件数据以训练检测与估计任务模型的方法。然而,由于事件相机与传统帧相机在传感器设计上的根本差异,现有模拟器难以准确再现真实事件相机采集的数据,从而限制了模拟数据的实际应用效果。为此,论文提出了一种名为事件质量评分(Event Quality Score, EQS)的关键解决方案,该指标基于RVT架构的激活特征构建,用于评估模拟事件数据的质量。通过在DSEC驾驶数据集上的模拟到真实实验验证,发现较高的EQS值能够显著提升模型在真实世界数据上的泛化能力。因此,优化EQS可以有效减小模拟与真实场景之间的差距,推动更逼真的事件相机模拟器的发展。

链接: https://arxiv.org/abs/2504.12515
作者: Kaustav Chanda,Aayush Atul Verma,Arpitsinh Vaghela,Yezhou Yang,Bharatesh Chakravarthi
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); Fifth International Workshop on Event-Based Vision

点击查看摘要

Abstract:Event cameras promise a paradigm shift in vision sensing with their low latency, high dynamic range, and asynchronous nature of events. Unfortunately, the scarcity of high-quality labeled datasets hinders their widespread adoption in deep learning-driven computer vision. To mitigate this, several simulators have been proposed to generate synthetic event data for training models for detection and estimation tasks. However, the fundamentally different sensor design of event cameras compared to traditional frame-based cameras poses a challenge for accurate simulation. As a result, most simulated data fail to mimic data captured by real event cameras. Inspired by existing work on using deep features for image comparison, we introduce event quality score (EQS), a quality metric that utilizes activations of the RVT architecture. Through sim-to-real experiments on the DSEC driving dataset, it is shown that a higher EQS implies improved generalization to real-world data after training on simulated events. Thus, optimizing for EQS can lead to developing more realistic event camera simulators, effectively reducing the simulation gap. EQS is available at this https URL.
zh

[CV-100] AdaVid: Adaptive Video-Language Pretraining CVPR

【速读】:该论文旨在解决在计算资源受限的边缘设备上部署对比视频-语言预训练模型的挑战,这些问题源于现有视频编码器的高计算需求以及通常只能处理短视频片段(4到64帧)的局限性。论文的关键解决方案是引入AdaVid,这是一种灵活的架构框架,设计用于学习高效的视频编码器,能够在推理时根据可用资源动态调整其计算开销。AdaVid的核心是一个受Matryoshka表示学习启发的自适应Transformer块,它允许模型在推理过程中调整隐藏嵌入维度。这一机制使得AdaVid-EgoVLP不仅能在短视频-语言基准测试中以一半的计算成本达到标准EgoVLP的性能,甚至在相同计算资源下超越后者,同时通过轻量级的分层网络有效处理更长视频任务。

链接: https://arxiv.org/abs/2504.12513
作者: Chaitanya Patel,Juan Carlos Niebles,Ehsan Adeli
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPRW 2025. Project Page: this https URL

点击查看摘要

Abstract:Contrastive video-language pretraining has demonstrated great success in learning rich and robust video representations. However, deploying such video encoders on compute-constrained edge devices remains challenging due to their high computational demands. Additionally, existing models are typically trained to process only short video clips, often limited to 4 to 64 frames. In this paper, we introduce AdaVid, a flexible architectural framework designed to learn efficient video encoders that can dynamically adapt their computational footprint based on available resources. At the heart of AdaVid is an adaptive transformer block, inspired by Matryoshka Representation Learning, which allows the model to adjust its hidden embedding dimension at inference time. We show that AdaVid-EgoVLP, trained on video-narration pairs from the large-scale Ego4D dataset, matches the performance of the standard EgoVLP on short video-language benchmarks using only half the compute, and even outperforms EgoVLP when given equal computational resources. We further explore the trade-off between frame count and compute on the challenging Diving48 classification benchmark, showing that AdaVid enables the use of more frames without exceeding computational limits. To handle longer videos, we also propose a lightweight hierarchical network that aggregates short clip features, achieving a strong balance between compute efficiency and accuracy across several long video benchmarks.
zh

[CV-101] Multimodal LLM Augmented Reasoning for Interpretable Visual Perception Analysis

【速读】:本文旨在解决在人机交互(HCI)、心理学和认知科学背景下,利用多模态大型语言模型(Multimodal Large Language Models, MLLMs)进行视觉感知任务中的可解释性评估问题。不同于以往主要依赖先进深度学习模型预测视觉内容复杂度指标的方法,本文提出了一种全新的无标注分析框架,以评估MLLMs作为认知辅助工具在HCI任务中的实用性,以视觉感知为案例研究。关键在于结合心理学和认知科学的相关原理,将这些原理作为指导原则,用于比较和解释视觉内容,从而实现对MLLMs在视觉感知相关可解释性原则上的基准测试,并探索其在提升人类推理能力及揭示现有由人工标注的感知数据集潜在偏差方面的应用潜力。

链接: https://arxiv.org/abs/2504.12511
作者: Shravan Chaudhari,Trilokya Akula,Yoon Kim,Tom Blake
机构: Amazon.com, Inc. (亚马逊); Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we advance the study of AI-augmented reasoning in the context of Human-Computer Interaction (HCI), psychology and cognitive science, focusing on the critical task of visual perception. Specifically, we investigate the applicability of Multimodal Large Language Models (MLLMs) in this domain. To this end, we leverage established principles and explanations from psychology and cognitive science related to complexity in human visual perception. We use them as guiding principles for the MLLMs to compare and interprete visual content. Our study aims to benchmark MLLMs across various explainability principles relevant to visual perception. Unlike recent approaches that primarily employ advanced deep learning models to predict complexity metrics from visual content, our work does not seek to develop a mere new predictive model. Instead, we propose a novel annotation-free analytical framework to assess utility of MLLMs as cognitive assistants for HCI tasks, using visual perception as a case study. The primary goal is to pave the way for principled study in quantifying and evaluating the interpretability of MLLMs for applications in improving human reasoning capability and uncovering biases in existing perception datasets annotated by humans.
zh

[CV-102] MobilePoser: Real-Time Full-Body Pose Estimation and 3D Human Translation from IMUs in Mobile Consumer Devices

【速读】:该论文致力于解决在利用消费级设备(如手机、手表、耳塞等)中普遍存在的低成本惯性测量单元(IMU)进行全身运动捕捉时所面临的挑战,包括在线性能下降、时间一致性不足以及由于传感器噪声和漂移导致的全局平移丢失等问题。论文的关键解决方案在于提出MobilePoser系统,它通过多阶段深度神经网络实现运动学姿势估计,并结合基于物理的运动优化器,在保证轻量级的同时达到了最先进的精度水平。这一方法有效克服了上述技术难题,展示了其在健康与健身、游戏以及室内导航等多个领域的应用潜力。

链接: https://arxiv.org/abs/2504.12492
作者: Vasco Xu,Chenfeng Gao,Henry Hoffmann,Karan Ahuja
机构: University of Chicago (芝加哥大学); Northwestern University (西北大学); University of Chicago (芝加哥大学); Northwestern University (西北大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:There has been a continued trend towards minimizing instrumentation for full-body motion capture, going from specialized rooms and equipment, to arrays of worn sensors and recently sparse inertial pose capture methods. However, as these techniques migrate towards lower-fidelity IMUs on ubiquitous commodity devices, like phones, watches, and earbuds, challenges arise including compromised online performance, temporal consistency, and loss of global translation due to sensor noise and drift. Addressing these challenges, we introduce MobilePoser, a real-time system for full-body pose and global translation estimation using any available subset of IMUs already present in these consumer devices. MobilePoser employs a multi-stage deep neural network for kinematic pose estimation followed by a physics-based motion optimizer, achieving state-of-the-art accuracy while remaining lightweight. We conclude with a series of demonstrative applications to illustrate the unique potential of MobilePoser across a variety of fields, such as health and wellness, gaming, and indoor navigation to name a few.
zh

[CV-103] DG-MVP: 3D Domain Generalization via Multiple Views of Point Clouds for Classification

【速读】:该论文旨在解决3D点云领域泛化问题,即如何使模型在未见过的不同点云数据域上表现良好。现有方法主要依赖基于点的方法提取特征,但这些方法通过最大池化操作丢弃了大量点特征,尤其是在处理存在缺失点和遮挡的点云数据时,这种资源浪费尤为显著。为了解决这些问题,论文提出了一种新颖的方法:利用三维点云的多视角二维投影来缓解缺失点的问题,并采用基于卷积的简单而有效的模型提取特征。这种方法的关键在于通过多视角投影保留更多点云信息,同时利用卷积操作高效提取特征,从而提升跨域泛化能力。实验结果表明,该方法在PointDA-10和Sim-to-Real基准数据集上优于多种基线方法,并能很好地从合成域迁移到真实域。

链接: https://arxiv.org/abs/2504.12456
作者: Huantao Ren,Minmin Yang,Senem Velipasalar
机构: Syracuse University (雪城大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks have achieved significant success in 3D point cloud classification while relying on large-scale, annotated point cloud datasets, which are labor-intensive to build. Compared to capturing data with LiDAR sensors and then performing annotation, it is relatively easier to sample point clouds from CAD models. Yet, data sampled from CAD models is regular, and does not suffer from occlusion and missing points, which are very common for LiDAR data, creating a large domain shift. Therefore, it is critical to develop methods that can generalize well across different point cloud domains. %In this paper, we focus on the 3D point cloud domain generalization problem. Existing 3D domain generalization methods employ point-based backbones to extract point cloud features. Yet, by analyzing point utilization of point-based methods and observing the geometry of point clouds from different domains, we have found that a large number of point features are discarded by point-based methods through the max-pooling operation. This is a significant waste especially considering the fact that domain generalization is more challenging than supervised learning, and point clouds are already affected by missing points and occlusion to begin with. To address these issues, we propose a novel method for 3D point cloud domain generalization, which can generalize to unseen domains of point clouds. Our proposed method employs multiple 2D projections of a 3D point cloud to alleviate the issue of missing points and involves a simple yet effective convolution-based model to extract features. The experiments, performed on the PointDA-10 and Sim-to-Real benchmarks, demonstrate the effectiveness of our proposed method, which outperforms different baselines, and can transfer well from synthetic domain to real-world domain.
zh

[CV-104] 3D-PointZshotS: Geometry-Aware 3D Point Cloud Zero-Shot Semantic Segmentation Narrowing the Visual-Semantic Gap

【速读】:该论文旨在解决现有零样本三维点云分割方法在从已见类别到未见类别的迁移能力以及从语义空间到视觉空间的适应性方面的局限性问题。论文的关键在于引入了3D-PointZshotS框架,通过潜伏几何原型(Latent Geometric Prototypes, LGPs)增强特征生成与对齐。具体而言,通过跨注意力机制将LGP集成到生成器中,以丰富语义特征的精细几何细节,并利用自一致性损失提升特征的鲁棒性以对抗点级别的扰动。此外,通过在共享空间中重新表示视觉和语义特征,弥合语义-视觉鸿沟,促进知识向未见类别的迁移。实验结果表明,该方法在ScanNet、SemanticKITTI和S3DIS三个数据集上的平均谐波mIoU表现优于四种基线方法。

链接: https://arxiv.org/abs/2504.12442
作者: Minmin Yang,Huantao Ren,Senem Velipasalar
机构: Electrical Engineering and Computer Science Dept., Syracuse University (电气工程与计算机科学系,锡拉丘兹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing zero-shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D-PointZshotS, a geometry-aware zero-shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a generator via a cross-attention mechanism, enriching semantic features with fine-grained geometric details. To further enhance stability and generalization, we introduce a self-consistency loss, which enforces feature robustness against point-wise perturbations. Additionally, we re-represent visual and semantic features in a shared space, bridging the semantic-visual gap and facilitating knowledge transfer to unseen classes. Experiments on three real-world datasets, namely ScanNet, SemanticKITTI, and S3DIS, demonstrate that our method achieves superior performance over four baselines in terms of harmonic mIoU. The code is available at \hrefthis https URLGithub.
zh

[CV-105] Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

【速读】:该论文旨在解决在少量标注样本条件下,将视觉-语言模型(Vision-Language Models, VLMs)适应到新领域时面临的严重过拟合和计算资源限制问题。现有最先进的方法如低秩重参数化虽能缓解这些问题,但通常在泛化能力方面表现不佳,并且需要大量的超参数调优。为了解决上述挑战,本文提出了一种新颖的稀疏优化(Sparse Optimization, SO)框架。该方法的关键在于采用高稀疏性来动态调整极少数参数,而非像低秩方法那样限制更新在一个固定的子空间内。具体而言,SO 方法引入了两个核心范式:一是“局部稀疏性和全局密集性”,即每轮迭代仅更新参数的一个最小子集,同时保持模型的整体表达能力;二是“局部随机性和全局重要性”,即通过随机选择稀疏化梯度,并基于重要性修剪一阶矩。这种组合显著减轻了过拟合现象,并确保了在低数据量环境下的稳定适应能力。大量实验结果表明,SO 方法在11个多样化的数据集上实现了最先进的少样本适应性能,同时降低了内存开销。

链接: https://arxiv.org/abs/2504.12436
作者: Nairouz Mrabah,Nicolas Richet,Ismail Ben Ayed,Éric Granger
机构: LIVIA, ILLS, Department of Systems Engineering, ÉTS Montreal, Québec, Canada (LIVIA, ILLS, 蒙特利尔ÉTS系统工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Adapting Vision-Language Models (VLMs) to new domains with few labeled samples remains a significant challenge due to severe overfitting and computational constraints. State-of-the-art solutions, such as low-rank reparameterization, mitigate these issues but often struggle with generalization and require extensive hyperparameter tuning. In this paper, a novel Sparse Optimization (SO) framework is proposed. Unlike low-rank approaches that typically constrain updates to a fixed subspace, our SO method leverages high sparsity to dynamically adjust very few parameters. We introduce two key paradigms. First, we advocate for \textitlocal sparsity and global density, which updates a minimal subset of parameters per iteration while maintaining overall model expressiveness. As a second paradigm, we advocate for \textitlocal randomness and global importance, which sparsifies the gradient using random selection while pruning the first moment based on importance. This combination significantly mitigates overfitting and ensures stable adaptation in low-data regimes. Extensive experiments on 11 diverse datasets show that SO achieves state-of-the-art few-shot adaptation performance while reducing memory overhead.
zh

[CV-106] NTIRE 2025 Challenge on Event-Based Image Deblurring: Methods and Results

【速读】:该论文旨在解决基于事件的图像去模糊(Event-Based Image Deblurring)问题,目标是设计一种能够实现高质量图像去模糊的事件驱动方法,其性能通过峰值信噪比(PSNR)进行定量评估。论文的关键在于充分利用事件数据与传统图像数据作为输入,共同用于单张图像去模糊任务。此外,研究不对计算复杂度或模型规模设置限制,以鼓励创新性解决方案的发展。

链接: https://arxiv.org/abs/2504.12401
作者: Lei Sun,Andrea Alfarano,Peiqi Duan,Shaolin Su,Kaiwei Wang,Boxin Shi,Radu Timofte,Danda Pani Paudel,Luc Van Gool,Qinglin Liu,Wei Yu,Xiaoqian Lv,Lu Yang,Shuigen Wang,Shengping Zhang,Xiangyang Ji,Long Bao,Yuqiang Yang,Jinao Song,Ziyi Wang,Shuang Wen,Heng Sun,Kean Liu,Mingchen Zhong,Senyan Xu,Zhijing Sun,Jiaying Zhu,Chengjie Ge,Xingbo Wang,Yidi Liu,Xin Lu,Xueyang Fu,Zheng-Jun Zha,Dawei Fan,Dafeng Zhang,Yong Yang,Siru Zhang,Qinghua Yang,Hao Kang,Huiyuan Fu,Heng Zhang,Hongyuan Yu,Zhijuan Huang,Shuoyan Wei,Feng Li,Runmin Cong,Weiqi Luo,Mingyun Lin,Chenxu Jiang,Hongyi Liu,Lei Yu,Weilun Li,Jiajun Zhai,Tingting Lin,Shuang Ma,Sai Zhou,Zhanwen Liu,Yang Wang,Eiffel Chong,Nuwan Bandara,Thivya Kandappu,Archan Misra,Yihang Chen,Zhan Li,Weijun Yuan,Wenzhuo Wang,Boyang Yao,Zhanglu Chen,Yijing Sun,Tianjiao Wan,Zijian Gao,Qisheng Xu,Kele Xu,Yukun Zhang,Yu He,Xiaoyan Xie,Tao Fu,Yashu Gautamkumar Patel,Vihar Ramesh Jain,Divesh Basina,Rishik Ashili,Manish Kumar Manjhi,Sourav Kumar,Prinon Benny,Himanshu Ghunawat,B Sri Sairam Gautam,Anett Varghese,Abhishek Yadav
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an overview of NTIRE 2025 the First Challenge on Event-Based Image Deblurring, detailing the proposed methodologies and corresponding results. The primary goal of the challenge is to design an event-based method that achieves high-quality image deblurring, with performance quantitatively assessed using Peak Signal-to-Noise Ratio (PSNR). Notably, there are no restrictions on computational complexity or model size. The task focuses on leveraging both events and images as inputs for single-image deblurring. A total of 199 participants registered, among whom 15 teams successfully submitted valid results, offering valuable insights into the current state of event-based image deblurring. We anticipate that this challenge will drive further advancements in event-based vision research.
zh

[CV-107] InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

【速读】:该论文致力于解决现有基于学习的个性化方法(主要依赖U-Net架构)在通用化能力有限以及生成图像质量妥协的问题,同时优化方法需要针对特定主体进行微调,这不可避免地损害了文本可控性。为了解决这些问题,论文提出了一种名为InstantCharacter的可扩展框架,用于角色定制,其基于基础扩散变换器构建。InstantCharacter的关键创新在于引入了一个带有堆叠变换器编码器的可扩展适配器,该适配器能够有效处理开放域的角色特征并与现代扩散变换器的潜在空间无缝交互,从而实现高质量、高保真的角色个性化生成,并保持文本可控性和身份一致性。此外,为了训练该框架,构建了一个包含百万级别样本的大规模角色数据集,并通过双数据结构同时优化身份一致性和文本可编辑性。

链接: https://arxiv.org/abs/2504.12395
作者: Jiale Tao,Yanbing Zhang,Qixun Wang,Yiji Cheng,Haofan Wang,Xu Bai,Zhengguang Zhou,Ruihuang Li,Linqing Wang,Chunyu Wang,Qin Lin,Qinglin Lu
机构: Hunyuan (浑元), Tencent (腾讯); InstantX Team; Tech Lead; Corresponding Author (通讯作者)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech Report. Code is available at this https URL

点击查看摘要

Abstract:Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character customization built upon a foundation diffusion transformer. InstantCharacter demonstrates three fundamental advantages: first, it achieves open-domain personalization across diverse character appearances, poses, and styles while maintaining high-fidelity results. Second, the framework introduces a scalable adapter with stacked transformer encoders, which effectively processes open-domain character features and seamlessly interacts with the latent space of modern diffusion transformers. Third, to effectively train the framework, we construct a large-scale character dataset containing 10-million-level samples. The dataset is systematically organized into paired (multi-view character) and unpaired (text-image combinations) subsets. This dual-data structure enables simultaneous optimization of identity consistency and textual editability through distinct learning pathways. Qualitative experiments demonstrate the advanced capabilities of InstantCharacter in generating high-fidelity, text-controllable, and character-consistent images, setting a new benchmark for character-driven image generation. Our source code is available at this https URL.
zh

[CV-108] WORLDMEM: Long-term Consistent World Simulation with Memory

【速读】:该论文旨在解决世界模拟中因时间上下文窗口有限而导致的长期一致性难以维持的问题,特别是三维空间一致性难以保持的挑战。论文的关键解决方案是提出了一种名为WorldMem的框架,其核心在于引入一个包含记忆单元的记忆库,这些记忆单元存储记忆帧和状态(如姿态和时间戳)。通过采用基于状态的记忆注意力机制,从记忆帧中有效提取相关联的信息,该方法能够即使在显著的视角或时间间隔下,也能够精确重建之前观察到的场景。此外,通过将时间戳纳入状态,WorldMem不仅能够建模静态世界,还能捕捉其随时间演化的动态过程,从而支持在模拟世界中的感知与交互。实验结果验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.12369
作者: Zeqi Xiao,Yushi Lan,Yifan Zhou,Wenqi Ouyang,Shuai Yang,Yanhong Zeng,Xingang Pan
机构: S-Lab, Nanyang Technological University (南洋理工大学); Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.
zh

[CV-109] Geographical Context Matters: Bridging Fine and Coarse Spatial Information to Enhance Continental Land Cover Mapping

【速读】:该论文旨在解决现有基于地球观测数据的土地覆盖制图方法中,高级机器学习和深度学习算法通常忽视重要地理空间元数据信息的问题。这种忽视限制了模型在区域、大陆乃至全球尺度上的可扩展性和准确性。为了解决这一局限性,论文提出了一种名为BRIDGE-LC(Bi-level Representation Integration for Disentangled GEospatial Land Cover)的新框架,其关键在于通过轻量级多层感知器架构同时整合细粒度(如经纬度)和粗粒度(如生物地理区域)的空间信息到土地覆盖分类过程中。该方案允许模型在推理阶段仅依赖细粒度信息,从而解耦特定区域与非特定区域的土地覆盖特征,同时保持计算效率。实验结果表明,联合利用细粒度和粗粒度空间信息显著提升了土地覆盖映射的性能。

链接: https://arxiv.org/abs/2504.12368
作者: Babak Ghassemi,Cassio Fraga-Dantas,Raffaele Gaetano,Dino Ienco,Omid Ghorbanzadeh,Emma Izquierdo-Verdiguier,Francesco Vuolo
机构: University of Natural Resources and Life Sciences, Vienna, Department of Ecosystem Management, Climate and Biodiversity, Institute of Geomatics, Peter Jordan Str. 82, Vienna, 1190, Austria; TETIS, University of Montpellier, AgroParisTech, CIRAD/CNRS/INRAE, Montpellier, 34093, France; INRIA, Université de Montpellier, Antenne INRIA de l’Université de Montpellier, Montpellier, 34090, France
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Land use and land cover mapping from Earth Observation (EO) data is a critical tool for sustainable land and resource management. While advanced machine learning and deep learning algorithms excel at analyzing EO imagery data, they often overlook crucial geospatial metadata information that could enhance scalability and accuracy across regional, continental, and global scales. To address this limitation, we propose BRIDGE-LC (Bi-level Representation Integration for Disentangled GEospatial Land Cover), a novel deep learning framework that integrates multi-scale geospatial information into the land cover classification process. By simultaneously leveraging fine-grained (latitude/longitude) and coarse-grained (biogeographical region) spatial information, our lightweight multi-layer perceptron architecture learns from both during training but only requires fine-grained information for inference, allowing it to disentangle region-specific from region-agnostic land cover features while maintaining computational efficiency. To assess the quality of our framework, we use an open-access in-situ dataset and adopt several competing classification approaches commonly considered for large-scale land cover mapping. We evaluated all approaches through two scenarios: an extrapolation scenario in which training data encompasses samples from all biogeographical regions, and a leave-one-region-out scenario where one region is excluded from training. We also explore the spatial representation learned by our model, highlighting a connection between its internal manifold and the geographical information used during training. Our results demonstrate that integrating geospatial information improves land cover mapping performance, with the most substantial gains achieved by jointly leveraging both fine- and coarse-grained spatial information.
zh

[CV-110] DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging

【速读】:该论文旨在解决由文本到图像(Text-to-Image, T2I)生成模型的多样化微调引发的参数冗余高和存储成本大的问题。传统方法通过静态线性插值在参数空间中实现风格混合,但忽视了T2I任务中众多独立模型涵盖不同风格可能导致合并模型中的不兼容性和混淆。为了解决这一问题,论文提出了一个可控制风格向量的图像生成流水线,并在此基础上设计了基于分数蒸馏的模型合并范式(DMM),将多个模型压缩为单一多功能的T2I模型。关键在于重新定义模型合并任务的目标与评估协议,以实现可控的任意风格生成。

链接: https://arxiv.org/abs/2504.12364
作者: Tianhui Song,Weixin Feng,Shuai Wang,Xubin Li,Tiezheng Ge,Bo Zheng,Limin Wang
机构: Nanjing University (南京大学); Alibaba Group (阿里集团); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The success of text-to-image (T2I) generation models has spurred a proliferation of numerous model checkpoints fine-tuned from the same base model on various specialized datasets. This overwhelming specialized model production introduces new challenges for high parameter redundancy and huge storage cost, thereby necessitating the development of effective methods to consolidate and unify the capabilities of diverse powerful models into a single one. A common practice in model merging adopts static linear interpolation in the parameter space to achieve the goal of style mixing. However, it neglects the features of T2I generation task that numerous distinct models cover sundry styles which may lead to incompatibility and confusion in the merged model. To address this issue, we introduce a style-promptable image generation pipeline which can accurately generate arbitrary-style images under the control of style vectors. Based on this design, we propose the score distillation based model merging paradigm (DMM), compressing multiple models into a single versatile T2I model. Moreover, we rethink and reformulate the model merging task in the context of T2I generation, by presenting new merging goals and evaluation protocols. Our experiments demonstrate that DMM can compactly reorganize the knowledge from multiple teacher models and achieve controllable arbitrary-style generation.
zh

[CV-111] NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement: Methods and Results CVPR

【速读】:该论文旨在解决短视频用户生成内容(Short-form UGC)的质量评估与增强问题。论文提出了解决方案分为两个赛道:Track 1关注轻量级高效的视频质量评估(Video Quality Assessment, VQA)模型的发展,强调消除对模型集成、冗余权重及计算开销大的组件的依赖;Track 2则引入了专为单图像超分辨率设计的新短格式UGC数据集——KwaiSR数据集,并通过合成与真实数据构建了大规模训练集以推动相关技术进步。关键在于通过优化模型架构与数据集构建,提升用户体验的同时降低计算复杂度,从而促进短视频平台如Kwai和TikTok的技术发展。

链接: https://arxiv.org/abs/2504.13131
作者: Xin Li,Kun Yuan,Bingchen Li,Fengbin Guan,Yizhen Shao,Zihao Yu,Xijun Wang,Yiting Lu,Wei Luo,Suhang Yao,Ming Sun,Chao Zhou,Zhibo Chen,Radu Timofte,Yabin Zhang,Ao-Xiang Zhang,Tianwu Zhi,Jianzhao Liu,Yang Li,Jingwen Xu,Yiting Liao,Yushen Zuo,Mingyang Wu,Renjie Li,Shengyun Zhong,Zhengzhong Tu,Yufan Liu,Xiangguang Chen,Zuowei Cao,Minhao Tang,Shan Liu,Kexin Zhang,Jingfen Xie,Yan Wang,Kai Chen,Shijie Zhao,Yunchen Zhang,Xiangkai Xu,Hong Gao,Ji Shi,Yiming Bao,Xiugang Dong,Xiangsheng Zhou,Yaofeng Tu,Ying Liang,Yiwen Wang,Xinning Chai,Yuxuan Zhang,Zhengxue Cheng,Yingsheng Qin,Yucai Yang,Rong Xie,Li Song,Wei Sun,Kang Fu,Linhan Cao,Dandan Zhu,Kaiwei Zhang,Yucheng Zhu,Zicheng Zhang,Menghan Hu,Xiongkuo Min,Guangtao Zhai,Zhi Jin,Jiawei Wu,Wei Wang,Wenjian Zhang,Yuhai Lan,Gaoxiong Yi,Hengyuan Na,Wang Luo,Di Wu,MingYin Bai,Jiawang Du,Zilong Lu,Zhenyu Jiang,Hui Zeng,Ziguan Cui,Zongliang Gan,Guijin Tang,Xinglin Xie,Kehuan Song,Xiaoqiang Lu,Licheng Jiao,Fang Liu,Xu Liu,Puhua Chen,Ha Thu Nguyen,Katrien De Moor,Seyed Ali Amirshahi,Mohamed-Chaker Larabi,Qi Tang,Linfeng He,Zhiyong Gao,Zixuan Gao,Guohua Zhang,Zhiye Huang,Yi Deng,Qingmiao Jiang,Lu Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Challenge Report of NTIRE 2025; Methods from 18 Teams; Accepted by CVPR Workshop; 21 pages

点击查看摘要

Abstract:This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at this https URL ChallengeCVPR-NTIRE2025.
zh

[CV-112] owards Cardiac MRI Foundation Models: Comprehensive Visual-Tabular Representations for Whole-Heart Assessment and Beyond

【速读】:该论文试图解决的问题是:当前心脏磁共振成像(Cardiac Magnetic Resonance, CMR)虽是无创评估心脏结构与功能的金标准,但仅依赖CMR难以捕捉患者层面的健康因素(如人口统计学、代谢及生活方式等),从而限制了全面理解个体的心脏健康状况及疾病风险的能力。此外,现有的多模态方法通常受限于时空数据不足且聚焦于孤立的临床任务,无法提供综合的心脏健康表征。

解决方案的关键在于提出ViTa模型,这是一种迈向基础模型的尝试,旨在通过整合多模态数据构建全面的心脏表征,并精确解读个体的疾病风险。具体而言,ViTa利用来自UK Biobank的42,000名参与者的数据,结合短轴和长轴视角下的3D+T动态电影堆栈(3D+T cine stacks),完整捕获心动周期;同时将这些影像数据与详细的患者层面特征融合,实现上下文感知的洞见。通过学习一个共享的潜在表示来桥接丰富的影像特征与患者上下文信息,ViTa超越了传统的任务特定模型,向通用且以患者为中心的心脏健康理解迈进了一步,展示了其在临床应用和心脏分析中的潜力与可扩展性。

链接: https://arxiv.org/abs/2504.13037
作者: Yundi Zhang,Paul Hager,Che Liu,Suprosanna Shit,Chen Chen,Daniel Rueckert,Jiazhen Pan
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cardiac magnetic resonance imaging is the gold standard for non-invasive cardiac assessment, offering rich spatio-temporal views of the cardiac anatomy and physiology. Patient-level health factors, such as demographics, metabolic, and lifestyle, are known to substantially influence cardiovascular health and disease risk, yet remain uncaptured by CMR alone. To holistically understand cardiac health and to enable the best possible interpretation of an individual’s disease risk, CMR and patient-level factors must be jointly exploited within an integrated framework. Recent multi-modal approaches have begun to bridge this gap, yet they often rely on limited spatio-temporal data and focus on isolated clinical tasks, thereby hindering the development of a comprehensive representation for cardiac health evaluation. To overcome these limitations, we introduce ViTa, a step toward foundation models that delivers a comprehensive representation of the heart and a precise interpretation of individual disease risk. Leveraging data from 42,000 UK Biobank participants, ViTa integrates 3D+T cine stacks from short-axis and long-axis views, enabling a complete capture of the cardiac cycle. These imaging data are then fused with detailed tabular patient-level factors, enabling context-aware insights. This multi-modal paradigm supports a wide spectrum of downstream tasks, including cardiac phenotype and physiological feature prediction, segmentation, and classification of cardiac and metabolic diseases within a single unified framework. By learning a shared latent representation that bridges rich imaging features and patient context, ViTa moves beyond traditional, task-specific models toward a universal, patient-specific understanding of cardiac health, highlighting its potential to advance clinical utility and scalability in cardiac analysis.
zh

[CV-113] UMLS: Trustful Fully Unsupervised Multi-Level Segmentation for Whole Slide Images of Histology

【速读】:该论文旨在解决数字病理学中因全切片图像(Whole Slide Images, WSIs)标注劳动强度大、计算需求高以及预测中缺乏不确定性估计导致的信任问题,从而阻碍人工智能方法在组织病理学中的实际应用。解决方案的关键在于提出了一种新颖的可信赖完全无监督多级分割方法(Trustful Fully Unsupervised Multi-Level Segmentation, TUMLS)。TUMLS 利用自动编码器(Autoencoder, AE)作为特征提取器从低分辨率训练数据中识别不同组织类型,并基于不确定性度量选择代表性样本,然后在更高分辨率空间中对这些样本进行无监督细胞核分割,无需使用任何机器学习算法。此方案能够无缝集成到临床医生的工作流程中,将整个 WSI 的检查转化为对简洁且可解释的跨级别洞察的审查,显著提升了工作效率并确保了透明性。

链接: https://arxiv.org/abs/2504.12718
作者: Walid Rehamnia,Alexandra Getmanskaya,Evgeniy Vasilyev,Vadim Turlapov
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 15 figures, 3 tables, 42 references

点击查看摘要

Abstract:Digital pathology, augmented by artificial intelligence (AI), holds significant promise for improving the workflow of pathologists. However, challenges such as the labor-intensive annotation of whole slide images (WSIs), high computational demands, and trust concerns arising from the absence of uncertainty estimation in predictions hinder the practical application of current AI methodologies in histopathology. To address these issues, we present a novel trustful fully unsupervised multi-level segmentation methodology (TUMLS) for WSIs. TUMLS adopts an autoencoder (AE) as a feature extractor to identify the different tissue types within low-resolution training data. It selects representative patches from each identified group based on an uncertainty measure and then does unsupervised nuclei segmentation in their respective higher-resolution space without using any ML algorithms. Crucially, this solution integrates seamlessly into clinicians workflows, transforming the examination of a whole WSI into a review of concise, interpretable cross-level insights. This integration significantly enhances and accelerates the workflow while ensuring transparency. We evaluated our approach using the UPENN-GBM dataset, where the AE achieved a mean squared error (MSE) of 0.0016. Additionally, nucleus segmentation is assessed on the MoNuSeg dataset, outperforming all unsupervised approaches with an F1 score of 77.46% and a Jaccard score of 63.35%. These results demonstrate the efficacy of TUMLS in advancing the field of digital pathology.
zh

[CV-114] Regist3R: Incremental Registration with Stereo Foundation Model

【速读】:该论文旨在解决多视角三维重建中的两个主要挑战:从无序图像集合中进行高效且可扩展的大规模三维重建,以及应对现有方法在多视角场景下因全局配准引起的高计算成本和累积误差的问题。论文提出的解决方案核心在于Regist3R,这是一种针对增量式重建优化的立体基础模型。Regist3R通过采用增量式重建范式,利用点云图(Pointmap)作为基础模型,实现了从无序和多视角图像集合中进行大规模三维重建的能力,同时显著提升了计算效率,并在真实世界应用场景中表现出色。

链接: https://arxiv.org/abs/2504.12356
作者: Sidun Liu,Wenyu Li,Peng Qiao,Yong Dou
机构: National University of Defence Technology (国防科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages

点击查看摘要

Abstract:Multi-view 3D reconstruction has remained an essential yet challenging problem in the field of computer vision. While DUSt3R and its successors have achieved breakthroughs in 3D reconstruction from unposed images, these methods exhibit significant limitations when scaling to multi-view scenarios, including high computational cost and cumulative error induced by global alignment. To address these challenges, we propose Regist3R, a novel stereo foundation model tailored for efficient and scalable incremental reconstruction. Regist3R leverages an incremental reconstruction paradigm, enabling large-scale 3D reconstructions from unordered and many-view image collections. We evaluate Regist3R on public datasets for camera pose estimation and 3D reconstruction. Our experiments demonstrate that Regist3R achieves comparable performance with optimization-based methods while significantly improving computational efficiency, and outperforms existing multi-view reconstruction models. Furthermore, to assess its performance in real-world applications, we introduce a challenging oblique aerial dataset which has long spatial spans and hundreds of views. The results highlight the effectiveness of Regist3R. We also demonstrate the first attempt to reconstruct large-scale scenes encompassing over thousands of views through pointmap-based foundation models, showcasing its potential for practical applications in large-scale 3D reconstruction tasks, including urban modeling, aerial mapping, and beyond.
zh

人工智能

[AI-0] Its All Connected: A Journey Through Test-Time Memorization Attentional Bias Retention and Online Optimization

【速读】:该论文试图解决如何设计更高效且有效的深度学习架构以提升基础模型的能力。解决方案的关键在于重新概念化神经网络架构(如Transformer、Titans和现代线性循环神经网络)为关联记忆模块,并引入基于注意力偏向(attentional bias)的内部目标来学习键值映射。论文指出,大多数现有序列模型仅依赖点积相似性或L2回归作为注意力偏向,而作者提出了多种替代配置及其稳定训练的有效近似方法。此外,论文重新解释了现代深度学习架构中的遗忘机制为保持正则化的一种形式,并提出了一组新颖的遗忘门以增强序列模型。在此基础上,作者提出了Miras框架,通过四种选择(关联记忆架构、注意力偏向目标、保持门和记忆学习算法)设计深度学习架构,并开发了三个新型序列模型(Moneta、Yaad和Memora)。实验结果表明,不同设计选择下的Miras模型在语言建模、常识推理和需要大量回忆的任务中展现出不同的优势,甚至超越了现有的Transformer和线性循环模型。

链接: https://arxiv.org/abs/2504.13173
作者: Ali Behrouz,Meisam Razaviyayn,Peilin Zhong,Vahab Mirrokni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) L2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. Our experiments show different design choices in Miras yield models with varying strengths. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.
zh

[AI-1] RUKA: Rethinking the Design of Humanoid Hands with Learning

【速读】:该论文试图解决机器人灵巧操作能力受限于硬件在精度、紧凑性、力量和成本之间权衡的问题,现有控制方法对机械手设计和应用施加了妥协,而基于学习的方法提供了重新思考这些权衡的机会,特别是解决与肌腱驱动致动和低成本材料相关的问题。论文的关键解决方案是提出RUKA,一种由3D打印零件和现成组件制成的紧凑、经济且功能强大的肌腱驱动类人手,具有15个欠驱动自由度,能够实现多样化的人类类似抓取动作,并通过从运动捕捉数据中学习关节到执行器以及指尖到执行器的模型来应对控制挑战,从而展示出优于其他机器人手的可达性、耐用性和力量。

链接: https://arxiv.org/abs/2504.13165
作者: Anya Zorin,Irmak Guzey,Billy Yan,Aadhithya Iyer,Lisa Kondrich,Nikhil X. Bhattasali,Lerrel Pinto
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Website at this https URL

点击查看摘要

Abstract:Dexterous manipulation is a fundamental capability for robotic systems, yet progress has been limited by hardware trade-offs between precision, compactness, strength, and affordability. Existing control methods impose compromises on hand designs and applications. However, learning-based approaches present opportunities to rethink these trade-offs, particularly to address challenges with tendon-driven actuation and low-cost materials. This work presents RUKA, a tendon-driven humanoid hand that is compact, affordable, and capable. Made from 3D-printed parts and off-the-shelf components, RUKA has 5 fingers with 15 underactuated degrees of freedom enabling diverse human-like grasps. Its tendon-driven actuation allows powerful grasping in a compact, human-sized form factor. To address control challenges, we learn joint-to-actuator and fingertip-to-actuator models from motion-capture data collected by the MANUS glove, leveraging the hand’s morphological accuracy. Extensive evaluations demonstrate RUKA’s superior reachability, durability, and strength compared to other robotic hands. Teleoperation tasks further showcase RUKA’s dexterous movements. The open-source design and assembly instructions of RUKA, code, and data are available at this https URL.
zh

[AI-2] Exploring Expert Failures Improves LLM Agent Tuning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)作为智能体(agents)在处理复杂子任务时表现不佳的问题。尽管拒绝采样微调(Rejection Sampling Fine-Tuning, RFT)方法通过模仿专家成功轨迹并迭代优化自动生成的轨迹,在简单场景下取得了显著效果,但在面对复杂的分布外(out-of-distribution, OOD)子任务时仍存在局限性。论文的关键观察是,失败的专家轨迹中包含有价值的指导信息(如计划和关键操作),这些信息能够显著提升智能体的探索效率和技能获取能力。基于此,作者提出了探索专家失败(Exploring Expert Failures, EEF)的方法,其核心在于从失败的专家轨迹中识别出有益的行为并将其整合到训练数据中,同时仔细排除潜在有害行为以避免污染模型学习过程。通过利用这些有益行为,EEF不仅解决了部分之前无法解决的子任务,还提升了智能体的整体性能。在WebShop任务中,EEF实现了62%的胜率,超越了RFT(53.6%)和GPT-4(35.6%),并创造了新的技术水平,成为首个在WebShop中得分超过0.81且在SciWorld中超过81分的方法。

链接: https://arxiv.org/abs/2504.13145
作者: Li-Cheng Lan,Andrew Bai,Minhao Cheng,Ruochen Wang,Cho-Jui Hsieh,Tianyi Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown tremendous potential as agents, excelling at tasks that require multiple rounds of reasoning and interactions. Rejection Sampling Fine-Tuning (RFT) has emerged as an effective method for finetuning LLMs as agents: it first imitates expert-generated successful trajectories and further improves agentic skills through iterative fine-tuning on successful, self-generated trajectories. However, since the expert (e.g., GPT-4) succeeds primarily on simpler subtasks and RFT inherently favors simpler scenarios, many complex subtasks remain unsolved and persistently out-of-distribution (OOD). Upon investigating these challenging subtasks, we discovered that previously failed expert trajectories can often provide valuable guidance, e.g., plans and key actions, that can significantly improve agent exploration efficiency and acquisition of critical skills. Motivated by these observations, we propose Exploring Expert Failures (EEF), which identifies beneficial actions from failed expert trajectories and integrates them into the training dataset. Potentially harmful actions are meticulously excluded to prevent contamination of the model learning process. By leveraging the beneficial actions in expert failures, EEF successfully solves some previously unsolvable subtasks and improves agent tuning performance. Remarkably, our approach achieved a 62% win rate in WebShop, outperforming RFT (53. 6%) and GPT-4 (35. 6%), and to the best of our knowledge, setting a new state-of-the-art as the first method to surpass a score of 0.81 in WebShop and exceed 81 in SciWorld.
zh

[AI-3] A Multi-task Learning Balanced Attention Convolutional Neural Network Model for Few-shot Underwater Acoustic Target Recognition

【速读】:该论文旨在解决水下声学目标识别(UATR)中因参考样本稀缺和复杂环境干扰带来的挑战。论文提出了一种多任务平衡通道注意力卷积神经网络(MT-BCA-CNN),其关键是将通道注意力机制与多任务学习策略相结合,构建共享特征提取器和多任务分类器以联合优化目标分类和特征重构任务。通道注意力机制能够动态增强谐波结构等判别性声学特征,同时抑制噪声,从而有效提升模型在少样本场景下的性能。

链接: https://arxiv.org/abs/2504.13102
作者: Wei Huang,Shumeng Sun,Junpeng Lu,Zhenpeng Xu,Zhengyang Xiu,Hao Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Underwater acoustic target recognition (UATR) is of great significance for the protection of marine diversity and national defense security. The development of deep learning provides new opportunities for UATR, but faces challenges brought by the scarcity of reference samples and complex environmental interference. To address these issues, we proposes a multi-task balanced channel attention convolutional neural network (MT-BCA-CNN). The method integrates a channel attention mechanism with a multi-task learning strategy, constructing a shared feature extractor and multi-task classifiers to jointly optimize target classification and feature reconstruction tasks. The channel attention mechanism dynamically enhances discriminative acoustic features such as harmonic structures while suppressing noise. Experiments on the Watkins Marine Life Dataset demonstrate that MT-BCA-CNN achieves 97% classification accuracy and 95% F1 -score in 27-class few-shot scenarios, significantly outperforming traditional CNN and ACNN models, as well as popular state-of-the-art UATR methods. Ablation studies confirm the synergistic benefits of multi-task learning and attention mechanisms, while a dynamic weighting adjustment strategy effectively balances task contributions. This work provides an efficient solution for few-shot underwater acoustic recognition, advancing research in marine bioacoustics and sonar signal processing.
zh

[AI-4] An Empirically Grounded Identifiability Theory Will Accelerate Self-Supervised Learning Research

【速读】:本文旨在解决自监督学习(Self-Supervised Learning, SSL)中理论与实践之间的差距问题。尽管现有研究表明自监督学习的不同方法最终可能收敛到同一理想表示(Platonic ideal),但这一现象缺乏精确的理论解释。论文通过整合辨识性理论(Identifiability Theory, IT)的证据,证明了柏拉图表征假设(Platonic Representation Hypothesis, PRH)在SSL中的潜在出现可能性,然而当前的IT无法充分解释SSL的实际成功。为此,作者提出了扩展IT为单一辨识性理论(Singular Identifiability Theory, SITh),这是一种更广泛的理论框架,覆盖整个SSL流程。SITh的关键在于提供对SSL隐含数据假设的更深层次理解,并推动学习更具可解释性和泛化能力的表示。论文指出未来研究的三个关键方向:1)SSL的训练动态与收敛特性;2)有限样本、批量大小和数据多样性的影响;以及3)架构、增强、初始化方案和优化器中的归纳偏置作用。

链接: https://arxiv.org/abs/2504.13101
作者: Patrik Reizinger,Randall Balestriero,David Klindt,Wieland Brendel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Self-Supervised Learning (SSL) powers many current AI systems. As research interest and investment grow, the SSL design space continues to expand. The Platonic view of SSL, following the Platonic Representation Hypothesis (PRH), suggests that despite different methods and engineering approaches, all representations converge to the same Platonic ideal. However, this phenomenon lacks precise theoretical explanation. By synthesizing evidence from Identifiability Theory (IT), we show that the PRH can emerge in SSL. However, current IT cannot explain SSL’s empirical success. To bridge the gap between theory and practice, we propose expanding IT into what we term Singular Identifiability Theory (SITh), a broader theoretical framework encompassing the entire SSL pipeline. SITh would allow deeper insights into the implicit data assumptions in SSL and advance the field towards learning more interpretable and generalizable representations. We highlight three critical directions for future research: 1) training dynamics and convergence properties of SSL; 2) the impact of finite samples, batch size, and data diversity; and 3) the role of inductive biases in architecture, augmentations, initialization schemes, and optimizers.
zh

[AI-5] InstructRAG : Leverag ing Retrieval-Augmented Generation on Instruction Graphs for LLM -Based Task Planning SIGIR2025

【速读】:该论文旨在解决在将 retrieval-augmented generation (RAG) 应用于任务规划时面临的可扩展性 (enlargability) 和迁移性 (transferability) 两大挑战。为应对这些挑战,论文提出了一种名为 InstructRAG 的创新解决方案,其关键在于结合多智能体元强化学习框架,通过设计一个用于组织历史指令路径(正确动作序列)的图结构、一个利用强化学习 (RL-Agent) 扩展图覆盖范围以提升可扩展性的智能体,以及一个采用元学习 (Meta-Learning) 提升任务泛化能力以增强迁移性的智能体 (ML-Agent)。这两个智能体协同训练以优化整体规划性能,并在四个常用的任务规划数据集上的实验表明,InstructRAG 相较现有最佳方法实现了最高 19.2% 的性能提升。

链接: https://arxiv.org/abs/2504.13032
作者: Zheng Wang,Shu Xian Teo,Jun Jie Chew,Wei Shi
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: This paper has been accepted by SIGIR 2025

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have enabled their use as agents for planning complex tasks. Existing methods typically rely on a thought-action-observation (TAO) process to enhance LLM performance, but these approaches are often constrained by the LLMs’ limited knowledge of complex tasks. Retrieval-augmented generation (RAG) offers new opportunities by leveraging external databases to ground generation in retrieved information. In this paper, we identify two key challenges (enlargability and transferability) in applying RAG to task planning. We propose InstructRAG, a novel solution within a multi-agent meta-reinforcement learning framework, to address these challenges. InstructRAG includes a graph to organize past instruction paths (sequences of correct actions), an RL-Agent with Reinforcement Learning to expand graph coverage for enlargability, and an ML-Agent with Meta-Learning to improve task generalization for transferability. The two agents are trained end-to-end to optimize overall planning performance. Our experiments on four widely used task planning datasets demonstrate that InstructRAG significantly enhances performance and adapts efficiently to new tasks, achieving up to a 19.2% improvement over the best existing approach.
zh

[AI-6] A Virtual Machine for Arbitrary Low-Precision GPGPU Computation in LLM Serving

【速读】:该论文旨在解决大型语言模型(LLMs)服务中因高计算资源需求(尤其是内存带宽和计算吞吐量)导致的效率瓶颈问题。现有低精度计算方法受限于权重位宽为2的幂次限制,并由于高级GPU编程抽象的约束,在性能上表现欠佳,这些抽象限制了寄存器管理优化和内存访问模式优化等关键操作。论文的关键解决方案是提出了一种面向通用GPU(GPGPU)计算的虚拟机(VM),该VM能够支持任意位宽的低精度数据类型,同时保持GPU的可编程性。通过引入线程块级编程模型、分层内存空间、新型代数布局系统以及对多种低精度数据类型的广泛支持,结合自动向量化和指令选择功能,该VM显著提升了低精度计算的效率,并在性能上超越了现有的低精度内核实现。实验结果表明,与Triton、Ladder、QuantLLM和Marlin等方法相比,该VM分别实现了1.75倍、2.61倍、1.29倍和1.03倍的性能提升。

链接: https://arxiv.org/abs/2504.12984
作者: Yaoyao Ding,Bohan Hou,Xiao Zhang,Allan Lin,Tianqi Chen,Cody Yu Hao,Yida Wang,Gennady Pekhimenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 18 pages, 15 figures

点击查看摘要

Abstract:Serving Large Language Models (LLMs) is critical for AI-powered applications but demands substantial computational resources, particularly in memory bandwidth and computational throughput. Low-precision computation has emerged as a key technique to improve efficiency while reducing resource consumption. Existing approaches for generating low-precision kernels are limited to weight bit widths that are powers of two and suffer from suboptimal performance due to high-level GPU programming abstractions. These abstractions restrict critical optimizations, such as fine-grained register management and optimized memory access patterns, which are essential for efficient low-precision computations. In this paper, we introduce a virtual machine (VM) designed for General-Purpose GPU (GPGPU) computing, enabling support for low-precision data types with arbitrary bit widths while maintaining GPU programmability. The proposed VM features a thread-block-level programming model, a hierarchical memory space, a novel algebraic layout system, and extensive support for diverse low-precision data types. VM programs are compiled into highly efficient GPU programs with automatic vectorization and instruction selection. Extensive experiments demonstrate that our VM efficiently supports a full spectrum of low-precision data types, and outperforms state-of-the-art low-precision kernels on their supported types. Compared to existing compilers like Triton and Ladder, as well as hand-optimized kernels such as QuantLLM and Marlin, our VM achieves performance improvements of 1.75x, 2.61x, 1.29x and 1.03x, respectively.
zh

[AI-7] ransferrable Surrogates in Expressive Neural Architecture Search Spaces

【速读】:该论文试图解决神经架构搜索(NAS)在探索表达能力强且广泛的搜索空间以促进架构创新的同时,如何高效评估架构以有效搜索此类空间的问题。解决方案的关键在于基于上下文无关文法的高表达NAS搜索空间中利用代理模型进行训练,通过使用零成本代理度量与神经图特征(GRAF)或微调现成的语言模型(LM),实现对架构性能的高预测能力,并利用这些代理模型筛选不良架构以加速搜索过程,同时提升最终性能,甚至可直接将其作为搜索目标以获得巨大的速度提升。

链接: https://arxiv.org/abs/2504.12971
作者: Shiwen Qin,Gabriela Kadlecová,Martin Pilát,Shay B. Cohen,Roman Neruda,Elliot J. Crowley,Jovita Lukasik,Linus Ericsson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Project page at: this https URL

点击查看摘要

Abstract:Neural architecture search (NAS) faces a challenge in balancing the exploration of expressive, broad search spaces that enable architectural innovation with the need for efficient evaluation of architectures to effectively search such spaces. We investigate surrogate model training for improving search in highly expressive NAS search spaces based on context-free grammars. We show that i) surrogate models trained either using zero-cost-proxy metrics and neural graph features (GRAF) or by fine-tuning an off-the-shelf LM have high predictive power for the performance of architectures both within and across datasets, ii) these surrogates can be used to filter out bad architectures when searching on novel datasets, thereby significantly speeding up search and achieving better final performances, and iii) the surrogates can be further used directly as the search objective for huge speed-ups.
zh

[AI-8] QLLM : Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning?

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中的归因分配(credit assignment)这一基础性挑战。现有方法主要通过集中训练与分散执行范式下的价值分解方法来处理该问题,利用神经网络近似个体Q值与全局Q值之间的非线性关系。然而,这些方法存在贡献归因不精确、可解释性有限以及在高维状态空间中扩展性差等局限性。为应对这些挑战,论文提出了一种名为\textbf{QLLM}的新算法,其关键在于利用大语言模型(Large Language Models, LLMs)自动生成归因函数。具体而言,引入了\textbf{TFCAF}概念,将归因分配过程表示为直接且表达性强的非线性泛函公式,并采用定制设计的\textit{coder-evaluator}框架指导LLMs生成、验证和优化可执行代码,显著缓解了推理过程中可能出现的幻觉(hallucination)和浅层推理等问题。实验结果表明,该方法在多个标准MARL基准测试中始终优于现有最先进基线,并展现出强大的泛化能力和与多种使用混合网络的MARL算法的良好兼容性。

链接: https://arxiv.org/abs/2504.12961
作者: Zhouyang Jiang,Bin Zhang,Airong Wei,Zhiwei Xu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Credit assignment has remained a fundamental challenge in multi-agent reinforcement learning (MARL). Previous studies have primarily addressed this issue through value decomposition methods under the centralized training with decentralized execution paradigm, where neural networks are utilized to approximate the nonlinear relationship between individual Q-values and the global Q-value. Although these approaches have achieved considerable success in various benchmark tasks, they still suffer from several limitations, including imprecise attribution of contributions, limited interpretability, and poor scalability in high-dimensional state spaces. To address these challenges, we propose a novel algorithm, \textbfQLLM, which facilitates the automatic construction of credit assignment functions using large language models (LLMs). Specifically, the concept of \textbfTFCAF is introduced, wherein the credit allocation process is represented as a direct and expressive nonlinear functional formulation. A custom-designed \textitcoder-evaluator framework is further employed to guide the generation, verification, and refinement of executable code by LLMs, significantly mitigating issues such as hallucination and shallow reasoning during inference. Extensive experiments conducted on several standard MARL benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art baselines. Moreover, QLLM exhibits strong generalization capability and maintains compatibility with a wide range of MARL algorithms that utilize mixing networks, positioning it as a promising and versatile solution for complex multi-agent scenarios.
zh

[AI-9] A Numerical Gradient Inversion Attack in Variational Quantum Neural-Networks

【速读】:本文旨在解决变分量子神经网络(Variational Quantum Neural Networks, VQNNs)在训练过程中,由于其损失景观中局部极小值随量子比特数呈指数增长,导致从模型梯度中恢复信息相较于经典神经网络(Neural Networks, NNs)更加困难的问题。论文提出了一种数值方案,能够成功从可训练的VQNNs的梯度中重构输入的真实训练数据。方案的关键在于结合梯度估计与有限差分法以及自适应低通滤波的梯度反转方法,并进一步通过卡尔曼滤波器优化以实现高效收敛。实验表明,当VQNN模型足够过参数化时,该算法可以逆向重构甚至批量训练的数据。

链接: https://arxiv.org/abs/2504.12806
作者: Georgios Papadopoulos,Shaltiel Eloul,Yash Satsangi,Jamie Heredge,Niraj Kumar,Chun-Fu Chen,Marco Pistoia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 9 pages, 17 figures

点击查看摘要

Abstract:The loss landscape of Variational Quantum Neural Networks (VQNNs) is characterized by local minima that grow exponentially with increasing qubits. Because of this, it is more challenging to recover information from model gradients during training compared to classical Neural Networks (NNs). In this paper we present a numerical scheme that successfully reconstructs input training, real-world, practical data from trainable VQNNs’ gradients. Our scheme is based on gradient inversion that works by combining gradients estimation with the finite difference method and adaptive low-pass filtering. The scheme is further optimized with Kalman filter to obtain efficient convergence. Our experiments show that our algorithm can invert even batch-trained data, given the VQNN model is sufficiently over-parameterized.
zh

[AI-10] Enhancing Explainability and Reliable Decision-Making in Particle Swarm Optimization through Communication Topologies

【速读】:该论文试图解决粒子群优化算法(Particle Swarm Optimization, PSO)在复杂系统优化中因配置不清和超参数设置不当导致可靠性低的问题。论文的关键在于分析不同通信拓扑结构(Ring、Star 和 Von Neumann)对PSO收敛性和搜索行为的影响,通过使用IOHxplainer这一可解释性基准工具,研究这些拓扑如何调节信息流、多样性及收敛速度,阐明探索与开发之间的平衡机制。最终,通过可视化和统计分析提升PSO决策的可解释性,并为特定优化任务选择合适的拓扑结构提供实践指导,从而增强基于群体智能优化的透明性、鲁棒性和可信度。

链接: https://arxiv.org/abs/2504.12803
作者: Nitin Gupta,Indu Bala,Bapi Dutta,Luis Martínez,Anupam Yadav
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Swarm intelligence effectively optimizes complex systems across fields like engineering and healthcare, yet algorithm solutions often suffer from low reliability due to unclear configurations and hyperparameters. This study analyzes Particle Swarm Optimization (PSO), focusing on how different communication topologies Ring, Star, and Von Neumann affect convergence and search behaviors. Using an adapted IOHxplainer , an explainable benchmarking tool, we investigate how these topologies influence information flow, diversity, and convergence speed, clarifying the balance between exploration and exploitation. Through visualization and statistical analysis, the research enhances interpretability of PSO’s decisions and provides practical guidelines for choosing suitable topologies for specific optimization tasks. Ultimately, this contributes to making swarm based optimization more transparent, robust, and trustworthy.
zh

[AI-11] Multi-Agent Reinforcement Learning Simulation for Environmental Policy Synthesis AAMAS’25

【速读】:该论文旨在解决气候政策开发过程中面临的深度不确定性、复杂系统动力学以及利益相关者竞争等挑战,其核心问题是现有气候模拟方法(如地球系统模型)虽可评估潜在政策,但难以直接综合生成政策路径。传统优化方法在处理非线性动态、异构代理及全面的不确定性量化时存在局限性。为此,论文提出了一种结合多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的框架来增强气候模拟,以克服上述限制。解决方案的关键在于应对气候模拟与MARL应用结合时的多重挑战,包括奖励函数的设计、随着智能体数量和状态空间增加的可扩展性、跨链接系统的不确定性传播,以及解决方案的有效验证。此外,还需解决如何使MARL衍生的政策方案对决策者具有可解释性和实用性的问题。此框架为更复杂的气候政策探索奠定了基础,并明确了研究的重要局限性和未来方向。

链接: https://arxiv.org/abs/2504.12777
作者: James Rudd-Jones,Mirco Musolesi,María Pérez-Ortiz
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Published in AAMAS’25 Blue Sky Ideas Track

点击查看摘要

Abstract:Climate policy development faces significant challenges due to deep uncertainty, complex system dynamics, and competing stakeholder interests. Climate simulation methods, such as Earth System Models, have become valuable tools for policy exploration. However, their typical use is for evaluating potential polices, rather than directly synthesizing them. The problem can be inverted to optimize for policy pathways, but the traditional optimization approaches often struggle with non-linear dynamics, heterogeneous agents, and comprehensive uncertainty quantification. We propose a framework for augmenting climate simulations with Multi-Agent Reinforcement Learning (MARL) to address these limitations. We identify key challenges at the interface between climate simulations and the application of MARL in the context of policy synthesis, including reward definition, scalability with increasing agents and state spaces, uncertainty propagation across linked systems, and solution validation. Additionally, we discuss challenges in making MARL-derived solutions interpretable and useful for policy-makers. Our framework provides a foundation for more sophisticated climate policy exploration while acknowledging important limitations and areas for future research.
zh

[AI-12] MCP Guardian: A Security-First Layer for Safeguarding MCP-Based AI System

【速读】:本文试图解决生成式 AI (Generative AI) 模型能力快速发展但因数据孤岛导致的集成困难问题,即每次新集成都需要定制逻辑且难以扩展。为应对这一挑战,论文提出了 Model Context Protocol (MCP),定义了一种通用开放标准以安全连接 AI 应用与数据源。然而,MCP 的灵活性带来了新的风险,如恶意工具服务器和数据完整性受损。为此,论文引入 MCP Guardian 框架,通过身份验证、速率限制、日志记录、追踪及 Web 应用防火墙扫描等机制增强基于 MCP 的通信安全性。实证测试表明,该方案在有效缓解攻击的同时确保了最小开销下的强大监管,强调了纵深防御策略在 AI 环境中实现更安全透明创新的重要性。

链接: https://arxiv.org/abs/2504.12757
作者: Sonu Kumar,Anubhav Girdhar,Ritesh Patil,Divyansh Tripathi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Agentic AI gain mainstream adoption, the industry invests heavily in model capabilities, achieving rapid leaps in reasoning and quality. However, these systems remain largely confined to data silos, and each new integration requires custom logic that is difficult to scale. The Model Context Protocol (MCP) addresses this challenge by defining a universal, open standard for securely connecting AI-based applications (MCP clients) to data sources (MCP servers). However, the flexibility of the MCP introduces new risks, including malicious tool servers and compromised data integrity. We present MCP Guardian, a framework that strengthens MCP-based communication with authentication, rate-limiting, logging, tracing, and Web Application Firewall (WAF) scanning. Through real-world scenarios and empirical testing, we demonstrate how MCP Guardian effectively mitigates attacks and ensures robust oversight with minimal overheads. Our approach fosters secure, scalable data access for AI assistants, underscoring the importance of a defense-in-depth approach that enables safer and more transparent innovation in AI-driven environments.
zh

[AI-13] rajectory Adaptation using Large Language Models

【速读】:该论文试图解决在动态环境中通过人类指令灵活调整机器人轨迹以实现更直观和可扩展的人机交互的问题。解决方案的关键在于提出了一种基于预训练大型语言模型(LLMs)的语言驱动框架,通过生成代码作为策略来适应通用运动规划器生成的轨迹或从人类演示中学习的轨迹,从而实现比现有方法更复杂和灵活的指令处理能力,并支持包括数值输入在内的更广泛命令类型。与需要任务特定训练的状态-of-the-art特征序列到序列模型相比,该方法无需任务特定训练,提供了更高的可解释性和更有效的反馈机制。

链接: https://arxiv.org/abs/2504.12755
作者: Anurag Maurya,Tashmoy Ghosh,Ravi Prakash
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to CoRL LangRob workshop 2024

点击查看摘要

Abstract:Adapting robot trajectories based on human instructions as per new situations is essential for achieving more intuitive and scalable human-robot interactions. This work proposes a flexible language-based framework to adapt generic robotic trajectories produced by off-the-shelf motion planners like RRT, A-star, etc, or learned from human demonstrations. We utilize pre-trained LLMs to adapt trajectory waypoints by generating code as a policy for dense robot manipulation, enabling more complex and flexible instructions than current methods. This approach allows us to incorporate a broader range of commands, including numerical inputs. Compared to state-of-the-art feature-based sequence-to-sequence models which require training, our method does not require task-specific training and offers greater interpretability and more effective feedback mechanisms. We validate our approach through simulation experiments on the robotic manipulator, aerial vehicle, and ground robot in the Pybullet and Gazebo simulation environments, demonstrating that LLMs can successfully adapt trajectories to complex human instructions.
zh

[AI-14] GPMFS: Global Foundation and Personalized Optimization for Multi-Label Feature Selection

【速读】:该论文旨在解决高维多标签学习中因维度灾难导致的性能瓶颈问题,现有方法主要关注全局特征的选择,而忽略了各标签个性化特征的需求,从而可能限制对标签特定判别信息的捕捉。论文提出了一种名为GPMFS(基于全局基础与个性化优化的多标签特征选择)的新方法,其关键是首先通过标签相关性识别全局特征,然后利用阈值控制策略为每个标签自适应补充判别特征的个性化子集,从而在保持强可解释性和鲁棒性的同时实现优越性能,并揭示不同多标签数据集中的标签特定强度,证明个性化特征选择方法的必要性和潜在适用性。

链接: https://arxiv.org/abs/2504.12740
作者: Yifan Cao,Zhilong Mi,Ziqiao Yin,Binghui Guo,Jin Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence methods are increasingly applied to complex task scenarios, high dimensional multi-label learning has emerged as a prominent research focus. At present, the curse of dimensionality remains one of the major bottlenecks in high-dimensional multi-label learning, which can be effectively addressed through multi-label feature selection methods. However, existing multi-label feature selection methods mostly focus on identifying global features shared across all labels, which overlooks personalized characteristics and specific requirements of individual labels. This global-only perspective may limit the ability to capture label-specific discriminative information, thereby affecting overall performance. In this paper, we propose a novel method called GPMFS (Global Foundation and Personalized Optimization for Multi-Label Feature Selection). GPMFS firstly identifies global features by exploiting label correlations, then adaptively supplements each label with a personalized subset of discriminative features using a threshold-controlled strategy. Experiments on multiple real-world datasets demonstrate that GPMFS achieves superior performance while maintaining strong interpretability and robustness. Furthermore, GPMFS provides insights into the label-specific strength across different multi-label datasets, thereby demonstrating the necessity and potential applicability of personalized feature selection approaches.
zh

[AI-15] he Athenian Academy: A Seven-Layer Architecture Model for Multi-Agent Systems

【速读】:该论文旨在系统性地解决人工智能艺术创作领域中多智能体系统(Multi-Agent Systems, MAS)面临的协作效率低下、角色分配不均、环境适应能力不足以及任务并行性差等挑战。论文提出了一种名为“雅典学院”(Academy of Athens)的七层框架,通过将MAS划分为七个层次:多智能体协作、单智能体多角色扮演、单智能体多场景遍历、单智能体多能力化身、不同单智能体使用同一大模型以实现相同目标代理、单智能体使用不同大模型以实现相同目标代理,以及多智能体合成相同目标代理。该框架的关键在于通过结构化方法提升任务协作效率、跨场景适应能力以及模型融合性能,并通过实验验证其在艺术创作中的有效性。未来研究将进一步探索元学习(meta-learning)和联邦学习(federated learning)等技术,以优化协作机制、增强模型稳定性及保障系统安全性。

链接: https://arxiv.org/abs/2504.12735
作者: Lidong Zhai,Zhijie Qiu,Xizhong Guo,Jiaqi Li
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes the “Academy of Athens” multi-agent seven-layer framework, aimed at systematically addressing challenges in multi-agent systems (MAS) within artificial intelligence (AI) art creation, such as collaboration efficiency, role allocation, environmental adaptation, and task parallelism. The framework divides MAS into seven layers: multi-agent collaboration, single-agent multi-role playing, single-agent multi-scene traversal, single-agent multi-capability incarnation, different single agents using the same large model to achieve the same target agent, single-agent using different large models to achieve the same target agent, and multi-agent synthesis of the same target agent. Through experimental validation in art creation, the framework demonstrates its unique advantages in task collaboration, cross-scene adaptation, and model fusion. This paper further discusses current challenges such as collaboration mechanism optimization, model stability, and system security, proposing future exploration through technologies like meta-learning and federated learning. The framework provides a structured methodology for multi-agent collaboration in AI art creation and promotes innovative applications in the art field.
zh

[AI-16] SimUSER: Simulating User Behavior with Large Language Models for Recommender System Evaluation

【速读】:该论文旨在解决推荐系统评估中存在的离线指标与在线行为之间的差距问题。由于真实用户数据的稀缺性和局限性(如隐私问题),论文提出了一种名为SimUSER的代理框架,作为可信且经济有效的用户代理。SimUSER的关键在于通过历史数据构建自洽的人物角色,并为虚拟用户配备个性、记忆、感知和推理模块,使其能够模拟真实用户的交互行为。相比以往研究,SimUSER在微观和宏观层面均表现出更贴近真实人类的特性。此外,论文通过实验探讨了缩略图对点击率的影响、曝光效应以及评论对用户参与度的作用,并基于离线A/B测试结果优化了推荐系统的参数,从而提升了实际应用中的用户参与度。

链接: https://arxiv.org/abs/2504.12722
作者: Nicolas Bougie,Narimasa Watanabe
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender systems play a central role in numerous real-life applications, yet evaluating their performance remains a significant challenge due to the gap between offline metrics and online behaviors. Given the scarcity and limits (e.g., privacy issues) of real user data, we introduce SimUSER, an agent framework that serves as believable and cost-effective human proxies. SimUSER first identifies self-consistent personas from historical data, enriching user profiles with unique backgrounds and personalities. Then, central to this evaluation are users equipped with persona, memory, perception, and brain modules, engaging in interactions with the recommender system. SimUSER exhibits closer alignment with genuine humans than prior work, both at micro and macro levels. Additionally, we conduct insightful experiments to explore the effects of thumbnails on click rates, the exposure effect, and the impact of reviews on user engagement. Finally, we refine recommender system parameters based on offline A/B test results, resulting in improved user engagement in the real world.
zh

[AI-17] meCapsule: Solving the Jigsaw Puzzle of Long-Term Time Series Forecasting with Compressed Predictive Representations

【速读】:该论文旨在解决长时序预测(Long-term Time Series Forecasting, LTSF)领域中,尽管深度学习模型设计日趋复杂,但简单架构如线性模型或MLPs(Multilayer Perceptrons)往往表现更优的问题。为了解决这一矛盾,论文重新审视并整合了多种高级LTSF模型中常用的技巧,如冗余减少(redundancy reduction)和多尺度建模(multi-scale modeling),以实现更高效的深度学习应用。论文的关键创新在于提出TimeCapsule模型,该模型基于高维信息压缩原理,通过统一这些技术到一个通用且简化的框架中,实现了方法的高效性和泛化能力。具体而言,TimeCapsule将时间序列表示为三维张量(3D tensor),包含时间、变量和层级三个维度,并利用模式生成(mode production)捕捉多模态依赖关系同时完成降维。此外,通过联合嵌入预测架构(Joint-Embedding Predictive Architecture, JEPA),在压缩表示域内进行内部预测,进一步提升预测表征的学习效率。实验结果表明,TimeCapsule在多个挑战性基准数据集上达到了最先进的性能水平。

链接: https://arxiv.org/abs/2504.12721
作者: Yihang Lu,Yangyang Xu,Qitao Qing,Xianwei Meng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Recent deep learning models for Long-term Time Series Forecasting (LTSF) often emphasize complex, handcrafted designs, while simpler architectures like linear models or MLPs have often outperformed these intricate solutions. In this paper, we revisit and organize the core ideas behind several key techniques, such as redundancy reduction and multi-scale modeling, which are frequently employed in advanced LTSF models. Our goal is to streamline these ideas for more efficient deep learning utilization. To this end, we introduce TimeCapsule, a model built around the principle of high-dimensional information compression that unifies these techniques in a generalized yet simplified framework. Specifically, we model time series as a 3D tensor, incorporating temporal, variate, and level dimensions, and leverage mode production to capture multi-mode dependencies while achieving dimensionality compression. We propose an internal forecast within the compressed representation domain, supported by the Joint-Embedding Predictive Architecture (JEPA), to monitor the learning of predictive representations. Extensive experiments on challenging benchmarks demonstrate the versatility of our method, showing that TimeCapsule can achieve state-of-the-art performance.
zh

[AI-18] Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination ICML2025

【速读】:该论文试图解决的问题是如何使人工智能系统具备零样本协调(Zero-shot Coordination, ZSC)能力,即在不依赖针对特定任务训练的情况下,适应与新伙伴合作的能力。传统方法通过单一任务强化学习训练的专用模型难以泛化到新的任务,即使这些任务与已知任务高度相似。为了解决这一问题,论文提出了一种名为跨环境协作(Cross-Environment Cooperation, CEC)的新范式,其关键在于通过在一个分布广泛的环境中与单一伙伴进行强化学习,使智能体学会通用的合作技能,从而实现与多个新伙伴在许多新问题上的有效协作。此外,通过开发两个基于Jax的程序化生成器,论文创造了数十亿个可解的协作挑战,进一步验证了CEC方法在定量和定性上均优于竞争基线,并证明了跨多种独特场景学习协作能够促使智能体发展出有效的通用规范,支持与不同伙伴的协作。

链接: https://arxiv.org/abs/2504.12714
作者: Kunal Jha,Wilka Carvalho,Yancheng Liang,Simon S. Du,Max Kleiman-Weiner,Natasha Jaques
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to CogSci 2025, In-review for ICML 2025

点击查看摘要

Abstract:Zero-shot coordination (ZSC), the ability to adapt to a new partner in a cooperative task, is a critical component of human-compatible AI. While prior work has focused on training agents to cooperate on a single task, these specialized models do not generalize to new tasks, even if they are highly similar. Here, we study how reinforcement learning on a distribution of environments with a single partner enables learning general cooperative skills that support ZSC with many new partners on many new problems. We introduce two Jax-based, procedural generators that create billions of solvable coordination challenges. We develop a new paradigm called Cross-Environment Cooperation (CEC), and show that it outperforms competitive baselines quantitatively and qualitatively when collaborating with real people. Our findings suggest that learning to collaborate across many unique scenarios encourages agents to develop general norms, which prove effective for collaboration with different partners. Together, our results suggest a new route toward designing generalist cooperative agents capable of interacting with humans without requiring human data.
zh

[AI-19] he Chronicles of Foundation AI for Forensics of Multi-Agent Provenance

【速读】:本文旨在解决在多智能体生成链中,跨时间维度追踪多智能体内容生成过程的问题。随着生成式人工智能(Generative AI)向能够协作完成复杂任务的自主代理发展,生成内容的出处(Provenance)变得复杂,尤其是在集体创作过程中,贡献往往被持续修订、扩展或覆盖,导致早期贡献的痕迹几乎消失。为应对这一挑战,论文提出了一种基于符号编年史(symbolic chronicles)的后验归因系统,用于仅从生成内容本身追溯生成历史,而无需依赖内部记忆状态或外部元信息。其关键是通过符号编年史记录带签名和时间戳的信息,形式类似于法医学中的证据保管链(chain of custody)。系统通过反馈回路运作,在每次生成步骤中更新先前交互的编年史,并在生成过程中同步与合成内容保持一致。这项研究致力于在动态网络生态系统中开发可问责的协同人工智能形式。

链接: https://arxiv.org/abs/2504.12612
作者: Ching-Chun Chang,Isao Echizen
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Provenance is the chronology of things, resonating with the fundamental pursuit to uncover origins, trace connections, and situate entities within the flow of space and time. As artificial intelligence advances towards autonomous agents capable of interactive collaboration on complex tasks, the provenance of generated content becomes entangled in the interplay of collective creation, where contributions are continuously revised, extended or overwritten. In a multi-agent generative chain, content undergoes successive transformations, often leaving little, if any, trace of prior contributions. In this study, we investigates the problem of tracking multi-agent provenance across the temporal dimension of generation. We propose a chronological system for post hoc attribution of generative history from content alone, without reliance on internal memory states or external meta-information. At its core lies the notion of symbolic chronicles, representing signed and time-stamped records, in a form analogous to the chain of custody in forensic science. The system operates through a feedback loop, whereby each generative timestep updates the chronicle of prior interactions and synchronises it with the synthetic content in the very act of generation. This research seeks to develop an accountable form of collaborative artificial intelligence within evolving cyber ecosystems.
zh

[AI-20] Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration

【速读】:该论文旨在解决通过人类演示视频训练机器人灵巧操作技能的问题,传统方法依赖于可穿戴设备或遥操作收集大量标注数据,而直接利用无标注的人类-物体交互视频进行机器人学习面临缺乏显式动作标签以及机器人与人手形态差异的挑战。论文提出了一种名为Human2Sim2Robot的新框架,通过仅使用单个RGB-D视频实现从真实场景到仿真再到真实场景的迁移学习。其关键在于提取两个任务特定组件:(1) 基于物体姿态轨迹定义一个与具体形态无关的奖励函数;(2) 使用预操作手部姿势初始化并引导强化学习中的探索过程。这种方法无需依赖可穿戴设备、遥操作或大规模数据集,同时避免了任务特定奖励设计的需求,显著提升了在抓取、非抓握操作及多步任务中的性能表现。

链接: https://arxiv.org/abs/2504.12609
作者: Tyler Ga Wei Lum,Olivia Y. Lee,C. Karen Liu,Jeannette Bohg
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 15 pages, 13 figures

点击查看摘要

Abstract:Teaching robots dexterous manipulation skills often requires collecting hundreds of demonstrations using wearables or teleoperation, a process that is challenging to scale. Videos of human-object interactions are easier to collect and scale, but leveraging them directly for robot learning is difficult due to the lack of explicit action labels from videos and morphological differences between robot and human hands. We propose Human2Sim2Robot, a novel real-to-sim-to-real framework for training dexterous manipulation policies using only one RGB-D video of a human demonstrating a task. Our method utilizes reinforcement learning (RL) in simulation to cross the human-robot embodiment gap without relying on wearables, teleoperation, or large-scale data collection typically necessary for imitation learning methods. From the demonstration, we extract two task-specific components: (1) the object pose trajectory to define an object-centric, embodiment-agnostic reward function, and (2) the pre-manipulation hand pose to initialize and guide exploration during RL training. We found that these two components are highly effective for learning the desired task, eliminating the need for task-specific reward shaping and tuning. We demonstrate that Human2Sim2Robot outperforms object-aware open-loop trajectory replay by 55% and imitation learning with data augmentation by 68% across grasping, non-prehensile manipulation, and multi-step tasks. Project Site: this https URL
zh

[AI-21] Code Copycat Conundrum: Demystifying Repetition in LLM -based Code Generation

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在代码生成中普遍存在的代码重复问题。代码重复表现为模型倾向于生成结构冗余的代码,这不仅降低了代码效率,还影响了可读性。为应对这一挑战,论文提出了一个名为DeRep的基于规则的技术方案,其关键是通过检测和缓解生成代码中的重复现象来提升代码质量。论文通过定量与定性分析揭示了重复现象的普遍存在及其在不同粒度(字符、语句和代码块级别)的表现形式,并总结了20种重复模式。实验结果表明,DeRep在减少重复(rep-3、rep-line和sim-line指标平均提升分别为91.3%、93.5%和79.9%)和提高代码质量(Pass@1相比贪心搜索提升208.3%)方面显著优于基线方法,并且能够进一步增强现有重复缓解方法的效果。

链接: https://arxiv.org/abs/2504.12608
作者: Mingwei Liu,Juntao Li,Ying Wang,Xueying Du,Zuoyu Ou,Qiuyuan Chen,Bingxu An,Zhao Wei,Yong Xu,Fangming Zou,Xin Peng,Yiling Lou
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite recent advances in Large Language Models (LLMs) for code generation, the quality of LLM-generated code still faces significant challenges. One significant issue is code repetition, which refers to the model’s tendency to generate structurally redundant code, resulting in inefficiencies and reduced readability. To address this, we conduct the first empirical study to investigate the prevalence and nature of repetition across 19 state-of-the-art code LLMs using three widely-used benchmarks. Our study includes both quantitative and qualitative analyses, revealing that repetition is pervasive and manifests at various granularities and extents, including character, statement, and block levels. We further summarize a taxonomy of 20 repetition patterns. Building on our findings, we propose DeRep, a rule-based technique designed to detect and mitigate repetition in generated code. We evaluate DeRep using both open-source benchmarks and in an industrial setting. Our results demonstrate that DeRep significantly outperforms baselines in reducing repetition (with an average improvements of 91.3%, 93.5%, and 79.9% in rep-3, rep-line, and sim-line metrics) and enhancing code quality (with a Pass@1 increase of 208.3% over greedy search). Furthermore, integrating DeRep improves the performance of existing repetition mitigation methods, with Pass@1 improvements ranging from 53.7% to 215.7%.
zh

[AI-22] Local Data Quantity-Aware Weighted Averag ing for Federated Learning with Dishonest Clients ICME2025

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端可能申报不准确训练数据量而导致的模型偏差问题。现有方法通常基于各客户端数据量加权平均进行模型聚合,但这一策略难以验证客户端申报数据的真实性和准确性。论文提出了一种新颖的安全联邦数据量感知加权平均方法(FedDua),其关键在于通过分析客户端上传的本地模型梯度,使服务器能够精确预测每个客户端的实际训练数据量。此方法可无缝集成到任何涉及服务器端模型聚合的联邦学习算法中,并在存在客户端数据量误报的情况下,将全局模型性能平均提升3.17%。

链接: https://arxiv.org/abs/2504.12577
作者: Leming Wu,Yaochu Jin,Kuangrong Hao,Han Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: The paper has been accepted by ICME 2025

点击查看摘要

Abstract:Federated learning (FL) enables collaborative training of deep learning models without requiring data to leave local clients, thereby preserving client privacy. The aggregation process on the server plays a critical role in the performance of the resulting FL model. The most commonly used aggregation method is weighted averaging based on the amount of data from each client, which is thought to reflect each client’s contribution. However, this method is prone to model bias, as dishonest clients might report inaccurate training data volumes to the server, which is hard to verify. To address this issue, we propose a novel secure \underlineFederated \underlineData q\underlineuantity-\underlineaware weighted averaging method (FedDua). It enables FL servers to accurately predict the amount of training data from each client based on their local model gradients uploaded. Furthermore, it can be seamlessly integrated into any FL algorithms that involve server-side model aggregation. Extensive experiments on three benchmarking datasets demonstrate that FedDua improves the global model performance by an average of 3.17% compared to four popular FL aggregation methods in the presence of inaccurate client data volume declarations.
zh

[AI-23] raCeS: Trajectory Based Credit Assignment From Sparse Safety Feedback

【速读】:该论文致力于解决在安全强化学习(Safe Reinforcement Learning, Safe RL)中,当真实的安全定义未知且难以明确指定(例如安全约束中的代价函数和预算)时,如何从稀疏标注的数据中学习有效的安全策略的问题。论文的关键在于设计了一个安全模型,通过分配信用(credit assignment)来估计每个决策步骤对整体安全的影响,并利用包含多样化轨迹及其相应二元安全标签(即轨迹是否安全)的数据集实现这一目标。此外,该模型能够为每个时间步学习独立的安全评分,进而重新构建安全强化学习问题,并提出一种有效的算法以优化既能保证安全又能获得奖励的策略。实证结果验证了该方法的有效性,并证明其适用于满足未知的安全定义以及多种连续控制任务。

链接: https://arxiv.org/abs/2504.12557
作者: Siow Meng Low,Akshat Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In safe reinforcement learning (RL), auxiliary safety costs are used to align the agent to safe decision making. In practice, safety constraints, including cost functions and budgets, are unknown or hard to specify, as it requires anticipation of all possible unsafe behaviors. We therefore address a general setting where the true safety definition is unknown, and has to be learned from sparsely labeled data. Our key contributions are: first, we design a safety model that performs credit assignment to estimate each decision step’s impact on the overall safety using a dataset of diverse trajectories and their corresponding binary safety labels (i.e., whether the corresponding trajectory is safe/unsafe). Second, we illustrate the architecture of our safety model to demonstrate its ability to learn a separate safety score for each timestep. Third, we reformulate the safe RL problem using the proposed safety model and derive an effective algorithm to optimize a safe yet rewarding policy. Finally, our empirical results corroborate our findings and show that this approach is effective in satisfying unknown safety definition, and scalable to various continuous control tasks.
zh

[AI-24] Anonymous Public Announcements

【速读】:本文旨在研究匿名公共公告(anonymous public announcement)的逻辑性质及其在认知逻辑中的表达能力。论文首先探讨了在不假设公告者意图的情况下,匿名公共公告算子的逻辑可以归约为标准的认知逻辑,此时问题相对简单。然而,在假设公告者有共同知识的意图以保持匿名性时,问题变得更加复杂且有趣,类似于“俄罗斯卡片”问题中的安全公告概念。关键在于分析这种匿名性如何影响信息的传播及接收者的知识状态,并揭示出即使意图保持匿名也可能泄露更多信息的现象。主要成果包括针对关键逻辑语言的形式化表达能力和公理完备性证明。

链接: https://arxiv.org/abs/2504.12546
作者: Thomas Ågotnes,Rustam Galimullin,Ken Satoh,Satoshi Tojo
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:We formalise the notion of an \emphanonymous public announcement in the tradition of public announcement logic. Such announcements can be seen as in-between a public announcement from the outside" (an announcement of \phi ) and a public announcement by one of the agents (an announcement of K_a\phi ): we get more information than just \phi , but not (necessarily) about exactly who made it. Even if such an announcement is prima facie anonymous, depending on the background knowledge of the agents it might reveal the identity of the announcer: if I post something on a message board, the information might reveal who I am even if I don't sign my name. Furthermore, like in the Russian Cards puzzle, if we assume that the announcer's intention was to stay anonymous, that in fact might reveal more information. In this paper we first look at the case when no assumption about intentions are made, in which case the logic with an anonymous public announcement operator is reducible to epistemic logic. We then look at the case when we assume common knowledge of the intention to stay anonymous, which is both more complex and more interesting: in several ways it boils down to the notion of a safe" announcement (again, similarly to Russian Cards). Main results include formal expressivity results and axiomatic completeness for key logical languages.
zh

[AI-25] Generalization through variance: how noise shapes inductive biases in diffusion models ICLR2025

【速读】:该论文试图解决扩散模型(Diffusion Models)在训练集之外的泛化能力这一问题,尤其是解释为何这些模型能够在目标分布与训练分布不完全一致的情况下表现出良好的性能。论文的关键在于揭示扩散模型通过方差(“generalization through variance”)实现泛化的机制,即扩散模型的目标函数——去噪评分匹配(Denoising Score Matching, DSM)的特性:其优化目标并非训练分布的真实评分函数,而是一个仅在期望意义上等于真实评分的噪声量。论文通过发展一种基于物理学路径积分方法的数学理论,分析了几种典型欠参数化和过参数化扩散模型所学习到的分布特性,发现扩散模型倾向于学习一个与训练分布相似但填补了“空隙”的分布,并将这种归纳偏置归因于训练过程中使用的噪声目标的协方差结构。此外,论文还探讨了这种归纳偏置与其他特征相关归纳偏置之间的相互作用。

链接: https://arxiv.org/abs/2504.12532
作者: John J. Vastola
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train diffusion models is the score function of the training distribution; and the networks usually used to learn the score function are expressive enough to learn this score to high accuracy. We claim that a certain feature of the DSM objective – the fact that its target is not the training distribution’s score, but a noisy quantity only equal to it in expectation – strongly impacts whether and to what extent diffusion models generalize. In this paper, we develop a mathematical theory that partly explains this ‘generalization through variance’ phenomenon. Our theoretical analysis exploits a physics-inspired path integral approach to compute the distributions typically learned by a few paradigmatic under- and overparameterized diffusion models. We find that the distributions diffusion models effectively learn to sample from resemble their training distributions, but with ‘gaps’ filled in, and that this inductive bias is due to the covariance structure of the noisy target used during training. We also characterize how this inductive bias interacts with feature-related inductive biases.
zh

[AI-26] Is Trust Correlated With Explainability in AI? A Meta-Analysis

【速读】:该论文试图解决的核心问题是评估人工智能(Artificial Intelligence, AI)系统可解释性(Explainability)是否必然提升用户信任这一普遍假设的有效性,并探索可解释性与信任之间的实际关系。论文的关键解决方案在于采用元分析方法,综合分析了来自90项研究的数据,揭示了AI系统可解释性与用户信任之间存在统计学上显著但中等程度的正相关关系。这表明虽然可解释性有助于建立信任,但它并非信任形成的唯一或主导因素。研究强调,除了推动可解释AI(Explainable AI, XAI)领域的学术进展外,还需关注其在促进问责制及增强医疗、司法等关键领域用户信任方面的社会技术影响,同时呼吁解决算法偏见和伦理透明度等问题,以实现公平且可持续的AI应用,而非仅仅追求短期信任,而是注重培养AI系统的真正持久可信性。

链接: https://arxiv.org/abs/2504.12529
作者: Zahra Atf,Peter R. Lewis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9 Page, 1 Figure

点击查看摘要

Abstract:This study critically examines the commonly held assumption that explicability in artificial intelligence (AI) systems inherently boosts user trust. Utilizing a meta-analytical approach, we conducted a comprehensive examination of the existing literature to explore the relationship between AI explainability and trust. Our analysis, incorporating data from 90 studies, reveals a statistically significant but moderate positive correlation between the explainability of AI systems and the trust they engender among users. This indicates that while explainability contributes to building trust, it is not the sole or predominant factor in this equation. In addition to academic contributions to the field of Explainable AI (XAI), this research highlights its broader socio-technical implications, particularly in promoting accountability and fostering user trust in critical domains such as healthcare and justice. By addressing challenges like algorithmic bias and ethical transparency, the study underscores the need for equitable and sustainable AI adoption. Rather than focusing solely on immediate trust, we emphasize the normative importance of fostering authentic and enduring trustworthiness in AI systems.
zh

[AI-27] Continual Learning Strategies for 3D Engineering Regression Problems: A Benchmarking Study

【速读】:该论文试图解决工程领域中机器学习应用所面临的两个主要问题:一是受限于有限数据集且计算开销大的挑战;二是随着工程数据随新设计和约束条件演化,模型需要在不遗忘已有知识的前提下持续学习。论文的关键解决方案是引入连续学习(Continual Learning, CL)方法,通过允许模型从序列数据中学习,并缓解灾难性遗忘(catastrophic forgetting),即避免模型忘记之前学到的映射关系。论文通过在五个工程数据集上评估多种CL方法,构建了九个新的工程CL基准,验证其在减少遗忘和提升泛化能力方面的表现。实验结果显示,Replay策略在多个基准测试中实现了与从头训练相当的性能,同时将训练时间减少了近一半,展现了其在实际工程工作流中的应用潜力。

链接: https://arxiv.org/abs/2504.12503
作者: Kaira M. Samuel,Faez Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Engineering problems that apply machine learning often involve computationally intensive methods but rely on limited datasets. As engineering data evolves with new designs and constraints, models must incorporate new knowledge over time. However, high computational costs make retraining models from scratch infeasible. Continual learning (CL) offers a promising solution by enabling models to learn from sequential data while mitigating catastrophic forgetting, where a model forgets previously learned mappings. This work introduces CL to engineering design by benchmarking several CL methods on representative regression tasks. We apply these strategies to five engineering datasets and construct nine new engineering CL benchmarks to evaluate their ability to address forgetting and improve generalization. Preliminary results show that applying existing CL methods to these tasks improves performance over naive baselines. In particular, the Replay strategy achieved performance comparable to retraining in several benchmarks while reducing training time by nearly half, demonstrating its potential for real-world engineering workflows. The code and datasets used in this work will be available at: this https URL.
zh

[AI-28] Heuristic Recognition and Rapid Response to Unfamiliar Events Outside of Agent Design Scope

【速读】:该论文试图解决开放世界中智能体在超出其原始设计范围的情况下,如何合理应对不熟悉的情况,并快速可靠地识别这些情况以制定合理的自适应行为的问题。解决方案的关键在于结合领域通用的元知识(以受人类认知启发的评估形式)与元推理,这种组合有望为不熟悉的情况提供快速且自适应的响应,从而更好地满足开放世界通用智能体所需的性能特征。

链接: https://arxiv.org/abs/2504.12497
作者: Robert E. Wray,Steven J. Jones,John E. Laird
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures. Submitted to AGI25 conference

点击查看摘要

Abstract:Regardless of past learning, an agent in an open world will face unfamiliar situations and events outside of prior experience, existing models, or policies. Further, the agent will sometimes lack relevant knowledge and/or sufficient time to assess the situation, generate and evaluate options, and pursue a robustly considered course of action. How can an agent respond reasonably to situations that are outside of its original design scope? How can it recognize such situations sufficiently quickly and reliably to determine reasonable, adaptive courses of action? We identify key characteristics needed for solutions, evaluate the state-of-the-art by these requirements, and outline a proposed, novel approach that combines domain-general meta-knowledge (in the form of appraisals inspired by human cognition) and metareasoning. It has the potential to provide fast, adaptive responses to unfamiliar situations, more fully meeting the performance characteristics required for open-world, general agents.
zh

[AI-29] Co-Writing with AI on Human Terms: Aligning Research with User Demands Across the Writing Process

【速读】:该论文试图解决的问题是如何在使用生成式 AI 工具(Generative AI)时,平衡作家的创作自主性与工具辅助之间的关系,并系统性地理解 AI 支持对写作过程各阶段的影响及其对作家代理权(agency)的塑造作用。当前这一领域研究尚不充分,存在知识空白。

解决方案的关键在于提出了一种基于认知写作过程的设计框架,将 AI 写作支持策略映射到写作的四个核心认知过程:规划(planning)、翻译(translating)、校审(reviewing)和监控(monitoring)。通过系统回顾 109 篇人机交互(HCI)领域的论文以及对 15 名来自不同领域的作家进行访谈,论文识别出四种主要的 AI 写作支持策略:结构化指导(structured guidance)、引导探索(guided exploration)、主动协同写作(active co-writing)和批判性反馈(critical feedback)。此外,研究揭示了作家对 AI 干预程度的需求因写作阶段、领域特点及个人目标的不同而异,强调了设计应以用户为中心,关注作家对创作所有权的关注点及其对 AI 交互体验的期望。最终,论文为开发符合人类需求的人机协同写作工具提供了实用的设计建议。

链接: https://arxiv.org/abs/2504.12488
作者: Mohi Reza,Jeb Thomas-Mitchell,Peter Dushniku,Nathan Laundry,Joseph Jay Williams,Anastasia Kuzminykh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As generative AI tools like ChatGPT become integral to everyday writing, critical questions arise about how to preserve writers’ sense of agency and ownership when using these tools. Yet, a systematic understanding of how AI assistance affects different aspects of the writing process - and how this shapes writers’ agency - remains underexplored. To address this gap, we conducted a systematic review of 109 HCI papers using the PRISMA approach. From this literature, we identify four overarching design strategies for AI writing support: structured guidance, guided exploration, active co-writing, and critical feedback - mapped across the four key cognitive processes in writing: planning, translating, reviewing, and monitoring. We complement this analysis with interviews of 15 writers across diverse domains. Our findings reveal that writers’ desired levels of AI intervention vary across the writing process: content-focused writers (e.g., academics) prioritize ownership during planning, while form-focused writers (e.g., creatives) value control over translating and reviewing. Writers’ preferences are also shaped by contextual goals, values, and notions of originality and authorship. By examining when ownership matters, what writers want to own, and how AI interactions shape agency, we surface both alignment and gaps between research and user needs. Our findings offer actionable design guidance for developing human-centered writing tools for co-writing with AI, on human terms.
zh

[AI-30] Agent ic AI Optimisation (AAIO): what it is how it works why it matters and how to deal with it

【速读】:本文旨在解决自主人工智能(Agentic Artificial Intelligence, AAI)系统与在线平台之间无缝交互的优化问题,提出了一种新的优化范式——具身人工智能优化(Agentic AI Optimisation, AAIO)。AAIO的关键在于通过定义自主AI代理与在线平台之间的交互方式,确保网站与AAI系统的有效集成,类似于搜索引擎优化(SEO)对数字内容可发现性的影响。论文强调了网站优化与AAI成功之间的相互依赖关系,并提出了AAIO能够创造的良性循环。此外,文章探讨了AAIO在治理、伦理、法律和社会影响(Governance, Ethical, Legal, and Social Implications, GELSI)方面的挑战,呼吁建立积极的监管框架以减轻潜在的负面影响。最终,论文主张AAIO作为自主数字代理时代基本数字基础设施的重要组成部分,应推动其公平和包容性应用。

链接: https://arxiv.org/abs/2504.12482
作者: Luciano Floridi,Carlotta Buttaboni,Emmie Hine,Jessica Morley,Claudio Novelli,Tyler Schroder
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of Agentic Artificial Intelligence (AAI) systems capable of independently initiating digital interactions necessitates a new optimisation paradigm designed explicitly for seamless agent-platform interactions. This article introduces Agentic AI Optimisation (AAIO) as an essential methodology for ensuring effective integration between websites and agentic AI systems. Like how Search Engine Optimisation (SEO) has shaped digital content discoverability, AAIO can define interactions between autonomous AI agents and online platforms. By examining the mutual interdependency between website optimisation and agentic AI success, the article highlights the virtuous cycle that AAIO can create. It further explores the governance, ethical, legal, and social implications (GELSI) of AAIO, emphasising the necessity of proactive regulatory frameworks to mitigate potential negative impacts. The article concludes by affirming AAIO’s essential role as part of a fundamental digital infrastructure in the era of autonomous digital agents, advocating for equitable and inclusive access to its benefits.
zh

[AI-31] What do people expect from Artificial Intelligence? Public opinion on alignment in AI moderation from Germany and the United States

【速读】:该论文试图解决公众对生成式人工智能(Generative AI)系统在功能特征和社会期望方面的支持程度及其跨国家差异的问题。论文的关键在于通过实证研究揭示不同国家(德国和美国)的公众对于AI系统在准确性与可靠性、安全性、偏见缓解以及促进理想愿景等四种对齐目标的支持程度,并分析个体经验、言论自由态度、政治意识形态、党派归属和性别等因素如何影响这些偏好。研究发现,在美国,公众对所有对齐特征的支持度显著更高,而在德国,对公平性和理想愿景等更具规范性目标的支持则更为谨慎。此外,论文强调将公众态度作为实证基础,并将规范性期望纳入AI生成内容治理的理论和政策讨论中的重要性,从而为AI治理辩论和跨国公众偏好的差异提供贡献。

链接: https://arxiv.org/abs/2504.12476
作者: Andreas Jungherr,Adrian Rauchfleisch
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in generative Artificial Intelligence have raised public awareness, shaping expectations and concerns about their societal implications. Central to these debates is the question of AI alignment – how well AI systems meet public expectations regarding safety, fairness, and social values. However, little is known about what people expect from AI-enabled systems and how these expectations differ across national contexts. We present evidence from two surveys of public preferences for key functional features of AI-enabled systems in Germany (n = 1800) and the United States (n = 1756). We examine support for four types of alignment in AI moderation: accuracy and reliability, safety, bias mitigation, and the promotion of aspirational imaginaries. U.S. respondents report significantly higher AI use and consistently greater support for all alignment features, reflecting broader technological openness and higher societal involvement with AI. In both countries, accuracy and safety enjoy the strongest support, while more normatively charged goals – like fairness and aspirational imaginaries – receive more cautious backing, particularly in Germany. We also explore how individual experience with AI, attitudes toward free speech, political ideology, partisan affiliation, and gender shape these preferences. AI use and free speech support explain more variation in Germany. In contrast, U.S. responses show greater attitudinal uniformity, suggesting that higher exposure to AI may consolidate public expectations. These findings contribute to debates on AI governance and cross-national variation in public preferences. More broadly, our study demonstrates the value of empirically grounding AI alignment debates in public attitudes and of explicitly developing normatively grounded expectations into theoretical and policy discussions on the governance of AI-generated content.
zh

[AI-32] Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

【速读】:该论文试图解决Mixture of Experts (MoE) 在预训练过程中因稀疏反向传播更新导致的训练不稳定和性能次优问题。解决方案的关键在于提出了一种轻量级的近似方法——Default MoE,通过用指数移动平均值替代缺失的专家输出(default outputs),使MoE路由能够接收到每个标记来自所有专家的信号,从而实现对路由的密集梯度更新,同时保持参数的稀疏激活,显著提升了训练性能,且无需显著增加计算开销。

链接: https://arxiv.org/abs/2504.12463
作者: Ashwinee Panda,Vatsal Baherwani,Zain Sarwar,Benjamin Therien,Supriyo Chakraborty,Tom Goldstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead. Code: this https URL.
zh

[AI-33] Deriving Equivalent Symbol-Based Decision Models from Feedforward Neural Networks

【速读】:该论文致力于解决人工智能(AI)系统因黑箱特性而引发的信任与可接受性挑战问题。其核心目标是通过结合连接主义与符号主义方法,从前馈神经网络(Feedforward Neural Networks, FNNs)中推导出可解释的符号模型,如决策树,以提高神经网络的透明性和可解释性,同时保持其功能完整性。

解决方案的关键在于提出了一种系统化的方法,利用前馈神经网络中的分布式表示来识别符号组件(包括填充物、角色及其相互关系)。该方法通过追踪网络各层中神经元的激活值及输入配置,将激活值及其对应的输入映射到决策树的边,从而构建能够有效捕捉FNN决策过程的符号结构。此外,通过迭代优化每一隐藏层的子路径,该方法实现了对深层网络的扩展性支持。论文还开发了一个基于Keras .h5数据并在Java JDK/JavaFX环境中模拟TensorFlow的原型系统,验证了从神经网络中提取符号表示的可行性,进一步增强了对AI系统的信任并促进了责任的明确性。

链接: https://arxiv.org/abs/2504.12446
作者: Sebastian Seidel,Uwe M. Borghoff
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 19 figures

点击查看摘要

Abstract:Artificial intelligence (AI) has emerged as a transformative force across industries, driven by advances in deep learning and natural language processing, and fueled by large-scale data and computing resources. Despite its rapid adoption, the opacity of AI systems poses significant challenges to trust and acceptance. This work explores the intersection of connectionist and symbolic approaches to artificial intelligence, focusing on the derivation of interpretable symbolic models, such as decision trees, from feedforward neural networks (FNNs). Decision trees provide a transparent framework for elucidating the operations of neural networks while preserving their functionality. The derivation is presented in a step-by-step approach and illustrated with several examples. A systematic methodology is proposed to bridge neural and symbolic paradigms by exploiting distributed representations in FNNs to identify symbolic components, including fillers, roles, and their interrelationships. The process traces neuron activation values and input configurations across network layers, mapping activations and their underlying inputs to decision tree edges. The resulting symbolic structures effectively capture FNN decision processes and enable scalability to deeper networks through iterative refinement of subpaths for each hidden layer. To validate the theoretical framework, a prototype was developed using Keras .h5-data and emulating TensorFlow within the Java JDK/JavaFX environment. This prototype demonstrates the feasibility of extracting symbolic representations from neural networks, enhancing trust in AI systems, and promoting accountability. Comments: 15 pages, 19 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.12446 [cs.LG] (or arXiv:2504.12446v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.12446 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-34] Dont Just Translate Agitate: Using Large Language Models as Devils Advocates for AI Explanations

【速读】:本文旨在解决当前基于大型语言模型(Large Language Models, LLMs)将可解释性技术(如特征归因权重)输出转化为自然语言解释的研究趋势中存在的问题。尽管这种转化可能提高用户对解释的可访问性和可读性,但研究表明,这种转化并不一定能真正提升用户理解,反而可能导致对人工智能系统的过度依赖。此外,当LLMs在总结可解释性人工智能(Explainable AI, XAI)输出时未能揭示模型的局限性、不确定性或不一致性时,可能会强化对模型解释性的错觉,而非促进有意义的透明度。为此,论文提出的关键解决方案是,LLMs不应仅仅作为翻译工具,而应扮演建设性的质疑者或“反方角色”的角色,通过提供替代性解读、潜在偏差、训练数据的限制以及模型推理失效的情况来主动审视AI解释。通过这种方式,LLMs能够帮助用户批判性地与AI系统及其生成的解释进行互动,从而减少因误解或虚假解释导致的过度依赖问题。

链接: https://arxiv.org/abs/2504.12424
作者: Ashley Suh,Kenneth Alperin,Harry Li,Steven R Gomez
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Presented at the Human-centered Explainable AI Workshop (HCXAI) @ CHI 2025

点击查看摘要

Abstract:This position paper highlights a growing trend in Explainable AI (XAI) research where Large Language Models (LLMs) are used to translate outputs from explainability techniques, like feature-attribution weights, into a natural language explanation. While this approach may improve accessibility or readability for users, recent findings suggest that translating into human-like explanations does not necessarily enhance user understanding and may instead lead to overreliance on AI systems. When LLMs summarize XAI outputs without surfacing model limitations, uncertainties, or inconsistencies, they risk reinforcing the illusion of interpretability rather than fostering meaningful transparency. We argue that - instead of merely translating XAI outputs - LLMs should serve as constructive agitators, or devil’s advocates, whose role is to actively interrogate AI explanations by presenting alternative interpretations, potential biases, training data limitations, and cases where the model’s reasoning may break down. In this role, LLMs can facilitate users in engaging critically with AI systems and generated explanations, with the potential to reduce overreliance caused by misinterpreted or specious explanations.
zh

[AI-35] Mitigating LLM Hallucinations with Knowledge Graphs: A Case Study

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险领域(如网络操作)中因幻觉(hallucinations)导致的不可靠性问题。论文的关键解决方案是通过开发开源自然语言接口LinkQ,在问答过程中强制LLM查询知识图谱(Knowledge Graph, KG)以获取真实数据,从而减少幻觉现象。这种方法的核心在于结合LLM与知识图谱查询的能力,以提高AI系统的可靠性和可信度。

链接: https://arxiv.org/abs/2504.12422
作者: Harry Li,Gabriel Appleby,Kenneth Alperin,Steven R Gomez,Ashley Suh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Presented at the Human-centered Explainable AI Workshop (HCXAI) @ CHI 2025

点击查看摘要

Abstract:High-stakes domains like cyber operations need responsible and trustworthy AI methods. While large language models (LLMs) are becoming increasingly popular in these domains, they still suffer from hallucinations. This research paper provides learning outcomes from a case study with LinkQ, an open-source natural language interface that was developed to combat hallucinations by forcing an LLM to query a knowledge graph (KG) for ground-truth data during question-answering (QA). We conduct a quantitative evaluation of LinkQ using a well-known KGQA dataset, showing that the system outperforms GPT-4 but still struggles with certain question categories - suggesting that alternative query construction strategies will need to be investigated in future LLM querying systems. We discuss a qualitative study of LinkQ with two domain experts using a real-world cybersecurity KG, outlining these experts’ feedback, suggestions, perceived limitations, and future opportunities for systems like LinkQ.
zh

[AI-36] Interpretable AI-driven Guidelines for Type 2 Diabetes Treatment from Observational Data

【速读】:该论文旨在解决2型糖尿病(Type 2 Diabetes)治疗方案进展过程中缺乏精确、结构化且数据支持的临床指导问题。研究通过结合机器学习与优化技术,从非随机观察数据中提取类似随机对照试验的数据子集,以减少混杂偏倚(confounding bias),进而构建基于人工智能的决策树模型,为患者提供个性化的治疗升级建议。解决方案的关键在于利用机器学习方法处理具有偏倚的观测数据,并通过人为整合各子组模型,形成端到端的处方流水线(end-to-end prescription pipeline),同时优先推荐更积极的治疗方案,在确保模型可解释性和效率的同时提升治疗效果。实验结果显示,所提出的AI驱动方法在内部和外部测试集中均优于现有临床实践。

链接: https://arxiv.org/abs/2504.12417
作者: Dewang Kumar Agarwal,Dimitris J. Bertsimas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: Create precise, structured, data-backed guidelines for type 2 diabetes treatment progression, suitable for clinical adoption. Research Design and Methods: Our training cohort was composed of patient (with type 2 diabetes) visits from Boston Medical Center (BMC) from 1998 to 2014. We divide visits into 4 groups based on the patient’s treatment regimen before the visit, and further divide them into subgroups based on the recommended treatment during the visit. Since each subgroup has observational data, which has confounding bias (sicker patients are prescribed more aggressive treatments), we used machine learning and optimization to remove some datapoints so that the remaining data resembles a randomized trial. On each subgroup, we train AI-backed tree-based models to prescribe treatment changes. Once we train these tree models, we manually combine the models for every group to create an end-to-end prescription pipeline for all patients in that group. In this process, we prioritize stepping up to a more aggressive treatment before considering less aggressive options. We tested this pipeline on unseen data from BMC, and an external dataset from Hartford healthcare (type 2 diabetes patient visits from January 2020 to May 2024). Results: The median HbA1c reduction achieved by our pipelines is 0.26% more than what the doctors achieved on the unseen BMC patients. For the Hartford cohort, our pipelines were better by 0.13%. Conclusions: This precise, interpretable, and efficient AI-backed approach to treatment progression in type 2 diabetes is predicted to outperform the current practice and can be deployed to improve patient outcomes. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2504.12417 [cs.AI] (or arXiv:2504.12417v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.12417 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Dewang Kumar Agarwal [view email] [v1] Wed, 16 Apr 2025 18:29:45 UTC (704 KB)
zh

[AI-37] Activated LoRA: Fine-tuned LLM s for Intrinsics

【速读】:该论文试图解决在多轮对话场景下频繁切换相关LoRA(低秩适应)时效率低下的问题。传统方法需要重新计算整个历史的键值(KV)缓存以适配新的LoRA权重,导致推理过程耗时增加。为了解决这一问题,论文提出了一种名为Activated LoRA (aLoRA) 的改进方案,其关键在于仅在激活aLoRA后对序列中的后续tokens进行权重适配,而非从头开始。这种设计允许aLoRA直接利用基础模型的KV缓存,从而无需重复计算即可快速激活,显著提升了推理效率。通过这种方法,aLoRA能够在输入链路或对话的特定部分高效调用高度专业化的小型模型(称为intrinsics),同时保持与标准LoRA相当的精度。

链接: https://arxiv.org/abs/2504.12397
作者: Kristjan Greenewald,Luis Lastras,Thomas Parnell,Vraj Shah,Lucian Popa,Giulio Zizzo,Chulaka Gunasekara,Ambrish Rawat,David Cox
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2504.11704

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has emerged as a highly efficient framework for finetuning the weights of large foundation models, and has become the go-to method for data-driven customization of LLMs. Despite the promise of highly customized behaviors and capabilities, switching between relevant LoRAs in a multiturn setting is highly inefficient, as the key-value (KV) cache of the entire turn history must be recomputed with the LoRA weights before generation can begin. To address this problem, we propose Activated LoRA (aLoRA), which modifies the LoRA framework to only adapt weights for the tokens in the sequence \emphafter the aLoRA is invoked. This change crucially allows aLoRA to accept the base model’s KV cache of the input string, meaning that aLoRA can be instantly activated whenever needed in a chain without recomputing the cache. This enables building what we call \emphintrinsics, i.e. highly specialized models invoked to perform well-defined operations on portions of an input chain or conversation that otherwise uses the base model by default. We use aLoRA to train a set of intrinsics models, demonstrating competitive accuracy with standard LoRA while achieving significant inference benefits.
zh

[AI-38] hemisto: Jupyter-Based Runtime Benchmark ICLR2025

【速读】:该论文试图解决的问题是如何衡量大型语言模型(Large Language Models, LLMs)利用运行时信息预测代码输出和代码生成的能力,并评估当前LLMs在这些任务上的表现。论文通过构建一个包含Jupyter Notebook开发轨迹的数据集作为基准,揭示了现有LLMs在这类任务上的不足。论文的关键在于强调在基于代码的模型开发中,运行时上下文(runtime context)这一重要但显著被忽视的研究领域,并提出应将运行时信息有效融入模型以提升其性能。

链接: https://arxiv.org/abs/2504.12365
作者: Konstantin Grotov,Sergey Titov
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the third Deep Learning for Code (DL4C) workshop @ ICLR 2025

点击查看摘要

Abstract:In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.
zh

[AI-39] Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models

【速读】:该论文旨在解决两个关键问题:(1) 识别专家协作模式,以及 (2) 通过专家剪枝优化混合专家 (Mixture-of-Experts, MoE) 大型语言模型。为解决第一个问题,论文提出了一种分层稀疏字典学习 (Hierarchical Sparse Dictionary Learning, HSDL) 方法,用于揭示专家之间的协作模式;为解决第二个问题,引入了贡献感知专家剪枝 (Contribution-Aware Expert Pruning, CAEP) 算法,以有效剪枝低贡献专家。实验结果表明,专家协作模式与特定输入类型密切相关,并在不同任务中展现出语义意义,同时剪枝方法平均提升了 2.5% 的整体性能,优于现有方法。这些发现为提升 MoE LLMs 的效率和可解释性提供了重要见解。

链接: https://arxiv.org/abs/2504.12359
作者: Yuanbo Tang,Yan Tang,Naifan Zhang,Meixuan Chen,Yang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts based large language models (MoE LLMs) have shown significant promise in multitask adaptability by dynamically routing inputs to specialized experts. Despite their success, the collaborative mechanisms among experts are still not well understood, limiting both the interpretability and optimization of these models. In this paper, we focus on two critical issues: (1) identifying expert collaboration patterns, and (2) optimizing MoE LLMs through expert pruning. To address the first issue, we propose a hierarchical sparse dictionary learning (HSDL) method that uncovers the collaboration patterns among experts. For the second issue, we introduce the Contribution-Aware Expert Pruning (CAEP) algorithm, which effectively prunes low-contribution experts. Our extensive experiments demonstrate that expert collaboration patterns are closely linked to specific input types and exhibit semantic significance across various tasks. Moreover, pruning experiments show that our approach improves overall performance by 2.5% on average, outperforming existing methods. These findings offer valuable insights into enhancing the efficiency and interpretability of MoE LLMs, offering a clearer understanding of expert interactions and improving model optimization.
zh

[AI-40] owards an AI Observatory for the Nuclear Sector: A tool for anticipatory governance

【速读】:该论文试图解决核能领域中AI模型快速嵌入所带来的安全、安保和保障后果缺乏充分理解的问题。论文提出的关键解决方案是构建一个面向核能领域的前瞻性治理体系,并建立全球AI观测站以实现前瞻性治理的运作化。观测站和治理体系的设计借鉴了科学技术研究、公共政策及预见研究的相关工作。

链接: https://arxiv.org/abs/2504.12358
作者: Aditi Verma,Elizabeth Williams
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
备注: Presented at the Sociotechnical AI Governance Workshop at CHI 2025, Yokohama

点击查看摘要

Abstract:AI models are rapidly becoming embedded in all aspects of nuclear energy research and work but the safety, security, and safeguards consequences of this embedding are not well understood. In this paper, we call for the creation of an anticipatory system of governance for AI in the nuclear sector as well as the creation of a global AI observatory as a means for operationalizing anticipatory governance. The paper explores the contours of the nuclear AI observatory and an anticipatory system of governance by drawing on work in science and technology studies, public policy, and foresight studies.
zh

[AI-41] Prototype-Guided Diffusion for Digital Pathology: Achieving Foundation Model Performance with Minimal Clinical Data

【速读】:该论文试图解决的问题是如何在减少对大规模真实患者样本依赖的同时,保持或提升数字病理模型的性能,并探索数据规模与模型性能之间的关系。论文指出,虽然现有方法通过增加数据量可以提高模型表现,但缺乏对这种相关性的透明度分析,因此提出是否单纯依赖数据扩增来提升性能始终必要。为了解决这一问题,论文的关键解决方案是提出了一种基于原型引导的扩散模型(Prototype-Guided Diffusion Model),用于以大规模生成高保真合成病理数据。这种方法不仅能够实现大规模自监督学习,还能确保生成的数据具有生物学和诊断学上的意义,同时显著降低对真实临床数据的需求。实验结果表明,利用该方法训练的自监督特征,在仅使用相当于传统方法约1/60到1/760的数据量情况下,仍能获得竞争性性能,并且在多个评估指标和任务中表现出统计上相当甚至更好的效果。此外,结合真实数据与合成数据的混合方法进一步提升了整体性能。这些发现强调了生成式AI在创建数字病理训练数据方面的潜力,大幅减少了对庞大临床数据集的依赖,并突显了所提方法的高效性。

链接: https://arxiv.org/abs/2504.12351
作者: Ekaterina Redekop,Mara Pleasure,Vedrana Ivezic,Zichen Wang,Kimberly Flores,Anthony Sisk,William Speier,Corey Arnold
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Tissues and Organs (q-bio.TO)
备注:

点击查看摘要

Abstract:Foundation models in digital pathology use massive datasets to learn useful compact feature representations of complex histology images. However, there is limited transparency into what drives the correlation between dataset size and performance, raising the question of whether simply adding more data to increase performance is always necessary. In this study, we propose a prototype-guided diffusion model to generate high-fidelity synthetic pathology data at scale, enabling large-scale self-supervised learning and reducing reliance on real patient samples while preserving downstream performance. Using guidance from histological prototypes during sampling, our approach ensures biologically and diagnostically meaningful variations in the generated data. We demonstrate that self-supervised features trained on our synthetic dataset achieve competitive performance despite using ~60x-760x less data than models trained on large real-world datasets. Notably, models trained using our synthetic data showed statistically comparable or better performance across multiple evaluation metrics and tasks, even when compared to models trained on orders of magnitude larger datasets. Our hybrid approach, combining synthetic and real data, further enhanced performance, achieving top results in several evaluations. These findings underscore the potential of generative AI to create compelling training data for digital pathology, significantly reducing the reliance on extensive clinical datasets and highlighting the efficiency of our approach.
zh

[AI-42] AUTONAV: A Toolfor Autonomous Navigation of Robots

【速读】:该论文试图解决机器人自主导航中的地图构建(Simultaneous Localization and Mapping, SLAM)、定位以及路径规划问题。解决方案的关键在于提出了一种名为AUTONAV的工具,其采用模块化架构,能够方便地集成与比较不同算法,从而实现这些任务的自动化,并在室内仿真场景中展示了生成的地图与路径规划结果。

链接: https://arxiv.org/abs/2504.12318
作者: Mir Md Sajid Sarwar,Sudip Samanta,Rajarshi Ray
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:We present a tool AUTONAV that automates the mapping, localization, and path-planning tasks for autonomous navigation of robots. The modular architecture allows easy integration of various algorithms for these tasks for comparison. We present the generated maps and path-plans by AUTONAV in indoor simulation scenarios.
zh

[AI-43] Design Topological Materials by Reinforcement Fine-Tuned Generative Model

【速读】:该论文致力于解决拓扑绝缘体(Topological Insulators, TIs)和拓扑晶体绝缘体(Topological Crystalline Insulators, TCIs)材料稀缺的问题,特别是具有完整带隙的材料。传统方法在已知材料中筛选候选者存在局限性,因此论文提出通过生成式模型(Generative Model)设计新型拓扑材料来克服这一挑战。解决方案的关键在于应用强化微调(Reinforcement Fine-Tuning, ReFT)技术对预训练的生成模型进行优化,使模型的目标与材料设计目标对齐。这种方法有效提升了模型生成TIs和TCIs的能力,同时保持生成材料的稳定性,并成功识别出大量新拓扑材料,其中Ge₂Bi₂O₆作为典型代表,是一种带隙为0.26 eV的完整带隙TI,属于已知此类材料中带隙最大的之一。

链接: https://arxiv.org/abs/2504.13048
作者: Haosheng Xu,Dongheng Qian,Zhixuan Liu,Yadong Jiang,Jing Wang
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Topological insulators (TIs) and topological crystalline insulators (TCIs) are materials with unconventional electronic properties, making their discovery highly valuable for practical applications. However, such materials, particularly those with a full band gap, remain scarce. Given the limitations of traditional approaches that scan known materials for candidates, we focus on the generation of new topological materials through a generative model. Specifically, we apply reinforcement fine-tuning (ReFT) to a pre-trained generative model, thereby aligning the model’s objectives with our material design goals. We demonstrate that ReFT is effective in enhancing the model’s ability to generate TIs and TCIs, with minimal compromise on the stability of the generated materials. Using the fine-tuned model, we successfully identify a large number of new topological materials, with Ge _2 Bi _2 O _6 serving as a representative example–a TI with a full band gap of 0.26 eV, ranking among the largest known in this category.
zh

[AI-44] Post-processing improves accuracy of Artificial Intelligence weather forecasts

【速读】:该论文旨在解决人工智能气象模型(Artificial Intelligence, AI)在提供可靠和无偏预报方面的挑战,类似于传统数值天气预报(Numerical Weather Prediction, NWP)模型存在的系统性偏差和可靠性问题。论文的关键在于测试并验证现有的统计后处理系统IMPROVER是否能够有效应用于基于AI的人工智能预报系统(Artificial Intelligence Forecasting System, AIFS),并与传统的ECMWF确定性预报(HRES)及集合预报(ENS)进行对比。结果显示,无需修改配置或处理流程,后处理方法即可为AIFS带来与传统NWP相当的准确性提升,无论是期望值输出还是概率性输出。此外,论文还表明将AIFS与NWP模型融合可以进一步提高整体预报技巧,即使单独使用AIFS时其精度并非最高。因此,该研究的关键在于证明了针对NWP开发的统计后处理技术可以直接适用于AI模型,从而为国家气象中心以低风险、渐进的方式整合AI预报至现有工作流提供了可行路径。

链接: https://arxiv.org/abs/2504.12672
作者: Belinda Trotta,Robert Johnson,Catherine de Burgh-Day,Debra Hudson,Esteban Abellan,James Canvin,Andrew Kelly,Daniel Mentiplay,Benjamin Owen,Jennifer Whelan
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) weather models are now reaching operational-grade performance for some variables, but like traditional Numerical Weather Prediction (NWP) models, they exhibit systematic biases and reliability issues. We test the application of the Bureau of Meteorology’s existing statistical post-processing system, IMPROVER, to ECMWF’s deterministic Artificial Intelligence Forecasting System (AIFS), and compare results against post-processed outputs from the ECMWF HRES and ENS models. Without any modification to configuration or processing workflows, post-processing yields comparable accuracy improvements for AIFS as for traditional NWP forecasts, in both expected value and probabilistic outputs. We show that blending AIFS with NWP models improves overall forecast skill, even when AIFS alone is not the most accurate component. These findings show that statistical post-processing methods developed for NWP are directly applicable to AI models, enabling national meteorological centres to incorporate AI forecasts into existing workflows in a low-risk, incremental fashion.
zh

[AI-45] WaterFlow: Learning Fast Robust Watermarks using Stable Diffusion

【速读】:该论文旨在解决在图像中嵌入鲁棒水印的根本问题,尤其关注于生成式图像快速增加的时代背景下,现有方法在计算效率与鲁棒性或感知质量之间难以兼顾的挑战。论文提出了一种名为WaterFlow (WF) 的高效且极其鲁棒的高保真视觉水印方法,其关键在于利用预训练的潜在扩散模型将任意图像编码到潜在空间,并基于学习到的与潜在相关的水印设计一种变换机制。具体而言,该方法通过可逆流(invertible flow)层增强预训练模型潜在空间的表达能力,在保证图像质量的同时实现水印的稳健嵌入与检测。此外,WaterFlow 在通用鲁棒性和对抗复杂组合攻击的防御能力方面表现出当前最先进的性能。

链接: https://arxiv.org/abs/2504.12354
作者: Vinay Shukla,Prachee Sharma,Ryan Rossi,Sungchul Kim,Tong Yu,Aditya Grover
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ability to embed watermarks in images is a fundamental problem of interest for computer vision, and is exacerbated by the rapid rise of generated imagery in recent times. Current state-of-the-art techniques suffer from computational and statistical challenges such as the slow execution speed for practical deployments. In addition, other works trade off fast watermarking speeds but suffer greatly in their robustness or perceptual quality. In this work, we propose WaterFlow (WF), a fast and extremely robust approach for high fidelity visual watermarking based on a learned latent-dependent watermark. Our approach utilizes a pretrained latent diffusion model to encode an arbitrary image into a latent space and produces a learned watermark that is then planted into the Fourier Domain of the latent. The transformation is specified via invertible flow layers that enhance the expressivity of the latent space of the pre-trained model to better preserve image quality while permitting robust and tractable detection. Most notably, WaterFlow demonstrates state-of-the-art performance on general robustness and is the first method capable of effectively defending against difficult combination attacks. We validate our findings on three widely used real and generated datasets: MS-COCO, DiffusionDB, and WikiArt.
zh

[AI-46] Deep Generative Model-Based Generation of Synthetic Individual-Specific Brain MRI Segmentations

【速读】:该论文旨在解决现有方法在生成特定个体的合成脑磁共振成像(MRI)扫描时需要详细结构或体积信息的问题。这类脑部信息通常稀缺、昂贵且难以获取。为了解决这一挑战,论文提出了一种名为CSegSynth的新方法,能够利用个体的易于获得的 demographic(人口统计学)、interview(访谈)和 cognitive test(认知测试)信息生成合成的脑MRI分割结果,包括三维白质(WM)、灰质(GM)和脑脊液(CSF)。该方案的关键在于开发了一种新颖的深度生成模型CSegSynth,其性能优于现有的条件变分自编码器(C-VAE)、条件生成对抗网络(C-GAN)和条件潜在扩散模型(C-LDM)。通过广泛的评估,证明了合成分割的质量,并且在个体特异性生成的效果评估中,预测的WM、GM和CSF体积与真实值之间的皮尔逊相关系数分别达到了0.80、0.82和0.70。

链接: https://arxiv.org/abs/2504.12352
作者: Ruijie Wang,Luca Rossetto,Susan Mérillat,Christina Röcke,Mike Martin,Abraham Bernstein
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:To the best of our knowledge, all existing methods that can generate synthetic brain magnetic resonance imaging (MRI) scans for a specific individual require detailed structural or volumetric information about the individual’s brain. However, such brain information is often scarce, expensive, and difficult to obtain. In this paper, we propose the first approach capable of generating synthetic brain MRI segmentations – specifically, 3D white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) segmentations – for individuals using their easily obtainable and often readily available demographic, interview, and cognitive test information. Our approach features a novel deep generative model, CSegSynth, which outperforms existing prominent generative models, including conditional variational autoencoder (C-VAE), conditional generative adversarial network (C-GAN), and conditional latent diffusion model (C-LDM). We demonstrate the high quality of our synthetic segmentations through extensive evaluations. Also, in assessing the effectiveness of the individual-specific generation, we achieve superior volume prediction, with Pearson correlation coefficients reaching 0.80, 0.82, and 0.70 between the ground-truth WM, GM, and CSF volumes of test individuals and those volumes predicted based on generated individual-specific segmentations, respectively.
zh

机器学习

[LG-0] Aligning Constraint Generation with Design Intent in Parametric CAD

链接: https://arxiv.org/abs/2504.13178
作者: Evan Casey,Tianyu Zhang,Shu Ishida,John Roger Thompson,Amir Khasahmadi,Joseph George Lambourne,Pradeep Kumar Jayaraman,Karl D.D. Willis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We adapt alignment techniques from reasoning LLMs to the task of generating engineering sketch constraints found in computer-aided design (CAD) models. Engineering sketches consist of geometric primitives (e.g. points, lines) connected by constraints (e.g. perpendicular, tangent) that define the relationships between them. For a design to be easily editable, the constraints must effectively capture design intent, ensuring the geometry updates predictably when parameters change. Although current approaches can generate CAD designs, an open challenge remains to align model outputs with design intent, we label this problem `design alignment’. A critical first step towards aligning generative CAD models is to generate constraints which fully-constrain all geometric primitives, without over-constraining or distorting sketch geometry. Using alignment techniques to train an existing constraint generation model with feedback from a constraint solver, we are able to fully-constrain 93% of sketches compared to 34% when using a naïve supervised fine-tuning (SFT) baseline and only 8.9% without alignment. Our approach can be applied to any existing constraint generation model and sets the stage for further research bridging alignment strategies between the language and design domains.

[LG-1] ransfer Learning via Auxiliary Labels with Application to Cold-Hardiness Prediction

链接: https://arxiv.org/abs/2504.13142
作者: Kristen Goebel,Paola Pesantez-Cabrera,Markus Keller,Alan Fern
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cold temperatures can cause significant frost damage to fruit crops depending on their resilience, or cold hardiness, which changes throughout the dormancy season. This has led to the development of predictive cold-hardiness models, which help farmers decide when to deploy expensive frost-mitigation measures. Unfortunately, cold-hardiness data for model training is only available for some fruit cultivars due to the need for specialized equipment and expertise. Rather, farmers often do have years of phenological data (e.g. date of budbreak) that they regularly collect for their crops. In this work, we introduce a new transfer-learning framework, Transfer via Auxiliary Labels (TAL), that allows farmers to leverage the phenological data to produce more accurate cold-hardiness predictions, even when no cold-hardiness data is available for their specific crop. The framework assumes a set of source tasks (cultivars) where each has associated primary labels (cold hardiness) and auxiliary labels (phenology). However, the target task (new cultivar) is assumed to only have the auxiliary labels. The goal of TAL is to predict primary labels for the target task via transfer from the source tasks. Surprisingly, despite the vast literature on transfer learning, to our knowledge, the TAL formulation has not been previously addressed. Thus, we propose several new TAL approaches based on model selection and averaging that can leverage recent deep multi-task models for cold-hardiness prediction. Our results on real-world cold-hardiness and phenological data for multiple grape cultivars demonstrate that TAL can leverage the phenological data to improve cold-hardiness predictions in the absence of cold-hardiness data.

[LG-2] Predicting BVD Re-emergence in Irish Cattle From Highly Imbalanced Herd-Level Data Using Machine Learning Algorithms

链接: https://arxiv.org/abs/2504.13116
作者: Niamh Mimnagh,Andrew Parnell,Conor McAloon,Jaden Carlson,Maria Guelbenzu,Jonas Brock,Damien Barrett,Guy McGrath,Jamie Tratalos,Rafael Moral
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Bovine Viral Diarrhoea (BVD) has been the focus of a successful eradication programme in Ireland, with the herd-level prevalence declining from 11.3% in 2013 to just 0.2% in 2023. As the country moves toward BVD freedom, the development of predictive models for targeted surveillance becomes increasingly important to mitigate the risk of disease re-emergence. In this study, we evaluate the performance of a range of machine learning algorithms, including binary classification and anomaly detection techniques, for predicting BVD-positive herds using highly imbalanced herd-level data. We conduct an extensive simulation study to assess model performance across varying sample sizes and class imbalance ratios, incorporating resampling, class weighting, and appropriate evaluation metrics (sensitivity, positive predictive value, F1-score and AUC values). Random forests and XGBoost models consistently outperformed other methods, with the random forest model achieving the highest sensitivity and AUC across scenarios, including real-world prediction of 2023 herd status, correctly identifying 219 of 250 positive herds while halving the number of herds that require compared to a blanket-testing strategy.

[LG-3] Quorum: Zero-Training Unsupervised Anomaly Detection using Quantum Autoencoders

链接: https://arxiv.org/abs/2504.13113
作者: Jason Zev Ludmir,Sophia Rebello,Jacob Ruiz,Tirthak Patel
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Detecting mission-critical anomalous events and data is a crucial challenge across various industries, including finance, healthcare, and energy. Quantum computing has recently emerged as a powerful tool for tackling several machine learning tasks, but training quantum machine learning models remains challenging, particularly due to the difficulty of gradient calculation. The challenge is even greater for anomaly detection, where unsupervised learning methods are essential to ensure practical applicability. To address these issues, we propose Quorum, the first quantum anomaly detection framework designed for unsupervised learning that operates without requiring any training.

[LG-4] Hadamard product in deep learning: Introduction Advances and Challenges

链接: https://arxiv.org/abs/2504.13112
作者: Grigorios G Chrysos,Yongtao Wu,Razvan Pascanu,Philip Torr,Volkan Cevher
类目: Machine Learning (cs.LG)
*备注: Accepted in IEEE T-PAMI

点击查看摘要

Abstract:While convolution and self-attention mechanisms have dominated architectural design in deep learning, this survey examines a fundamental yet understudied primitive: the Hadamard product. Despite its widespread implementation across various applications, the Hadamard product has not been systematically analyzed as a core architectural primitive. We present the first comprehensive taxonomy of its applications in deep learning, identifying four principal domains: higher-order correlation, multimodal data fusion, dynamic representation modulation, and efficient pairwise operations. The Hadamard product’s ability to model nonlinear interactions with linear computational complexity makes it particularly valuable for resource-constrained deployments and edge computing scenarios. We demonstrate its natural applicability in multimodal fusion tasks, such as visual question answering, and its effectiveness in representation masking for applications including image inpainting and pruning. This systematic review not only consolidates existing knowledge about the Hadamard product’s role in deep learning architectures but also establishes a foundation for future architectural innovations. Our analysis reveals the Hadamard product as a versatile primitive that offers compelling trade-offs between computational efficiency and representational power, positioning it as a crucial component in the deep learning toolkit.

[LG-5] Uncertainty-Aware Trajectory Prediction via Rule-Regularized Heteroscedastic Deep Classification

链接: https://arxiv.org/abs/2504.13111
作者: Kumar Manas,Christian Schlauch,Adrian Paschke,Christian Wirth,Nadja Klein
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 17 Pages, 9 figures. Accepted to Robotics: Science and Systems(RSS), 2025

点击查看摘要

Abstract:Deep learning-based trajectory prediction models have demonstrated promising capabilities in capturing complex interactions. However, their out-of-distribution generalization remains a significant challenge, particularly due to unbalanced data and a lack of enough data and diversity to ensure robustness and calibration. To address this, we propose SHIFT (Spectral Heteroscedastic Informed Forecasting for Trajectories), a novel framework that uniquely combines well-calibrated uncertainty modeling with informative priors derived through automated rule extraction. SHIFT reformulates trajectory prediction as a classification task and employs heteroscedastic spectral-normalized Gaussian processes to effectively disentangle epistemic and aleatoric uncertainties. We learn informative priors from training labels, which are automatically generated from natural language driving rules, such as stop rules and drivability constraints, using a retrieval-augmented generation framework powered by a large language model. Extensive evaluations over the nuScenes dataset, including challenging low-data and cross-location scenarios, demonstrate that SHIFT outperforms state-of-the-art methods, achieving substantial gains in uncertainty calibration and displacement metrics. In particular, our model excels in complex scenarios, such as intersections, where uncertainty is inherently higher. Project page: this https URL.

[LG-6] An All-Atom Generative Model for Designing Protein Complexes

链接: https://arxiv.org/abs/2504.13075
作者: Ruizhe Chen,Dongyu Xue,Xiangxin Zhou,Zaixiang Zheng,Xiangxiang Zeng,Quanquan Gu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Proteins typically exist in complexes, interacting with other proteins or biomolecules to perform their specific biological roles. Research on single-chain protein modeling has been extensively and deeply explored, with advancements seen in models like the series of ESM and AlphaFold. Despite these developments, the study and modeling of multi-chain proteins remain largely uncharted, though they are vital for understanding biological functions. Recognizing the importance of these interactions, we introduce APM (All-Atom Protein Generative Model), a model specifically designed for modeling multi-chain proteins. By integrating atom-level information and leveraging data on multi-chain proteins, APM is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch. It also performs folding and inverse-folding tasks for multi-chain proteins. Moreover, APM demonstrates versatility in downstream applications: it achieves enhanced performance through supervised fine-tuning (SFT) while also supporting zero-shot sampling in certain tasks, achieving state-of-the-art results. Code will be released at this https URL.

[LG-7] Inference-friendly Graph Compression for Graph Neural Networks

链接: https://arxiv.org/abs/2504.13034
作者: Yangxin Fan,Haolai Che,Yinghui Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated promising performance in graph analysis. Nevertheless, the inference process of GNNs remains costly, hindering their applications for large graphs. This paper proposes inference-friendly graph compression (IFGC), a graph compression scheme to accelerate GNNs inference. Given a graph G and a GNN M , an IFGC computes a small compressed graph G_c , to best preserve the inference results of M over G , such that the result can be directly inferred by accessing G_c with no or little decompression cost. (1) We characterize IFGC with a class of inference equivalence relation. The relation captures the node pairs in G that are not distinguishable for GNN inference. (2) We introduce three practical specifications of IFGC for representative GNNs: structural preserving compression (SPGC), which computes G_c that can be directly processed by GNN inference without decompression; ( \alpha , r )-compression, that allows for a configurable trade-off between compression ratio and inference quality, and anchored compression that preserves inference results for specific nodes of interest. For each scheme, we introduce compression and inference algorithms with guarantees of efficiency and quality of the inferred results. We conduct extensive experiments on diverse sets of large-scale graphs, which verifies the effectiveness and efficiency of our graph compression approaches.

[LG-8] Chain-of-Thought Prompting for Out-of-Distribution Samples: A Latent-Variable Study

链接: https://arxiv.org/abs/2504.12991
作者: Yu Wang,Fu-Chieh Chang,Pei-Yuan Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has emerged as a powerful technique to improve in-context learning (ICL) in large language models (LLMs) by breaking complex reasoning into intermediate steps. However, the ability of CoT to generalize under distribution shift remains poorly understood. In this work, we extend a latent-variable framework for CoT prompting and study its behavior on two prototypical out-of-distribution (OOD) scenarios: (i) the latent variables for CoT steps are permuted into novel combinations, and (ii) the latent variables uniformly scaled by a factor. Our experiments demonstrate that CoT inference generalizes effectively to OOD samples whose latent variables closely resemble those seen during training, but its performance degrades as this similarity decreases. These findings provide foundational insights into the strengths and limitations of CoT prompting under OOD conditions and suggest directions for developing more resilient reasoning strategies in future LLMs.

[LG-9] Why Ask One When You Can Ask k? Two-Stage Learning-to-Defer to a Set of Experts

链接: https://arxiv.org/abs/2504.12988
作者: Yannis Montreuil,Axel Carlier,Lai Xing Ng,Wei Tsang Ooi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning-to-Defer (L2D) enables decision-making systems to improve reliability by selectively deferring uncertain predictions to more competent agents. However, most existing approaches focus exclusively on single-agent deferral, which is often inadequate in high-stakes scenarios that require collective expertise. We propose Top- k Learning-to-Defer, a generalization of the classical two-stage L2D framework that allocates each query to the k most confident agents instead of a single one. To further enhance flexibility and cost-efficiency, we introduce Top- k(x) Learning-to-Defer, an adaptive extension that learns the optimal number of agents to consult for each query, based on input complexity, agent competency distributions, and consultation costs. For both settings, we derive a novel surrogate loss and prove that it is Bayes-consistent and (\mathcalR, \mathcalG) -consistent, ensuring convergence to the Bayes-optimal allocation. Notably, we show that the well-established model cascades paradigm arises as a restricted instance of our Top- k and Top- k(x) formulations. Extensive experiments across diverse benchmarks demonstrate the effectiveness of our framework on both classification and regression tasks.

[LG-10] RL-PINNs: Reinforcement Learning-Driven Adaptive Sampling for Efficient Training of PINNs

链接: https://arxiv.org/abs/2504.12949
作者: Zhenao Song
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have emerged as a powerful framework for solving partial differential equations (PDEs). However, their performance heavily relies on the strategy used to select training points. Conventional adaptive sampling methods, such as residual-based refinement, often require multi-round sampling and repeated retraining of PINNs, leading to computational inefficiency due to redundant points and costly gradient computations-particularly in high-dimensional or high-order derivative scenarios. To address these limitations, we propose RL-PINNs, a reinforcement learning(RL)-driven adaptive sampling framework that enables efficient training with only a single round of sampling. Our approach formulates adaptive sampling as a Markov decision process, where an RL agent dynamically selects optimal training points by maximizing a long-term utility metric. Critically, we replace gradient-dependent residual metrics with a computationally efficient function variation as the reward signal, eliminating the overhead of derivative calculations. Furthermore, we employ a delayed reward mechanism to prioritize long-term training stability over short-term gains. Extensive experiments across diverse PDE benchmarks, including low-regular, nonlinear, high-dimensional, and high-order problems, demonstrate that RL-PINNs significantly outperforms existing residual-driven adaptive methods in accuracy. Notably, RL-PINNs achieve this with negligible sampling overhead, making them scalable to high-dimensional and high-order problems.

[LG-11] IdentiARAT: Toward Automated Identification of Individual ARAT Items from Wearable Sensors

链接: https://arxiv.org/abs/2504.12921
作者: Daniel Homm,Patrick Carqueville,Christian Eichhorn,Thomas Weikert,Thomas Menard,David A. Plecher,Chris Awai Easthope
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the potential of using wrist-worn inertial sensors to automate the labeling of ARAT (Action Research Arm Test) items. While the ARAT is commonly used to assess upper limb motor function, its limitations include subjectivity and time consumption of clinical staff. By using IMU (Inertial Measurement Unit) sensors and MiniROCKET as a time series classification technique, this investigation aims to classify ARAT items based on sensor recordings. We test common preprocessing strategies to efficiently leverage included information in the data. Afterward, we use the best preprocessing to improve the classification. The dataset includes recordings of 45 participants performing various ARAT items. Results show that MiniROCKET offers a fast and reliable approach for classifying ARAT domains, although challenges remain in distinguishing between individual resembling items. Future work may involve improving classification through more advanced machine-learning models and data enhancements.

[LG-12] Sliced-Wasserstein Distance-based Data Selection

链接: https://arxiv.org/abs/2504.12918
作者: Julien Pallage,Antoine Lesage-Landry
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2410.21712

点击查看摘要

Abstract:We propose a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches. Our filtering technique is interesting for decision-making pipelines deploying machine learning models in critical sectors, e.g., power systems, as it offers a conservative data selection and an optimal transport interpretation. To ensure the scalability of our method, we provide two efficient approximations. The first approximation processes reduced-cardinality representations of the datasets concurrently. The second makes use of a computationally light Euclidian distance approximation. Additionally, we open the first dataset showcasing localized critical peak rebate demand response in a northern climate. We present the filtering patterns of our method on synthetic datasets and numerically benchmark our method for training data selection. Finally, we employ our method as part of a first forecasting benchmark for our open-source dataset.

[LG-13] Exact Learning Dynamics of In-Context Learning in Linear Transformers and Its Application to Non-Linear Transformers

链接: https://arxiv.org/abs/2504.12916
作者: Nischal Mainali,Lucas Teixeira
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Transformer models exhibit remarkable in-context learning (ICL), adapting to novel tasks from examples within their context, yet the underlying mechanisms remain largely mysterious. Here, we provide an exact analytical characterization of ICL emergence by deriving the closed-form stochastic gradient descent (SGD) dynamics for a simplified linear transformer performing regression tasks. Our analysis reveals key properties: (1) a natural separation of timescales directly governed by the input data’s covariance structure, leading to staged learning; (2) an exact description of how ICL develops, including fixed points corresponding to learned algorithms and conservation laws constraining the dynamics; and (3) surprisingly nonlinear learning behavior despite the model’s linearity. We hypothesize this phenomenology extends to non-linear models. To test this, we introduce theory-inspired macroscopic measures (spectral rank dynamics, subspace stability) and use them to provide mechanistic explanations for (1) the sudden emergence of ICL in attention-only networks and (2) delayed generalization (grokking) in modular arithmetic models. Our work offers an exact dynamical model for ICL and theoretically grounded tools for analyzing complex transformer training.

[LG-14] Mirror Mirror of the Flow: How Does Regularization Shape Implicit Bias?

链接: https://arxiv.org/abs/2504.12883
作者: Tom Jacobs,Chao Zhou,Rebekka Burkholz
类目: Machine Learning (cs.LG)
*备注: 26 pages, 16 figures

点击查看摘要

Abstract:Implicit bias plays an important role in explaining how overparameterized models generalize well. Explicit regularization like weight decay is often employed in addition to prevent overfitting. While both concepts have been studied separately, in practice, they often act in tandem. Understanding their interplay is key to controlling the shape and strength of implicit bias, as it can be modified by explicit regularization. To this end, we incorporate explicit regularization into the mirror flow framework and analyze its lasting effects on the geometry of the training dynamics, covering three distinct effects: positional bias, type of bias, and range shrinking. Our analytical approach encompasses a broad class of problems, including sparse coding, matrix sensing, single-layer attention, and LoRA, for which we demonstrate the utility of our insights. To exploit the lasting effect of regularization and highlight the potential benefit of dynamic weight decay schedules, we propose to switch off weight decay during training, which can improve generalization, as we demonstrate in experiments.

[LG-15] Can Masked Autoencoders Also Listen to Birds?

链接: https://arxiv.org/abs/2504.12880
作者: Lukas Rauch,Ilyass Moummad,René Heinrich,Alexis Joly,Bernhard Sick,Christoph Scholz
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Masked Autoencoders (MAEs) pretrained on AudioSet fail to capture the fine-grained acoustic characteristics of specialized domains such as bioacoustic monitoring. Bird sound classification is critical for assessing environmental health, yet general-purpose models inadequately address its unique acoustic challenges. To address this, we introduce Bird-MAE, a domain-specialized MAE pretrained on the large-scale BirdSet dataset. We explore adjustments to pretraining, fine-tuning and utilizing frozen representations. Bird-MAE achieves state-of-the-art results across all BirdSet downstream tasks, substantially improving multi-label classification performance compared to the general-purpose Audio-MAE baseline. Additionally, we propose prototypical probing, a parameter-efficient method for leveraging MAEs’ frozen representations. Bird-MAE’s prototypical probes outperform linear probing by up to 37% in MAP and narrow the gap to fine-tuning to approximately 3% on average on BirdSet.

[LG-16] A Client-level Assessment of Collaborative Backdoor Poisoning in Non-IID Federated Learning

链接: https://arxiv.org/abs/2504.12875
作者: Phung Lai,Guanxiong Liu,Hai Phan,Issa Khalil,Abdallah Khreishah,Xintao Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training using decentralized private data from multiple clients. While FL has shown robustness against poisoning attacks with basic defenses, our research reveals new vulnerabilities stemming from non-independent and identically distributed (non-IID) data among clients. These vulnerabilities pose a substantial risk of model poisoning in real-world FL scenarios. To demonstrate such vulnerabilities, we develop a novel collaborative backdoor poisoning attack called CollaPois. In this attack, we distribute a single pre-trained model infected with a Trojan to a group of compromised clients. These clients then work together to produce malicious gradients, causing the FL model to consistently converge towards a low-loss region centered around the Trojan-infected model. Consequently, the impact of the Trojan is amplified, especially when the benign clients have diverse local data distributions and scattered local gradients. CollaPois stands out by achieving its goals while involving only a limited number of compromised clients, setting it apart from existing attacks. Also, CollaPois effectively avoids noticeable shifts or degradation in the FL model’s performance on legitimate data samples, allowing it to operate stealthily and evade detection by advanced robust FL algorithms. Thorough theoretical analysis and experiments conducted on various benchmark datasets demonstrate the superiority of CollaPois compared to state-of-the-art backdoor attacks. Notably, CollaPois bypasses existing backdoor defenses, especially in scenarios where clients possess diverse data distributions. Moreover, the results show that CollaPois remains effective even when involving a small number of compromised clients. Notably, clients whose local data is closely aligned with compromised clients experience higher risks of backdoor infections. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.12875 [cs.LG] (or arXiv:2504.12875v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.12875 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2025 International Conference on Distributed Computing Systems (ICDCS)

[LG-17] HHO-SMOTe: A Cleansed Approach for Handling Outliers and Reducing Noise to Improve Imbalanced Data Classification

链接: https://arxiv.org/abs/2504.12850
作者: Khaled SH. Raslan,Almohammady S. Alsharkawy,K.R. Raslan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classifying imbalanced datasets remains a significant challenge in machine learning, particularly with big data where instances are unevenly distributed among classes, leading to class imbalance issues that impact classifier performance. While Synthetic Minority Over-sampling Technique (SMOTE) addresses this challenge by generating new instances for the under-represented minority class, it faces obstacles in the form of noise and outliers during the creation of new samples. In this paper, a proposed approach, iHHO-SMOTe, which addresses the limitations of SMOTE by first cleansing the data from noise points. This process involves employing feature selection using a random forest to identify the most valuable features, followed by applying the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to detect outliers based on the selected features. The identified outliers from the minority classes are then removed, creating a refined dataset for subsequent oversampling using the hybrid approach called iHHO-SMOTe. The comprehensive experiments across diverse datasets demonstrate the exceptional performance of the proposed model, with an AUC score exceeding 0.99, a high G-means score of 0.99 highlighting its robustness, and an outstanding F1-score consistently exceeding 0.967. These findings collectively establish Cleansed iHHO-SMOTe as a formidable contender in addressing imbalanced datasets, focusing on noise reduction and outlier handling for improved classification models.

[LG-18] FedX: Adaptive Model Decomposition and Quantization for IoT Federated Learning

链接: https://arxiv.org/abs/2504.12849
作者: Phung Lai,Xiaopeng Jiang,Hai Phan,Cristian Borcea,Khang Tran,An Chen,Vijaya Datta Mayyuri,Ruoming Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) allows collaborative training among multiple devices without data sharing, thus enabling privacy-sensitive applications on mobile or Internet of Things (IoT) devices, such as mobile health and asset tracking. However, designing an FL system with good model utility that works with low computation/communication overhead on heterogeneous, resource-constrained mobile/IoT devices is challenging. To address this problem, this paper proposes FedX, a novel adaptive model decomposition and quantization FL system for IoT. To balance utility with resource constraints on IoT devices, FedX decomposes a global FL model into different sub-networks with adaptive numbers of quantized bits for different devices. The key idea is that a device with fewer resources receives a smaller sub-network for lower overhead but utilizes a larger number of quantized bits for higher model utility, and vice versa. The quantization operations in FedX are done at the server to reduce the computational load on devices. FedX iteratively minimizes the losses in the devices’ local data and in the server’s public data using quantized sub-networks under a regularization term, and thus it maximizes the benefits of combining FL with model quantization through knowledge sharing among the server and devices in a cost-effective training process. Extensive experiments show that FedX significantly improves quantization times by up to 8.43X, on-device computation time by 1.5X, and total end-to-end training time by 1.36X, compared with baseline FL systems. We guarantee the global model convergence theoretically and validate local model convergence empirically, highlighting FedX’s optimization efficiency.

[LG-19] Predicting Stock Prices using Permutation Decision Trees and Strategic Trailing

链接: https://arxiv.org/abs/2504.12828
作者: Vishrut Ramraj,Nithin Nagaraj,Harikrishnan N B
类目: Machine Learning (cs.LG)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:In this paper, we explore the application of Permutation Decision Trees (PDT) and strategic trailing for predicting stock market movements and executing profitable trades in the Indian stock market. We focus on high-frequency data using 5-minute candlesticks for the top 50 stocks listed in the NIFTY 50 index. We implement a trading strategy that aims to buy stocks at lower prices and sell them at higher prices, capitalizing on short-term market fluctuations. Due to regulatory constraints in India, short selling is not considered in our strategy. The model incorporates various technical indicators and employs hyperparameters such as the trailing stop-loss value and support thresholds to manage risk effectively. Our results indicate that the proposed trading bot has the potential to outperform the market average and yield returns higher than the risk-free rate offered by 10-year Indian government bonds. We trained and tested data on a 60 day dataset provided by Yahoo Finance. Specifically, 12 days for testing and 48 days for training. Our bot based on permutation decision tree achieved a profit of 1.3468 % over a 12-day testing period, where as a bot based on LSTM gave a return of 0.1238 % over a 12-day testing period and a bot based on RNN gave a return of 0.3096 % over a 12-day testing period. All of the bots outperform the buy-and-hold strategy, which resulted in a loss of 2.2508 %.

[LG-20] GraphOmni: A Comprehensive and Extendable Benchmark Framework for Large Language Models on Graph-theoretic Tasks

链接: https://arxiv.org/abs/2504.12764
作者: Hao Xu,Xiangru Jian,Xinjian Zhao,Wei Pang,Chao Zhang,Suyuchen Wang,Qixin Zhang,Joao Monteiro,Qiuzhuang Sun,Tianshu Yu
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注: 82 pages

点击查看摘要

Abstract:In this paper, we presented GraphOmni, a comprehensive benchmark framework for systematically evaluating the graph reasoning capabilities of LLMs. By analyzing critical dimensions, including graph types, serialization formats, and prompt schemes, we provided extensive insights into the strengths and limitations of current LLMs. Our empirical findings emphasize that no single serialization or prompting strategy consistently outperforms others. Motivated by these insights, we propose a reinforcement learning-based approach that dynamically selects the best serialization-prompt pairings, resulting in significant accuracy improvements. GraphOmni’s modular and extensible design establishes a robust foundation for future research, facilitating advancements toward general-purpose graph reasoning models.

[LG-21] Decentralized Nonconvex Composite Federated Learning with Gradient Tracking and Momentum

链接: https://arxiv.org/abs/2504.12742
作者: Yuan Zhou,Xinli Shi,Xuelong Li,Jiachen Zhong,Guanghui Wen,Jinde Cao
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Decentralized Federated Learning (DFL) eliminates the reliance on the server-client architecture inherent in traditional federated learning, attracting significant research interest in recent years. Simultaneously, the objective functions in machine learning tasks are often nonconvex and frequently incorporate additional, potentially nonsmooth regularization terms to satisfy practical requirements, thereby forming nonconvex composite optimization problems. Employing DFL methods to solve such general optimization problems leads to the formulation of Decentralized Nonconvex Composite Federated Learning (DNCFL), a topic that remains largely underexplored. In this paper, we propose a novel DNCFL algorithm, termed \bfDEPOSITUM. Built upon proximal stochastic gradient tracking, DEPOSITUM mitigates the impact of data heterogeneity by enabling clients to approximate the global gradient. The introduction of momentums in the proximal gradient descent step, replacing tracking variables, reduces the variance introduced by stochastic gradients. Additionally, DEPOSITUM supports local updates of client variables, significantly reducing communication costs. Theoretical analysis demonstrates that DEPOSITUM achieves an expected \epsilon -stationary point with an iteration complexity of \mathcalO(1/\epsilon^2) . The proximal gradient, consensus errors, and gradient estimation errors decrease at a sublinear rate of \mathcalO(1/T) . With appropriate parameter selection, the algorithm achieves network-independent linear speedup without requiring mega-batch sampling. Finally, we apply DEPOSITUM to the training of neural networks on real-world datasets, systematically examining the influence of various hyperparameters on its performance. Comparisons with other federated composite optimization algorithms validate the effectiveness of the proposed method.

[LG-22] Hierarchical Vector Quantized Graph Autoencoder with Annealing-Based Code Selection

链接: https://arxiv.org/abs/2504.12715
作者: Long Zeng,Jianxiang Yu,Jiapeng Zhu,Qingsong Zhong,Xiang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph self-supervised learning has gained significant attention recently. However, many existing approaches heavily depend on perturbations, and inappropriate perturbations may corrupt the graph’s inherent information. The Vector Quantized Variational Autoencoder (VQ-VAE) is a powerful autoencoder extensively used in fields such as computer vision; however, its application to graph data remains underexplored. In this paper, we provide an empirical analysis of vector quantization in the context of graph autoencoders, demonstrating its significant enhancement of the model’s capacity to capture graph topology. Furthermore, we identify two key challenges associated with vector quantization when applying in graph data: codebook underutilization and codebook space sparsity. For the first challenge, we propose an annealing-based encoding strategy that promotes broad code utilization in the early stages of training, gradually shifting focus toward the most effective codes as training progresses. For the second challenge, we introduce a hierarchical two-layer codebook that captures relationships between embeddings through clustering. The second layer codebook links similar codes, encouraging the model to learn closer embeddings for nodes with similar features and structural topology in the graph. Our proposed model outperforms 16 representative baseline methods in self-supervised link prediction and node classification tasks across multiple datasets.

[LG-23] Convergence and Implicit Bias of Gradient Descent on Continual Linear Classification ICLR2025

链接: https://arxiv.org/abs/2504.12712
作者: Hyunji Jung,Hanseul Cho,Chulhee Yun
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 67 pages, 11 figures, accepted to ICLR 2025

点击查看摘要

Abstract:We study continual learning on multiple linear classification tasks by sequentially running gradient descent (GD) for a fixed budget of iterations per task. When all tasks are jointly linearly separable and are presented in a cyclic/random order, we show the directional convergence of the trained linear classifier to the joint (offline) max-margin solution. This is surprising because GD training on a single task is implicitly biased towards the individual max-margin solution for the task, and the direction of the joint max-margin solution can be largely different from these individual solutions. Additionally, when tasks are given in a cyclic order, we present a non-asymptotic analysis on cycle-averaged forgetting, revealing that (1) alignment between tasks is indeed closely tied to catastrophic forgetting and backward knowledge transfer and (2) the amount of forgetting vanishes to zero as the cycle repeats. Lastly, we analyze the case where the tasks are no longer jointly separable and show that the model trained in a cyclic order converges to the unique minimum of the joint loss function.

[LG-24] Physics Informed Constrained Learning of Dynamics from Static Data

链接: https://arxiv.org/abs/2504.12675
作者: Pengtao Dang,Tingbo Guo,Sha Cao,Chi Zhang
类目: Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Molecular Networks (q-bio.MN)
*备注: 39 pages, 10 figures

点击查看摘要

Abstract:A physics-informed neural network (PINN) models the dynamics of a system by integrating the governing physical laws into the architecture of a neural network. By enforcing physical laws as constraints, PINN overcomes challenges with data scarsity and potentially high dimensionality. Existing PINN frameworks rely on fully observed time-course data, the acquisition of which could be prohibitive for many systems. In this study, we developed a new PINN learning paradigm, namely Constrained Learning, that enables the approximation of first-order derivatives or motions using non-time course or partially observed data. Computational principles and a general mathematical formulation of Constrained Learning were developed. We further introduced MPOCtrL (Message Passing Optimization-based Constrained Learning) an optimization approach tailored for the Constrained Learning framework that strives to balance the fitting of physical models and observed data. Its code is available at github link: this https URL Experiments on synthetic and real-world data demonstrated that MPOCtrL can effectively detect the nonlinear dependency between observed data and the underlying physical properties of the system. In particular, on the task of metabolic flux analysis, MPOCtrL outperforms all existing data-driven flux estimators.

[LG-25] Predicting Drivers Perceived Risk: a Model Based on Semi-Supervised Learning Strategy

链接: https://arxiv.org/abs/2504.12665
作者: Siwei Huang,Chenhao Yang,Chuan Hu
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 6pages, 8figures, 5tables. Accepted to be presented at the 2025 36th IEEE Intelligent Vehicles Symposium (IV) (IV 2025)

点击查看摘要

Abstract:Drivers’ perception of risk determines their acceptance, trust, and use of the Automated Driving Systems (ADSs). However, perceived risk is subjective and difficult to evaluate using existing methods. To address this issue, a driver’s subjective perceived risk (DSPR) model is proposed, regarding perceived risk as a dynamically triggered mechanism with anisotropy and attenuation. 20 participants are recruited for a driver-in-the-loop experiment to report their real-time subjective risk ratings (SRRs) when experiencing various automatic driving scenarios. A convolutional neural network and bidirectional long short-term memory network with temporal pattern attention (CNN-Bi-LSTM-TPA) is embedded into a semi-supervised learning strategy to predict SRRs, aiming to reduce data noise caused by subjective randomness of participants. The results illustrate that DSPR achieves the highest prediction accuracy of 87.91% in predicting SRRs, compared to three state-of-the-art risk models. The semi-supervised strategy improves accuracy by 20.12%. Besides, CNN-Bi-LSTM-TPA network presents the highest accuracy among four different LSTM structures. This study offers an effective method for assessing driver’s perceived risk, providing support for the safety enhancement of ADS and driver’s trust improvement.

[LG-26] Feature selection based on cluster assumption in PU learning GECCO2025

链接: https://arxiv.org/abs/2504.12651
作者: Motonobu Uchikoshi,Youhei Akimoto
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted at GECCO 2025

点击查看摘要

Abstract:Feature selection is essential for efficient data mining and sometimes encounters the positive-unlabeled (PU) learning scenario, where only a few positive labels are available, while most data remains unlabeled. In certain real-world PU learning tasks, data subjected to adequate feature selection often form clusters with concentrated positive labels. Conventional feature selection methods that treat unlabeled data as negative may fail to capture the statistical characteristics of positive data in such scenarios, leading to suboptimal performance. To address this, we propose a novel feature selection method based on the cluster assumption in PU learning, called FSCPU. FSCPU formulates the feature selection problem as a binary optimization task, with an objective function explicitly designed to incorporate the cluster assumption in the PU learning setting. Experiments on synthetic datasets demonstrate the effectiveness of FSCPU across various data conditions. Moreover, comparisons with 10 conventional algorithms on three open datasets show that FSCPU achieves competitive performance in downstream classification tasks, even when the cluster assumption does not strictly hold.

[LG-27] Uncertainty Quantification in Graph Neural Networks with Shallow Ensembles

链接: https://arxiv.org/abs/2504.12627
作者: Tirtha Vinchurkar,Kareem Abdelmaqsoud,John R. Kitchin
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Machine-learned potentials (MLPs) have revolutionized materials discovery by providing accurate and efficient predictions of molecular and material properties. Graph Neural Networks (GNNs) have emerged as a state-of-the-art approach due to their ability to capture complex atomic interactions. However, GNNs often produce unreliable predictions when encountering out-of-domain data and it is difficult to identify when that happens. To address this challenge, we explore Uncertainty Quantification (UQ) techniques, focusing on Direct Propagation of Shallow Ensembles (DPOSE) as a computationally efficient alternative to deep ensembles. By integrating DPOSE into the SchNet model, we assess its ability to provide reliable uncertainty estimates across diverse Density Functional Theory datasets, including QM9, OC20, and Gold Molecular Dynamics. Our findings often demonstrate that DPOSE successfully distinguishes between in-domain and out-of-domain samples, exhibiting higher uncertainty for unobserved molecule and material classes. This work highlights the potential of lightweight UQ methods in improving the robustness of GNN-based materials modeling and lays the foundation for future integration with active learning strategies.

[LG-28] Machine Learning Methods for Gene Regulatory Network Inference

链接: https://arxiv.org/abs/2504.12610
作者: Akshata Hegde,Tom Nguyen,Jianlin Cheng
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注: 40 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Gene Regulatory Networks (GRNs) are intricate biological systems that control gene expression and regulation in response to environmental and developmental cues. Advances in computational biology, coupled with high throughput sequencing technologies, have significantly improved the accuracy of GRN inference and modeling. Modern approaches increasingly leverage artificial intelligence (AI), particularly machine learning techniques including supervised, unsupervised, semi-supervised, and contrastive learning to analyze large scale omics data and uncover regulatory gene interactions. To support both the application of GRN inference in studying gene regulation and the development of novel machine learning methods, we present a comprehensive review of machine learning based GRN inference methodologies, along with the datasets and evaluation metrics commonly used. Special emphasis is placed on the emerging role of cutting edge deep learning techniques in enhancing inference performance. The potential future directions for improving GRN inference are also discussed.

[LG-29] Stochastic Gradient Descent in Non-Convex Problems: Asymptotic Convergence with Relaxed Step-Size via Stopping Time Methods

链接: https://arxiv.org/abs/2504.12601
作者: Ruinan Jin,Difei Cheng,Hong Qiao,Xin Shi,Shaodong Liu,Bo Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR)
*备注: 42 pages

点击查看摘要

Abstract:Stochastic Gradient Descent (SGD) is widely used in machine learning research. Previous convergence analyses of SGD under the vanishing step-size setting typically require Robbins-Monro conditions. However, in practice, a wider variety of step-size schemes are frequently employed, yet existing convergence results remain limited and often rely on strong assumptions. This paper bridges this gap by introducing a novel analytical framework based on a stopping-time method, enabling asymptotic convergence analysis of SGD under more relaxed step-size conditions and weaker assumptions. In the non-convex setting, we prove the almost sure convergence of SGD iterates for step-sizes \ \epsilon_t _t \geq 1 satisfying \sum_t=1^+\infty \epsilon_t = +\infty and \sum_t=1^+\infty \epsilon_t^p +\infty for some p 2 . Compared with previous studies, our analysis eliminates the global Lipschitz continuity assumption on the loss function and relaxes the boundedness requirements for higher-order moments of stochastic gradients. Building upon the almost sure convergence results, we further establish L_2 convergence. These significantly relaxed assumptions make our theoretical results more general, thereby enhancing their applicability in practical scenarios.

[LG-30] Meta-Dependence in Conditional Independence Testing

链接: https://arxiv.org/abs/2504.12594
作者: Bijan Mazaheri,Jiaqi Zhang,Caroline Uhler
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Constraint-based causal discovery algorithms utilize many statistical tests for conditional independence to uncover networks of causal dependencies. These approaches to causal discovery rely on an assumed correspondence between the graphical properties of a causal structure and the conditional independence properties of observed variables, known as the causal Markov condition and faithfulness. Finite data yields an empirical distribution that is “close” to the actual distribution. Across these many possible empirical distributions, the correspondence to the graphical properties can break down for different conditional independencies, and multiple violations can occur at the same time. We study this “meta-dependence” between conditional independence properties using the following geometric intuition: each conditional independence property constrains the space of possible joint distributions to a manifold. The “meta-dependence” between conditional independences is informed by the position of these manifolds relative to the true probability distribution. We provide a simple-to-compute measure of this meta-dependence using information projections and consolidate our findings empirically using both synthetic and real-world data.

[LG-31] Efficient MAP Estimation of LLM Judgment Performance with Prior Transfer

链接: https://arxiv.org/abs/2504.12589
作者: Huaizhi Qu,Inyoung Choi,Zhen Tan,Song Wang,Sukwon Yun,Qi Long,Faizan Siddiqui,Kwonjoon Lee,Tianlong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLM ensembles are widely used for LLM judges. However, how to estimate their accuracy, especially in an efficient way, is unknown. In this paper, we present a principled maximum a posteriori (MAP) framework for an economical and precise estimation of the performance of LLM ensemble judgment. We first propose a mixture of Beta-Binomial distributions to model the judgment distribution, revising from the vanilla Binomial distribution. Next, we introduce a conformal prediction-driven approach that enables adaptive stopping during iterative sampling to balance accuracy with efficiency. Furthermore, we design a prior transfer mechanism that utilizes learned distributions on open-source datasets to improve estimation on a target dataset when only scarce annotations are available. Finally, we present BetaConform, a framework that integrates our distribution assumption, adaptive stopping, and the prior transfer mechanism to deliver a theoretically guaranteed distribution estimation of LLM ensemble judgment with minimum labeled samples. BetaConform is also validated empirically. For instance, with only 10 samples from the TruthfulQA dataset, for a Llama ensembled judge, BetaConform gauges its performance with error margin as small as 3.37%.

[LG-32] Software Engineering Principles for Fairer Systems: Experiments with GroupCART

链接: https://arxiv.org/abs/2504.12587
作者: Kewen Peng,Hao Zhuo,Yicheng Yang,Tim Menzies
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Discrimination-aware classification aims to make accurate predictions while satisfying fairness constraints. Traditional decision tree learners typically optimize for information gain in the target attribute alone, which can result in models that unfairly discriminate against protected social groups (e.g., gender, ethnicity). Motivated by these shortcomings, we propose GroupCART, a tree-based ensemble optimizer that avoids bias during model construction by optimizing not only for decreased entropy in the target attribute but also for increased entropy in protected attributes. Our experiments show that GroupCART achieves fairer models without data transformation and with minimal performance degradation. Furthermore, the method supports customizable weighting, offering a smooth and flexible trade-off between predictive performance and fairness based on user requirements. These results demonstrate that algorithmic bias in decision tree models can be mitigated through multi-task, fairness-aware learning. All code and datasets used in this study are available at: this https URL.

[LG-33] ChemKANs for Combustion Chemistry Modeling and Acceleration

链接: https://arxiv.org/abs/2504.12580
作者: Benjamin C. Koenig,Suyong Kim,Sili Deng
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: B.C.K. and S.K. contributed equally to this work. 23 pages, 8 figures, and 1 table

点击查看摘要

Abstract:Efficient chemical kinetic model inference and application for combustion problems is challenging due to large ODE systems and wideley separated time scales. Machine learning techniques have been proposed to streamline these models, though strong nonlinearity and numerical stiffness combined with noisy data sources makes their application challenging. The recently developed Kolmogorov-Arnold Networks (KANs) and KAN ordinary differential equations (KAN-ODEs) have been demonstrated as powerful tools for scientific applications thanks to their rapid neural scaling, improved interpretability, and smooth activation functions. Here, we develop ChemKANs by augmenting the KAN-ODE framework with physical knowledge of the flow of information through the relevant kinetic and thermodynamic laws, as well as an elemental conservation loss term. This novel framework encodes strong inductive bias that enables streamlined training and higher accuracy predictions, while facilitating parameter sparsity through full sharing of information across all inputs and outputs. In a model inference investigation, we find that ChemKANs exhibit no overfitting or model degradation when tasked with extracting predictive models from data that is both sparse and noisy, a task that a standard DeepONet struggles to accomplish. Next, we find that a remarkably parameter-lean ChemKAN (only 344 parameters) can accurately represent hydrogen combustion chemistry, providing a 2x acceleration over the detailed chemistry in a solver that is generalizable to larger-scale turbulent flow simulations. These demonstrations indicate potential for ChemKANs in combustion physics and chemical kinetics, and demonstrate the scalability of generic KAN-ODEs in significantly larger and more numerically challenging problems than previously studied.

[LG-34] he Others: Naturally Isolating Out-of-Distribution Samples for Robust Open-Set Semi-Supervised Learning

链接: https://arxiv.org/abs/2504.12569
作者: You Rim Choi,Subeom Park,Seojun Heo,Eunchung Noh,Hyung-Sin Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Open-Set Semi-Supervised Learning (OSSL) tackles the practical challenge of learning from unlabeled data that may include both in-distribution (ID) and unknown out-of-distribution (OOD) classes. However, existing OSSL methods form suboptimal feature spaces by either excluding OOD samples, interfering with them, or overtrusting their information during training. In this work, we introduce MagMatch, a novel framework that naturally isolates OOD samples through a prototype-based contrastive learning paradigm. Unlike conventional methods, MagMatch does not assign any prototypes to OOD samples; instead, it selectively aligns ID samples with class prototypes using an ID-Selective Magnetic (ISM) module, while allowing OOD samples - the “others” - to remain unaligned in the feature space. To support this process, we propose Selective Magnetic Alignment (SMA) loss for unlabeled data, which dynamically adjusts alignment based on sample confidence. Extensive experiments on diverse datasets demonstrate that MagMatch significantly outperforms existing methods in both closed-set classification accuracy and OOD detection AUROC, especially in generalizing to unseen OOD data.

[LG-35] Evolutionary Policy Optimization GECCO2025

链接: https://arxiv.org/abs/2504.12568
作者: Zelal Su “Lain” Mustafaoglu,Keshav Pingali,Risto Miikkulainen
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Builds upon previous GECCO 2025 work

点击查看摘要

Abstract:A key challenge in reinforcement learning (RL) is managing the exploration-exploitation trade-off without sacrificing sample efficiency. Policy gradient (PG) methods excel in exploitation through fine-grained, gradient-based optimization but often struggle with exploration due to their focus on local search. In contrast, evolutionary computation (EC) methods excel in global exploration, but lack mechanisms for exploitation. To address these limitations, this paper proposes Evolutionary Policy Optimization (EPO), a hybrid algorithm that integrates neuroevolution with policy gradient methods for policy optimization. EPO leverages the exploration capabilities of EC and the exploitation strengths of PG, offering an efficient solution to the exploration-exploitation dilemma in RL. EPO is evaluated on the Atari Pong and Breakout benchmarks. Experimental results show that EPO improves both policy quality and sample efficiency compared to standard PG and EC methods, making it effective for tasks that require both exploration and local optimization.

[LG-36] Kernel Ridge Regression for Efficient Learning of High-Capacity Hopfield Networks

链接: https://arxiv.org/abs/2504.12561
作者: Akira Tamamori
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 4 pages, 4 figures

点击查看摘要

Abstract:Hebbian learning limits Hopfield network capacity. While kernel methods like Kernel Logistic Regression (KLR) improve performance via iterative learning, we propose Kernel Ridge Regression (KRR) as an alternative. KRR learns dual variables non-iteratively via a closed-form solution, offering significant learning speed advantages. We show KRR achieves comparably high storage capacity (reaching ratio 1.5 shown) and noise robustness (recalling from around 80% corrupted patterns) as KLR, while drastically reducing training time, establishing KRR as an efficient method for building high-performance associative memories.

[LG-37] Fine Flood Forecasts: Incorporating local data into global models through fine-tuning

链接: https://arxiv.org/abs/2504.12559
作者: Emil Ryd,Grey Nearing
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Floods are the most common form of natural disaster and accurate flood forecasting is essential for early warning systems. Previous work has shown that machine learning (ML) models are a promising way to improve flood predictions when trained on large, geographically-diverse datasets. This requirement of global training can result in a loss of ownership for national forecasters who cannot easily adapt the models to improve performance in their region, preventing ML models from being operationally deployed. Furthermore, traditional hydrology research with physics-based models suggests that local data – which in many cases is only accessible to local agencies – is valuable for improving model performance. To address these concerns, we demonstrate a methodology of pre-training a model on a large, global dataset and then fine-tuning that model on data from individual basins. This results in performance increases, validating our hypothesis that there is extra information to be captured in local data. In particular, we show that performance increases are most significant in watersheds that underperform during global training. We provide a roadmap for national forecasters who wish to take ownership of global models using their own data, aiming to lower the barrier to operational deployment of ML-based hydrological forecast systems.

[LG-38] Reinforcement Learning from Human Feedback

链接: https://arxiv.org/abs/2504.12501
作者: Nathan Lambert
类目: Machine Learning (cs.LG)
*备注: 123 pages. Web-native version at this https URL

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF – both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics – understudied research questions in synthetic data and evaluation – and open questions for the field.

[LG-39] Boosting Reservoir Computing with Brain-inspired Adaptive Dynamics

链接: https://arxiv.org/abs/2504.12480
作者: Keshav Srinivasan,Dietmar Plenz,Michelle Girvan
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Reservoir computers (RCs) provide a computationally efficient alternative to deep learning while also offering a framework for incorporating brain-inspired computational principles. By using an internal neural network with random, fixed connections - the ‘reservoir’ - and training only the output weights, RCs simplify the training process but remain sensitive to the choice of hyperparameters that govern activation functions and network architecture. Moreover, typical RC implementations overlook a critical aspect of neuronal dynamics: the balance between excitatory and inhibitory (E-I) signals, which is essential for robust brain function. We show that RCs characteristically perform best in balanced or slightly over-inhibited regimes, outperforming excitation-dominated ones. To reduce the need for precise hyperparameter tuning, we introduce a self-adapting mechanism that locally adjusts E/I balance to achieve target neuronal firing rates, improving performance by up to 130% in tasks like memory capacity and time series prediction compared with globally tuned RCs. Incorporating brain-inspired heterogeneity in target neuronal firing rates further reduces the need for fine-tuning hyperparameters and enables RCs to excel across linear and non-linear tasks. These results support a shift from static optimization to dynamic adaptation in reservoir design, demonstrating how brain-inspired mechanisms improve RC performance and robustness while deepening our understanding of neural computation.

[LG-40] You Dont Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models

链接: https://arxiv.org/abs/2504.12471
作者: Shiwei Ding,Lan Zhang,Zhenlin Wang,Giuseppe Ateniese,Xiaoyong Yuan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Fine-tuning plays a crucial role in adapting models to downstream tasks with minimal training efforts. However, the rapidly increasing size of foundation models poses a daunting challenge for accommodating foundation model fine-tuning in most commercial devices, which often have limited memory bandwidth. Techniques like model sharding and tensor parallelism address this issue by distributing computation across multiple devices to meet memory requirements. Nevertheless, these methods do not fully leverage their foundation nature in facilitating the fine-tuning process, resulting in high computational costs and imbalanced workloads. We introduce a novel Distributed Dynamic Fine-Tuning (D2FT) framework that strategically orchestrates operations across attention modules based on our observation that not all attention modules are necessary for forward and backward propagation in fine-tuning foundation models. Through three innovative selection strategies, D2FT significantly reduces the computational workload required for fine-tuning foundation models. Furthermore, D2FT addresses workload imbalances in distributed computing environments by optimizing these selection strategies via multiple knapsack optimization. Our experimental results demonstrate that the proposed D2FT framework reduces the training computational costs by 40% and training communication costs by 50% with only 1% to 2% accuracy drops on the CIFAR-10, CIFAR-100, and Stanford Cars datasets. Moreover, the results show that D2FT can be effectively extended to recent LoRA, a state-of-the-art parameter-efficient fine-tuning technique. By reducing 40% computational cost or 50% communication cost, D2FT LoRA top-1 accuracy only drops 4% to 6% on Stanford Cars dataset.

[LG-41] Geometric Generality of Transformer-Based Gröbner Basis Computation

链接: https://arxiv.org/abs/2504.12465
作者: Yuta Kambe,Yota Maeda,Tristan Vaccon
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC); Algebraic Geometry (math.AG); Machine Learning (stat.ML)
*备注: 19 pages

点击查看摘要

Abstract:The intersection of deep learning and symbolic mathematics has seen rapid progress in recent years, exemplified by the work of Lample and Charton. They demonstrated that effective training of machine learning models for solving mathematical problems critically depends on high-quality, domain-specific datasets. In this paper, we address the computation of Gröbner basis using Transformers. While a dataset generation method tailored to Transformer-based Gröbner basis computation has previously been proposed, it lacked theoretical guarantees regarding the generality or quality of the generated datasets. In this work, we prove that datasets generated by the previously proposed algorithm are sufficiently general, enabling one to ensure that Transformers can learn a sufficiently diverse range of Gröbner bases. Moreover, we propose an extended and generalized algorithm to systematically construct datasets of ideal generators, further enhancing the training effectiveness of Transformer. Our results provide a rigorous geometric foundation for Transformers to address a mathematical problem, which is an answer to Lample and Charton’s idea of training on diverse or representative inputs.

[LG-42] M2FGB: A Min-Max Gradient Boosting Framework for Subgroup Fairness

链接: https://arxiv.org/abs/2504.12458
作者: Jansen S. B. Pereira,Giovani Valdrighi,Marcos Medeiros Raimundo
类目: Machine Learning (cs.LG)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:In recent years, fairness in machine learning has emerged as a critical concern to ensure that developed and deployed predictive models do not have disadvantageous predictions for marginalized groups. It is essential to mitigate discrimination against individuals based on protected attributes such as gender and race. In this work, we consider applying subgroup justice concepts to gradient-boosting machines designed for supervised learning problems. Our approach expanded gradient-boosting methodologies to explore a broader range of objective functions, which combines conventional losses such as the ones from classification and regression and a min-max fairness term. We study relevant theoretical properties of the solution of the min-max optimization problem. The optimization process explored the primal-dual problems at each boosting round. This generic framework can be adapted to diverse fairness concepts. The proposed min-max primal-dual gradient boosting algorithm was theoretically shown to converge under mild conditions and empirically shown to be a powerful and flexible approach to address binary and subgroup fairness.

[LG-43] Can Moran Eigenvectors Improve Machine Learning of Spatial Data? Insights from Synthetic Data Validation

链接: https://arxiv.org/abs/2504.12450
作者: Ziqi Li,Zhan Peng
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Moran Eigenvector Spatial Filtering (ESF) approaches have shown promise in accounting for spatial effects in statistical models. Can this extend to machine learning? This paper examines the effectiveness of using Moran Eigenvectors as additional spatial features in machine learning models. We generate synthetic datasets with known processes involving spatially varying and nonlinear effects across two different geometries. Moran Eigenvectors calculated from different spatial weights matrices, with and without a priori eigenvector selection, are tested. We assess the performance of popular machine learning models, including Random Forests, LightGBM, XGBoost, and TabNet, and benchmark their accuracies in terms of cross-validated R2 values against models that use only coordinates as features. We also extract coefficients and functions from the models using GeoShapley and compare them with the true processes. Results show that machine learning models using only location coordinates achieve better accuracies than eigenvector-based approaches across various experiments and datasets. Furthermore, we discuss that while these findings are relevant for spatial processes that exhibit positive spatial autocorrelation, they do not necessarily apply when modeling network autocorrelation and cases with negative spatial autocorrelation, where Moran Eigenvectors would still be useful.

[LG-44] Enhanced Battery Capacity Estimation in Data-Limited Scenarios through Swarm Learning

链接: https://arxiv.org/abs/2504.12444
作者: Jiawei Zhang,Yu Zhang,Wei Xu,Yifei Zhang,Weiran Jiang,Qi Jiao,Yao Ren,Ziyou Song
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: This paper has been accepted for presentation at the 2025 IEEE Transportation Electrification Conference Expo (ITEC)

点击查看摘要

Abstract:Data-driven methods have shown potential in electric-vehicle battery management tasks such as capacity estimation, but their deployment is bottlenecked by poor performance in data-limited scenarios. Sharing battery data among algorithm developers can enable accurate and generalizable data-driven models. However, an effective battery management framework that simultaneously ensures data privacy and fault tolerance is still lacking. This paper proposes a swarm battery management system that unites a decentralized swarm learning (SL) framework and credibility weight-based model merging mechanism to enhance battery capacity estimation in data-limited scenarios while ensuring data privacy and security. The effectiveness of the SL framework is validated on a dataset comprising 66 commercial LiNiCoAlO2 cells cycled under various operating conditions. Specifically, the capacity estimation performance is validated in four cases, including data-balanced, volume-biased, feature-biased, and quality-biased scenarios. Our results show that SL can enhance the estimation accuracy in all data-limited cases and achieve a similar level of accuracy with central learning where large amounts of data are available.

[LG-45] Learning Transferable Friction Models and LuGre Identification via Physics Informed Neural Networks

链接: https://arxiv.org/abs/2504.12441
作者: Asutay Ozmen,João P. Hespanha,Katie Byl
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 7 pages, 8 figures, Submitted to 2025 64th IEEE Conference on Decision and Control (CDC)

点击查看摘要

Abstract:Accurately modeling friction in robotics remains a core challenge, as robotics simulators like Mujoco and PyBullet use simplified friction models or heuristics to balance computational efficiency with accuracy, where these simplifications and approximations can lead to substantial differences between simulated and physical performance. In this paper, we present a physics-informed friction estimation framework that enables the integration of well-established friction models with learnable components-requiring only minimal, generic measurement data. Our approach enforces physical consistency yet retains the flexibility to adapt to real-world complexities. We demonstrate, on an underactuated and nonlinear system, that the learned friction models, trained solely on small and noisy datasets, accurately simulate dynamic friction properties and reduce the sim-to-real gap. Crucially, we show that our approach enables the learned models to be transferable to systems they are not trained on. This ability to generalize across multiple systems streamlines friction modeling for complex, underactuated tasks, offering a scalable and interpretable path toward bridging the sim-to-real gap in robotics and control.

[LG-46] Standardization of Multi-Objective QUBOs

链接: https://arxiv.org/abs/2504.12419
作者: Loong Kuan Lee,Thore Thassilo Gerlach,Nico Piatkowski
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Quantum Physics (quant-ph)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:Multi-objective optimization involving Quadratic Unconstrained Binary Optimization (QUBO) problems arises in various domains. A fundamental challenge in this context is the effective balancing of multiple objectives, each potentially operating on very different scales. This imbalance introduces complications such as the selection of appropriate weights when scalarizing multiple objectives into a single objective function. In this paper, we propose a novel technique for scaling QUBO objectives that uses an exact computation of the variance of each individual QUBO objective. By scaling each objective to have unit variance, we align all objectives onto a common scale, thereby allowing for more balanced solutions to be found when scalarizing the objectives with equal weights, as well as potentially assisting in the search or choice of weights during scalarization. Finally, we demonstrate its advantages through empirical evaluations on various multi-objective optimization problems. Our results are noteworthy since manually selecting scalarization weights is cumbersome, and reliable, efficient solutions are scarce.

[LG-47] Diffusion Based Robust LiDAR Place Recognition ICRA2025

链接: https://arxiv.org/abs/2504.12412
作者: Benjamin Krummenacher,Jonas Frey,Turcan Tuna,Olga Vysotska,Marco Hutter
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: accepted for ICRA 2025

点击查看摘要

Abstract:Mobile robots on construction sites require accurate pose estimation to perform autonomous surveying and inspection missions. Localization in construction sites is a particularly challenging problem due to the presence of repetitive features such as flat plastered walls and perceptual aliasing due to apartments with similar layouts inter and intra floors. In this paper, we focus on the global re-positioning of a robot with respect to an accurate scanned mesh of the building solely using LiDAR data. In our approach, a neural network is trained on synthetic LiDAR point clouds generated by simulating a LiDAR in an accurate real-life large-scale mesh. We train a diffusion model with a PointNet++ backbone, which allows us to model multiple position candidates from a single LiDAR point cloud. The resulting model can successfully predict the global position of LiDAR in confined and complex sites despite the adverse effects of perceptual aliasing. The learned distribution of potential global positions can provide multi-modal position distribution. We evaluate our approach across five real-world datasets and show the place recognition accuracy of 77% +/-2m on average while outperforming baselines at a factor of 2 in mean error.

[LG-48] Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time

链接: https://arxiv.org/abs/2504.13110
作者: Margalit Glasgow,Denny Wu,Joan Bruna
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 70 pages

点击查看摘要

Abstract:We study the approximation gap between the dynamics of a polynomial-width neural network and its infinite-width counterpart, both trained using projected gradient descent in the mean-field scaling regime. We demonstrate how to tightly bound this approximation gap through a differential equation governed by the mean-field dynamics. A key factor influencing the growth of this ODE is the local Hessian of each particle, defined as the derivative of the particle’s velocity in the mean-field dynamics with respect to its position. We apply our results to the canonical feature learning problem of estimating a well-specified single-index model; we permit the information exponent to be arbitrarily large, leading to convergence times that grow polynomially in the ambient dimension d . We show that, due to a certain ``self-concordance’’ property in these problems – where the local Hessian of a particle is bounded by a constant times the particle’s velocity – polynomially many neurons are sufficient to closely approximate the mean-field dynamics throughout training.

[LG-49] he Dissipation Theory of Aging: A Quantitative Analysis Using a Cellular Aging Map

链接: https://arxiv.org/abs/2504.13044
作者: Farhan Khodaee,Rohola Zandie,Yufan Xia,Elazer R. Edelman
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:We propose a new theory for aging based on dynamical systems and provide a data-driven computational method to quantify the changes at the cellular level. We use ergodic theory to decompose the dynamics of changes during aging and show that aging is fundamentally a dissipative process within biological systems, akin to dynamical systems where dissipation occurs due to non-conservative forces. To quantify the dissipation dynamics, we employ a transformer-based machine learning algorithm to analyze gene expression data, incorporating age as a token to assess how age-related dissipation is reflected in the embedding space. By evaluating the dynamics of gene and age embeddings, we provide a cellular aging map (CAM) and identify patterns indicative of divergence in gene embedding space, nonlinear transitions, and entropy variations during aging for various tissues and cell types. Our results provide a novel perspective on aging as a dissipative process and introduce a computational framework that enables measuring age-related changes with molecular resolution.

[LG-50] Query Complexity of Classical and Quantum Channel Discrimination

链接: https://arxiv.org/abs/2504.12989
作者: Theshani Nuradha,Mark M. Wilde
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 22 pages; see also the independent work “Sampling complexity of quantum channel discrimination” DOI https://doi.org/10.1088/1572-9494/adcb9e

点击查看摘要

Abstract:Quantum channel discrimination has been studied from an information-theoretic perspective, wherein one is interested in the optimal decay rate of error probabilities as a function of the number of unknown channel accesses. In this paper, we study the query complexity of quantum channel discrimination, wherein the goal is to determine the minimum number of channel uses needed to reach a desired error probability. To this end, we show that the query complexity of binary channel discrimination depends logarithmically on the inverse error probability and inversely on the negative logarithm of the (geometric and Holevo) channel fidelity. As a special case of these findings, we precisely characterize the query complexity of discriminating between two classical channels. We also provide lower and upper bounds on the query complexity of binary asymmetric channel discrimination and multiple quantum channel discrimination. For the former, the query complexity depends on the geometric Rényi and Petz Rényi channel divergences, while for the latter, it depends on the negative logarithm of (geometric and Uhlmann) channel fidelity. For multiple channel discrimination, the upper bound scales as the logarithm of the number of channels.

[LG-51] On the asymptotic behaviour of stochastic processes with applications to supermartingale convergence Dvoretzkys approximation theorem and stochastic quasi-Fejér monotonicity

链接: https://arxiv.org/abs/2504.12922
作者: Morenikeji Neri,Nicholas Pischke,Thomas Powell
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Logic (math.LO); Probability (math.PR)
*备注: 41 pages

点击查看摘要

Abstract:We prove a novel and general result on the asymptotic behavior of stochastic processes which conform to a certain relaxed supermartingale condition. Our result provides quantitative information in the form of an explicit and effective construction of a rate of convergence for this process, both in mean and almost surely, that is moreover highly uniform in the sense that it only depends on very few data of the surrounding objects involved in the iteration. We then apply this result to derive new quantitative versions of well-known concepts and theorems from stochastic approximation, in particular providing effective rates for a variant of the Robbins-Siegmund theorem, Dvoretzky’s convergence theorem, as well as the convergence of stochastic quasi-Fejér monotone sequences, the latter of which formulated in a novel and highly general metric context. We utilize the classic and widely studied Robbins-Monro procedure as a template to evaluate our quantitative results and their applicability in greater detail. We conclude by illustrating the breadth of potential further applications with a brief discussion on a variety of other well-known iterative procedures from stochastic approximation, covering a range of different applied scenarios to which our methods can be immediately applied. Throughout, we isolate and discuss special cases of our results which even allow for the construction of fast, and in particular linear, rates.

[LG-52] When do Random Forests work?

链接: https://arxiv.org/abs/2504.12860
作者: C. Revelas,O. Boldea,B. J. M. Werker
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the effectiveness of randomizing split-directions in random forests. Prior literature has shown that, on the one hand, randomization can reduce variance through decorrelation, and, on the other hand, randomization regularizes and works in low signal-to-noise ratio (SNR) environments. First, we bring together and revisit decorrelation and regularization by presenting a systematic analysis of out-of-sample mean-squared error (MSE) for different SNR scenarios based on commonly-used data-generating processes. We find that variance reduction tends to increase with the SNR and forests outperform bagging when the SNR is low because, in low SNR cases, variance dominates bias for both methods. Second, we show that the effectiveness of randomization is a question that goes beyond the SNR. We present a simulation study with fixed and moderate SNR, in which we examine the effectiveness of randomization for other data characteristics. In particular, we find that (i) randomization can increase bias in the presence of fat tails in the distribution of covariates; (ii) in the presence of irrelevant covariates randomization is ineffective because bias dominates variance; and (iii) when covariates are mutually correlated randomization tends to be effective because variance dominates bias. Beyond randomization, we find that, for both bagging and random forests, bias can be significantly reduced in the presence of correlated covariates. This last finding goes beyond the prevailing view that averaging mostly works by variance reduction. Given that in practice covariates are often correlated, our findings on correlated covariates could open the way for a better understanding of why random forests work well in many applications.

[LG-53] Universal Approximation with XL MIMO Systems: OTA Classification via Trainable Analog Combining

链接: https://arxiv.org/abs/2504.12758
作者: Kyriakos Stylianopoulos,George C. Alexandropoulos
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to IEEE SPAWC 2025

点击查看摘要

Abstract:In this paper, we demonstrate that an eXtremely Large (XL) Multiple-Input Multiple-Output (MIMO) wireless system with appropriate analog combining components exhibits the properties of a universal function approximator, similar to a feedforward neural network. By treating the XL MIMO channel coefficients as the random nodes of a hidden layer, and the receiver’s analog combiner as a trainable output layer, we cast the end-to-end system to the Extreme Learning Machine (ELM) framework, leading to a novel formulation for Over-The-Air (OTA) edge inference without requiring traditional digital processing nor pre-processing at the transmitter. Through theoretical analysis and numerical evaluation, we showcase that XL-MIMO-ELM enables near-instantaneous training and efficient classification, suggesting the paradigm shift of beyond massive MIMO systems as neural networks alongside their profound communications role. Compared to deep learning approaches and conventional ELMs, the proposed framework achieves on par performance with orders of magnitude lower complexity, making it highly attractive for ultra low power wireless devices.

[LG-54] A Two-Phase Perspective on Deep Learning Dynamics

链接: https://arxiv.org/abs/2504.12700
作者: Robert de Mello Koch,Animik Ghosh
类目: High Energy Physics - Theory (hep-th); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:We propose that learning in deep neural networks proceeds in two phases: a rapid curve fitting phase followed by a slower compression or coarse graining phase. This view is supported by the shared temporal structure of three phenomena: grokking, double descent and the information bottleneck, all of which exhibit a delayed onset of generalization well after training error reaches zero. We empirically show that the associated timescales align in two rather different settings. Mutual information between hidden layers and input data emerges as a natural progress measure, complementing circuit-based metrics such as local complexity and the linear mapping number. We argue that the second phase is not actively optimized by standard training algorithms and may be unnecessarily prolonged. Drawing on an analogy with the renormalization group, we suggest that this compression phase reflects a principled form of forgetting, critical for generalization.

[LG-55] Attractor-merging Crises and Intermittency in Reservoir Computing

链接: https://arxiv.org/abs/2504.12695
作者: Tempei Kabayama,Motomasa Komuro,Yasuo Kuniyoshi,Kazuyuki Aihara,Kohei Nakajima
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Dynamical Systems (math.DS)
*备注: 20 pages, 15 figures

点击查看摘要

Abstract:Reservoir computing can embed attractors into random neural networks (RNNs), generating a ``mirror’’ of a target attractor because of its inherent symmetrical constraints. In these RNNs, we report that an attractor-merging crisis accompanied by intermittency emerges simply by adjusting the global parameter. We further reveal its underlying mechanism through a detailed analysis of the phase-space structure and demonstrate that this bifurcation scenario is intrinsic to a general class of RNNs, independent of training data.

[LG-56] Cluster weighted models with multivariate skewed distributions for functional data

链接: https://arxiv.org/abs/2504.12683
作者: Cristina Anton,Roy Shivam Ram Shreshtth
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a clustering method, funWeightClustSkew, based on mixtures of functional linear regression models and three skewed multivariate distributions: the variance-gamma distribution, the skew-t distribution, and the normal-inverse Gaussian distribution. Our approach follows the framework of the functional high dimensional data clustering (funHDDC) method, and we extend to functional data the cluster weighted models based on skewed distributions used for finite dimensional multivariate data. We consider several parsimonious models, and to estimate the parameters we construct an expectation maximization (EM) algorithm. We illustrate the performance of funWeightClustSkew for simulated data and for the Air Quality dataset.

[LG-57] Spectral Algorithms under Covariate Shift

链接: https://arxiv.org/abs/2504.12625
作者: Jun Fan,Zheng-Chu Guo,Lei Shi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spectral algorithms leverage spectral regularization techniques to analyze and process data, providing a flexible framework for addressing supervised learning problems. To deepen our understanding of their performance in real-world scenarios where the distributions of training and test data may differ, we conduct a rigorous investigation into the convergence behavior of spectral algorithms under distribution shifts, specifically within the framework of reproducing kernel Hilbert spaces. Our study focuses on the case of covariate shift. In this scenario, the marginal distributions of the input data differ between the training and test datasets, while the conditional distribution of the output given the input remains unchanged. Under this setting, we analyze the generalization error of spectral algorithms and show that they achieve minimax optimality when the density ratios between the training and test distributions are uniformly bounded. However, we also identify a critical limitation: when the density ratios are unbounded, the spectral algorithms may become suboptimal. To address this limitation, we propose a weighted spectral algorithm that incorporates density ratio information into the learning process. Our theoretical analysis shows that this weighted approach achieves optimal capacity-independent convergence rates. Furthermore, by introducing a weight clipping technique, we demonstrate that the convergence rates of the weighted spectral algorithm can approach the optimal capacity-dependent convergence rates arbitrarily closely. This improvement resolves the suboptimality issue in unbounded density ratio scenarios and advances the state-of-the-art by refining existing theoretical results.

[LG-58] Featuremetric benchmarking: Quantum computer benchmarks based on circuit features

链接: https://arxiv.org/abs/2504.12575
作者: Timothy Proctor,Anh Tran,Xingxin Liu,Aditya Dhumuntarao,Stefan Seritan,Alaina Green,Norbert M Linke
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Benchmarks that concisely summarize the performance of many-qubit quantum computers are essential for measuring progress towards the goal of useful quantum computation. In this work, we present a benchmarking framework that is based on quantifying how a quantum computer’s performance on quantum circuits varies as a function of features of those circuits, such as circuit depth, width, two-qubit gate density, problem input size, or algorithmic depth. Our featuremetric benchmarking framework generalizes volumetric benchmarking – a widely-used methodology that quantifies performance versus circuit width and depth – and we show that it enables richer and more faithful models of quantum computer performance. We demonstrate featuremetric benchmarking with example benchmarks run on IBM Q and IonQ systems of up to 27 qubits, and we show how to produce performance summaries from the data using Gaussian process regression. Our data analysis methods are also of interest in the special case of volumetric benchmarking, as they enable the creation of intuitive two-dimensional capability regions using data from few circuits.

[LG-59] Robust and Scalable Variational Bayes

链接: https://arxiv.org/abs/2504.12528
作者: Carlos Misael Madrid Padilla,Shitao Fan,Lizhen Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a robust and scalable framework for variational Bayes (VB) that effectively handles outliers and contamination of arbitrary nature in large datasets. Our approach divides the dataset into disjoint subsets, computes the posterior for each subset, and applies VB approximation independently to these posteriors. The resulting variational posteriors with respect to the subsets are then aggregated using the geometric median of probability measures, computed with respect to the Wasserstein distance. This novel aggregation method yields the Variational Median Posterior (VM-Posterior) distribution. We rigorously demonstrate that the VM-Posterior preserves contraction properties akin to those of the true posterior, while accounting for approximation errors or the variational gap inherent in VB methods. We also provide provable robustness guarantee of the VM-Posterior. Furthermore, we establish a variational Bernstein-von Mises theorem for both multivariate Gaussian distributions with general covariance structures and the mean-field variational family. To facilitate practical implementation, we adapt existing algorithms for computing the VM-Posterior and evaluate its performance through extensive numerical experiments. The results highlight its robustness and scalability, making it a reliable tool for Bayesian inference in the presence of complex, contaminated datasets.

[LG-60] Corner Gradient Descent

链接: https://arxiv.org/abs/2504.12519
作者: Dmitry Yarotsky
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider SGD-type optimization on infinite-dimensional quadratic problems with power law spectral conditions. It is well-known that on such problems deterministic GD has loss convergence rates L_t=O(t^-\zeta) , which can be improved to L_t=O(t^-2\zeta) by using Heavy Ball with a non-stationary Jacobi-based schedule (and the latter rate is optimal among fixed schedules). However, in the mini-batch Stochastic GD setting, the sampling noise causes the Jacobi HB to diverge; accordingly no O(t^-2\zeta) algorithm is known. In this paper we show that rates up to O(t^-2\zeta) can be achieved by a generalized stationary SGD with infinite memory. We start by identifying generalized (S)GD algorithms with contours in the complex plane. We then show that contours that have a corner with external angle \theta\pi accelerate the plain GD rate O(t^-\zeta) to O(t^-\theta\zeta) . For deterministic GD, increasing \theta allows to achieve rates arbitrarily close to O(t^-2\zeta) . However, in Stochastic GD, increasing \theta also amplifies the sampling noise, so in general \theta needs to be optimized by balancing the acceleration and noise effects. We prove that the optimal rate is given by \theta_\max=\min(2,\nu,\tfrac2\zeta+1/\nu) , where \nu,\zeta are the exponents appearing in the capacity and source spectral conditions. Furthermore, using fast rational approximations of the power functions, we show that ideal corner algorithms can be efficiently approximated by finite-memory algorithms, and demonstrate their practical efficiency on a synthetic problem and MNIST.

[LG-61] A Survey on Archetypal Analysis

链接: https://arxiv.org/abs/2504.12392
作者: Aleix Alcacer,Irene Epifanio,Sebastian Mair,Morten Mørup
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 13 figures, under review

点击查看摘要

Abstract:Archetypal analysis (AA) was originally proposed in 1994 by Adele Cutler and Leo Breiman as a computational procedure to extract the distinct aspects called archetypes in observations with each observational record approximated as a mixture (i.e., convex combination) of these archetypes. AA thereby provides straightforward, interpretable, and explainable representations for feature extraction and dimensionality reduction, facilitating the understanding of the structure of high-dimensional data with wide applications throughout the sciences. However, AA also faces challenges, particularly as the associated optimization problem is non-convex. This survey provides researchers and data mining practitioners an overview of methodologies and opportunities that AA has to offer surveying the many applications of AA across disparate fields of science, as well as best practices for modeling data using AA and limitations. The survey concludes by explaining important future research directions concerning AA.

[LG-62] Predictive control of blast furnace temperature in steelmaking with hybrid depth-infused quantum neural networks

链接: https://arxiv.org/abs/2504.12389
作者: Nayoung Lee,Minsoo Shin,Asel Sagingalieva,Ayush Joshi Tripathi,Karan Pinto,Alexey Melnikov
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction and stabilization of blast furnace temperatures are crucial for optimizing the efficiency and productivity of steel production. Traditional methods often struggle with the complex and non-linear nature of the temperature fluctuations within blast furnaces. This paper proposes a novel approach that combines hybrid quantum machine learning with pulverized coal injection control to address these challenges. By integrating classical machine learning techniques with quantum computing algorithms, we aim to enhance predictive accuracy and achieve more stable temperature control. For this we utilized a unique prediction-based optimization method. Our method leverages quantum-enhanced feature space exploration and the robustness of classical regression models to forecast temperature variations and optimize pulverized coal injection values. Our results demonstrate a significant improvement in prediction accuracy over 25 percent and our solution improved temperature stability to ±7.6 degrees of target range from the earlier variance of ±50 degrees, highlighting the potential of hybrid quantum machine learning models in industrial steel production applications.

[LG-63] Resonances in reflective Hamiltonian Monte Carlo

链接: https://arxiv.org/abs/2504.12374
作者: Namu Kroupa,Gábor Csányi,Will Handley
类目: Machine Learning (stat.ML); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:In high dimensions, reflective Hamiltonian Monte Carlo with inexact reflections exhibits slow mixing when the particle ensemble is initialised from a Dirac delta distribution and the uniform distribution is targeted. By quantifying the instantaneous non-uniformity of the distribution with the Sinkhorn divergence, we elucidate the principal mechanisms underlying the mixing problems. In spheres and cubes, we show that the collective motion transitions between fluid-like and discretisation-dominated behaviour, with the critical step size scaling as a power law in the dimension. In both regimes, the particles can spontaneously unmix, leading to resonances in the particle density and the aforementioned problems. Additionally, low-dimensional toy models of the dynamics are constructed which reproduce the dominant features of the high-dimensional problem. Finally, the dynamics is contrasted with the exact Hamiltonian particle flow and tuning practices are discussed.

[LG-64] ransST: Transfer Learning Embedded Spatial Factor Modeling of Spatial Transcriptomics Data

链接: https://arxiv.org/abs/2504.12353
作者: Shuo Shuo Liu,Shikun Wang,Yuxuan Chen,Anil K. Rustgi,Ming Yuan,Jianhua Hu
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Background: Spatial transcriptomics have emerged as a powerful tool in biomedical research because of its ability to capture both the spatial contexts and abundance of the complete RNA transcript profile in organs of interest. However, limitations of the technology such as the relatively low resolution and comparatively insufficient sequencing depth make it difficult to reliably extract real biological signals from these data. To alleviate this challenge, we propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources in inferring cell-level heterogeneity of a target spatial transcriptomics data. Results: Applications in several real studies as well as a number of simulation settings show that our approach significantly improves existing techniques. For example, in the breast cancer study, TransST successfully identifies five biologically meaningful cell clusters, including the two subgroups of cancer in situ and invasive cancer; in addition, only TransST is able to separate the adipose tissues from the connective issues among all the studied methods. Conclusions: In summary, the proposed method TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data. Subjects: Genomics (q-bio.GN); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML) Cite as: arXiv:2504.12353 [q-bio.GN] (or arXiv:2504.12353v1 [q-bio.GN] for this version) https://doi.org/10.48550/arXiv.2504.12353 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

信息检索

[IR-0] Should We Tailor the Talk? Understanding the Impact of Conversational Styles on Preference Elicitation in Conversational Recommender Systems

链接: https://arxiv.org/abs/2504.13095
作者: Ivica Kostric,Krisztian Balog,Ujwal Gadiraju
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: To appear in: Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization (UMAP '25), June 16–19, 2025, New York City, NY, USA

点击查看摘要

Abstract:Conversational recommender systems (CRSs) provide users with an interactive means to express preferences and receive real-time personalized recommendations. The success of these systems is heavily influenced by the preference elicitation process. While existing research mainly focuses on what questions to ask during preference elicitation, there is a notable gap in understanding what role broader interaction patterns including tone, pacing, and level of proactiveness play in supporting users in completing a given task. This study investigates the impact of different conversational styles on preference elicitation, task performance, and user satisfaction with CRSs. We conducted a controlled experiment in the context of scientific literature recommendation, contrasting two distinct conversational styles, high involvement (fast paced, direct, and proactive with frequent prompts) and high considerateness (polite and accommodating, prioritizing clarity and user comfort) alongside a flexible experimental condition where users could switch between the two. Our results indicate that adapting conversational strategies based on user expertise and allowing flexibility between styles can enhance both user satisfaction and the effectiveness of recommendations in CRSs. Overall, our findings hold important implications for the design of future CRSs.

[IR-1] CSMF: Cascaded Selective Mask Fine-Tuning for Multi-Objective Embedding-Based Retrieval SIGIR’25 SIGIR

链接: https://arxiv.org/abs/2504.12920
作者: Hao Deng,Haibo Xing,Kanefumi Matsuyama,Moyu Zhang,Jinxin Hu,Hong Wen,Yu Zhang,Xiaoyi Zeng,Jing Zhang
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 8 figures, Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25), July 13–18, 2025, Padua, Italy

点击查看摘要

Abstract:Multi-objective embedding-based retrieval (EBR) has become increasingly critical due to the growing complexity of user behaviors and commercial objectives. While traditional approaches often suffer from data sparsity and limited information sharing between objectives, recent methods utilizing a shared network alongside dedicated sub-networks for each objective partially address these limitations. However, such methods significantly increase the model parameters, leading to an increased retrieval latency and a limited ability to model causal relationships between objectives. To address these challenges, we propose the Cascaded Selective Mask Fine-Tuning (CSMF), a novel method that enhances both retrieval efficiency and serving performance for multi-objective EBR. The CSMF framework selectively masks model parameters to free up independent learning space for each objective, leveraging the cascading relationships between objectives during the sequential fine-tuning. Without increasing network parameters or online retrieval overhead, CSMF computes a linearly weighted fusion score for multiple objective probabilities while supporting flexible adjustment of each objective’s weight across various recommendation scenarios. Experimental results on real-world datasets demonstrate the superior performance of CSMF, and online experiments validate its significant practical value.

[IR-2] FashionDPO:Fine-tune Fashion Outfit Generation Model using Direct Preference Optimization SIGIR’25

链接: https://arxiv.org/abs/2504.12900
作者: Mingzhe Yu,Yunshan Ma,Lei Wu,Changshuo Wang,Xue Li,Lei Meng
类目: Multimedia (cs.MM); Information Retrieval (cs.IR)
*备注: Accepted by SIGIR’25

点击查看摘要

Abstract:Personalized outfit generation aims to construct a set of compatible and personalized fashion items as an outfit. Recently, generative AI models have received widespread attention, as they can generate fashion items for users to complete an incomplete outfit or create a complete outfit. However, they have limitations in terms of lacking diversity and relying on the supervised learning paradigm. Recognizing this gap, we propose a novel framework FashionDPO, which fine-tunes the fashion outfit generation model using direct preference optimization. This framework aims to provide a general fine-tuning approach to fashion generative models, refining a pre-trained fashion outfit generation model using automatically generated feedback, without the need to design a task-specific reward function. To make sure that the feedback is comprehensive and objective, we design a multi-expert feedback generation module which covers three evaluation perspectives, \ie quality, compatibility and personalization. Experiments on two established datasets, \ie iFashion and Polyvore-U, demonstrate the effectiveness of our framework in enhancing the model’s ability to align with users’ personalized preferences while adhering to fashion compatibility principles. Our code and model checkpoints are available at this https URL.

[IR-3] Validating LLM -Generated Relevance Labels for Educational Resource Search WSDM’25

链接: https://arxiv.org/abs/2504.12732
作者: Ratan J. Sebastian,Anett Hoppe
类目: Information Retrieval (cs.IR)
*备注: Presented in the LLM4Eval Workshop Co-located with WSDM '25 in Hannover, Germany

点击查看摘要

Abstract:Manual relevance judgements in Information Retrieval are costly and require expertise, driving interest in using Large Language Models (LLMs) for automatic assessment. While LLMs have shown promise in general web search scenarios, their effectiveness for evaluating domain-specific search results, such as educational resources, remains unexplored. To investigate different ways of including domain-specific criteria in LLM prompts for relevance judgement, we collected and released a dataset of 401 human relevance judgements from a user study involving teaching professionals performing search tasks related to lesson planning. We compared three approaches to structuring these prompts: a simple two-aspect evaluation baseline from prior work on using LLMs as relevance judges, a comprehensive 12-dimensional rubric derived from educational literature, and criteria directly informed by the study participants. Using domain-specific frameworks, LLMs achieved strong agreement with human judgements (Cohen’s \kappa up to 0.650), significantly outperforming the baseline approach. The participant-derived framework proved particularly robust, with GPT-3.5 achieving \kappa scores of 0.639 and 0.613 for 10-dimension and 5-dimension versions respectively. System-level evaluation showed that LLM judgements reliably identified top-performing retrieval approaches (RBO scores 0.71-0.76) while maintaining reasonable discrimination between systems (RBO 0.52-0.56). These findings suggest that LLMs can effectively evaluate educational resources when prompted with domain-specific criteria, though performance varies with framework complexity and input structure.

附件下载

点击下载今日全部论文列表