Arxiv今日论文 | 2025-03-12

本篇博文主要内容为 2025-03-12 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决由源偏差（source bias）引起的信息检索生态系统的可持续发展威胁问题。源偏差指的是基于预训练语言模型（PLM）的检索模型倾向于对大语言模型（LLM）生成的内容赋予更高的相关性评分，即使这些内容与人工撰写的内容在语义质量上相当。论文通过因果图解释信息检索过程，发现PLM-based检索器利用困惑度特征进行相关性估计，从而导致以低困惑度文档排名较高的源偏差现象。理论分析表明，这种现象源于语言建模任务和检索任务中损失函数梯度之间的正相关关系。为解决此问题，论文提出了一种因果启发的推理时去偏方法——因果诊断与校正（Causal Diagnosis and Correction, CDC）。CDC的关键在于首先诊断困惑度的偏差效应，然后从整体估计的相关性评分中分离出偏差效应。实验结果验证了CDC在三个不同领域的优越去偏效果，强调了所提出的解释框架的有效性。

链接: https://arxiv.org/abs/2503.08684
作者: Haoyu Wang,Sunhao Dai,Haiyuan Zhao,Liang Pang,Xiao Zhang,Gang Wang,Zhenhua Dong,Jun Xu,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院, 北京, 中国); CAS Key Laboratory of AI Safety, Institute of Computing Technology (中科院计算技术研究所人工智能安全重点实验室, 北京, 中国); Huawei Noah’s Ark Lab (华为诺亚方舟实验室, 深圳, 中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: ICLR 2025

点击查看摘要

Abstract:Previous studies have found that PLM-based retrieval models exhibit a preference for LLM-generated content, assigning higher relevance scores to these documents even when their semantic quality is comparable to human-written ones. This phenomenon, known as source bias, threatens the sustainable development of the information access ecosystem. However, the underlying causes of source bias remain unexplored. In this paper, we explain the process of information retrieval with a causal graph and discover that PLM-based retrievers learn perplexity features for relevance estimation, causing source bias by ranking the documents with low perplexity higher. Theoretical analysis further reveals that the phenomenon stems from the positive correlation between the gradients of the loss functions in language modeling task and retrieval task. Based on the analysis, a causal-inspired inference-time debiasing method is proposed, called Causal Diagnosis and Correction (CDC). CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall estimated relevance score. Experimental results across three domains demonstrate the superior debiasing effectiveness of CDC, emphasizing the validity of our proposed explanatory framework. Source codes are available at this https URL.
zh

[NLP-1] Self-Taught Self-Correction for Small Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多种任务中虽表现出色但仍易出错的问题，特别是如何使模型具备自我纠正的能力。传统方法依赖外部工具或专有大型模型，而本文提出了一种新的解决方案：通过迭代微调仅使用自生成数据的方式，在小语言模型（Small Language Models, SLMs）中实现自我纠正。解决方案的关键在于引入了Self-Taught Self-Correction (STaSC)算法，并结合了多项算法设计选择，从而实现在问答任务中显著提升性能的同时深入分析了自我纠正机制及不同设计选择对学习动态和整体表现的影响。

链接: https://arxiv.org/abs/2503.08681
作者: Viktor Moskvoretskii,Chris Biemann,Irina Nikishina
机构: Skoltech (Skoltech); HSE University (HSE大学); University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code is available at this https URL

点击查看摘要

Abstract:Although large language models (LLMs) have achieved remarkable performance across various tasks, they remain prone to errors. A key challenge is enabling them to self-correct. While prior research has relied on external tools or large proprietary models, this work explores self-correction in small language models (SLMs) through iterative fine-tuning using solely self-generated data. We introduce the Self-Taught Self-Correction (STaSC) algorithm, which incorporates multiple algorithmic design choices. Experimental results on a question-answering task demonstrate that STaSC effectively learns self-correction, leading to significant performance improvements. Our analysis further provides insights into the mechanisms of self-correction and the impact of different design choices on learning dynamics and overall performance. To support future research, we release our user-friendly codebase and lightweight models.
zh

[NLP-2] Chain-of-Thought Reasoning In The Wild Is Not Always Faithful ICLR2025

【速读】：该论文试图解决生成式 AI (Generative AI) 中 Chain-of-Thought (CoT) 推理的不忠实性问题。现有研究主要关注在引入显式偏差的非自然情境下 CoT 推理的不忠实现象，而本文指出即使在无人工偏差的真实提示下，高级模型（如 Sonnet 3.7、DeepSeek R1 和 ChatGPT-4o）仍普遍存在多种形式的不忠实推理，包括隐式事后合理化、修复错误以及利用明显不合逻辑的捷径解决问题。关键在于揭示这些不忠实推理现象的普遍存在性和潜在机制，从而对依赖 CoT 监控保障 AI 安全的工作提出挑战。

链接: https://arxiv.org/abs/2503.08679
作者: Iván Arcuschin,Jett Janiak,Robert Krzyzanowski,Senthooran Rajamanoharan,Neel Nanda,Arthur Conmy
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the ICLR 2025 Workshop, 10 main paper pages, 38 appendix pages

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful, i.e. CoT reasoning does not always reflect how models arrive at conclusions. So far, most of these studies have focused on unfaithfulness in unnatural contexts where an explicit bias has been introduced. In contrast, we show that unfaithful CoT can occur on realistic prompts with no artificial bias. Our results reveal concerning rates of several forms of unfaithful reasoning in frontier models: Sonnet 3.7 (30.6%), DeepSeek R1 (15.8%) and ChatGPT-4o (12.6%) all answer a high proportion of question pairs unfaithfully. Specifically, we find that models rationalize their implicit biases in answers to binary questions (“implicit post-hoc rationalization”). For example, when separately presented with the questions “Is X bigger than Y?” and “Is Y bigger than X?”, models sometimes produce superficially coherent arguments to justify answering Yes to both questions or No to both questions, despite such responses being logically contradictory. We also investigate restoration errors (Dziri et al., 2023), where models make and then silently correct errors in their reasoning, and unfaithful shortcuts, where models use clearly illogical reasoning to simplify solving problems in Putnam questions (a hard benchmark). Our findings raise challenges for AI safety work that relies on monitoring CoT to detect undesired behavior.
zh

[NLP-3] Agent Orca: A Dual-System Framework to Evaluate Language Agents on Operational Routine and Constraint Adherence

【速读】：该论文旨在解决语言代理（Language Agents）在遵循操作约束和安全协议方面的可靠性问题，尽管已有研究表明这些代理在下游任务完成中的有效性，但它们在遵守操作程序和约束方面的能力尚未得到充分探索。论文的关键解决方案是提出AgentOrca，这是一个双系统框架，用于评估语言代理是否符合操作约束和常规流程。AgentOrca通过自然语言提示（natural language prompts）为代理编码动作约束和流程，同时使用对应的可执行代码作为自动验证的真实依据。通过在五个现实世界领域的自动化测试用例生成与评估流水线，论文定量分析了当前语言代理遵守操作约束的程度，并发现最先进的模型之间存在显著性能差距，其中像o1这样的大型推理模型表现出更高的合规性，而其他模型在处理复杂约束或用户说服尝试时表现明显较差。

链接: https://arxiv.org/abs/2503.08669
作者: Zekun Li,Shinda Huang,Jiangtian Wang,Nathan Zhang,Antonis Antoniades,Wenyue Hua,Kaijie Zhu,Sirui Zeng,William Yang Wang,Xifeng Yan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As language agents progressively automate critical tasks across domains, their ability to operate within operational constraints and safety protocols becomes essential. While extensive research has demonstrated these agents’ effectiveness in downstream task completion, their reliability in following operational procedures and constraints remains largely unexplored. To this end, we present AgentOrca, a dual-system framework for evaluating language agents’ compliance with operational constraints and routines. Our framework encodes action constraints and routines through both natural language prompts for agents and corresponding executable code serving as ground truth for automated verification. Through an automated pipeline of test case generation and evaluation across five real-world domains, we quantitatively assess current language agents’ adherence to operational constraints. Our findings reveal notable performance gaps among state-of-the-art models, with large reasoning models like o1 demonstrating superior compliance while others show significantly lower performance, particularly when encountering complex constraints or user persuasion attempts.
zh

[NLP-4] Exploring the Word Sense Disambiguation Capabilities of Large Language Models

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）在词义消歧（Word Sense Disambiguation, WSD）任务中的性能，并通过扩展先前的基准数据集XL-WSD，重新设计了两个适合LLMs的子任务：1）给定句子中的一个词，LLM需生成正确的定义；2）给定句子中的一个词及一组预定义的意义，LLM需选择正确的一项。论文的关键解决方案在于构建了一个基于XL-WSD和BabelNet的扩展基准，并验证了LLMs在零样本学习下的良好表现，同时发现经过适度参数量微调的模型在性能上超越了所有其他模型，包括当前的最先进方法。

链接: https://arxiv.org/abs/2503.08662
作者: Pierpaolo Basile,Lucia Siciliani,Elio Musacchio,Giovanni Semeraro
机构: Department of Computer Science (计算机科学系), University of Bari Aldo Moro (巴里阿尔多莫罗大学); Department of Computer Science (计算机科学系), University of Pisa (比萨大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Word Sense Disambiguation (WSD) is a historical task in computational linguistics that has received much attention over the years. However, with the advent of Large Language Models (LLMs), interest in this task (in its classical definition) has decreased. In this study, we evaluate the performance of various LLMs on the WSD task. We extend a previous benchmark (XL-WSD) to re-design two subtasks suitable for LLM: 1) given a word in a sentence, the LLM must generate the correct definition; 2) given a word in a sentence and a set of predefined meanings, the LLM must select the correct one. The extended benchmark is built using the XL-WSD and BabelNet. The results indicate that LLMs perform well in zero-shot learning but cannot surpass current state-of-the-art methods. However, a fine-tuned model with a medium number of parameters outperforms all other models, including the state-of-the-art.
zh

[NLP-5] Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

【速读】：该论文试图解决 instruction-following retrievers 在增强搜索能力过程中带来的安全风险问题。论文通过实证研究发现，大多数检索器在面对恶意查询时能够选择相关的有害片段（对于 50% 的查询），例如 LLM2Vec 在恶意查询中的相关有害片段选择准确率为 61.35%。此外，研究揭示了一种新兴风险，即通过利用检索器的指令跟随能力，可以暴露高度相关的有害信息。关键解决方案在于强调即使安全对齐的语言模型（如 Llama3）也可能受到有害检索结果的影响，从而引发恶意滥用的风险，提示需要加强对检索器能力和使用场景的安全性管控。

链接: https://arxiv.org/abs/2503.08644
作者: Parishad BehnamGhader,Nicholas Meade,Siva Reddy
机构: McGill University (麦吉尔大学); Mila – Quebec AI Institute (魁北克人工智能研究所); Canada CIFAR AI Chair (加拿大 CIFAR 人工智能主席)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for 50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.
zh

[NLP-6] Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention

【速读】：该论文旨在解决多示例即时上下文学习（many-shot in-context learning, ICL）在实际部署中的计算成本过高的问题，特别是在推理阶段将训练负担转移至推理阶段的情况下。此外，如果为每个推理示例检索定制化的演示集，这种成本将进一步增加。论文的关键解决方案是提出了一种名为“动态块稀疏注意力（Dynamic Block-Sparse Attention）”的无训练框架，该框架通过精心设计的块稀疏注意力机制与缓存分组演示的检索相结合，在保持接近微调方法的单样本延迟的同时，平均达到了最强ICL和微调基线方法95%以上的准确性，从而使得大规模部署多示例即时上下文学习成为可能。

链接: https://arxiv.org/abs/2503.08640
作者: Emily Xiao,Chin-Jou Li,Yilin Zhang,Graham Neubig,Amanda Bertsch
机构: Language Technologies Institute, Carnegie Mellon University (卡内基梅隆大学语言技术研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many-shot in-context learning has recently shown promise as an alternative to finetuning, with the major advantage that the same model can be served for multiple tasks. However, this shifts the computational burden from training-time to inference-time, making deployment of many-shot ICL challenging to justify in-practice. This cost is further increased if a custom demonstration set is retrieved for each inference example. We present Dynamic Block-Sparse Attention, a training-free framework for retrieval-based many-shot in-context learning. By combining carefully designed block-sparse attention and retrieval of cached groups of demonstrations, we achieve comparable per-example latency to finetuning while maintaining on average 95% of the best method’s accuracy across strong ICL and finetuning baselines. We hope that this will further enable the deployment of many-shot ICL at scale.
zh

[NLP-7] NSF-SciFy: Mining the NSF Awards Database for Scientific Claims

【速读】：本文旨在构建一个大规模科学声明提取数据集NSF-SciFy，并解决现有科学声明数据集局限于已发表文献的问题。论文的关键创新在于利用国家科学基金会（NSF）资助项目数据库中的资助摘要，这些摘要能够在研究生命周期的早期阶段捕获科学声明，从而提供独特优势。此外，论文引入了材料科学领域的子集NSF-SciFy-MatSci，用于评估三个关键任务：技术到非技术摘要生成、科学声明提取以及调查提案提取。通过前沿大语言模型的零样本提示方法，论文实现了科学声明和调查提案的联合提取，并提出了新的基于LLM的评估指标以确保提取质量的鲁棒性。最终，NSF-SciFy成为迄今为止最大的科学声明数据集，为科学声明验证和元科学研究提供了新机会。

链接: https://arxiv.org/abs/2503.08600
作者: Delip Rao,Weiqiu You,Eric Wong,Chris Callison-Burch
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures, 6 tables

点击查看摘要

Abstract:We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publication takes effect. We also introduce a new task to distinguish between existing scientific claims and aspirational research intentions in this http URL zero-shot prompting with frontier large language models, we jointly extract 114K scientific claims and 145K investigation proposals from 16K grant abstracts in the materials science domain to create a focused subset called NSF-SciFy-MatSci. We use this dataset to evaluate 3 three key tasks: (1) technical to non-technical abstract generation, where models achieve high BERTScore (0.85+ F1); (2) scientific claim extraction, where fine-tuned models outperform base models by 100% relative improvement; and (3) investigation proposal extraction, showing 90%+ improvement with fine-tuning. We introduce novel LLM-based evaluation metrics for robust assessment of claim/proposal extraction quality. As the largest scientific claim dataset to date – with an estimated 2.8 million claims across all STEM disciplines funded by the NSF – NSF-SciFy enables new opportunities for claim verification and meta-scientific research. We publicly release all datasets, trained models, and evaluation code to facilitate further research.
zh

[NLP-8] BiasEdit: Debiasing Stereotyped Language Models via Model Editing NAACL2025

【速读】：该论文旨在解决语言模型中存在的刻板偏见问题，以及现有去偏策略（如使用反事实数据微调、表示投影和提示等）在高效消除偏见或直接修改模型内部偏见表示方面的不足。论文的关键在于提出了一种名为BiasEdit的高效模型编辑方法，通过轻量级网络作为编辑器生成参数更新来去除语言模型中的刻板偏见。BiasEdit利用去偏损失引导编辑网络对语言模型的部分参数进行局部编辑以实现去偏，同时通过保留损失确保编辑过程中保持语言建模能力。实验结果表明，与对比的间接去偏基线相比，BiasEdit在消除偏见方面表现出更高的有效性、效率和鲁棒性，并且对语言模型的整体能力几乎没有负面影响。此外，还进行了偏见追踪以探究模型各模块中的偏见情况及编辑对不同组件的影响。

链接: https://arxiv.org/abs/2503.08588
作者: Xin Xu,Wei Xu,Ningyu Zhang,Julian McAuley
机构: University of California, San Diego (加州大学圣地亚哥分校); Georgia Institute of Technology (乔治亚理工学院); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted by TrustNLP @ NAACL 2025

点击查看摘要

Abstract:Previous studies have established that language models manifest stereotyped biases. Existing debiasing strategies, such as retraining a model with counterfactual data, representation projection, and prompting often fail to efficiently eliminate bias or directly alter the models’ biased internal representations. To address these issues, we propose BiasEdit, an efficient model editing method to remove stereotypical bias from language models through lightweight networks that act as editors to generate parameter updates. BiasEdit employs a debiasing loss guiding editor networks to conduct local edits on partial parameters of a language model for debiasing while preserving the language modeling abilities during editing through a retention loss. Experiments on StereoSet and Crows-Pairs demonstrate the effectiveness, efficiency, and robustness of BiasEdit in eliminating bias compared to tangental debiasing baselines and little to no impact on the language models’ general capabilities. In addition, we conduct bias tracing to probe bias in various modules and explore bias editing impacts on different components of language models.
zh

[NLP-9] DeepReview: Improving LLM -based Paper Review with Human-like Deep Thinking Process

【速读】：该论文旨在解决现有基于大型语言模型（LLMs）的科学论文评审系统面临的三大挑战：领域专业知识有限、推理过程中的幻觉问题以及缺乏结构化评估。为应对这些局限性，论文提出了一种名为DeepReview的多阶段框架，其核心在于通过引入结构化分析、文献检索和基于证据的论证来模拟专家评审行为。解决方案的关键在于使用DeepReview-13K这一带有结构化标注的数据集训练出的DeepReviewer-14B模型，该模型在参数量更少的情况下超越了CycleReviewer-70B，并在与GPT-o1和DeepSeek-R1的对比评测中分别取得了88.21%和80.20%的胜率。

链接: https://arxiv.org/abs/2503.08569
作者: Minjun Zhu,Yixuan Weng,Linyi Yang,Yue Zhang
机构: Zhejiang University (浙江大学); School of Engineering, Westlake University (西湖大学工程学院); Research Center for Industries of the Future (未来产业研究中心); University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21% and 80.20% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available. The code, model, dataset and demo have be released in this http URL.
zh

[NLP-10] ransferring Extreme Subword Style Using Ngram Model-Based Logit Scaling NAACL

【速读】：该论文旨在解决大型语言模型在推理阶段应对极端子词（subword）风格变化的问题。论文的关键解决方案是一种基于ngram模型的logit缩放技术，通过在推理过程中调整模型输出的logits，将源文本的极端风格变化有效地迁移到预训练的语言模型中。这种方法的核心在于通过跟踪评估模型在ngram插值版本与原始版本下的困惑度（perplexity），在保证生成文本流畅性的同时，优化适应程度以匹配目标作者或角色的风格。

链接: https://arxiv.org/abs/2503.08550
作者: Craig Messner,Tom Lippincott
机构: Center for Digital Humanities (数字人文中心), Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: Accepted for publication at NLP4DH 2025 @ NAACL

点击查看摘要

Abstract:We present an ngram model-based logit scaling technique that effectively transfers extreme subword stylistic variation to large language models at inference time. We demonstrate its efficacy by tracking the perplexity of generated text with respect to the ngram interpolated and original versions of an evaluation model. Minimizing the former measure while the latter approaches the perplexity of a text produced by a target author or character lets us select a sufficient degree of adaptation while retaining fluency.
zh

[NLP-11] Graph of AI Ideas: Leverag ing Knowledge Graphs and LLM s for AI Research Idea Generation

【速读】：该论文试图解决在研究文献数量激增且引用关系复杂的背景下，研究人员难以快速分析并提取有意义的研究趋势的问题。此外，现有基于论文的创意生成方法要么简单地通过提示词将文献输入大语言模型（LLMs），要么基于引用关系构建逻辑链，但未能充分挖掘引用关系中嵌入的语义信息。
解决方案的关键在于提出了一种名为“AI创意知识图谱（GoAI）”的框架，专门针对以开放获取论文为主的AI研究领域。该框架将相关文献组织为知识图谱中的实体，并将引用关系中的语义信息总结为图谱中的关系，从而有效反映两篇学术论文之间的联系以及AI研究领域的进展。这种组织方式帮助LLMs更好地捕捉当前研究进展，进而提升其创造性。实验结果验证了该方法在生成新颖、清晰且有效的研究创意方面的有效性。

链接: https://arxiv.org/abs/2503.08549
作者: Xian Gao,Zongyun Zhang,Mingye Xie,Ting Liu,Yuzhuo Fu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Reading relevant scientific papers and analyzing research development trends is a critical step in generating new scientific ideas. However, the rapid increase in the volume of research literature and the complex citation relationships make it difficult for researchers to quickly analyze and derive meaningful research trends. The development of large language models (LLMs) has provided a novel approach for automatically summarizing papers and generating innovative research ideas. However, existing paper-based idea generation methods either simply input papers into LLMs via prompts or form logical chains of creative development based on citation relationships, without fully exploiting the semantic information embedded in these citations. Inspired by knowledge graphs and human cognitive processes, we propose a framework called the Graph of AI Ideas (GoAI) for the AI research field, which is dominated by open-access papers. This framework organizes relevant literature into entities within a knowledge graph and summarizes the semantic information contained in citations into relations within the graph. This organization effectively reflects the relationships between two academic papers and the advancement of the AI research field. Such organization aids LLMs in capturing the current progress of research, thereby enhancing their creativity. Experimental results demonstrate the effectiveness of our approach in generating novel, clear, and effective research ideas.
zh

[NLP-12] DAFE: LLM -Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering

【速读】：该论文旨在解决自由形式生成的大规模语言模型（Large Language Models, LLMs）输出评估的挑战，传统基于监督信号的自动度量方法难以捕捉语义等价性或处理开放性响应的多样性，而人工评估虽可靠但资源消耗大。论文的关键解决方案是提出了一种名为动态仲裁评估框架（Dynamic Arbitration Framework for Evaluation, DAFE），它利用两个主要的语言模型作为裁判，并仅在出现分歧时引入第三方仲裁者进行选择性仲裁。这种策略优先保证了评估的可靠性，同时相较于传统的多数投票法减少了不必要的计算开销。通过任务特定参考答案与动态仲裁的结合，DAFE显著提升了评估指标（如Macro F1和Cohen’s Kappa）的表现，并通过综合实验验证了其一致性、可扩展性和资源效率，确立了其作为自由形式模型输出评估的稳健框架的地位。

链接: https://arxiv.org/abs/2503.08542
作者: Sher Badshah,Hassan Sajjad
机构: Dalhousie University (达尔豪西大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating Large Language Models (LLMs) free-form generated responses remains a challenge due to their diverse and open-ended nature. Traditional supervised signal-based automatic metrics fail to capture semantic equivalence or handle the variability of open-ended responses, while human evaluation, though reliable, is resource-intensive. Leveraging LLMs as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. Taking advantage of these capabilities, we propose the Dynamic Arbitration Framework for Evaluation (DAFE), which employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreements. This selective arbitration prioritizes evaluation reliability while reducing unnecessary computational demands compared to conventional majority voting. DAFE utilizes task-specific reference answers with dynamic arbitration to enhance judgment accuracy, resulting in significant improvements in evaluation metrics such as Macro F1 and Cohen’s Kappa. Through experiments, including a comprehensive human evaluation, we demonstrate DAFE’s ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating free-form model outputs.
zh

[NLP-13] ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems NAACL2025

【速读】：该论文试图解决跨不同端到端（E2E）及级联语音对话系统有效对比与分析的挑战，这些系统因各自独立的网页界面而难以直接比较。为应对这一问题，论文提出了一种开源且用户友好的工具包，旨在构建统一的网页接口以支持多种级联与端到端语音对话系统的开发。解决方案的关键在于引入了一个可提供实时自动评估指标的框架，包括延迟、对用户输入的理解能力、响应的一致性与多样性以及输出的可懂度和音频质量，并通过这些指标将不同系统与人类-人类对话数据集进行对比分析，从而为研究者提供直观的技术对比洞见。

链接: https://arxiv.org/abs/2503.08533
作者: Siddhant Arora,Yifan Peng,Jiatong Shi,Jinchuan Tian,William Chen,Shikhar Bharadwaj,Hayato Futami,Yosuke Kashiwagi,Emiru Tsunoo,Shuichiro Shimizu,Vaibhav Srivastav,Shinji Watanabe
机构: Carnegie Mellon University (卡内基梅隆大学, USA); Sony Group Corporation (索尼集团, Japan); Kyoto University (京都大学, Japan); Hugging Face (Hugging Face, USA)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at NAACL 2025 Demo Track

点击查看摘要

Abstract:Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo further provides users with the option to get on-the-fly automated evaluation metrics such as (1) latency, (2) ability to understand user input, (3) coherence, diversity, and relevance of system response, and (4) intelligibility and audio quality of system output. Using the evaluation metrics, we compare various cascaded and E2E spoken dialogue systems with a human-human conversation dataset as a proxy. Our analysis demonstrates that the toolkit allows researchers to effortlessly compare and contrast different technologies, providing valuable insights such as current E2E systems having poorer audio quality and less diverse responses. An example demo produced using our toolkit is publicly available here: this https URL.
zh

[NLP-14] Position-Aware Depth Decay Decoding (D3): Boosting Large Language Model Inference Efficiency

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在推理阶段因参数量大而导致资源消耗高的问题。不同于需要重新训练的传统模型压缩方法，近期的动态计算方法表明并非所有组件在推理过程中都是必需的，从而开启了无需重新训练的流水线可能性。论文的关键创新在于提出了一种名为Position-Aware Depth Decay Decoding (D^3) 的无训练算法，通过利用幂律衰减函数 (\left\lfloor L \times (\alpha^i) \right\rfloor) 动态调整生成第 (i) 个标记 (T_i) 时保留的层数，以实现高效的层跳过机制。论文观察到后预测的标记具有较低的困惑度，因此需要更少的计算，并基于此设计了这一训练-free的动态深度框架。实验结果表明，D^3 在多种生成任务中实现了平均 1.5 倍的速度提升，同时在 GSM8K 和 BBH 等基准测试中性能几乎无损（下降 <1%），成功验证了其有效性。

链接: https://arxiv.org/abs/2503.08524
作者: Siqi Fan,Xuezhi Fang,Xingrun Xing,Peng Han,Shuo Shang,Yequan Wang
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); University of Electronic Science and Technology of China (电子科技大学); Institute of Computing Automation, Chinese Academy of Sciences (中国科学院计算自动化研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding ( D^3 ), which leverages a power-law decay function, \left\lfloor L \times (\alpha^i) \right\rfloor , to determine the number of layers to retain when generating token T_i . Remarkably, without any retraining, the D^3 achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with 7 \sim 70 billion parameters show that D^3 can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop ( 1% ) on the GSM8K and BBH benchmarks.
zh

[NLP-15] ReviewAgents : Bridging the Gap Between Human and AI-Generated Paper Reviews

【速读】：该论文试图解决学术论文评审过程自动化的问题，特别是如何生成与人类评审员判断一致的全面、准确且推理一致的评审评论。解决方案的关键在于提出ReviewAgents框架，该框架利用大型语言模型（Large Language Models, LLMs）生成学术论文评审。具体而言，通过引入包含142k条评审评论的新型数据集Review-CoT，模拟人类评审员的结构化推理过程，并采用一种基于相关论文感知的训练方法来训练具备结构化推理能力的LLM评审代理。此外，构建了多角色、多LLM代理的ReviewAgents框架以优化评审评论生成过程，并提出了ReviewBench基准用于评估LLMs生成的评审评论。实验结果表明，尽管现有LLMs在一定程度上展现了自动化的潜力，但与人类生成的评审相比仍存在差距，而ReviewAgents框架进一步缩小了这一差距。

链接: https://arxiv.org/abs/2503.08506
作者: Xian Gao,Jiacheng Ruan,Jingsheng Gao,Ting Liu,Yuzhuo Fu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Academic paper review is a critical yet time-consuming task within the research community. With the increasing volume of academic publications, automating the review process has become a significant challenge. The primary issue lies in generating comprehensive, accurate, and reasoning-consistent review comments that align with human reviewers’ judgments. In this paper, we address this challenge by proposing ReviewAgents, a framework that leverages large language models (LLMs) to generate academic paper reviews. We first introduce a novel dataset, Review-CoT, consisting of 142k review comments, designed for training LLM agents. This dataset emulates the structured reasoning process of human reviewers-summarizing the paper, referencing relevant works, identifying strengths and weaknesses, and generating a review conclusion. Building upon this, we train LLM reviewer agents capable of structured reasoning using a relevant-paper-aware training method. Furthermore, we construct ReviewAgents, a multi-role, multi-LLM agent review framework, to enhance the review comment generation process. Additionally, we propose ReviewBench, a benchmark for evaluating the review comments generated by LLMs. Our experimental results on ReviewBench demonstrate that while existing LLMs exhibit a certain degree of potential for automating the review process, there remains a gap when compared to human-generated reviews. Moreover, our ReviewAgents framework further narrows this gap, outperforming advanced LLMs in generating review comments.
zh

[NLP-16] Enhancing Multi-Hop Fact Verification with Structured Knowledge-Augmented Large Language Models AAAI2025

【速读】：该论文旨在解决多跳事实验证（multi-hop fact verification）中的关键挑战，即如何通过复杂实体关系和逻辑推理来验证声明的真实性和关联性。传统方法通常将事实验证视为单跳任务，并依赖于语义特征进行单一层次的理解，而忽视了真实场景中声明验证需要多条证据及其复杂内在逻辑和关系的问题。此外，现有研究虽尝试提升模型的理解与推理能力，但未能充分关注实体间关键关系的重要性，而这对于模型更好地理解上下文和辅助预测至关重要。

论文的关键解决方案在于提出了一种基于结构化知识增强的大语言模型网络（LLM-SKAN）。不同于其他将大语言模型作为预测器的方法，该研究利用大语言模型作为关系抽取器，因为实验结果显示其在理解方面优于推理。具体而言，LLM-SKAN 首先通过一个由大语言模型驱动的知识提取器捕获细粒度信息，包括实体及其复杂关系；其次，通过引入知识增强的关系图融合模块（Knowledge-Augmented Relation Graph Fusion），实现节点间的交互学习，从而全面优化声明-证据表示的学习过程。这种设计不仅突出了关系建模的重要性，还显著提升了多跳事实验证的性能。

实验结果表明，LLM-SKAN 在四个常用数据集上的表现优于现有方法，验证了所提方法的有效性和优越性。

链接: https://arxiv.org/abs/2503.08495
作者: Han Cao,Lingwei Wei,Wei Zhou,Songlin Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:The rapid development of social platforms exacerbates the dissemination of misinformation, which stimulates the research in fact verification. Recent studies tend to leverage semantic features to solve this problem as a single-hop task. However, the process of verifying a claim requires several pieces of evidence with complicated inner logic and relations to verify the given claim in real-world situations. Recent studies attempt to improve both understanding and reasoning abilities to enhance the performance, but they overlook the crucial relations between entities that benefit models to understand better and facilitate the prediction. To emphasize the significance of relations, we resort to Large Language Models (LLMs) considering their excellent understanding ability. Instead of other methods using LLMs as the predictor, we take them as relation extractors, for they do better in understanding rather than reasoning according to the experimental results. Thus, to solve the challenges above, we propose a novel Structured Knowledge-Augmented LLM-based Network (LLM-SKAN) for multi-hop fact verification. Specifically, we utilize an LLM-driven Knowledge Extractor to capture fine-grained information, including entities and their complicated relations. Besides, we leverage a Knowledge-Augmented Relation Graph Fusion module to interact with each node and learn better claim-evidence representations comprehensively. The experimental results on four common-used datasets demonstrate the effectiveness and superiority of our model.
zh

[NLP-17] Stick to Facts: Towards Fidelity-oriented Product Description Generation EMNLP2010

【速读】：该论文致力于解决产品描述生成任务中描述忠实性不足的问题，即生成的描述未能充分贴合产品属性信息。传统文本生成方法对此关注较少。为填补这一研究空白，论文提出了一种名为Fidelity-oriented Product Description Generator (FPDG) 的模型。FPDG的关键创新在于结合实体标签进行生成，通过引入Entity-label-guided Long Short-Term Memory (ELSTM) 单元，将每个词的嵌入向量及其实体标签作为输入，增强对产品属性信息的关注。此外，模型构建了一个关键词记忆模块，以实体标签为键、关键词为值，使FPDG能够通过关注实体标签来匹配相关关键词，从而显著提升生成描述的忠实性，在大规模真实数据集上的实验表明，FPDG在传统生成指标与人工评估中均达到当前最优性能，并将描述的忠实性提高了25%。

链接: https://arxiv.org/abs/2503.08454
作者: Zhangming Chan,Xiuying Chen,Yongliang Wang,Juntao Li,Zhiqiang Zhang,Kun Gai,Dongyan Zhao,Rui Yan
机构: Center for Data Science, AAIS, Peking University (北京大学数据科学中心, AAIS); Wangxuan Institute of Computer Technology, Peking University (王选计算机研究所, 北京大学); Alibaba Group, Beijing (阿里巴巴集团, 北京)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2010

点击查看摘要

Abstract:Different from other text generation tasks, in product description generation, it is of vital importance to generate faithful descriptions that stick to the product attribute information. However, little attention has been paid to this problem. To bridge this gap, we propose a model named Fidelity-oriented Product Description Generator (FPDG). FPDG takes the entity label of each word into account, since the product attribute information is always conveyed by entity words. Specifically, we first propose a Recurrent Neural Network (RNN) decoder based on the Entity-label-guided Long Short-Term Memory (ELSTM) cell, taking both the embedding and the entity label of each word as input. Second, we establish a keyword memory that stores the entity labels as keys and keywords as values, allowing FPDG to attend to keywords by attending to their entity labels. Experiments conducted on a large-scale real-world product description dataset show that our model achieves state-of-the-art performance in terms of both traditional generation metrics and human evaluations. Specifically, FPDG increases the fidelity of the generated descriptions by 25%.
zh

[NLP-18] Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLM s Capacity to Detect Veracity of Political Information

【速读】：该论文试图解决如何利用大型语言模型（Large Language Models, LLMs）进行事实核查，并评估其在自动化真实性识别领域的应用潜力与局限性。论文的关键解决方案在于采用系统性的AI审计方法，通过主题建模和回归分析，研究影响LLMs对真假及混合陈述判断准确性的因素，如提示的主题或模型类型。研究发现不同LLMs性能差异显著，且特定主题输出质量不均，这主要归因于训练数据的不足。因此，论文强调了在政治事实核查领域LLMs的潜力与限制，并提出了改进模型性能的潜在方向，包括优化模型护栏设计与针对性微调。

链接: https://arxiv.org/abs/2503.08404
作者: Elizaveta Kuznetsova,Ilaria Vitulano,Mykola Makhortykh,Martha Stolze,Tomas Nagy,Victoria Vziatysheva
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 15 pages, 2 figures

点击查看摘要

Abstract:The purpose of this study is to assess how large language models (LLMs) can be used for fact-checking and contribute to the broader debate on the use of automated means for veracity identification. To achieve this purpose, we use AI auditing methodology that systematically evaluates performance of five LLMs (ChatGPT 4, Llama 3 (70B), Llama 3.1 (405B), Claude 3.5 Sonnet, and Google Gemini) using prompts regarding a large set of statements fact-checked by professional journalists (16,513). Specifically, we use topic modeling and regression analysis to investigate which factors (e.g. topic of the prompt or the LLM type) affect evaluations of true, false, and mixed statements. Our findings reveal that while ChatGPT 4 and Google Gemini achieved higher accuracy than other models, overall performance across models remains modest. Notably, the results indicate that models are better at identifying false statements, especially on sensitive topics such as COVID-19, American political controversies, and social issues, suggesting possible guardrails that may enhance accuracy on these topics. The major implication of our findings is that there are significant challenges for using LLMs for factchecking, including significant variation in performance across different LLMs and unequal quality of outputs for specific topics which can be attributed to deficits of training data. Our research highlights the potential and limitations of LLMs in political fact-checking, suggesting potential avenues for further improvements in guardrails as well as fine-tuning.
zh

[NLP-19] OpenRAG : Optimizing RAG End-to-End via In-Context Retrieval Learning

【速读】：该论文试图解决传统信息检索（Information Retrieval, IR）场景中学到的相关性在检索增强生成（Retrieval-Augmented Generation, RAG）场景中可能不一致的问题。为了解决这一问题，论文提出了OpenRAG框架，其关键在于通过端到端调优检索器以捕获上下文相关性，从而适应多样且不断演进的需求。实验结果表明，OpenRAG通过端到端调优检索器，相比原始检索器提升了4.0%，并且比现有的最先进的检索器持续高出2.1%。此外，研究还发现，在某些任务中，经过端到端调优的小规模（0.2B参数）检索器的表现可以超越针对RAG或指令微调的大规模语言模型（8B参数），凸显了该方法在提升RAG系统效能方面的成本效益优势。

链接: https://arxiv.org/abs/2503.08398
作者: Jiawei Zhou,Lei Chen
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In this paper, we analyze and empirically show that the learned relevance for conventional information retrieval (IR) scenarios may be inconsistent in retrieval-augmented generation (RAG) scenarios. To bridge this gap, we introduce OpenRAG, a RAG framework that is optimized end-to-end by tuning the retriever to capture in-context relevance, enabling adaptation to the diverse and evolving needs. Extensive experiments across a wide range of tasks demonstrate that OpenRAG, by tuning a retriever end-to-end, leads to a consistent improvement of 4.0% over the original retriever, consistently outperforming existing state-of-the-art retrievers by 2.1%. Additionally, our results indicate that for some tasks, an end-to-end tuned 0.2B retriever can achieve improvements that surpass those of RAG-oriented or instruction-tuned 8B large language models (LLMs), highlighting the cost-effectiveness of our approach in enhancing RAG systems.
zh

[NLP-20] JurisTCU: A Brazilian Portuguese Information Retrieval Dataset with Query Relevance Judgments

【速读】：该论文试图解决葡萄牙语法律信息检索（Legal Information Retrieval, LIR）领域中缺乏带有查询相关性标注的数据集的问题。为了解决这一问题，作者构建了一个名为JurisTCU的巴西葡萄牙语数据集，包含来自巴西联邦审计法院的16,045份司法文档以及150条带有相关性判断的查询。解决方案的关键在于采用了一种混合方法来生成高质量的相关性标注，该方法结合了大型语言模型（LLM）打分与领域专家的验证。此外，通过在JurisTCU数据集上进行的14项实验，证明了文档扩展方法显著提升了基于BM25的传统检索性能，并且基于OpenAI嵌入模型的语义搜索在短关键词查询的P@10、R@10和nDCG@10指标上的提升超过了70%，表明这些密集向量能够更好地捕捉领域内的语义关系，而非单纯依赖词法匹配。

链接: https://arxiv.org/abs/2503.08379
作者: Leandro Carísio Fernandes,Leandro dos Santos Ribeiro,Marcos Vinícius Borela de Castro,Leonardo Augusto da Silva Pacheco,Edans Flávius de Oliveira Sandes
机构: tcu.gov.br (巴西审计法院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 21 pages

点击查看摘要

Abstract:This paper introduces JurisTCU, a Brazilian Portuguese dataset for legal information retrieval (LIR). The dataset is freely available and consists of 16,045 jurisprudential documents from the Brazilian Federal Court of Accounts, along with 150 queries annotated with relevance judgments. It addresses the scarcity of Portuguese-language LIR datasets with query relevance annotations. The queries are organized into three groups: real user keyword-based queries, synthetic keyword-based queries, and synthetic question-based queries. Relevance judgments were produced through a hybrid approach combining LLM-based scoring with expert domain validation. We used JurisTCU in 14 experiments using lexical search (document expansion methods) and semantic search (BERT-based and OpenAI embeddings). We show that the document expansion methods significantly improve the performance of standard BM25 search on this dataset, with improvements exceeding 45% in P@10, R@10, and nDCG@10 metrics when evaluating short keyword-based queries. Among the embedding models, the OpenAI models produced the best results, with improvements of approximately 70% in P@10, R@10, and nDCG@10 metrics for short keyword-based queries, suggesting that these dense embeddings capture semantic relationships in this domain, surpassing the reliance on lexical terms. Besides offering a dataset for the Portuguese-language IR research community, suitable for evaluating search systems, the results also contribute to enhancing a search system highly relevant to Brazilian citizens.
zh

[NLP-21] Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation

【速读】：该论文试图解决在模型开发过程中因“metric interference (Mint)”而导致的误引导问题，即由于使用相同或相关的指标进行模型调优和评估，可能导致系统性能被高估，并且即使未直接优化的指标也可能受到严重干扰。论文分析了机器翻译任务中两种常见的Mint现象：训练数据过滤和基于质量信号的解码。为了解决这一问题，论文提出了一种名为MintAdjust的方法，旨在减轻Mint对评估结果的影响。MintAdjust的关键在于通过调整评分机制，更可靠地反映系统的实际性能，尤其是在高质量系统的表现排名上，其表现优于现有最先进的指标以及比赛组织者使用的集成方法AutoRank。

链接: https://arxiv.org/abs/2503.08327
作者: José Pombal,Nuno M. Guerreiro,Ricardo Rei,André F. T. Martins
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As automatic metrics become increasingly stronger and widely adopted, the risk of unintentionally “gaming the metric” during model development rises. This issue is caused by metric interference (Mint), i.e., the use of the same or related metrics for both model tuning and evaluation. Mint can misguide practitioners into being overoptimistic about the performance of their systems: as system outputs become a function of the interfering metric, their estimated quality loses correlation with human judgments. In this work, we analyze two common cases of Mint in machine translation-related tasks: filtering of training data, and decoding with quality signals. Importantly, we find that Mint strongly distorts instance-level metric scores, even when metrics are not directly optimized for – questioning the common strategy of leveraging a different, yet related metric for evaluation that is not used for tuning. To address this problem, we propose MintAdjust, a method for more reliable evaluation under Mint. On the WMT24 MT shared task test set, MintAdjust ranks translations and systems more accurately than state-of-the-art-metrics across a majority of language pairs, especially for high-quality systems. Furthermore, MintAdjust outperforms AutoRank, the ensembling method used by the organizers.
zh

[NLP-22] owards Scalable and Cross-Lingual Specialist Language Models for Oncology

【速读】：该论文旨在解决临床肿瘤学领域中大量非结构化数据存在的不一致性、缺失信息和模糊性等问题，这些特性阻碍了基于数据驱动的决策制定。传统通用大型语言模型（LLMs）在处理这些问题时面临挑战，因其缺乏特定领域的推理能力，包括专门的临床术语理解、上下文依赖性解释以及多模态数据整合。为应对上述挑战，论文提出了一种专注于肿瘤学的高效且可适应的自然语言处理（NLP）框架，该框架结合了指令微调、检索增强生成（RAG）以及基于图的知识集成。此方法的关键在于通过这种综合策略提升模型在肿瘤学特定任务上的表现，如命名实体识别（例如识别癌症诊断）、实体链接（例如将实体连接到标准化本体）、TNM分期、文档分类（例如从病理报告中分类癌症亚型）以及治疗反应预测，同时保持模型的轻量化以实现资源效率和跨语言知识迁移能力。

链接: https://arxiv.org/abs/2503.08323
作者: Morteza Rohanian,Tarun Mehra,Nicola Miglino,Farhad Nooralahzadeh,Michael Krauthammer,Andreas Wicki
机构: University of Zurich and University Hospital Zurich, Switzerland (苏黎世大学和苏黎世大学医院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical oncology generates vast, unstructured data that often contain inconsistencies, missing information, and ambiguities, making it difficult to extract reliable insights for data-driven decision-making. General-purpose large language models (LLMs) struggle with these challenges due to their lack of domain-specific reasoning, including specialized clinical terminology, context-dependent interpretations, and multi-modal data integration. We address these issues with an oncology-specialized, efficient, and adaptable NLP framework that combines instruction tuning, retrieval-augmented generation (RAG), and graph-based knowledge integration. Our lightweight models prove effective at oncology-specific tasks, such as named entity recognition (e.g., identifying cancer diagnoses), entity linking (e.g., linking entities to standardized ontologies), TNM staging, document classification (e.g., cancer subtype classification from pathology reports), and treatment response prediction. Our framework emphasizes adaptability and resource efficiency. We include minimal German instructions, collected at the University Hospital Zurich (USZ), to test whether small amounts of non-English language data can effectively transfer knowledge across languages. This approach mirrors our motivation for lightweight models, which balance strong performance with reduced computational costs, making them suitable for resource-limited healthcare settings. We validated our models on oncology datasets, demonstrating strong results in named entity recognition, relation extraction, and document classification.
zh

[NLP-23] Large Language Models for Outpatient Referral: Problem Definition Benchmarking and Challenges

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在智能门诊转诊（Intelligent Outpatient Referral, IOR）系统中的标准化评估标准缺失的问题，特别是在动态交互场景下的有效性评估。论文的关键解决方案是提出了一套专门针对IOR系统的综合评估框架，该框架包含两个核心任务：静态评估，用于评价预定义门诊转诊的能力；动态评估，通过迭代对话评估模型优化门诊转诊推荐的能力。研究发现，LLMs相比BERT-like模型在某些方面仅具有有限的优势，但在交互对话中提出有效问题的能力展现出潜力。

链接: https://arxiv.org/abs/2503.08292
作者: Xiaoxiao Liu,Qingying Xiao,Junying Chen,Xiangyi Feng,Xiangbo Wu,Bairui Zhang,Xiang Wan,Jian Chang,Guangjun Yu,Yan Hu,Benyou Wang
机构: Bournemouth Univerisity (伯恩茅斯大学); National Health Data Institute, Shenzhen (深圳国家健康数据研究所); Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区); Shenzhen Research Institute of Big Data (深圳大数据研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems. However, there is a lack of standardized evaluation criteria to assess their effectiveness, particularly in dynamic, interactive scenarios. In this study, we systematically examine the capabilities and limitations of LLMs in managing tasks within Intelligent Outpatient Referral (IOR) systems and propose a comprehensive evaluation framework specifically designed for such systems. This framework comprises two core tasks: static evaluation, which focuses on evaluating the ability of predefined outpatient referrals, and dynamic evaluation, which evaluates capabilities of refining outpatient referral recommendations through iterative dialogues. Our findings suggest that LLMs offer limited advantages over BERT-like models, but show promise in asking effective questions during interactive dialogues.
zh

[NLP-24] Beyond Outlining: Heterogeneous Recursive Planning for Adaptive Long-form Writing with Language Models

【速读】：该论文试图解决长篇写作代理在信息检索、推理和 Composition（组成）之间缺乏灵活集成与交互的问题，特别是当前方法因预设工作流程和僵化思维模式，在生成大纲后再进行写作，导致写作过程适应性受限。论文的关键解决方案在于提出一种通用代理框架，通过递归任务分解（Recursive Task Decomposition）和三种基本任务类型（Retrieval、Reasoning 和 Composition）的动态集成实现类人自适应写作。其核心创新包括：1）一种交织递归任务分解与执行的规划机制，消除了人为对写作流程的限制；2）促进异构任务分解的任务类型集成。实验表明，该方法在小说写作和科技报告生成任务中均优于现有最先进的方法，验证了所提框架的有效性和广泛适用性。

链接: https://arxiv.org/abs/2503.08275
作者: Ruibin Xiong,Yimeng Chen,Dmitrii Khizbullin,Jürgen Schmidhuber
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages, 2 figures

点击查看摘要

Abstract:Long-form writing agents require flexible integration and interaction across information retrieval, reasoning, and composition. Current approaches rely on predetermined workflows and rigid thinking patterns to generate outlines before writing, resulting in constrained adaptability during writing. In this paper we propose a general agent framework that achieves human-like adaptive writing through recursive task decomposition and dynamic integration of three fundamental task types, i.e. retrieval, reasoning, and composition. Our methodology features: 1) a planning mechanism that interleaves recursive task decomposition and execution, eliminating artificial restrictions on writing workflow; and 2) integration of task types that facilitates heterogeneous task decomposition. Evaluations on both fiction writing and technical report generation show that our method consistently outperforms state-of-the-art approaches across all automatic evaluation metrics, which demonstrate the effectiveness and broad applicability of our proposed framework.
zh

[NLP-25] Investigating Execution-Aware Language Models for Code Optimization

【速读】：该论文试图解决的问题是如何通过将代码执行信息整合到语言模型中，提升其在代码优化任务中的性能。论文的关键在于探索三种不同的训练策略，分别将代码行执行、行覆盖率、分支覆盖率以及变量状态这四个代码执行方面的信息引入CodeT5+模型，并评估这些执行感知模型在代码优化任务中的表现。研究结果表明，与标准的CodeT5+模型相比，这些执行感知模型仅提供了有限的改进。

链接: https://arxiv.org/abs/2503.08228
作者: Federico Di Menna,Luca Traini,Gabriele Bavota,Vittorio Cortellessa
机构: University of L’Aquila (University of L’Aquila); Software Institute - Università della Svizzera Italiana (软件学院 - 意大利语大学);
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Code optimization is the process of enhancing code efficiency, while preserving its intended functionality. This process often requires a deep understanding of the code execution behavior at run-time to identify and address inefficiencies effectively. Recent studies have shown that language models can play a significant role in automating code optimization. However, these models may have insufficient knowledge of how code execute at run-time. To address this limitation, researchers have developed strategies that integrate code execution information into language models. These strategies have shown promise, enhancing the effectiveness of language models in various software engineering tasks. However, despite the close relationship between code execution behavior and efficiency, the specific impact of these strategies on code optimization remains largely unexplored. This study investigates how incorporating code execution information into language models affects their ability to optimize code. Specifically, we apply three different training strategies to incorporate four code execution aspects – line executions, line coverage, branch coverage, and variable states – into CodeT5+, a well-known language model for code. Our results indicate that execution-aware models provide limited benefits compared to the standard CodeT5+ model in optimizing code.
zh

[NLP-26] A Grey-box Text Attack Framework using Explainable AI

【速读】：本文旨在解决复杂黑箱模型预测难以被人类理解的问题，并同时探索如何利用可解释人工智能（Explainable AI）技术定位这些模型可能存在的漏洞。传统对抗性文本攻击方法主要依赖于词替换、数据增强以及基于梯度的方法，但这些方法通常需要白盒访问且容易被人类察觉，因此缺乏实际应用价值。针对这一局限性，本文提出了一种结合灰盒与黑盒特性的简单而有效的攻击方案，无需直接了解目标模型的具体细节，而是通过一组代理Transformer/BERT模型来执行攻击任务。关键在于利用Transformer架构中的注意力机制，该机制使模型能够捕捉序列中的长距离依赖关系，从而实现从一个BERT模型生成的对抗样本向其他BERT模型的有效迁移（Transferability）。通过攻击不同的代理Transformer变体，研究展示了只需修改少量单词即可生成语义合理且不易被人类发现但仍能误导其他BERT模型的对抗句子。

链接: https://arxiv.org/abs/2503.08226
作者: Esther Chiramal,Kelvin Soh Boon Kai
机构: Nanyang Technological University (南洋理工大学); University of Glasgow (格拉斯哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explainable AI is a strong strategy implemented to understand complex black-box model predictions in a human interpretable language. It provides the evidence required to execute the use of trustworthy and reliable AI systems. On the other hand, however, it also opens the door to locating possible vulnerabilities in an AI model. Traditional adversarial text attack uses word substitution, data augmentation techniques and gradient-based attacks on powerful pre-trained Bidirectional Encoder Representations from Transformers (BERT) variants to generate adversarial sentences. These attacks are generally whitebox in nature and not practical as they can be easily detected by humans E.g. Changing the word from “Poor” to “Rich”. We proposed a simple yet effective Grey-box cum Black-box approach that does not require the knowledge of the model while using a set of surrogate Transformer/BERT models to perform the attack using Explainable AI techniques. As Transformers are the current state-of-the-art models for almost all Natural Language Processing (NLP) tasks, an attack generated from BERT1 is transferable to BERT2. This transferability is made possible due to the attention mechanism in the transformer that allows the model to capture long-range dependencies in a sequence. Using the power of BERT generalisation via attention, we attempt to exploit how transformers learn by attacking a few surrogate transformer variants which are all based on a different architecture. We demonstrate that this approach is highly effective to generate semantically good sentences by changing as little as one word that is not detectable by humans while still fooling other BERT models.
zh

[NLP-27] DeepRAG : Building a Custom Hindi Embedding Model for Retrieval Augmented Generation from Scratch

【速读】：该论文试图解决低资源语言（如印地语）在检索增强生成（Retrieval-Augmented Generation, RAG）系统中高质量嵌入缺乏的问题。传统多语言模型在处理印地语等低资源语言时表现不佳，而该研究通过从零开始构建专用嵌入模型来解决这一问题。关键在于采用自定义的SentencePiece分词器以理解印地语形态，并设计包含印地语特定注意力机制的Transformer架构，结合对比学习进行优化。最终，与多语言模型相比，检索精度提升了23%，证明了语言特定方法的重要性。

链接: https://arxiv.org/abs/2503.08213
作者: Nandakishor M
机构: Deepmost Innovations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, I present our work on DeepRAG, a specialized embedding model we built specifically for Hindi language in RAG systems. While LLMs have gotten really good at generating text, their performance in retrieval tasks still depends heavily on having quality embeddings - something that’s been lacking for Hindi despite being one of the world’s most spoken languages. We tackled this by creating embeddings from the ground up rather than just fine-tuning existing models. Our process involved collecting diverse Hindi texts (over 2.7M samples), training a custom SentencePiece tokenizer that actually understands Hindi morphology, designing transformer architecture with Hindi-specific attention mechanisms, and optimizing with contrastive learning. Results were honestly better than I expected - we saw a 23% improvement in retrieval precision compared to the multilingual models everyone’s been using. The paper details our methodology, which I think could help others working with low-resource languages where the one-size-fits-all multilingual models fall short. We’ve also integrated our embeddings with LangChain to build complete Hindi RAG systems, which might be useful for practitioners. While there’s still tons more to explore, I believe this work addresses a critical gap for Hindi NLP and demonstrates why language-specific approaches matter.
zh

[NLP-28] Dialogue Injection Attack: Jailbreaking LLM s through Context Manipulation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际部署中因历史对话影响而加剧的安全漏洞问题，特别是针对现有单轮交互研究忽视多轮对话历史对模型行为影响的不足，提出了一种新颖的越狱攻击范式——对话注入攻击（Dialogue Injection Attack, DIA）。DIA的关键在于利用历史对话记录来提升攻击的成功率，其操作无需访问模型内部机制，仅需通过聊天API或了解LLM的聊天模板即可实现。论文提出了两种构建对抗性历史对话的方法：一种基于灰盒预填充攻击的适应性策略，另一种利用延迟响应机制。实验结果表明，DIA在最新LLMs（如Llama-3.1和GPT-4o）上的攻击成功率达到了当前最优水平，并成功绕过了五种不同的防御机制，验证了其鲁棒性和有效性。

链接: https://arxiv.org/abs/2503.08195
作者: Wenlong Meng,Fan Zhang,Wendao Yao,Zhenyuan Guo,Yuwei Li,Chengkun Wei,Wenzhi Chen
机构: Zhejiang University (浙江大学); National University of Defense Technology (国防科技大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant utility in a wide range of applications; however, their deployment is plagued by security vulnerabilities, notably jailbreak attacks. These attacks manipulate LLMs to generate harmful or unethical content by crafting adversarial prompts. While much of the current research on jailbreak attacks has focused on single-turn interactions, it has largely overlooked the impact of historical dialogues on model behavior. In this paper, we introduce a novel jailbreak paradigm, Dialogue Injection Attack (DIA), which leverages the dialogue history to enhance the success rates of such attacks. DIA operates in a black-box setting, requiring only access to the chat API or knowledge of the LLM’s chat template. We propose two methods for constructing adversarial historical dialogues: one adapts gray-box prefilling attacks, and the other exploits deferred responses. Our experiments show that DIA achieves state-of-the-art attack success rates on recent LLMs, including Llama-3.1 and GPT-4o. Additionally, we demonstrate that DIA can bypass 5 different defense mechanisms, highlighting its robustness and effectiveness.
zh

[NLP-29] Automating Violence Detection and Categorization from Ancient Texts

【速读】：该论文试图解决通过手动方式从古代文本中提取和分析暴力描述数据效率低下且耗时的问题。论文的关键解决方案在于评估大型语言模型（Large Language Models, LLMs）在识别和多维度分类历史文本中暴力描述的有效性，并通过微调（fine-tuning）和数据增强（data augmentation）技术提升模型性能，最终实现高达0.93的暴力检测F1分数和0.86的细粒度暴力分类F1分数。

链接: https://arxiv.org/abs/2503.08192
作者: Alhassan Abdelhalim,Michaela Regneri
机构: Universität Hamburg (汉堡大学), Dept. of Computer Science (计算机科学系), Hamburg, Germany (德国)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Violence descriptions in literature offer valuable insights for a wide range of research in the humanities. For historians, depictions of violence are of special interest for analyzing the societal dynamics surrounding large wars and individual conflicts of influential people. Harvesting data for violence research manually is laborious and time-consuming. This study is the first one to evaluate the effectiveness of large language models (LLMs) in identifying violence in ancient texts and categorizing it across multiple dimensions. Our experiments identify LLMs as a valuable tool to scale up the accurate analysis of historical texts and show the effect of fine-tuning and data augmentation, yielding an F1-score of up to 0.93 for violence detection and 0.86 for fine-grained violence categorization.
zh

[NLP-30] RigoChat 2: an adapted language model to Spanish using a bounded dataset and reduced hardware

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在训练和推理过程中对计算资源、时间和内存需求过高的问题。解决方案的关键在于利用一个相对较小的预训练LLM作为基础，在少量资源和极短时间内通过特定任务适配，提升其在目标语言任务上的性能，同时保持其整体能力不下降。具体而言，论文通过RigoChat 2的实际案例，展示了如何优化LLM以实现更优的西班牙语任务效果。

链接: https://arxiv.org/abs/2503.08188
作者: Gonzalo Santamaría Gómez,Guillem García Subies,Pablo Gutiérrez Ruiz,Mario González Valero,Natàlia Fuertes,Helena Montoro Zamorano,Carmen Muñoz Sanz,Leire Rosado Plaza,Nuria Aldama García,David Betancur Sánchez,Kateryna Sushkova,Marta Guerrero Nieto,Álvaro Barbero Jiménez
机构: Instituto de Ingeniería del Conocimiento (知识工程研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become a key element of modern artificial intelligence, demonstrating the ability to address a wide range of language processing tasks at unprecedented levels of accuracy without the need of collecting problem-specific data. However, these versatile models face a significant challenge: both their training and inference processes require substantial computational resources, time, and memory. Consequently, optimizing this kind of models to minimize these requirements is crucial. In this article, we demonstrate that, with minimal resources and in a remarkably short time, it is possible to enhance a state-of-the-art model, specifically for a given language task, without compromising its overall capabilities using a relatively small pretrained LLM as a basis. Specifically, we present our use case, RigoChat 2, illustrating how LLMs can be adapted to achieve superior results in Spanish-language tasks.
zh

[NLP-31] FASIONAD : Integrating High-Level Instruction and Information Bottleneck in FAt-Slow fusION Systems for Enhanced Safety in Autonomous Driving with Adaptive Feedback

【速读】：该论文旨在解决自动驾驶系统在复杂低频事件中的规划挑战，尽管端到端模型在标准驾驶场景中表现良好，但面对复杂情况时存在局限性。同时，大型语言模型（LLMs）和视觉语言模型（VLMs）虽提升了推理能力，但计算效率较低。受“快速思维与慢速思维”双过程认知模型启发，论文提出了一种名为FASIONAD的新框架，结合了快速端到端规划器与基于VLM的推理模块。其关键创新在于：(1) 动态切换机制允许根据实时不确定性评估激活慢速系统；(2) 引入信息瓶颈及高级计划反馈优化慢速系统的引导能力；(3) 双向知识交流促进视觉提示增强慢速推理及反馈改进快速决策。此外，通过问答机制与奖励指导训练策略强化VLM推理能力。实验结果显示，FASIONAD在开放环路测试中将平均L2轨迹误差降低了6.7%，碰撞率降低了28.1%。

链接: https://arxiv.org/abs/2503.08162
作者: Kangan Qian,Ziang Luo,Sicong Jiang,Zilin Huang,Jinyu Miao,Zhikun Ma,Tianze Zhu,Jiayin Li,Yangfan He,Zheng Fu,Yining Shi,Boyue Wang,Hezhe Lin,Ziyu Chen,Jiangbo Yu,Xinyu Jiao,Mengmeng Yang,Kun Jiang,Diange Yang
机构: The School of Vehicle and Mobility, Tsinghua University (清华大学车辆与运载学院), McGill University (麦吉尔大学), University of Wisconsin-Madison (威斯康星大学麦迪逊分校), Waseda University (早稻田大学), University of Minnesota (明尼苏达大学), AI2Robotics (北京), AI2Robotics (AI2Robotics), Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Ensuring safe, comfortable, and efficient planning is crucial for autonomous driving systems. While end-to-end models trained on large datasets perform well in standard driving scenarios, they struggle with complex low-frequency events. Recent Large Language Models (LLMs) and Vision Language Models (VLMs) advancements offer enhanced reasoning but suffer from computational inefficiency. Inspired by the dual-process cognitive model “Thinking, Fast and Slow”, we propose \textbfFASIONAD – a novel dual-system framework that synergizes a fast end-to-end planner with a VLM-based reasoning module. The fast system leverages end-to-end learning to achieve real-time trajectory generation in common scenarios, while the slow system activates through uncertainty estimation to perform contextual analysis and complex scenario resolution. Our architecture introduces three key innovations: (1) A dynamic switching mechanism enabling slow system intervention based on real-time uncertainty assessment; (2) An information bottleneck with high-level plan feedback that optimizes the slow system’s guidance capability; (3) A bidirectional knowledge exchange where visual prompts enhance the slow system’s reasoning while its feedback refines the fast planner’s decision-making. To strengthen VLM reasoning, we develop a question-answering mechanism coupled with reward-instruct training strategy. In open-loop experiments, FASIONAD achieves a 6.7% reduction in average L2 trajectory error and 28.1% lower collision rate.
zh

[NLP-32] OASIS: Order-Augmented Strategy for Improved Code Search

【速读】：该论文旨在解决现有代码嵌入（code embeddings）训练方法因代码上下文稀疏性导致的无法充分捕获正负样本之间深层语义细微差异的问题。传统的训练方式主要通过优化InfoNCE损失函数，利用批次内的正自然语言-代码对与负样本进行比较，但这种方法可能忽视了负样本之间的微妙语义差别。为了解决这一局限，论文提出了一种名为“顺序增强策略用于改进代码搜索”（Order-Augmented Strategy for Improved Code Search, OASIS）的解决方案。其关键是引入基于顺序的相似性标签（order-based similarity labels），指导模型学习负样本间更精细的相似性变化，从而提升代码嵌入的质量。实验结果表明，OASIS在多个基准测试中显著超越了仅关注主要正负差异的传统方法，强调了利用带有顺序标签的负样本细微差异对于有效训练代码嵌入的重要性。

链接: https://arxiv.org/abs/2503.08161
作者: Zuchen Gao,Zizheng Zhan,Xianming Li,Erxin Yu,Haotian Zhang,Yuqun Zhang,Jing Li
机构: The Hong Kong Polytechnic University (香港理工大学); Southern University of Science and Technology (南方科技大学); Kwai Inc. (快手)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Code embeddings capture the semantic representations of code and are crucial for various code-related large language model (LLM) applications, such as code search. Previous training primarily relies on optimizing the InfoNCE loss by comparing positive natural language (NL)-code pairs with in-batch negatives. However, due to the sparse nature of code contexts, training solely by comparing the major differences between positive and negative pairs may fail to capture deeper semantic nuances. To address this issue, we propose a novel order-augmented strategy for improved code search (OASIS). It leverages order-based similarity labels to train models to capture subtle differences in similarity among negative pairs. Extensive benchmark evaluations demonstrate that our OASIS model significantly outperforms previous state-of-the-art models focusing solely on major positive-negative differences. It underscores the value of exploiting subtle differences among negative pairs with order labels for effective code embedding training.
zh

[NLP-33] Mimicking How Humans Interpret Out-of-Context Sentences Through Controlled Toxicity Decoding NAACL2025

【速读】：该论文旨在解决在语境缺失的情况下，单句解读多样性及其潜在毒性意义被误解的问题。论文提出了一种解码策略，通过显式控制生成解释中的毒性水平，在输入与输出之间建立对齐关系，同时在更毒性的输入句子上放松约束，并促进生成解释中毒性水平的多样性。解决方案的关键在于这一解码策略能够有效提升生成解释在句法和语义上与人工编写解释的一致性，同时降低模型预测的不确定性。

链接: https://arxiv.org/abs/2503.08159
作者: Maria Mihaela Trusca,Liesbeth Allein
机构: Department of Computer Science (计算机科学系) KU Leuven (鲁汶天主教大学)
类目: Computation and Language (cs.CL)
备注: Short paper; accepted at TrustNLP @ NAACL 2025

点击查看摘要

Abstract:Interpretations of a single sentence can vary, particularly when its context is lost. This paper aims to simulate how readers perceive content with varying toxicity levels by generating diverse interpretations of out-of-context sentences. By modeling toxicity, we can anticipate misunderstandings and reveal hidden toxic meanings. Our proposed decoding strategy explicitly controls toxicity in the set of generated interpretations by (i) aligning interpretation toxicity with the input, (ii) relaxing toxicity constraints for more toxic input sentences, and (iii) promoting diversity in toxicity levels within the set of generated interpretations. Experimental results show that our method improves alignment with human-written interpretations in both syntax and semantics while reducing model prediction uncertainty.
zh

[NLP-34] AI-native Memory 2.0: Second Me

【速读】：本文献旨在解决人类与外部世界交互过程中因重复提供相同信息而导致的冗余问题。现有解决方案如浏览器存储的凭据、自动填充机制及统一认证系统虽有所缓解，但仍有局限性。论文提出的关键解决方案是SECOND ME，这是一种基于大型语言模型（Large Language Models, LLMs）的智能持久记忆卸载系统。它通过参数化记忆管理超越传统静态数据存储，实现结构化组织、上下文推理和自适应知识检索，从而以更系统化和智能化的方式重新定义记忆管理范式。这不仅减轻了用户的认知负担，还减少了人机交互摩擦，标志着向增强人类与世界互动的持续、上下文感知且自我优化的记忆系统迈进了一步。

链接: https://arxiv.org/abs/2503.08102
作者: Jiale Wei,Xiang Ying,Tao Gao,Felix Tao,Jingbo Shang
机构: Mindverse.ai (Mindverse.ai)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Human interaction with the external world fundamentally involves the exchange of personal memory, whether with other individuals, websites, applications, or, in the future, AI agents. A significant portion of this interaction is redundant, requiring users to repeatedly provide the same information across different contexts. Existing solutions, such as browser-stored credentials, autofill mechanisms, and unified authentication systems, have aimed to mitigate this redundancy by serving as intermediaries that store and retrieve commonly used user data. The advent of large language models (LLMs) presents an opportunity to redefine memory management through an AI-native paradigm: SECOND ME. SECOND ME acts as an intelligent, persistent memory offload system that retains, organizes, and dynamically utilizes user-specific knowledge. By serving as an intermediary in user interactions, it can autonomously generate context-aware responses, prefill required information, and facilitate seamless communication with external systems, significantly reducing cognitive load and interaction friction. Unlike traditional memory storage solutions, SECOND ME extends beyond static data retention by leveraging LLM-based memory parameterization. This enables structured organization, contextual reasoning, and adaptive knowledge retrieval, facilitating a more systematic and intelligent approach to memory management. As AI-driven personal agents like SECOND ME become increasingly integrated into digital ecosystems, SECOND ME further represents a critical step toward augmenting human-world interaction with persistent, contextually aware, and self-optimizing memory systems. We have open-sourced the fully localizable deployment system at GitHub: this https URL.
zh

[NLP-35] Advancing Sentiment Analysis: A Novel LSTM Framework with Multi-head Attention

【速读】：该论文致力于解决文本情感分析中的性能提升问题，特别是在标准 LSTM 模型基础上进一步优化情感分类的准确性。解决方案的关键在于结合 Term Frequency-Inverse Document Frequency (TF-IDF) 特征提取与多头注意力机制 (multi-head attention)，通过 TF-IDF 优化输入特征，并利用多头注意力机制增强模型对文本中重要信息的关注能力，从而显著提升了准确率（Accuracy）、召回率（Recall）及 F1 值等关键指标。实验结果表明，该方法在测试集上的准确率达到 80.28%，相较标准 LSTM 提升约 12%，且消融实验验证了各模块的必要性，其中多头注意力机制对性能改进贡献最大。

链接: https://arxiv.org/abs/2503.08079
作者: Jingyuan Yi,Peiyang Yu,Tianyi Huang,Xiaochuan Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work proposes an LSTM-based sentiment classification model with multi-head attention mechanism and TF-IDF optimization. Through the integration of TF-IDF feature extraction and multi-head attention, the model significantly improves text sentiment analysis performance. Experimental results on public data sets demonstrate that the new method achieves substantial improvements in the most critical metrics like accuracy, recall, and F1-score compared to baseline models. Specifically, the model achieves an accuracy of 80.28% on the test set, which is improved by about 12% in comparison with standard LSTM models. Ablation experiments also support the necessity and necessity of all modules, in which the impact of multi-head attention is greatest to performance improvement. This research provides a proper approach to sentiment analysis, which can be utilized in public opinion monitoring, product recommendation, etc.
zh

[NLP-36] MuCoS: Efficient Drug Target Discovery via Multi Context Aware Sampling in Knowledge Graphs

【速读】：该论文旨在解决药物靶点相互作用预测中的两个关键挑战：一是传统知识图谱（Knowledge Graph, KG）嵌入方法如TransE和ComplEx SE因依赖计算密集型的负采样而效率低下；二是其在处理未见的药物-靶点对时泛化能力有限。为应对这些挑战，论文提出了一种名为Multi Context Aware Sampling (MuCoS) 的新框架，其关键是通过优先选择高密度邻居捕获显著的结构模式，并结合从BERT中提取的上下文嵌入，实现结构模态与文本模态的统一以及高度信息量模式的选择性采样。这种方法避免了负采样的需求，大幅降低了计算开销，同时提升了对新型药物-靶点关联预测的准确性。实验结果表明，MuCoS在KEGG50k数据集上的表现优于现有最先进的基线模型，预测任意关系的平均倒数排名（Mean Reciprocal Rank, MRR）提升了高达13%，专门针对药物-靶点关系的预测也提升了6%。

链接: https://arxiv.org/abs/2503.08075
作者: Haji Gul,Abdul Ghani Naim,Ajaz Ahmad Bhat
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate prediction of drug target interactions is critical for accelerating drug discovery and elucidating complex biological mechanisms. In this work, we frame drug target prediction as a link prediction task on heterogeneous biomedical knowledge graphs (KG) that integrate drugs, proteins, diseases, pathways, and other relevant entities. Conventional KG embedding methods such as TransE and ComplEx SE are hindered by their reliance on computationally intensive negative sampling and their limited generalization to unseen drug target pairs. To address these challenges, we propose Multi Context Aware Sampling (MuCoS), a novel framework that prioritizes high-density neighbours to capture salient structural patterns and integrates these with contextual embeddings derived from BERT. By unifying structural and textual modalities and selectively sampling highly informative patterns, MuCoS circumvents the need for negative sampling, significantly reducing computational overhead while enhancing predictive accuracy for novel drug target associations and drug targets. Extensive experiments on the KEGG50k dataset demonstrate that MuCoS outperforms state-of-the-art baselines, achieving up to a 13% improvement in mean reciprocal rank (MRR) in predicting any relation in the dataset and a 6% improvement in dedicated drug target relation prediction.
zh

[NLP-37] Context-aware Biases for Length Extrapolation

【速读】：该论文试图解决Transformer在长度外推（length extrapolation）能力随序列长度增加而逐渐退化的问题。传统相对位置编码（Relative Positional Encoding, RPE）方法主要通过添加常量线性偏置或学习通用偏置来缓解这一问题，但缺乏针对不同序列的特异性适应能力。论文的关键解决方案是提出了一种名为Context-aware Biases for Length Extrapolation (Cable) 的方法，它为基于解码器的Transformer中的每个头学习与上下文相关的特定标记偏置。Cable通过引入动态偏置克服了固定模式的限制，这些动态偏置针对序列中的每个标记进行调整。实验结果表明，使用Cable的GPT-3 Medium模型在序列长度为1024的情况下，其困惑度比在序列长度为1024下训练且采用正弦位置编码的相似网络低0.65，同时内存使用减少了48%，训练时间仅增加了3.5%。此外，该方法显著提升了现有RPE方法在外推能力上的表现。

链接: https://arxiv.org/abs/2503.08067
作者: Ali Veisi,Amir Mansourian
机构: Axiom Lab
类目: Computation and Language (cs.CL)
备注: 11 pages, 8 figures, 1 table

点击查看摘要

Abstract:Transformers’ ability to generalize to longer sequences than they have been trained on, known as length extrapolation, degrades as sequence length increases. Most of Relative Positional Encoding (RPE) methods address this problem by either adding constant linear biases or learning general biases, lacking the ability to specialize for different sequences. In this work, inspired by ALiBi, we propose Context-aware Biases for Length Extrapolation (Cable), that learns token-specific biases for each head in decoder-based transformers. Cable learns adaptive, context-aware biases, overcoming the limitations of fixed patterns by adding dynamic biases specific to each token in the sequence. Results show that when tested on a sequence length of 1024, a GPT-3 Medium (334M parameters) with our positional encoding, trained on a sequence length of 512, achieves better perplexity (-0.65) than a similar network with sinusoidal positional encoding trained on a sequence length of 1024. This is achieved with 48% lower memory usage, and only 3.5% higher training time. Furthermore, our method notably improves the extrapolation ability of existing RPE methods on the Edu-FineWeb10B and WikiText-103 datasets. Code is available at: this https URL
zh

[NLP-38] Odysseus Navigates the Sirens Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation

【速读】：该论文旨在解决大型语言模型（LLMs）在开放性应用场景中生成既事实准确又多样化文本的难题。当前的随机解码方法难以同时满足这两个目标。为了解决这一权衡问题，论文提出了一种名为动态聚焦解码（Dynamic Focus Decoding, DFD）的新颖插件式随机方法。DFD 的关键在于其能够根据各层分布差异自适应调整解码焦点，利用 LLMs 中事实知识的模块化和分层特性。通过在知识密集型解码步骤中提升事实准确性，在较少依赖知识的步骤中促进多样性，DFD 在不增加额外数据、知识或模型的情况下实现了性能提升。此外，DFD 易于与现有解码方法集成，并以较低的计算开销显著改善事实性和多样性。实验结果表明，DFD 在七个数据集上的性能得到了显著提升，提供了一种可扩展且高效的开放式文本生成解决方案。

链接: https://arxiv.org/abs/2503.08057
作者: Wen Luo,Feifan Song,Wei Li,Guangyue Peng,Shaohang Wei,Houfeng Wang
机构: State Key Laboratory of Multimedia Information Processing (多媒体信息处理国家重点实验室), School of Computer Science, Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly required to generate text that is both factually accurate and diverse across various open-ended applications. However, current stochastic decoding methods struggle to balance such objectives. We introduce Dynamic Focus Decoding (DFD), a novel plug-and-play stochastic approach that resolves this trade-off without requiring additional data, knowledge, or models. DFD adaptively adjusts the decoding focus based on distributional differences across layers, leveraging the modular and hierarchical nature of factual knowledge within LLMs. This dynamic adjustment improves factuality in knowledge-intensive decoding steps and promotes diversity in less knowledge-reliant steps. DFD can be easily integrated with existing decoding methods, enhancing both factuality and diversity with minimal computational overhead. Extensive experiments across seven datasets demonstrate that DFD significantly improves performance, providing a scalable and efficient solution for open-ended text generation.
zh

[NLP-39] Adapting Large Language Models for Parameter-Efficient Log Anomaly Detection PAKDD2025

【速读】：该论文旨在解决日志异常检测（Log Anomaly Detection, LAD）中利用大型语言模型（Large Language Models, LLMs）进行异常模式识别的研究空白。尽管LLMs在多个领域取得了显著成功，但其在LAD中的应用尚未得到充分探索。论文的关键在于采用参数高效微调技术（Parameter-Efficient Fine-Tuning, PEFT），以降低完全微调LLMs的高昂成本，同时深入研究两种流行的PEFT方法——低秩适应（Low-Rank Adaptation, LoRA）和表征微调（Representation Fine-tuning, ReFT），将其应用于不同规模的三种代表性LLMs（RoBERTa、GPT-2和Llama-3），从而实现高效的LAD。通过在四个公开日志数据集上的综合实验，论文揭示了这些基于PEFT的LLM驱动的LAD方法在有效性、稳定性、样本效率、对不稳定日志的鲁棒性以及跨数据集泛化能力方面的关键洞见。

链接: https://arxiv.org/abs/2503.08045
作者: Ying Fu Lim,Jiawen Zhu,Guansong Pang
机构: Singapore Management University (新加坡管理大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 5 figures, accepted by PAKDD 2025 special session

点击查看摘要

Abstract:Log Anomaly Detection (LAD) seeks to identify atypical patterns in log data that are crucial to assessing the security and condition of systems. Although Large Language Models (LLMs) have shown tremendous success in various fields, the use of LLMs in enabling the detection of log anomalies is largely unexplored. This work aims to fill this gap. Due to the prohibitive costs involved in fully fine-tuning LLMs, we explore the use of parameter-efficient fine-tuning techniques (PEFTs) for adapting LLMs to LAD. To have an in-depth exploration of the potential of LLM-driven LAD, we present a comprehensive investigation of leveraging two of the most popular PEFTs – Low-Rank Adaptation (LoRA) and Representation Fine-tuning (ReFT) – to tap into three prominent LLMs of varying size, including RoBERTa, GPT-2, and Llama-3, for parameter-efficient LAD. Comprehensive experiments on four public log datasets are performed to reveal important insights into effective LLM-driven LAD in several key perspectives, including the efficacy of these PEFT-based LLM-driven LAD methods, their stability, sample efficiency, robustness w.r.t. unstable logs, and cross-dataset generalization. Code is available at this https URL.
zh

[NLP-40] A General Framework to Evaluate Methods for Assessing Dimensions of Lexical Semantic Change Using LLM -Generated Synthetic Data

【速读】：该论文旨在解决Lexical Semantic Change (LSC) 方法有效性评估缺乏历史基准数据集的问题。解决方案的关键在于提出了一种新颖的三阶段评估框架：首先，通过利用In-Context Learning和词典数据库构建可扩展的合成数据集以模拟理论驱动的LSC变化；其次，使用这些合成数据集评估不同方法的效果；最后，评估这些方法在特定维度和领域中的适用性。此框架通过心理学领域的示例，在情感(Sentiment)、强度(Intensity)和广度(Breadth)三个关键维度上模拟变化，并验证所选方法对人工诱导变化的敏感性。研究结果支持合成数据方法的实用性，验证了针对特定维度的定制化方法的有效性，并揭示了最先进的LSC模型在检测情感维度上的挑战。这一框架为LSC方法的维度和领域特定基准测试与评估提供了有价值的工具，尤其对社会科学具有重要意义。

链接: https://arxiv.org/abs/2503.08042
作者: Naomi Baes,Raphaël Merx,Nick Haslam,Ekaterina Vylomova,Haim Dubossarsky
机构: Melbourne School of Psychological Sciences, The University of Melbourne (墨尔本心理科学学院，墨尔本大学); School of Computing and Information Systems, The University of Melbourne (计算与信息系统学院，墨尔本大学); School of Electronic Engineering and Computer Science, Queen Mary University of London (电子工程与计算机科学学院，伦敦玛丽女王大学); The Alan Turing Institute (阿兰·图灵研究所); Language Technology Lab, University of Cambridge (语言技术实验室，剑桥大学)
类目: Computation and Language (cs.CL)
备注: 36 pages, under review

点击查看摘要

Abstract:Lexical Semantic Change (LSC) offers insights into cultural and social dynamics. Yet, the validity of methods for measuring kinds of LSC has yet to be established due to the absence of historical benchmark datasets. To address this gap, we develop a novel three-stage evaluation framework that involves: 1) creating a scalable, domain-general methodology for generating synthetic datasets that simulate theory-driven LSC across time, leveraging In-Context Learning and a lexical database; 2) using these datasets to evaluate the effectiveness of various methods; and 3) assessing their suitability for specific dimensions and domains. We apply this framework to simulate changes across key dimensions of LSC (SIB: Sentiment, Intensity, and Breadth) using examples from psychology, and evaluate the sensitivity of selected methods to detect these artificially induced changes. Our findings support the utility of the synthetic data approach, validate the efficacy of tailored methods for detecting synthetic changes in SIB, and reveal that a state-of-the-art LSC model faces challenges in detecting affective dimensions of LSC. This framework provides a valuable tool for dimension- and domain-specific bench-marking and evaluation of LSC methods, with particular benefits for the social sciences.
zh

[NLP-41] Group Preference Alignment: Customized LLM Response Generation from In-Situ Conversations

【速读】：该论文试图解决大型语言模型（LLMs）因统一训练范式而难以满足不同用户群体特定需求的问题，并填补关于各用户群体期望个性化方面的研究空白。解决方案的关键在于提出了一种名为“群体偏好对齐”（Group Preference Alignment, GPA）的框架，该框架通过两个步骤实现：首先从真实对话日志中提取用户群体间最大差异化的会话偏好并将其提炼为可解释的准则（Group-Aware Preference Extraction），然后利用这些准则生成定制化响应（Tailored Response Generation）。后者通过两种方法实现：一是上下文调整推理（GAP-CT），二是基于准则的微调推理（GPA-FT），后者进一步生成对比合成数据以优化特定群体模型的个性化。实验表明，该框架显著提高了输出与用户偏好的一致性，优于基线方法，同时在标准基准测试中保持了鲁棒性能。

链接: https://arxiv.org/abs/2503.08035
作者: Ishani Mondal,Jack W. Stokes,Sujay Kumar Jauhar,Longqi Yang,Mengting Wan,Xiaofeng Xu,Xia Song,Jennifer Neville
机构: University of Maryland, College Park (马里兰大学帕克分校); Microsoft Research, Redmond (微软研究，雷德蒙德)
类目: Computation and Language (cs.CL)
备注: 23 pages

点击查看摘要

Abstract:LLMs often fail to meet the specialized needs of distinct user groups due to their one-size-fits-all training paradigm \citelucy-etal-2024-one and there is limited research on what personalization aspects each group expect. To address these limitations, we propose a group-aware personalization framework, Group Preference Alignment (GPA), that identifies context-specific variations in conversational preferences across user groups and then steers LLMs to address those preferences. Our approach consists of two steps: (1) Group-Aware Preference Extraction, where maximally divergent user-group preferences are extracted from real-world conversation logs and distilled into interpretable rubrics, and (2) Tailored Response Generation, which leverages these rubrics through two methods: a) Context-Tuned Inference (GAP-CT), that dynamically adjusts responses via context-dependent prompt instructions, and b) Rubric-Finetuning Inference (GPA-FT), which uses the rubrics to generate contrastive synthetic data for personalization of group-specific models via alignment. Experiments demonstrate that our framework significantly improves alignment of the output with respect to user preferences and outperforms baseline methods, while maintaining robust performance on standard benchmarks.
zh

[NLP-42] Learning to Search Effective Example Sequences for In-Context Learning NAACL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在利用少量示例进行学习时，其性能受上下文示例序列影响较大的问题。具体而言，序列的长度、组成、排列方式及其与特定查询的关系是影响性能的关键因素，但现有方法通常孤立地处理这些因素，忽略了它们之间的相互依赖性，且在寻找最优序列时面临巨大的搜索空间挑战。论文提出的解决方案之关键是引入Beam Search-based Example Sequence Constructor (BESC)，通过在推理过程中联合考虑所有关键因素，并逐步构建序列，从而显著降低搜索空间的复杂度。这一设计使得BESC能够有效提升在不同数据集和语言模型上的性能表现。

链接: https://arxiv.org/abs/2503.08030
作者: Xiang Gao,Ankita Sinha,Kamalika Das
机构: Intuit AI Research (Intuit AI 研究院)
类目: Computation and Language (cs.CL)
备注: Accepted to appear at NAACL 2025

点击查看摘要

Abstract:Large language models (LLMs) demonstrate impressive few-shot learning capabilities, but their performance varies widely based on the sequence of in-context examples. Key factors influencing this include the sequence’s length, composition, and arrangement, as well as its relation to the specific query. Existing methods often tackle these factors in isolation, overlooking their interdependencies. Moreover, the extensive search space for selecting optimal sequences complicates the development of a holistic approach. In this work, we introduce Beam Search-based Example Sequence Constructor (BESC), a novel method for learning to construct optimal example sequences. BESC addresses all key factors involved in sequence selection by considering them jointly during inference, while incrementally building the sequence. This design enables the use of beam search to significantly reduce the complexity of the search space. Experiments across various datasets and language models show notable improvements in performance.
zh

[NLP-43] In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

【速读】：该论文旨在解决大型语言模型（LLMs）在开放域对话中的局限性，即无法有效保留和检索长期交互中的相关信息，这限制了其在需要持续个性化应用场景中的效果。为应对这一挑战，论文提出了反射记忆管理（Reflective Memory Management, RMM），这是一种创新的记忆管理机制，用于长期对话代理。RMM的关键在于结合前瞻性和回顾性的反思：(1) 前瞻性反思通过动态总结不同粒度（话语、轮次和会话）的交互，构建个性化的记忆库，以实现未来检索的有效性；(2) 回顾性反思则基于LLMs引用的证据，以在线强化学习的方式迭代优化检索结果。实验表明，RMM在多种指标和基准数据集上表现出一致的性能提升，在LongMemEval数据集上比无记忆管理的基线模型提高了超过10%的准确性。

链接: https://arxiv.org/abs/2503.08026
作者: Zhen Tan,Jun Yan,I-Hung Hsu,Rujun Han,Zifeng Wang,Long T. Le,Yiwen Song,Yanfei Chen,Hamid Palangi,George Lee,Anand Iyer,Tianlong Chen,Huan Liu,Chen-Yu Lee,Tomas Pfister
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities-utterances, turns, and sessions-into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs’ cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.
zh

[NLP-44] SQLCritic: Correcting Text-to-SQL Generation via Clause-wise Critic

【速读】：该论文致力于解决Text-to-SQL系统中自然语言查询转换为SQL语句时存在的准确性与可靠性不足的问题，特别是现有方法在处理语法错误（syntax issues）的同时未能有效解决语义错误（semantic errors），即查询逻辑未能准确反映用户意图的情况。论文的关键创新在于提出了一种结合结构化执行反馈与经过训练的批评者代理（critic agent）的新方法，该代理能够提供详细且可解释的批评建议。这种方法不仅能够识别并修正语法层面的错误，还能有效解决语义层面的问题，从而显著提升系统的准确性和可解释性。实验结果表明，该方法在Spider和BIRD两个主流Text-to-SQL基准数据集上取得了显著性能提升。

链接: https://arxiv.org/abs/2503.07996
作者: Jikai Chen,Leilei Gan
机构: Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Text-to-SQL systems have improved the conversion of natural language queries into SQL, but challenges remain in ensuring accuracy and reliability. While self-correction techniques refine outputs, they often introduce new errors. Existing methods focused on execution feedback mainly address syntax issues, leaving semantic errors – where the query’s logic fails to align with the user’s intent – largely unaddressed. We propose a novel approach combining structured execution feedback with a trained critic agent that provides detailed, interpretable critiques. This method effectively identifies and corrects both syntactic and semantic errors, enhancing accuracy and interpretability. Experimental results show significant improvements on two major Text-to-SQL benchmarks, Spider and BIRD, demonstrating the effectiveness of our approach. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2503.07996 [cs.AI] (or arXiv:2503.07996v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.07996 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-45] Enhancing Multilingual Language Models for Code-Switched Input Data

【速读】：该论文试图解决多语言语言模型在处理代码切换（code-switching）数据时性能下降的问题。代码切换指在同一对话中交替使用多种语言的现象，在自然语言处理（NLP）任务中对模型提出了挑战。论文的关键解决方案是通过在包含代码切换的数据集上预训练Multilingual BERT (mBERT)，评估其在词性标注、情感分析、命名实体识别和语言识别等关键NLP任务上的表现，并与基线模型对比。研究发现，预训练后的mBERT模型在这些任务中表现出色，尤其是在词性标注任务中提升显著。此外，潜在分析显示，该模型在语言识别任务中生成了更均匀的英语和西班牙语嵌入，为未来模型设计提供了重要见解。这一研究强调了调整多语言LM以适应代码切换输入数据的潜力，为全球化和多语言环境下的高级应用奠定了基础。

链接: https://arxiv.org/abs/2503.07990
作者: Katherine Xie,Nitya Babbar,Vicky Chen,Yoanna Turura
机构: MIT (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code-switching, or alternating between languages within a single conversation, presents challenges for multilingual language models on NLP tasks. This research investigates if pre-training Multilingual BERT (mBERT) on code-switched datasets improves the model’s performance on critical NLP tasks such as part of speech tagging, sentiment analysis, named entity recognition, and language identification. We use a dataset of Spanglish tweets for pre-training and evaluate the pre-trained model against a baseline model. Our findings show that our pre-trained mBERT model outperforms or matches the baseline model in the given tasks, with the most significant improvements seen for parts of speech tagging. Additionally, our latent analysis uncovers more homogenous English and Spanish embeddings for language identification tasks, providing insights for future modeling work. This research highlights potential for adapting multilingual LMs for code-switched input data in order for advanced utility in globalized and multilingual contexts. Future work includes extending experiments to other language pairs, incorporating multiform data, and exploring methods for better understanding context-dependent code-switches. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2503.07990 [cs.CL] (or arXiv:2503.07990v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.07990 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-46] LabelCoRank: Revolutionizing Long Tail Multi-Label Classification with Co-Occurrence Reranking

【速读】：该论文致力于解决多标签文本分类中的长尾挑战问题，即如何更准确地对低频标签进行分类。当前方法主要关注提升文本语义表示，而忽视了标签关系的重要性。论文的关键解决方案是提出了一种名为LabelCoRank的新方法，它通过利用标签共现关系，在双阶段重排序过程中优化初始标签分类结果。第一阶段基于初步分类形成初始排名，第二阶段则借助标签共现矩阵进一步重排序，从而提高最终分类的准确性和相关性。此外，通过将重排序后的标签表示作为附加文本特征集成到模型中，LabelCoRank有效缓解了多标签文本分类中的长尾问题。实验评估表明，该方法在MAG-CS、PubMed和AAPD等流行数据集上表现出色。

链接: https://arxiv.org/abs/2503.07968
作者: Yan Yan,Junyuan Liu,Bo-Wen Zhang
机构: China University of Mining & Technology (中国矿业大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Motivation: Despite recent advancements in semantic representation driven by pre-trained and large-scale language models, addressing long tail challenges in multi-label text classification remains a significant issue. Long tail challenges have persistently posed difficulties in accurately classifying less frequent labels. Current approaches often focus on improving text semantics while neglecting the crucial role of label relationships. Results: This paper introduces LabelCoRank, a novel approach inspired by ranking principles. LabelCoRank leverages label co-occurrence relationships to refine initial label classifications through a dual-stage reranking process. The first stage uses initial classification results to form a preliminary ranking. In the second stage, a label co-occurrence matrix is utilized to rerank the preliminary results, enhancing the accuracy and relevance of the final classifications. By integrating the reranked label representations as additional text features, LabelCoRank effectively mitigates long tail issues in multi-labeltext classification. Experimental evaluations on popular datasets including MAG-CS, PubMed, and AAPD demonstrate the effectiveness and robustness of LabelCoRank.
zh

[NLP-47] EFPC: Towards Efficient and Flexible Prompt Compression

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）如GPT-4在自然语言处理（NLP）任务中因令牌数量庞大导致的高计算成本和高昂费用问题。为应对这一挑战，论文提出了一种名为高效且灵活提示压缩（Efficient and Flexible Prompt Compression, EFPC）的新方法，该方法通过统一任务感知和任务无关的压缩策略，在准确率与效率之间实现良好权衡。EFPC的关键在于利用GPT-4生成压缩后的提示，并将其与原始提示结合用于训练；同时在训练和推理过程中依据预测概率选择性地添加用户指令及压缩提示。这种设计不仅显著提升了数据使用效率，还在多个基准测试中表现出色，例如在LongBench单文档问答任务上，相较于最先进的LLMLingua-2方法，EFPC分别实现了相对F1分数提升4.8%（使用额外1%数据且压缩率为4倍）以及11.4%（使用额外10%数据）。此外，EFPC的统一框架增强了其跨模型、任务及领域的适用性，为NLP领域提供了实用性的技术进步。

链接: https://arxiv.org/abs/2503.07956
作者: Yun-Hao Cao,Yangsong Wang,Shuzheng Hao,Zhenxing Li,Chengjun Zhan,Sichao Liu,Yi-Qi Hu
机构: Huawei Technologies (华为); Huaqiao University (华侨大学); HiSilicon (海思); Huawei (华为)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:The emergence of large language models (LLMs) like GPT-4 has revolutionized natural language processing (NLP), enabling diverse, complex tasks. However, extensive token counts lead to high computational and financial burdens. To address this, we propose Efficient and Flexible Prompt Compression (EFPC), a novel method unifying task-aware and task-agnostic compression for a favorable accuracy-efficiency trade-off. EFPC uses GPT-4 to generate compressed prompts and integrates them with original prompts for training. During training and inference, we selectively prepend user instructions and compress prompts based on predicted probabilities. EFPC is highly data-efficient, achieving significant performance with minimal data. Compared to the state-of-the-art method LLMLingua-2, EFPC achieves a 4.8% relative improvement in F1-score with 1% additional data at a 4x compression rate, and an 11.4% gain with 10% additional data on the LongBench single-doc QA benchmark. EFPC’s unified framework supports broad applicability and enhances performance across various models, tasks, and domains, offering a practical advancement in NLP.
zh

[NLP-48] Enhancing Sentiment Analysis through Multimodal Fusion: A BERT-DINOv2 Approach

【速读】：该论文旨在解决传统情感分析仅依赖文本信息的局限性，通过整合图像、文本和音频等多种模态的信息，提升情感分析的全面性和准确性。论文的关键在于提出了一种新颖的多模态情感分析架构，该架构结合了文本和图像数据以实现更全面的情感理解。具体而言，文本特征提取使用BERT（Bidirectional Encoder Representations from Transformers），而图像特征提取则采用基于视觉Transformer的DINOv2模型。此外，为了有效融合文本和视觉隐含特征，论文提出了三种融合技术：基本融合模型（Basic Fusion Model）、自注意力融合模型（Self Attention Fusion Model）以及双注意力融合模型（Dual Attention Fusion Model）。实验结果表明，所提出的多模态架构在Memotion 7k数据集、MVSA单模态数据集和MVSA多模态数据集上的表现验证了其可行性和实用性。

链接: https://arxiv.org/abs/2503.07943
作者: Taoxu Zhao,Meisi Li,Kehao Chen,Liye Wang,Xucheng Zhou,Kunal Chaturvedi,Mukesh Prasad,Ali Anaissi,Ali Braytee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:Multimodal sentiment analysis enhances conventional sentiment analysis, which traditionally relies solely on text, by incorporating information from different modalities such as images, text, and audio. This paper proposes a novel multimodal sentiment analysis architecture that integrates text and image data to provide a more comprehensive understanding of sentiments. For text feature extraction, we utilize BERT, a natural language processing model. For image feature extraction, we employ DINOv2, a vision-transformer-based model. The textual and visual latent features are integrated using proposed fusion techniques, namely the Basic Fusion Model, Self Attention Fusion Model, and Dual Attention Fusion Model. Experiments on three datasets, Memotion 7k dataset, MVSA single dataset, and MVSA multi dataset, demonstrate the viability and practicality of the proposed multimodal architecture.
zh

[NLP-49] Crowdsource Crawl or Generate? Creating SEA-VL a Multicultural Vision-Language Dataset for Southeast Asia

【速读】：该论文试图解决东南亚（Southeast Asia, SEA）在视觉-语言（Vision-Language, VL）研究中因文化与语言代表性不足而导致的人工智能（Artificial Intelligence, AI）模型无法有效捕捉区域文化细微差别的问题。解决方案的关键在于通过SEA-VL这一开源项目开发高质量且与文化相关联的数据集，以提高SEA地区语言和文化的包容性与多样性。具体而言，SEA-VL不仅依赖众包方式，还进一步探索了通过网络爬取（image crawling）和图像生成（image generation）自动收集具有文化关联的图像的方法。研究表明，网络爬取能够达到约85%的文化相关性，并且比众包更具成本效益和时间效率；然而，尽管生成式视觉模型已取得显著进展，合成图像仍难以可靠地准确反映SEA地区的文化特征。最终，该项目收集了超过128万张与SEA文化相关的图像，规模远超现有其他数据集。通过这些努力，SEA-VL旨在缩小SEA地区的表征差距，推动构建更包容、能真实反映SEA多元文化的AI系统。

链接: https://arxiv.org/abs/2503.07920
作者: Samuel Cahyawijaya,Holy Lovenia,Joel Ruben Antony Moniz,Tack Hwa Wong,Mohammad Rifqi Farhansyah,Thant Thiri Maung,Frederikus Hudi,David Anugraha,Muhammad Ravi Shulthan Habibi,Muhammad Reza Qorib,Amit Agarwal,Joseph Marvin Imperial,Hitesh Laxmichand Patel,Vicky Feliren,Bahrul Ilmi Nasution,Manuel Antonio Rufino,Genta Indra Winata,Rian Adam Rajagede,Carlos Rafael Catalan,Mohamed Fazli Imam,Priyaranjan Pattnayak,Salsabila Zahirah Pranida,Kevin Pratama,Yeshil Bangera,Adisai Na-Thalang,Patricia Nicole Monderin,Yueqi Song,Christian Simon,Lynnette Hui Xian Ng,Richardy Lobo’ Sapan,Taki Hasan Rafi,Bin Wang,Supryadi,Kanyakorn Veerakanjana,Piyalitt Ittichaiwong,Matthew Theodore Roque,Karissa Vincentio,Takdanai Kreangphet,Phakphum Artkaew,Kadek Hendrawan Palgunadi,Yanzhi Yu,Rochana Prih Hastuti,William Nixon,Mithil Bangera,Adrian Xuan Wei Lim,Aye Hninn Khine,Hanif Muhammad Zhafran,Teddy Ferdinan,Audra Aurora Izzani,Ayushman Singh,Evan,Jauza Akbar Krito,Michael Anugraha,Fenal Ashokbhai Ilasariya,Haochen Li,John Amadeo Daniswara,Filbert Aurelian Tjiaranata,Eryawan Presma Yulianrifat,Can Udomcharoenchaikit,Fadil Risdian Ansori,Mahardika Krisna Ihsani,Giang Nguyen,Anab Maulana Barik,Dan John Velasco,Rifo Ahmad Genadi,Saptarshi Saha,Chengwei Wei,Isaiah Flores,Kenneth Ko Han Chen,Anjela Gail Santos,Wan Shen Lim,Kaung Si Phyo,Tim Santos,Meisyarah Dwiastuti,Jiayun Luo,Jan Christian Blaise Cruz,Ming Shan Hee,Ikhlasul Akmal Hanif,M.Alif Al Hakim,Muhammad Rizky Sya’ban,Kun Kerdthaisong,Lester James V. Miranda,Fajri Koto,Tirana Noor Fatyanosa,Alham Fikri Aji,Jostin Jerico Rosal,Jun Kevin,Robert Wijaya,Onno P. Kampman,Ruochen Zhang,Börje F. Karlsson,Peerat Limkonchotiwat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: SEA-VL Dataset: this https URL

点击查看摘要

Abstract:Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.
zh

[NLP-50] BEARCUBS: A benchmark for computer-using web agents

【速读】：该论文旨在解决如何在真实世界场景中有效评估现代网络代理（web agent）能力的问题。现有评估基准通常依赖于合成或模拟网页，无法充分反映实际网络交互的不可预测性，也无法全面测试代理的多模态交互能力。为应对这一挑战，论文提出了一种名为BEARCUBS的新基准，包含111个信息检索问题。BEARCUBS的关键创新在于：(1) 要求代理访问真实的在线内容而非静态页面，从而捕捉现实交互的复杂性；(2) 强调需要执行广泛的多模态操作（如视频理解与三维导航），这些任务不能通过文本方式绕过。此外，每个问题都配有明确的答案及人类验证的浏览路径，便于透明评估。实验表明，尽管这些问题对人类而言具有挑战性（84.7%的人类准确率），但最先进的网络代理表现远逊于此（最佳系统仅达24.3%）。这些结果揭示了改进方向，包括提升可靠的信息源选择能力和增强多模态处理能力。解决方案的关键在于设计一个既贴近现实又能够全面测试代理能力的动态基准。

链接: https://arxiv.org/abs/2503.07919
作者: Yixiao Song,Katherine Thai,Chau Minh Pham,Yapei Chang,Mazin Nadaf,Mohit Iyyer
机构: UMass Amherst (马萨诸塞大学阿默斯特分校); University of Maryland, College Park (马里兰大学帕克分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages

点击查看摘要

Abstract:Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a “small but mighty” benchmark of 111 information-seeking questions designed to evaluate a web agent’s ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing search inefficiencies and domain knowledge gaps as common failure points. By contrast, state-of-the-art computer-using agents underperform, with the best-scoring system (OpenAI’s Operator) reaching only 24.3% accuracy. These results highlight critical areas for improvement, including reliable source selection and more powerful multimodal capabilities. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.
zh

[NLP-51] Demystifying the Accuracy-Interpretability Trade-Off: A Case Study of Inferring Ratings from Reviews AAAI-2025

【速读】：本文旨在解决可解释性机器学习模型与高性能黑盒模型之间的权衡问题，特别是在自然语言处理（NLP）领域中从评论推断评分这一特定任务上的应用。为了解决这个问题，研究者们对比分析了多种黑盒模型和可解释模型的表现，并提出了一个名为复合可解释性（Composite Interpretability, CI）的定量评分标准来可视化这种性能与可解释性之间的权衡关系，尤其是在复合模型中的表现情况。关键在于通过引入CI评分系统，研究揭示了虽然通常情况下随着模型可解释性的降低其学习性能会提高，但这一趋势并非始终单调递增，在某些情形下，可解释模型可能更具优势。

链接: https://arxiv.org/abs/2503.07914
作者: Pranjal Atrey,Michael P. Brundage,Min Wu,Sanghamitra Dutta
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at DAI Workshop, AAAI-2025

点击查看摘要

Abstract:Interpretable machine learning models offer understandable reasoning behind their decision-making process, though they may not always match the performance of their black-box counterparts. This trade-off between interpretability and model performance has sparked discussions around the deployment of AI, particularly in critical applications where knowing the rationale of decision-making is essential for trust and accountability. In this study, we conduct a comparative analysis of several black-box and interpretable models, focusing on a specific NLP use case that has received limited attention: inferring ratings from reviews. Through this use case, we explore the intricate relationship between the performance and interpretability of different models. We introduce a quantitative score called Composite Interpretability (CI) to help visualize the trade-off between interpretability and performance, particularly in the case of composite models. Our results indicate that, in general, the learning performance improves as interpretability decreases, but this relationship is not strictly monotonic, and there are instances where interpretable models are more advantageous.
zh

[NLP-52] Can Memory-Augmented Language Models Generalize on Reasoning -in-a-Haystack Tasks?

【速读】：该论文试图解决大型语言模型（LLMs）在推理任务中暴露的脆弱性，尤其是在处理长上下文链时执行长推理链条的能力不足的问题。为了解决这一问题，论文提出了一种名为MemReasoner的新架构，其关键是引入了一个增强记忆的机制，使模型能够学习上下文中事实的相对顺序，并通过解码器选择性地关注记忆，从而实现对事实的高效跳跃访问。MemReasoner通过端到端训练，并可选配不同程度的支持事实弱监督。实验表明，即使仅使用极少的支持事实（如1%），MemReasoner在单跳和双跳推理任务上的泛化能力仍显著优于基线模型，强调了显式记忆机制结合弱监督对于提升大语言模型上下文处理能力的重要性。

链接: https://arxiv.org/abs/2503.07903
作者: Payel Das,Ching-Yun Ko,Sihui Dai,Georgios Kollias,Subhajit Chaudhury,Aurelie Lozano
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models often expose their brittleness in reasoning tasks, especially while executing long chains of reasoning over context. We propose MemReasoner, a new and simple memory-augmented LLM architecture, in which the memory learns the relative order of facts in context, and enables hopping over them, while the decoder selectively attends to the memory. MemReasoner is trained end-to-end, with optional supporting fact supervision of varying degrees. We train MemReasoner, along with existing memory-augmented transformer models and a state-space model, on two distinct synthetic multi-hop reasoning tasks. Experiments performed under a variety of challenging scenarios, including the presence of long distractor text or target answer changes in test set, show strong generalization of MemReasoner on both single- and two-hop tasks. This generalization of MemReasoner is achieved using none-to-weak supporting fact supervision (using none and 1% of supporting facts for one- and two-hop tasks, respectively). In contrast, baseline models overall struggle to generalize and benefit far less from using full supporting fact supervision. The results highlight the importance of explicit memory mechanisms, combined with additional weak supervision, for improving large language model’s context processing ability toward reasoning tasks.
zh

[NLP-53] Gemini Embedding: Generalizable Embeddings from Gemini

【速读】：该论文旨在解决多语言和跨模态文本嵌入的通用性和性能提升问题。解决方案的关键在于提出Gemini Embedding，这是一种基于Google最先进的大型语言模型Gemini构建的状态-of-the-art嵌入模型。Gemini Embedding充分利用了Gemini的多语言理解和代码理解能力，能够为跨越多种语言和文本模态的文本生成高度泛化的嵌入表示。这些表示可以预先计算，并广泛应用于分类、相似性分析、聚类、排序和检索等下游任务。通过在包含250多种语言超过一百项任务的Massive Multilingual Text Embedding Benchmark (MMTEB)上的评估，Gemini Embedding显著超越了先前的最先进模型，在嵌入质量方面表现出显著改进，实现了在多语言、英语以及代码基准测试中的最佳性能。

链接: https://arxiv.org/abs/2503.07891
作者: Jinhyuk Lee,Feiyang Chen,Sahil Dua,Daniel Cer,Madhuri Shanbhogue,Iftekhar Naim,Gustavo Hernández Ábrego,Zhe Li,Kaifeng Chen,Henrique Schechter Vera,Xiaoqi Ren,Shanfeng Zhang,Daniel Salz,Michael Boratko,Jay Han,Blair Chen,Shuo Huang,Vikram Rao,Paul Suganthan,Feng Han,Andreas Doumanoglou,Nithi Gupta,Fedor Moiseev,Cathy Yip,Aashi Jain,Simon Baumgartner,Shahrokh Shahi,Frank Palma Gomez,Sandeep Mariserla,Min Choi,Parashar Shah,Sonam Goenka,Ke Chen,Ye Xia,Koert Chen,Sai Meher Karthik Duddu,Yichang Chen,Trevor Walker,Wenlei Zhou,Rakesh Ghiya,Zach Gleicher,Karan Gill,Zhe Dong,Mojtaba Seyedhosseini,Yunhsuan Sung,Raphael Hoffmann,Tom Duerig
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google’s most capable large language model. Capitalizing on Gemini’s inherent multilingual and code understanding capabilities, Gemini Embedding produces highly generalizable embeddings for text spanning numerous languages and textual modalities. The representations generated by Gemini Embedding can be precomputed and applied to a variety of downstream tasks including classification, similarity, clustering, ranking, and retrieval. Evaluated on the Massive Multilingual Text Embedding Benchmark (MMTEB), which includes over one hundred tasks across 250+ languages, Gemini Embedding substantially outperforms prior state-of-the-art models, demonstrating considerable improvements in embedding quality. Achieving state-of-the-art performance across MMTEB’s multilingual, English, and code benchmarks, our unified model demonstrates strong capabilities across a broad selection of tasks and surpasses specialized domain-specific models.
zh

[NLP-54] Datasets Documents and Repetitions: The Practicalities of Unequal Data Quality

【速读】：该论文试图解决在大规模语言模型训练中，如何在有限计算预算下平衡数据过滤与模型性能的问题。随着计算资源的增长，过度过滤和去重后的数据集体积有限，可能导致模型训练受限。论文通过研究不同计算预算下的模型表现，并分析经过数据过滤和去重的多种预训练数据集，发现适当调整训练策略后，在小规模重复的过滤数据集上进行多次训练（最多十轮），其效果可优于在十倍大的数据集上单轮训练的表现，且适用于多个数量级的计算预算。此外，论文进一步指出，通过对文档级别的重复次数进行显式调控，可以基于相同的令牌预算构建更优的数据集。关键在于通过调整训练轮次和文档级别的重复策略，优化数据使用效率，从而提升模型性能，而非单纯依赖更大的数据量或更高的计算投入。

链接: https://arxiv.org/abs/2503.07879
作者: Alex Fang,Hadi Pouransari,Matt Jordan,Alexander Toshev,Vaishaal Shankar,Ludwig Schmidt,Tom Gunter
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. While this finding relies on repeating the dataset for many epochs, we also investigate repeats within these datasets at the document level. We find that not all documents within a dataset are equal, and we can create better datasets relative to a token budget by explicitly manipulating the counts of individual documents. We conclude by arguing that even as large language models scale, data filtering remains an important direction of research.
zh

[NLP-55] MapQA: Open-domain Geospatial Question Answering on Map Data

【速读】：该论文旨在解决地理空间问答（Geospatial Question Answering, Geo-QA）数据集在规模和多样性上的局限性，特别是现有数据集通常仅依赖于地理实体的文字描述而忽视其几何结构的问题。论文指出，扩展Geo-QA数据集以支持推理的关键挑战在于地理空间关系的复杂性，这些关系需要整合空间结构、拓扑依赖以及多跳推理能力，而这往往是基于文本的问答数据集所缺乏的。为了解决这些问题，论文引入了MapQA数据集，它不仅包含问答对，还包含了问题中引用的地理实体的几何信息。MapQA通过SQL查询模板从OpenStreetMap (OSM) 中提取问答对，并涵盖九种需要地理空间推理的问题类型。

解决方案的关键在于提出两种方法：一种是基于检索的语言模型，通过嵌入相似度对候选地理实体进行排名；另一种是大型语言模型（Large Language Model, LLM），能够从自然语言问题和地理实体属性生成SQL查询，然后在OSM数据库中执行。研究发现，基于检索的方法在捕捉“接近”和“方向”等概念方面表现良好，但在需要显式计算的问题（如距离计算）上表现不佳。而LLMs（如GPT和Gemini）在处理单跳推理的SQL查询生成方面表现出色，但在多跳推理任务上面临挑战，这揭示了提升Geo-QA系统性能的一个重要瓶颈。

链接: https://arxiv.org/abs/2503.07871
作者: Zekun Li,Malcolm Grossman,Eric(Ehsan)Qasemi,Mihir Kulkarni,Muhao Chen,Yao-Yi Chiang
机构: University of Minnesota, Twin Cities(明尼苏达大学双城分校); University of Minnesota, Twin Cities(明尼苏达大学双城分校); Oracle(甲骨文); Pennsylvania State University(宾夕法尼亚州立大学); University of California, Davis(加州大学戴维斯分校); University of Minnesota, Twin Cities(明尼苏达大学双城分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Geospatial question answering (QA) is a fundamental task in navigation and point of interest (POI) searches. While existing geospatial QA datasets exist, they are limited in both scale and diversity, often relying solely on textual descriptions of geo-entities without considering their geometries. A major challenge in scaling geospatial QA datasets for reasoning lies in the complexity of geospatial relationships, which require integrating spatial structures, topological dependencies, and multi-hop reasoning capabilities that most text-based QA datasets lack. To address these limitations, we introduce MapQA, a novel dataset that not only provides question-answer pairs but also includes the geometries of geo-entities referenced in the questions. MapQA is constructed using SQL query templates to extract question-answer pairs from OpenStreetMap (OSM) for two study regions: Southern California and Illinois. It consists of 3,154 QA pairs spanning nine question types that require geospatial reasoning, such as neighborhood inference and geo-entity type identification. Compared to existing datasets, MapQA expands both the number and diversity of geospatial question types. We explore two approaches to tackle this challenge: (1) a retrieval-based language model that ranks candidate geo-entities by embedding similarity, and (2) a large language model (LLM) that generates SQL queries from natural language questions and geo-entity attributes, which are then executed against an OSM database. Our findings indicate that retrieval-based methods effectively capture concepts like closeness and direction but struggle with questions that require explicit computations (e.g., distance calculations). LLMs (e.g., GPT and Gemini) excel at generating SQL queries for one-hop reasoning but face challenges with multi-hop reasoning, highlighting a key bottleneck in advancing geospatial QA systems.
zh

[NLP-56] cantnlp@DravidianLangTech-2025: A Bag-of-Sounds Approach to Multimodal Hate Speech Detection

【速读】：该论文旨在解决多模态社交媒体数据中使用Dravidian语言进行仇恨言论检测的问题。论文的关键在于提出了一种“声音袋”（bag-of-sounds）的方法，通过将Mel谱图特征转换后应用于音频数据训练仇恨言论检测模型。尽管在测试集上的表现不佳，但该方法在Malayalam和Tamil语言的训练和开发阶段展现了有前景的结果。研究结果表明，在拥有充足且平衡的数据条件下，结合文本与语音数据开发多模态仇恨言论检测系统是可行的。

链接: https://arxiv.org/abs/2503.07862
作者: Sidney Wong,Andrew Li
机构: Geospatial Research Institute (空间研究所); University of Canterbury (坎特伯雷大学); Lake Washington School District (华盛顿湖学区)
类目: Computation and Language (cs.CL)
备注: Accepted Fifth Workshop on Speech and Language Technologies for Dravidian Languages

点击查看摘要

Abstract:This paper presents the systems and results for the Multimodal Social Media Data Analysis in Dravidian Languages (MSMDA-DL) shared task at the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2025). We took a `bag-of-sounds’ approach by training our hate speech detection system on the speech (audio) data using transformed Mel spectrogram measures. While our candidate model performed poorly on the test set, our approach offered promising results during training and development for Malayalam and Tamil. With sufficient and well-balanced training data, our results show that it is feasible to use both text and speech (audio) data in the development of multimodal hate speech detection systems.
zh

[NLP-57] HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations

【速读】：该论文试图解决现有大型语言模型（LLMs）在多语言环境中生成非事实内容（即“幻觉”）的检测难题，特别是细粒度幻觉的识别不足问题。解决方案的关键在于构建了一个名为HalluVerse25的多语言LLM幻觉数据集，该数据集涵盖了英语、阿拉伯语和土耳其语三种语言，并细分为实体级、关系级和句子级幻觉。为了确保数据质量，研究团队通过利用LLM注入幻觉后，采用严格的人工标注流程来验证数据。这一方法不仅丰富了多语言幻觉数据的覆盖范围，还为评估不同上下文中LLM检测幻觉的能力提供了有价值的基准。

链接: https://arxiv.org/abs/2503.07833
作者: Samir Abdaljalil,Hasan Kurban,Erchin Serpedin
机构: Texas A&M University (德克萨斯农工大学); Hamad Bin Khalifa University (哈马德本哈利法大学); Texas A&M University (德克萨斯农工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in various contexts, yet remain prone to generating non-factual content, commonly referred to as “hallucinations”. The literature categorizes hallucinations into several types, including entity-level, relation-level, and sentence-level hallucinations. However, existing hallucination datasets often fail to capture fine-grained hallucinations in multilingual settings. In this work, we introduce HalluVerse25, a multilingual LLM hallucination dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality. We evaluate several LLMs on HalluVerse25, providing valuable insights into how proprietary models perform in detecting LLM-generated hallucinations across different contexts.
zh

[NLP-58] RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code ICLR2025

【速读】：该论文旨在研究语言模型（Language Model, LM）代理在复杂代码重构任务中的独特局限性，并提出相应的改进方法。论文通过引入RefactorBench基准测试，包含100个精心设计的多文件代码重构任务，强调任务间依赖性和对指令的严格遵循。研究发现，当前LM代理仅能解决22%的基础指令任务，而人类开发者在有限时间内可解决87%的任务。关键解决方案在于通过状态感知的方法，将基线代理调整为基于状态表示进行条件判断，从而将任务解决率提高了43.9%。此外，论文进一步探索了状态感知方法在完整数字环境中的应用，并提出了未来研究方向。RefactorBench的目标是通过提供真实世界中的多跳代码任务，支持对LM代理的研究。

链接: https://arxiv.org/abs/2503.07832
作者: Dhruv Gautam,Spandan Garg,Jinu Jang,Neel Sundaresan,Roshanak Zilouchian Moghaddam
机构: UC Berkeley (加州大学伯克利分校); Microsoft (微软)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: ICLR 2025 Camera Ready

点击查看摘要

Abstract:Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce RefactorBench, a benchmark consisting of 100 large handcrafted multi-file refactoring tasks in popular open-source repositories. Solving tasks within RefactorBench requires thorough exploration of dependencies across multiple files and strong adherence to relevant instructions. Every task is defined by 3 natural language instructions of varying specificity and is mutually exclusive, allowing for the creation of longer combined tasks on the same repository. Baselines on RefactorBench reveal that current LM agents struggle with simple compositional tasks, solving only 22% of tasks with base instructions, in contrast to a human developer with short time constraints solving 87%. Through trajectory analysis, we identify various unique failure modes of LM agents, and further explore the failure mode of tracking past actions. By adapting a baseline agent to condition on representations of state, we achieve a 43.9% improvement in solving RefactorBench tasks. We further extend our state-aware approach to encompass entire digital environments and outline potential directions for future research. RefactorBench aims to support the study of LM agents by providing a set of real-world, multi-hop tasks within the realm of code.
zh

[NLP-59] Modern Models Medieval Texts: A POS Tagging Study of Old Occitan

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理历史语言（如古奥克语 Old Occitan）时的能力局限性问题，特别是面对非标准化拼写（orthography）和历时语法变异（diachronic variation）等挑战时的表现。论文通过对比分析圣徒传记文本与医学文本两个不同语料库，评估现有开放源代码LLMs在词性标注（Part-of-Speech Tagging, POS Tagging）任务上的性能。研究的关键在于识别LLMs在处理低资源历史语言时的核心限制，并通过对错误的详细分析提出针对性改进建议，以提升LLMs在复杂历史语言处理中的有效性。

链接: https://arxiv.org/abs/2503.07827
作者: Matthias Schöffel,Marinus Wiedner,Esteban Garces Arias,Paula Ruppert,Christian Heumann,Matthias Aßenmacher
机构: Bavarian Academy of Sciences (巴伐利亚科学院); LMU Munich (慕尼黑大学); Albert-Ludwigs-Universität Freiburg (弗莱堡阿尔贝图斯-路德维希大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, yet their effectiveness in handling historical languages remains largely unexplored. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan, a historical language characterized by non-standardized orthography and significant diachronic variation. Through comparative analysis of two distinct corpora-hagiographical and medical texts-we evaluate how current models handle the inherent challenges of processing a low-resource historical language. Our findings demonstrate critical limitations in LLM performance when confronted with extreme orthographic and syntactic variability. We provide detailed error analysis and specific recommendations for improving model performance in historical language processing. This research advances our understanding of LLM capabilities in challenging linguistic contexts while offering practical insights for both computational linguistics and historical language studies.
zh

[NLP-60] Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在复杂、多轮交互场景下与人类及多个工具协作时功能调用能力受限的问题。为了解决这一问题，论文提出了一种名为Magnet的框架，其关键是通过一种自动且迭代的方法，将函数签名路径转化为一系列可执行的查询和函数调用序列，从而合成高质量的训练轨迹以增强LLMs在多轮对话中的功能调用能力。Magnet利用图建模多轮交互中的复杂函数关系，并设计新颖的节点操作构建可靠的签名路径。此外，受上下文蒸馏启发，Magnet在使用教师模型指导正负轨迹生成时，提供正确的函数调用序列作为正向提示，同时引入对比学习中的错误函数调用作为负向提示，从而有效提升模型的功能调用性能。实验表明，基于监督微调和偏好优化（Supervised Fine-Tuning and Preference Optimization, DPO）训练的Magnet-14B-mDPO模型，在BFCL-v3和ToolQuery基准测试中分别达到68.01和73.30，显著超越了教师模型Gemini-1.5-pro-002。

链接: https://arxiv.org/abs/2503.07826
作者: Fan Yin,Zifeng Wang,I-Hung Hsu,Jun Yan,Ke Jiang,Yanfei Chen,Jindong Gu,Long T. Le,Kai-Wei Chang,Chen-Yu Lee,Hamid Palangi,Tomas Pfister
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Large language models (LLMs) have exhibited the ability to effectively utilize external tools to address user queries. However, their performance may be limited in complex, multi-turn interactions involving users and multiple tools. To address this, we propose Magnet, a principled framework for synthesizing high-quality training trajectories to enhance the function calling capability of large language model agents in multi-turn conversations with humans. The framework is based on automatic and iterative translations from a function signature path to a sequence of queries and executable function calls. We model the complicated function interactions in multi-turn cases with graph and design novel node operations to build reliable signature paths. Motivated by context distillation, when guiding the generation of positive and negative trajectories using a teacher model, we provide reference function call sequences as positive hints in context and contrastive, incorrect function calls as negative hints. Experiments show that training with the positive trajectories with supervised fine-tuning and preference optimization against negative trajectories, our 14B model, Magnet-14B-mDPO, obtains 68.01 on BFCL-v3 and 73.30 on ToolQuery, surpassing the performance of the teacher model Gemini-1.5-pro-002 by a large margin in function calling.
zh

[NLP-61] raining Domain Draft Models for Speculative Decoding: Best Practices and Insights ICLR2025

【速读】：该论文旨在解决在将推测解码（Speculative Decoding）应用于领域特定目标模型时，由于领域偏移导致通用草案模型的接受率显著下降的问题。论文的关键在于系统性地研究知识蒸馏技术，用于训练领域特定的草案模型以提高其推测准确性。具体而言，论文对比了白盒（white-box）和黑盒（black-box）蒸馏方法，并探讨了它们在不同数据可及性场景（如历史用户查询、精心策划的领域数据以及合成生成的对齐数据）下的有效性。实验结果表明，离线蒸馏比在线蒸馏性能高出11%到25%，白盒蒸馏优于黑盒蒸馏2%到10%，并且数据扩展趋势在各领域保持一致。此外，合成数据能够有效对齐草案模型，并达到使用历史用户查询训练性能的80%至93%。这些发现为训练领域特定的草案模型以提升推测解码效率提供了实用指南。

链接: https://arxiv.org/abs/2503.07807
作者: Fenglu Hong,Ravi Raju,Jonathan Lingjie Li,Bo Li,Urmish Thakker,Avinash Ravichandran,Swayambhoo Jain,Changran Hu
机构: SambaNova Systems, Inc. (萨姆巴诺瓦系统公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published as a workshop paper at SCOPE - ICLR 2025

点击查看摘要

Abstract:Speculative decoding is an effective method for accelerating inference of large language models (LLMs) by employing a small draft model to predict the output of a target model. However, when adapting speculative decoding to domain-specific target models, the acceptance rate of the generic draft model drops significantly due to domain shift. In this work, we systematically investigate knowledge distillation techniques for training domain draft models to improve their speculation accuracy. We compare white-box and black-box distillation approaches and explore their effectiveness in various data accessibility scenarios, including historical user queries, curated domain data, and synthetically generated alignment data. Our experiments across Function Calling, Biology, and Chinese domains show that offline distillation consistently outperforms online distillation by 11% to 25%, white-box distillation surpasses black-box distillation by 2% to 10%, and data scaling trends hold across domains. Additionally, we find that synthetic data can effectively align draft models and achieve 80% to 93% of the performance of training on historical user queries. These findings provide practical guidelines for training domain-specific draft models to improve speculative decoding efficiency.
zh

[NLP-62] owards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在群体公平性（group fairness）方面的两个主要局限性。首先，现有研究通常假设不同群体接收到的非敏感属性（如提示问题）相同，而在实际应用中，不同群体可能偏好不同的提示问题，这一假设在实践中难以实现。其次，当前方法仅评估LLM最终输出的公平性，而未能识别潜在偏见的来源，即偏见可能源于预训练和微调过程中的多个组件，包括强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）以及学习到的奖励模型（reward models）。论文的关键解决方案是通过基准测试LLM管道中每个组件的群体公平性，特别是评估学习到的奖励模型的公平性。为此，作者利用arXiv上的专家撰写文本，在无需跨群体使用相同提示问题的情况下，对奖励模型的群体公平性进行基准测试。结果表明，所有被评估的奖励模型均表现出显著的群体不公平性，同时发现性能最佳的奖励模型往往在群体公平性方面表现更好。

链接: https://arxiv.org/abs/2503.07806
作者: Kefan Song,Jin Yao,Runnan Jiang,Rohan Chandra,Shangtong Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) become increasingly powerful and accessible to human users, ensuring fairness across diverse demographic groups, i.e., group fairness, is a critical ethical concern. However, current fairness and bias research in LLMs is limited in two aspects. First, compared to traditional group fairness in machine learning classification, it requires that the non-sensitive attributes, in this case, the prompt questions, be the same across different groups. In many practical scenarios, different groups, however, may prefer different prompt questions and this requirement becomes impractical. Second, it evaluates group fairness only for the LLM’s final output without identifying the source of possible bias. Namely, the bias in LLM’s output can result from both the pretraining and the finetuning. For finetuning, the bias can result from both the RLHF procedure and the learned reward model. Arguably, evaluating the group fairness of each component in the LLM pipeline could help develop better methods to mitigate the possible bias. Recognizing those two limitations, this work benchmarks the group fairness of learned reward models. By using expert-written text from arXiv, we are able to benchmark the group fairness of reward models without requiring the same prompt questions across different demographic groups. Surprisingly, our results demonstrate that all the evaluated reward models (e.g., Nemotron-4-340B-Reward, ArmoRM-Llama3-8B-v0.1, and GRM-llama3-8B-sftreg) exhibit statistically significant group unfairness. We also observed that top-performing reward models (w.r.t. canonical performance metrics) tend to demonstrate better group fairness.
zh

[NLP-63] Fair Text Classification via Transferable Representations

【速读】：该论文旨在解决文本分类中的群体公平性（group fairness）问题，特别是在敏感组别（如女性与男性）之间实现公平对待的挑战。论文的关键解决方案在于提出一种基于Wasserstein依赖度量（Wasserstein Dependency Measure）扩展的方法，用于学习无偏的神经文本分类器。其核心思想是从对抗训练中汲取灵感，通过在目标标签表示与敏感属性表示之间诱导独立性，区分公平与不公平的信息。此外，论文进一步利用领域自适应（Domain Adaptation）技术，在无需访问数据集中敏感属性的情况下实现公平性优化。理论分析与实证结果均表明该方法具有坚实的基础。

链接: https://arxiv.org/abs/2503.07691
作者: Thibaud Leteno,Michael Perrot,Charlotte Laclau,Antoine Gourru,Christophe Gravier
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2311.12689

点击查看摘要

Abstract:Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups (e.g., women and men) remains an open challenge. We propose an approach that extends the use of the Wasserstein Dependency Measure for learning unbiased neural text classifiers. Given the challenge of distinguishing fair from unfair information in a text encoder, we draw inspiration from adversarial training by inducing independence between representations learned for the target label and those for a sensitive attribute. We further show that Domain Adaptation can be efficiently leveraged to remove the need for access to the sensitive attributes in the dataset we cure. We provide both theoretical and empirical evidence that our approach is well-founded.
zh

[NLP-64] Early Detection of Mental Health Issues Using Social Media Posts

【速读】：该论文旨在解决精神健康危机（如抑郁症、焦虑症和双相情感障碍）早期检测与干预的需求不足问题，通过利用Reddit等社交平台上的用户生成内容（User-Generated Content, UGC），探索基于多模态深度学习框架的解决方案。论文的关键在于提出了一种结合语言特征与时序特征的多模态方法，利用双向长短期记忆网络（Bi-directional Long Short-Term Memory, BiLSTM）分别分析文本和时序数据，捕捉序列依赖性和上下文模式，并引入跨模态注意力机制实现上下文化的分类预测。这种架构通过融合文本预处理、时序特征归一化及标签编码后的Reddit帖子数据集进行训练与评估，验证了其在精神健康状况检测中的有效性，相较于传统模型实现了更高的验证准确率（74.55%）和F1分数（0.7376）。

链接: https://arxiv.org/abs/2503.07653
作者: Qasim Bin Saeed,Ijaz Ahmed
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The increasing prevalence of mental health disorders, such as depression, anxiety, and bipolar disorder, calls for immediate need in developing tools for early detection and intervention. Social media platforms, like Reddit, represent a rich source of user-generated content, reflecting emotional and behavioral patterns. In this work, we propose a multi-modal deep learning framework that integrates linguistic and temporal features for early detection of mental health crises. Our approach is based on the method that utilizes a BiLSTM network both for text and temporal feature analysis, modeling sequential dependencies in a different manner, capturing contextual patterns quite well. This work includes a cross-modal attention approach that allows fusion of such outputs into context-aware classification of mental health conditions. The model was then trained and evaluated on a dataset of labeled Reddit posts preprocessed using text preprocessing, scaling of temporal features, and encoding of labels. Experimental results indicate that the proposed architecture performs better compared to traditional models with a validation accuracy of 74.55% and F1-Score of 0.7376. This study presents the importance of multi-modal learning for mental health detection and provides a baseline for further improvements by using more advanced attention mechanisms and other data modalities.
zh

[NLP-65] Mixture of Experts Made Intrinsically Interpretable

【速读】：该论文旨在解决大型语言模型中神经元的“多语义性”（polysemanticity）问题，即神经元同时编码多个无关概念，导致模型可解释性降低。传统方法依赖于事后解释（post-hoc methods），而该研究提出了一种名为\textbf{MoE-X}的新方法，这是一种基于“专家混合网络”（Mixture-of-Experts, MoE）的模型，旨在实现内在可解释性（intrinsic interpretability）。解决方案的关键在于通过重新设计MoE架构，将其改写为等价的稀疏大MLP（多层感知机），并在每个专家内部强制执行稀疏激活，同时优化路由机制以优先选择激活最稀疏的专家。这种设计不仅在性能上与密集模型相当，还显著提升了模型的可解释性，超越了基于稀疏自动编码器（Sparse Autoencoder, SAE）的方法。

链接: https://arxiv.org/abs/2503.07639
作者: Xingyi Yang,Constantin Venhoff,Ashkan Khakzar,Christian Schroeder de Witt,Puneet K. Dokania,Adel Bibi,Philip Torr
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neurons in large language models often exhibit \emphpolysemanticity, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present \textbfMoE-X, a Mixture-of-Experts (MoE) language model designed to be \emphintrinsically interpretable. Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors. However, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.
zh

[NLP-66] Cross-modal Causal Relation Alignment for Video Question Grounding CVPR2025

【速读】：该论文旨在解决视频问题定位（VideoQG）任务中模型因虚假跨模态相关性导致无法准确识别与问题意图对齐的关键视觉场景的问题。此外，现有视觉语言模型在下游任务如VideoQG中表现出不忠实的泛化能力和缺乏鲁棒性。为应对这些挑战，论文提出了一种名为跨模态因果关系对齐（Cross-modal Causal Relation Alignment, CRA）的新框架。CRA的关键在于其三个核心组件：(i) 高斯平滑定位（Gaussian Smoothing Grounding, GSG）模块通过跨模态注意力估计时间间隔，并利用自适应高斯滤波器去噪；(ii) 跨模态对齐（Cross-Modal Alignment, CMA）模块通过双向对比学习增强弱监督VideoQG的性能；(iii) 显式因果干预（Explicit Causal Intervention, ECI）模块通过前门干预视觉模态和后门干预语言模态实现多模态去混杂。实验结果验证了CRA在发现视觉定位内容和实现稳健问题推理方面的优越性。

链接: https://arxiv.org/abs/2503.07635
作者: Weixing Chen,Yang Liu,Binglin Chen,Jiandong Su,Yongsen Zheng,Liang Lin
机构: Sun Yat-sen University (中山大学); Shenzhen Institute of Advanced Technology (深圳先进技术研究院); Nanyang Technological University, Singapore (新加坡南洋理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available at this https URL.
zh

[NLP-67] OWLViz: An Open-World Benchmark for Visual Question Answering

【速读】：该论文旨在解决开放世界视觉问答（OWLViz）任务中多模态系统在工具选择与复杂推理序列执行能力上的显著局限性问题。论文通过构建一个具有挑战性的基准测试来揭示这一问题，发现即使最先进的视觉-语言模型（VLMs），如Gemini 2.0，也仅能达到26.6%的准确率，而人类在类似任务中的准确率为69.2%。当前基于有限视觉和视觉-语言模型的自主性VLMs表现更差。关键在于突破现有系统在工具选择与复杂推理方面的瓶颈，为实用型AI研究开辟新方向。

链接: https://arxiv.org/abs/2503.07631
作者: Thuy Nguyen,Dang Nguyen,Hoang Nguyen,Thuan Luong,Long Hoang Dang,Viet Dac Lai
机构: Reason Foundation (Reason基金会); University of Maryland (马里兰大学); Posts and Telecommunications Institute of Technology (邮政电信学院); Adobe Research (Adobe研究实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:We present a challenging benchmark for the Open WorLd VISual question answering (OWLViz) task. OWLViz presents concise, unambiguous queries that require integrating multiple capabilities, including visual understanding, web exploration, and specialized tool usage. While humans achieve 69.2% accuracy on these intuitive tasks, even state-of-the-art VLMs struggle, with the best model, Gemini 2.0, achieving only 26.6% accuracy. Current agentic VLMs, which rely on limited vision and vision-language models as tools, perform even worse. This performance gap reveals significant limitations in multimodal systems’ ability to select appropriate tools and execute complex reasoning sequences, establishing new directions for advancing practical AI research.
zh

[NLP-68] FourierNAT: A Fourier-Mixing-Based Non-Autoregressive Transformer for Parallel Sequence Generation

【速读】：该论文试图解决非自回归Transformer（Non-Autoregressive Transformer, NAT）在捕捉全局依赖关系方面的挑战。传统NAT方法通常难以有效建模长距离依赖，导致生成结果可能缺乏连贯性。为了解决这一问题，论文提出了一种名为FourierNAT的新架构，其关键创新在于利用基于离散傅里叶变换（Discrete Fourier Transform, DFT）的混合机制，在解码器中通过在整个序列维度上混合token嵌入，并结合学习到的频域门控机制（frequency-domain gating）。这种方法使得模型能够在无需显式自回归步骤的情况下高效传播上下文信息，同时允许模型根据需要动态调整对长距离或短距离依赖的关注程度。实验结果显示，FourierNAT在WMT机器翻译和CNN/DailyMail摘要任务等标准基准测试中取得了与领先NAT基线相当的结果，并显著提升了生成速度，展示了将频域操作集成到并行文本生成中的潜力，从而可能为大规模语言模型（LLMs）的推理任务带来显著的计算和时间节省。

链接: https://arxiv.org/abs/2503.07630
作者: Andrew Kiruluta,Eric Lundy,Andreas Lemos
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:We present FourierNAT, a novel non-autoregressive Transformer (NAT) architecture that employs Fourier-based mixing in the decoder to generate output sequences in parallel. While traditional NAT approaches often face challenges with capturing global dependencies, our method leverages a discrete Fourier transform to mix token embeddings across the entire sequence dimension, coupled with learned frequency-domain gating. This allows the model to efficiently propagate context without explicit autoregressive steps. Empirically, FourierNAT achieves competitive results against leading NAT baselines on standard benchmarks like WMT machine translation and CNN/DailyMail summarization, providing significant speed advantages over autoregressive Transformers. We further demonstrate that learned frequency-domain parameters allow the model to adaptively focus on long-range or short-range dependencies, partially mitigating the well-known coherence gaps in one-pass NAT generation. Overall, FourierNAT highlights the potential of integrating spectral-domain operations to accelerate and improve parallel text generation. This approach can potentially provide great computational and time savings in inference tasks LLMs.
zh

[NLP-69] Psychological Counseling Ability of Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在心理辅导领域能力评估不足的问题。解决方案的关键在于通过构建一个包含1096道心理辅导技能题的测试集，涵盖知识性、分析性和应用性题目类型，全面评估主流LLMs的心理辅导能力，并进一步结合《心理咨询师指南（三级）》优化模型表现。研究发现不同LLMs在中文和英文问题上的正确率存在显著差异，并通过指导书辅助将ERNIE-3.5的正确率提升了13.8%。这为未来提升LLMs的心理咨询能力提供了实证基础和改进方向。

链接: https://arxiv.org/abs/2503.07627
作者: Fangyu Peng,Jingxin Nie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 25 pages, 1 figure

点击查看摘要

Abstract:With the development of science and the continuous progress of artificial intelligence technology, Large Language Models (LLMs) have begun to be widely utilized across various fields. However, in the field of psychological counseling, the ability of LLMs have not been systematically assessed. In this study, we assessed the psychological counseling ability of mainstream LLMs using 1096 psychological counseling skill questions which were selected from the Chinese National Counselor Level 3 Examination, including Knowledge-based, Analytical-based, and Application-based question types. The analysis showed that the correctness rates of the LLMs for Chinese questions, in descending order, were GLM-3 (46.5%), GPT-4 (46.1%), Gemini (45.0%), ERNIE-3.5 (45.7%) and GPT-3.5 (32.9%). The correctness rates of the LLMs for English questions, in descending order, were ERNIE-3.5 (43.9%), GPT-4 (40.6%), Gemini (36.6%), GLM-3 (29.9%) and GPT-3.5 (29.5%). A chi-square test indicated significant differences in the LLMs’ performance on Chinese and English questions. Furthermore, we subsequently utilized the Counselor’s Guidebook (Level 3) as a reference for ERNIE-3.5, resulting in a new correctness rate of 59.6%, a 13.8% improvement over its initial rate of 45.8%. In conclusion, the study assessed the psychological counseling ability of LLMs for the first time, which may provide insights for future enhancement and improvement of psychological counseling ability of LLMs.
zh

[NLP-70] FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

【速读】：该论文旨在解决现有基准测试缺乏专门评估大规模语言模型（Large Language Models, LLMs）在代码仓库级别增量开发能力框架的问题。为填补这一空白，论文提出了FEA-Bench，这是一个针对LLMs在代码仓库中增量开发能力进行评估的基准。解决方案的关键在于通过从83个GitHub仓库收集拉取请求，并采用基于规则和意图的过滤方法构建专注于新功能开发的任务实例。每个包含代码更改的任务实例都与相关的单元测试文件配对，以确保解决方案可验证性。此外，新功能实现需要LLMs同时具备为新组件完成代码以及编辑代码仓库中其他相关部分的能力，从而提供了一种更全面评估LLMs自动化软件工程能力的方法。实验结果表明，LLMs在FEA-Bench上的表现显著较差，凸显出此类仓库级别增量代码开发面临的重大挑战。

链接: https://arxiv.org/abs/2503.06680
作者: Wei Li,Xin Zhang,Zhongxin Guo,Shaoguang Mao,Wen Luo,Guangyue Peng,Yangyu Huang,Houfeng Wang,Scarlett Li
机构: State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University (北京大学); Microsoft Research Asia (微软亚洲研究院)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs’ automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.
zh

[NLP-71] BIPED: Pedagogically Informed Tutoring System for ESL Education ACL2024

【速读】：该论文旨在解决现有会话式智能辅导系统（Conversational Intelligent Tutoring Systems, CITS）在教学复杂概念时存在的局限性，如仅能教授简单概念或缺乏必要的教学深度以适应多样化的学习策略。为开发更具教学洞察力的CITS，论文的关键解决方案是构建了一个双语教学导向的辅导数据集（BIPED），包含一对一的人类英语辅导互动，并基于此数据集开发了对话行为标注体系（包括34种导师行为和9种学生行为）。通过两步框架——首先预测适当的导师行为，然后生成对应的响应，论文分别使用GPT-4和SOLAR-KO实现了两种CITS模型。实验表明，这些模型不仅再现了人类教师的教学风格，还采用了多样化且情境适宜的教学策略。

链接: https://arxiv.org/abs/2406.03486
作者: Soonwoo Kwon,Sojung Kim,Minju Park,Seunghyun Lee,Kyuseok Kim
机构: Twelve Labs (Twelve Labs); KAIST (韩国科学技术院); Riiid AI Research (Riiid AI研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have a great potential to serve as readily available and cost-efficient Conversational Intelligent Tutoring Systems (CITS) for teaching L2 learners of English. Existing CITS, however, are designed to teach only simple concepts or lack the pedagogical depth necessary to address diverse learning strategies. To develop a more pedagogically informed CITS capable of teaching complex concepts, we construct a BIlingual PEDagogically-informed Tutoring Dataset (BIPED) of one-on-one, human-to-human English tutoring interactions. Through post-hoc analysis of the tutoring interactions, we come up with a lexicon of dialogue acts (34 tutor acts and 9 student acts), which we use to further annotate the collected dataset. Based on a two-step framework of first predicting the appropriate tutor act then generating the corresponding response, we implemented two CITS models using GPT-4 and SOLAR-KO, respectively. We experimentally demonstrate that the implemented models not only replicate the style of human teachers but also employ diverse and contextually appropriate pedagogical strategies.
zh

计算机视觉

[CV-0] QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

【速读】：该论文旨在解决现有长视频理解方法在处理视觉冗余时忽视输入级视觉标记与指令之间语义相关性的问题。解决方案的关键在于提出了一种名为QuoTA的前馈训练无关模块，它通过基于查询导向的帧级别重要性评估来扩展现有的大规模视频语言模型（LVLMs），实现视觉标记的分配。QuoTA的核心创新点包括：(i) 根据查询相关性策略性地分配帧级别的重要性分数，使得在解码器层跨模态交互之前可以一次性完成视觉标记的分配；(ii) 通过链式思维推理解耦查询，以促进更精确的LVLM基础帧重要性评分；(iii) 提供即插即用功能，可轻松集成到现有LVLM中。实验结果表明，在LLaVA-Video-7B上应用QuoTA后，在六个基准测试（包括Video-MME和MLVU）中的平均性能提升了3.2%，同时保持了与基线相同的视觉标记预算。

链接: https://arxiv.org/abs/2503.08689
作者: Yongdong Luo,Wang Chen,Xiawu Zheng,Weizhong Huang,Shukang Yin,Haojia Lin,Chaoyou Fu,Jinfa Huang,Jiayi Ji,Jiebo Luo,Rongrong Ji
机构: Xiamen University (厦门大学); Nanjing University (南京大学); University of Rochester (罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline. Codes are open-sourced at this https URL.
zh

[CV-1] OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

【速读】：该论文旨在解决现有统一多模态理解与生成（Unified Multimodal Understanding and Generation）模型因计算复杂度呈二次增长以及对大规模训练数据的高度依赖而面临的挑战。论文提出的解决方案核心在于OmniMamba，这是一种基于线性架构的多模态生成模型，通过统一的下一-token预测范式实现文本和图像的联合生成。关键创新包括：(1) 解耦词汇表以引导特定模态的生成；(2) 针对任务的LoRA（Low-Rank Adaptation）用于参数高效的适应。此外，还引入了一种解耦的两阶段训练策略来缓解两个任务间的数据不平衡问题。这些技术使OmniMamba在仅使用200万图像-文本对的情况下，在多个基准测试中表现出色，同时实现了显著的推理效率提升，相比基于Transformer的模型，其长序列生成速度提升了119.2倍，GPU内存减少了63%。

链接: https://arxiv.org/abs/2503.08686
作者: Jialv Zou,Bencheng Liao,Qian Zhang,Wenyu Liu,Xinggang Wang
机构: School of EIC, Huazhong University of Science & Technology (华中科技大学电子与信息工程学院); Institute of Artificial Intelligence, Huazhong University of Science & Technology (华中科技大学人工智能研究院); Horizon Robotics (地平线)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2’s high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at this https URL
zh

[CV-2] “Principal Components” Enable A New Language of Images

【速读】：该论文旨在解决现有视觉分词器（Visual Tokenizer）在优化重建保真度（reconstruction fidelity）的同时忽视潜在空间结构特性的问题。这些结构特性对于模型的可解释性（interpretability）以及下游任务至关重要。论文的关键创新在于引入了一种新颖的视觉分词框架，将一种类似于主成分分析（Principal Component Analysis, PCA）的可证明结构嵌入到潜在令牌空间中。具体而言，该方法生成了一个一维因果令牌序列（1D causal token sequence），其中每个连续的令牌以数学方式保证其解释方差递减且不重叠，从而确保分词器优先提取图像中最显著的视觉特征，并通过后续令牌逐步补充互补但递减的信息。此外，论文还通过扩散解码器（diffusion decoder）解决了语义-光谱耦合效应（semantic-spectrum coupling effect），避免了高层语义内容与底层光谱细节在令牌中的非期望纠缠。实验结果表明，该方法不仅实现了最先进的重建性能，还提升了模型的可解释性，使其更接近人类视觉系统，并且基于该令牌序列训练的自回归模型在较少令牌数量的情况下达到了与当前最先进方法相当的性能。

链接: https://arxiv.org/abs/2503.08685
作者: Xin Wen,Bingchen Zhao,Ismail Elezi,Jiankang Deng,Xiaojuan Qi
机构: University of Hong Kong (香港大学); University of Edinburgh (爱丁堡大学); Noah’s Ark Lab (诺亚方舟实验室); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally, project page: this https URL

点击查看摘要

Abstract:We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. While existing visual tokenizers primarily optimize for reconstruction fidelity, they often neglect the structural properties of the latent space – a critical factor for both interpretability and downstream tasks. Our method generates a 1D causal token sequence for images, where each successive token contributes non-overlapping information with mathematically guaranteed decreasing explained variance, analogous to principal component analysis. This structural constraint ensures the tokenizer extracts the most salient visual features first, with each subsequent token adding diminishing yet complementary information. Additionally, we identified and resolved a semantic-spectrum coupling effect that causes the unwanted entanglement of high-level semantic content and low-level spectral details in the tokens by leveraging a diffusion decoder. Experiments demonstrate that our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system. Moreover, auto-regressive models trained on our token sequences achieve performance comparable to current state-of-the-art methods while requiring fewer tokens for training and inference.
zh

[CV-3] CoLMDriver: LLM -based Negotiation Benefits Cooperative Autonomous Driving

【速读】：该论文旨在解决传统车车通信（V2V）合作自动驾驶方法因协作协议僵化且对未见过的交互场景泛化能力有限而导致的局限性。同时，虽然基于大型语言模型（LLM）的方法具备通用推理能力，但其在空间规划上的不足及推理延迟的不稳定性限制了其直接应用于合作驾驶。为克服这些限制，论文提出了一种名为CoLMDriver的全管道LLM驱动合作驾驶系统，该系统能够实现有效的基于语言的协商和实时驾驶控制。CoLMDriver的关键在于其并行驾驶管道中的两个主要组件：(i) 基于LLM的协商模块，在演员-评论家范式下工作，通过所有车辆先前决策的反馈不断优化协作策略；以及(ii) 意图引导的路径点生成器，将协商结果转化为可执行的路径点。此外，还引入了InterDrive，这是一个基于CARLA的模拟基准，包含10个具有挑战性的交互驾驶场景，用于评估V2V合作性能。实验结果显示，CoLMDriver在各种高度交互的V2V驾驶场景中成功率达到现有方法的11%更高。

链接: https://arxiv.org/abs/2503.08683
作者: Changxing Liu,Genjia Liu,Zijun Wang,Jinchang Yang,Siheng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Vehicle-to-vehicle (V2V) cooperative autonomous driving holds great promise for improving safety by addressing the perception and prediction uncertainties inherent in single-agent systems. However, traditional cooperative methods are constrained by rigid collaboration protocols and limited generalization to unseen interactive scenarios. While LLM-based approaches offer generalized reasoning capabilities, their challenges in spatial planning and unstable inference latency hinder their direct application in cooperative driving. To address these limitations, we propose CoLMDriver, the first full-pipeline LLM-based cooperative driving system, enabling effective language-based negotiation and real-time driving control. CoLMDriver features a parallel driving pipeline with two key components: (i) an LLM-based negotiation module under an actor-critic paradigm, which continuously refines cooperation policies through feedback from previous decisions of all vehicles; and (ii) an intention-guided waypoint generator, which translates negotiation outcomes into executable waypoints. Additionally, we introduce InterDrive, a CARLA-based simulation benchmark comprising 10 challenging interactive driving scenarios for evaluating V2V cooperation. Experimental results demonstrate that CoLMDriver significantly outperforms existing approaches, achieving an 11% higher success rate across diverse highly interactive V2V driving scenarios. Code will be released on this https URL.
zh

[CV-4] GarmentCrafter: Progressive Novel View Synthesis for Single-View 3D Garment Reconstruction and Editing

【速读】：本文旨在解决非专业用户在从单视角图像创建和修改三维服装时面临的挑战，现有单视角三维重建方法依赖预训练生成模型来基于参考图像和相机姿态合成新视角，但存在跨视角一致性不足的问题，无法捕捉不同视角间的内在关系。为应对这一挑战，论文提出通过渐进深度预测和图像变形来近似新视角，并利用多视图扩散模型完成被遮挡或未知的服装区域，同时结合RGB与深度联合推理以确保视角间的一致性并精确重构几何形状及细节。关键在于引入渐进深度预测、图像变形以及多视图扩散模型，从而实现更高的视觉保真度和跨视角一致性。

链接: https://arxiv.org/abs/2503.08678
作者: Yuanhao Wang,Cheng Zhang,Gonçalo Frazão,Jinlong Yang,Alexandru-Eugen Ichim,Thabo Beeler,Fernando De la Torre
机构: Carnegie Mellon University (卡内基梅隆大学); Texas A&M University (德克萨斯农工大学); Google AR (谷歌)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce GarmentCrafter, a new approach that enables non-professional users to create and modify 3D garments from a single-view image. While recent advances in image generation have facilitated 2D garment design, creating and editing 3D garments remains challenging for non-professional users. Existing methods for single-view 3D reconstruction often rely on pre-trained generative models to synthesize novel views conditioning on the reference image and camera pose, yet they lack cross-view consistency, failing to capture the internal relationships across different views. In this paper, we tackle this challenge through progressive depth prediction and image warping to approximate novel views. Subsequently, we train a multi-view diffusion model to complete occluded and unknown clothing regions, informed by the evolving camera pose. By jointly inferring RGB and depth, GarmentCrafter enforces inter-view coherence and reconstructs precise geometries and fine details. Extensive experiments demonstrate that our method achieves superior visual fidelity and inter-view coherence compared to state-of-the-art single-view 3D garment reconstruction methods.
zh

[CV-5] OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

【速读】：该论文旨在解决现实场景中基于扩散模型的对象移除与插入任务面临的挑战，包括物理效应的复杂交互以及配对训练数据不足等问题。论文的关键创新在于提出OmniPaint，这是一个统一框架，将对象移除和插入重新定义为相互依赖的过程而非孤立任务。其解决方案的核心在于利用预训练的扩散先验，并结合初始配对样本优化及后续通过CycleFlow进行的大规模非配对细化的渐进式训练管道，从而实现精确的前景消除和无缝的对象插入，同时忠实保留场景几何结构和内在属性。此外，文中提出的CFD指标提供了一种鲁棒的无参考上下文一致性评估方法，为高保真图像编辑设立了新的基准。

链接: https://arxiv.org/abs/2503.08677
作者: Yongsheng Yu,Ziyun Zeng,Haitian Zheng,Jiebo Luo
机构: University of Rochester (罗切斯特大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based generative models have revolutionized object-oriented image editing, yet their deployment in realistic object removal and insertion remains hampered by challenges such as the intricate interplay of physical effects and insufficient paired training data. In this work, we introduce OmniPaint, a unified framework that re-conceptualizes object removal and insertion as interdependent processes rather than isolated tasks. Leveraging a pre-trained diffusion prior along with a progressive training pipeline comprising initial paired sample optimization and subsequent large-scale unpaired refinement via CycleFlow, OmniPaint achieves precise foreground elimination and seamless object insertion while faithfully preserving scene geometry and intrinsic properties. Furthermore, our novel CFD metric offers a robust, reference-free evaluation of context consistency and object hallucination, establishing a new benchmark for high-fidelity image editing. Project page: this https URL
zh

[CV-6] Language-Depth Navigated Thermal and Visible Image Fusion

【速读】：该论文旨在解决现有热红外与可见光图像融合方法主要关注检测任务，而忽视深度等关键信息的问题。在低光照和复杂环境中，单模态图像的局限性明显，通过融合图像中的深度信息，不仅能够生成更精确的点云数据以提升三维重建的完整性和准确性，还能为机器人导航、定位及环境感知提供全面的场景理解，从而支持自动驾驶和救援任务中的精准识别与高效操作。论文的关键解决方案在于提出了一种文本引导且深度驱动的红外与可见光图像融合网络。该网络包含一个用于提取多通道互补信息的图像融合分支（基于扩散模型，并配备文本引导模块），以及两个辅助深度估计分支。融合分支利用CLIP提取语义信息和参数，通过对深度增强的图像描述进行处理来指导扩散模型提取多通道特征并生成融合图像；这些融合图像随后输入到深度估计分支以计算深度驱动损失，进而优化图像融合网络。此框架的核心目标是整合视觉-语言和深度信息，直接从多模态输入生成色彩融合图像。

链接: https://arxiv.org/abs/2503.08676
作者: Jinchang Zhang,Zijun Li,Guoyu Lu
机构: Intelligent Vision and Sensing (IVS) Lab at Binghamton University (宾厄姆顿大学), USA; University of Georgia (乔治亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Depth-guided multimodal fusion combines depth information from visible and infrared images, significantly enhancing the performance of 3D reconstruction and robotics applications. Existing thermal-visible image fusion mainly focuses on detection tasks, ignoring other critical information such as depth. By addressing the limitations of single modalities in low-light and complex environments, the depth information from fused images not only generates more accurate point cloud data, improving the completeness and precision of 3D reconstruction, but also provides comprehensive scene understanding for robot navigation, localization, and environmental perception. This supports precise recognition and efficient operations in applications such as autonomous driving and rescue missions. We introduce a text-guided and depth-driven infrared and visible image fusion network. The model consists of an image fusion branch for extracting multi-channel complementary information through a diffusion model, equipped with a text-guided module, and two auxiliary depth estimation branches. The fusion branch uses CLIP to extract semantic information and parameters from depth-enriched image descriptions to guide the diffusion model in extracting multi-channel features and generating fused images. These fused images are then input into the depth estimation branches to calculate depth-driven loss, optimizing the image fusion network. This framework aims to integrate vision-language and depth to directly generate color-fused images from multimodal inputs.
zh

[CV-7] Keypoint Detection and Description for Raw Bayer Images

【速读】：该论文旨在解决传统关键点检测与局部特征描述方法主要依赖于经过图像信号处理器（Image Signal Processor, ISP）处理的RGB图像的问题，提出了一种直接处理原始图像的新网络。这种方法的关键在于引入了两种自定义设计的卷积核，能够在不转换为RGB格式的情况下直接对原始图像进行卷积运算，从而保留跨通道信息。这种创新不仅显著降低了硬件需求和内存消耗，还提升了在大旋转和尺度变化条件下的检测精度与稳定性，为资源受限环境提供了更高效的解决方案。

链接: https://arxiv.org/abs/2503.08673
作者: Jiakai Lin,Jinchang Zhang,Guoyu Lu
机构: University of Georgia; Binghamton University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Keypoint detection and local feature description are fundamental tasks in robotic perception, critical for applications such as SLAM, robot localization, feature matching, pose estimation, and 3D mapping. While existing methods predominantly operate on RGB images, we propose a novel network that directly processes raw images, bypassing the need for the Image Signal Processor (ISP). This approach significantly reduces hardware requirements and memory consumption, which is crucial for robotic vision systems. Our method introduces two custom-designed convolutional kernels capable of performing convolutions directly on raw images, preserving inter-channel information without converting to RGB. Experimental results show that our network outperforms existing algorithms on raw images, achieving higher accuracy and stability under large rotations and scale variations. This work represents the first attempt to develop a keypoint detection and feature description network specifically for raw images, offering a more efficient solution for resource-constrained environments.
zh

[CV-8] SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting

【速读】：该论文致力于解决向量量化（Vector Quantization, VQ）在权重微调过程中因压缩格式约束导致的量化误差问题。传统VQ方法在极端压缩场景下虽能显著降低量化误差，但在微调阶段受限于权重向量共享同一码本向量的更新方向约束，这使得许多量化权重被迫朝与局部梯度信息相反的方向调整，从而影响模型性能。为解决此问题，论文提出了一种名为Sign-Splitting VQ (SSVQ) 的新范式，其关键在于将权重的符号位与码本解耦，并通过提取未压缩权重的符号位进行独立处理，同时引入潜在变量联合优化符号位与码本。此外，还设计了渐进冻结策略以确保训练稳定性。实验结果表明，SSVQ在多种现代模型和任务上实现了优于传统VQ的压缩-精度权衡，并在硬件加速器上的测试显示其相比8位压缩模型提升了3倍内存访问速度。

链接: https://arxiv.org/abs/2503.08668
作者: Shuaiting Li,Juncan Deng,Chenxuan Wang,Kedong Xu,Rongtao Deng,Hong Gu,Haibin Shen,Kejie Huang
机构: Zhejiang University (浙江大学); vivo Mobile Communication Co., Ltd (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vector Quantization (VQ) has emerged as a prominent weight compression technique, showcasing substantially lower quantization errors than uniform quantization across diverse models, particularly in extreme compression scenarios. However, its efficacy during fine-tuning is limited by the constraint of the compression format, where weight vectors assigned to the same codeword are restricted to updates in the same direction. Consequently, many quantized weights are compelled to move in directions contrary to their local gradient information. To mitigate this issue, we introduce a novel VQ paradigm, Sign-Splitting VQ (SSVQ), which decouples the sign bit of weights from the codebook. Our approach involves extracting the sign bits of uncompressed weights and performing clustering and compression on all-positive weights. We then introduce latent variables for the sign bit and jointly optimize both the signs and the codebook. Additionally, we implement a progressive freezing strategy for the learnable sign to ensure training stability. Extensive experiments on various modern models and tasks demonstrate that SSVQ achieves a significantly superior compression-accuracy trade-off compared to conventional VQ. Furthermore, we validate our algorithm on a hardware accelerator, showing that SSVQ achieves a 3 \times speedup over the 8-bit compressed model by reducing memory access.
zh

[CV-9] REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

【速读】：该论文旨在解决视频嵌入器（video embedder）在生成式建模中的性能与压缩效率之间的权衡问题。传统方法通常要求嵌入器精确复现输入视频，而该研究提出了一种更宽松的标准：嵌入器应专注于生成视觉上合理的重建结果。这种新视角允许显著提升压缩比，同时不损害下游生成模型的质量。

解决方案的关键在于采用一种编码器-生成器框架，将传统的编码器-解码器替换为利用扩散变换器（diffusion transformer, DiT）从紧凑潜空间中合成缺失细节的结构。此外，论文设计了一个专门的潜在条件模块，用于将DiT解码器与编码后的视频潜在嵌入进行条件化处理。实验表明，该方法在高压缩比下实现了优于现有技术的编解码性能，并在文本到视频生成任务中验证了超紧凑潜在空间的鲁棒性，显著提升了潜扩散模型的训练与推理效率。

链接: https://arxiv.org/abs/2503.08665
作者: Yitian Zhang,Long Mai,Aniruddha Mahapatra,David Bourgin,Yicong Hong,Jonah Casebeer,Feng Liu,Yun Fu
机构: Adobe Research (Adobe 研究院); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32x (8x higher than leading video embedders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.
zh

[CV-10] MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention CVPR2025

【速读】：该论文旨在解决现有方法在将多视角扩散模型应用于人体数据时面临的挑战，特别是在高分辨率（如1024x1024）下实现有效的多视角注意力机制的问题。论文的关键创新在于提出了“网格注意力（mesh attention）”这一解决方案，通过使用带服装的人体网格作为粗略几何表示，并结合光栅化和投影技术建立跨视角坐标对应关系，显著降低了多视角注意力的复杂性，同时保持了跨视角一致性。基于此，作者设计了一种网格注意力模块，并结合关键点条件构建了专门用于人体的多视角扩散模型MEAT。此外，论文还探讨了利用多视角人体运动视频进行扩散训练的方法，解决了数据稀缺的长期难题。实验结果表明，MEAT能够在百万像素级别有效生成密集且一致的多视角人体图像，超越现有的多视角扩散方法。

链接: https://arxiv.org/abs/2503.08664
作者: Yuhan Wang,Fangzhou Hong,Shuai Yang,Liming Jiang,Wayne Wu,Chen Change Loy
机构: S-Lab, Nanyang Technological University (南洋理工大学); WICT, Peking University (北京大学); UCLA (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025. Code this https URL Project Page this https URL

点击查看摘要

Abstract:Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mesh attention to enable training at 1024x1024 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods.
zh

[CV-11] Generating Robot Constitutions Benchmarks for Semantic Safety

【速读】：该论文旨在解决大型视觉语言模型（Vision-and-Language Models, VLMs）控制机器人后引发的语义安全风险问题。随着VLMs被赋予操作物理世界的机器人控制权，尽管它们能够实现更高层次的场景理解和自然语言交互，但其固有缺陷（如幻觉或越狱行为）可能导致危险操作。论文提出的关键解决方案包括两方面：首先，构建ASIMOV基准数据集，通过可扩展的数据生成方法从真实世界视觉场景和医院的人类伤害报告中生成潜在的不当情境，用于评估和提升基础模型的语义安全性；其次，开发一种框架利用宪法型人工智能（Constitutional AI）机制，从真实数据自动生成机器人的行为准则，并引入自动修正过程以细化规则表达，从而提高行为与人类偏好在期望性和安全性上的对齐程度。实验表明，使用生成的行为准则在ASIMOV基准上的对齐率达到84.3%，优于无准则基线和人工编写准则。

链接: https://arxiv.org/abs/2503.08663
作者: Pierre Sermanet,Anirudha Majumdar,Alex Irpan,Dmitry Kalashnikov,Vikas Sindhwani
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Until recently, robotics safety research was predominantly about collision avoidance and hazard reduction in the immediate vicinity of a robot. Since the advent of large vision and language models (VLMs), robots are now also capable of higher-level semantic scene understanding and natural language interactions with humans. Despite their known vulnerabilities (e.g. hallucinations or jail-breaking), VLMs are being handed control of robots capable of physical contact with the real world. This can lead to dangerous behaviors, making semantic safety for robots a matter of immediate concern. Our contributions in this paper are two fold: first, to address these emerging risks, we release the ASIMOV Benchmark, a large-scale and comprehensive collection of datasets for evaluating and improving semantic safety of foundation models serving as robot brains. Our data generation recipe is highly scalable: by leveraging text and image generation techniques, we generate undesirable situations from real-world visual scenes and human injury reports from hospitals. Secondly, we develop a framework to automatically generate robot constitutions from real-world data to steer a robot’s behavior using Constitutional AI mechanisms. We propose a novel auto-amending process that is able to introduce nuances in written rules of behavior; this can lead to increased alignment with human preferences on behavior desirability and safety. We explore trade-offs between generality and specificity across a diverse set of constitutions of different lengths, and demonstrate that a robot is able to effectively reject unconstitutional actions. We measure a top alignment rate of 84.3% on the ASIMOV Benchmark using generated constitutions, outperforming no-constitution baselines and human-written constitutions. Data is available at this http URL
zh

[CV-12] ask-Oriented Co-Design of Communication Computing and Control for Edge-Enabled Industrial Cyber-Physical Systems

【速读】：本文旨在解决任务关键型工业Cyber-Physical Systems (CPS) 中由于带宽限制、噪声干扰和延迟所面临的挑战。为应对这些问题，论文提出了一种面向任务的通信、计算和控制协同设计框架。解决方案的关键在于设计了一种面向任务的联合信源信道编码 (Joint Source-Channel Coding, JSCC)，利用信息瓶颈 (Information Bottleneck, IB) 方法优先传输任务相关的信息，以提高数据传输效率和鲁棒性；同时开发了一种考虑端到端 (End-to-End, E2E) 延迟的轨迹引导控制预测 (Delay-Aware Trajectory-Guided Control Prediction, DTCP) 策略，将轨迹规划与控制预测相结合，并基于端到端延迟预测命令。此外，JSCC 和 DTCP 被协同设计，重点是及时可靠地传输任务相关的信息，从而实现自动驾驶。实验结果表明，在 1 秒（20 时间槽）的端到端延迟下，所提出的框架在 CARLA 仿真器中的驾驶得分为 48.12，比使用 Better Portable Graphics (BPG) 的方法高出 31.59 分，同时减少了 99.19% 的带宽使用。

链接: https://arxiv.org/abs/2503.08661
作者: Yufeng Diao,Yichi Zhang,Daniele De Martini,Philip Guodong Zhao,Emma Liying Li
机构: School of Computing Science, University of Glasgow (格拉斯哥大学计算科学学院); Department of Computer Science, University of Manchester (曼彻斯特大学计算机科学系); James Watt School of Engineering, University of Glasgow (格拉斯哥大学詹姆斯瓦特工程学院); Oxford Robotics Institute, Department of Engineering Science, University of Oxford (牛津大学工程科学系牛津机器人研究所)
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This paper has been accepted for publication in IEEE Journal on Selected Areas in Communications (JSAC), with publication expected in 2025

点击查看摘要

Abstract:This paper proposes a task-oriented co-design framework that integrates communication, computing, and control to address the key challenges of bandwidth limitations, noise interference, and latency in mission-critical industrial Cyber-Physical Systems (CPS). To improve communication efficiency and robustness, we design a task-oriented Joint Source-Channel Coding (JSCC) using Information Bottleneck (IB) to enhance data transmission efficiency by prioritizing task-specific information. To mitigate the perceived End-to-End (E2E) delays, we develop a Delay-Aware Trajectory-Guided Control Prediction (DTCP) strategy that integrates trajectory planning with control prediction, predicting commands based on E2E delay. Moreover, the DTCP is co-designed with task-oriented JSCC, focusing on transmitting task-specific information for timely and reliable autonomous driving. Experimental results in the CARLA simulator demonstrate that, under an E2E delay of 1 second (20 time slots), the proposed framework achieves a driving score of 48.12, which is 31.59 points higher than using Better Portable Graphics (BPG) while reducing bandwidth usage by 99.19%.
zh

[CV-13] MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input

【速读】：该论文旨在解决现有虚拟试穿（Virtual Try-On, VITON）方法依赖用户提供的辅助掩码而导致输入复杂性和性能下降的问题（如图1(a)所示）。论文提出了一种无掩码虚拟试穿（Mask-Free VITON, MF-VITON）框架，通过仅使用单人图像和目标服装实现逼真的虚拟试穿，无需额外的掩码输入。解决方案的关键在于引入了一种新颖的两阶段流程：第一阶段利用现有的基于掩码的VITON模型生成包含多样化真实人物图像与对应服装的数据集，并通过背景增强模拟现实场景；第二阶段在生成的数据集上微调预训练的基于掩码的模型，从而实现在不依赖掩码的情况下进行服装迁移，同时保持服装纹理和形状的保真度。该框架在服装迁移精度和视觉逼真度方面达到了最先进的性能水平。

链接: https://arxiv.org/abs/2503.08650
作者: Zhenchen Wan,Yanwu xu,Dongting Hu,Weilun Cheng,Tianxi Chen,Zhaoqing Wang,Feng Liu,Tongliang Liu,Mingming Gong
机构: University of Melbourne (墨尔本大学); The University of Sydney (悉尼大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is available at: this https URL

点击查看摘要

Abstract:Recent advancements in Virtual Try-On (VITON) have significantly improved image realism and garment detail preservation, driven by powerful text-to-image (T2I) diffusion models. However, existing methods often rely on user-provided masks, introducing complexity and performance degradation due to imperfect inputs, as shown in Fig.1(a). To address this, we propose a Mask-Free VITON (MF-VITON) framework that achieves realistic VITON using only a single person image and a target garment, eliminating the requirement for auxiliary masks. Our approach introduces a novel two-stage pipeline: (1) We leverage existing Mask-based VITON models to synthesize a high-quality dataset. This dataset contains diverse, realistic pairs of person images and corresponding garments, augmented with varied backgrounds to mimic real-world scenarios. (2) The pre-trained Mask-based model is fine-tuned on the generated dataset, enabling garment transfer without mask dependencies. This stage simplifies the input requirements while preserving garment texture and shape fidelity. Our framework achieves state-of-the-art (SOTA) performance regarding garment transfer accuracy and visual realism. Notably, the proposed Mask-Free model significantly outperforms existing Mask-based approaches, setting a new benchmark and demonstrating a substantial lead over previous approaches. For more details, visit our project page: this https URL.
zh

[CV-14] GBlobs: Explicit Local Structure via Gaussian Blobs for Improved Cross-Domain LiDAR-based 3D Object Detection CVPR2025

【速读】：该论文旨在解决基于 LiDAR 的 3D 检测器在训练过程中对大规模数据集的依赖以及在面对新颖领域时泛化能力不足的问题。为缓解这一挑战，Domain Generalization (DG) 方法致力于训练对域偏移具有不变性的检测器。然而，当前的 DG 方法仅依赖于全局几何特征（点云笛卡尔坐标）作为输入特征，这种过度依赖可能导致检测器过分关注物体的位置和绝对位置，从而影响跨领域的性能表现。论文的关键解决方案在于提出利用显式的局部点云结构进行 DG，具体通过使用高斯斑块（Gaussian Blobs, GBlobs）来编码点云邻域。GBlobs 的引入不仅高效且无需额外参数，在不牺牲领域内性能的前提下，显著提升了在单源 DG 挑战性基准测试中的表现，并在多源 DG 场景中也展现出卓越的性能。

链接: https://arxiv.org/abs/2503.08639
作者: Dušan Malić,Christian Fruhwirth-Reisinger,Samuel Schulter,Horst Possegger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:LiDAR-based 3D detectors need large datasets for training, yet they struggle to generalize to novel domains. Domain Generalization (DG) aims to mitigate this by training detectors that are invariant to such domain shifts. Current DG approaches exclusively rely on global geometric features (point cloud Cartesian coordinates) as input features. Over-reliance on these global geometric features can, however, cause 3D detectors to prioritize object location and absolute position, resulting in poor cross-domain performance. To mitigate this, we propose to exploit explicit local point cloud structure for DG, in particular by encoding point cloud neighborhoods with Gaussian blobs, GBlobs. Our proposed formulation is highly efficient and requires no additional parameters. Without any bells and whistles, simply by integrating GBlobs in existing detectors, we beat the current state-of-the-art in challenging single-source DG benchmarks by over 21 mAP (Waymo-KITTI), 13 mAP (KITTI-Waymo), and 12 mAP (nuScenes-KITTI), without sacrificing in-domain performance. Additionally, GBlobs demonstrate exceptional performance in multi-source DG, surpassing the current state-of-the-art by 17, 12, and 5 mAP on Waymo, KITTI, and ONCE, respectively.
zh

[CV-15] Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning

【速读】：该论文试图解决的问题是评估所谓“固有可解释”（inherently interpretable）深度学习模型在实际应用中的可靠性和安全性，特别是其对对抗性攻击（如原型操纵和后门攻击）的脆弱性以及由此引发的虚假安全感。论文的关键解决方案在于引入两种针对基于原型网络的对抗性分析策略：原型操纵和后门攻击，并探讨概念瓶颈模型如何防御这些攻击。研究揭示了通过利用潜在原型来误导模型推理的现象，这暴露了深度神经网络本质上不可解释的特性，从而质疑了此类模型的信任度和适用性，进一步推动了对可解释模型鲁棒性和对齐性的研究。

链接: https://arxiv.org/abs/2503.08636
作者: Hubert Baniecki,Przemyslaw Biecek
机构: University of Warsaw (华沙大学); Warsaw University of Technology (华沙工业大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:A common belief is that intrinsically interpretable deep learning models ensure a correct, intuitive understanding of their behavior and offer greater robustness against accidental errors or intentional manipulation. However, these beliefs have not been comprehensively verified, and growing evidence casts doubt on them. In this paper, we highlight the risks related to overreliance and susceptibility to adversarial manipulation of these so-called “intrinsically (aka inherently) interpretable” models by design. We introduce two strategies for adversarial analysis with prototype manipulation and backdoor attacks against prototype-based networks, and discuss how concept bottleneck models defend against these attacks. Fooling the model’s reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks, leading to a false sense of security reinforced by a visual confirmation bias. The reported limitations of prototype-based networks put their trustworthiness and applicability into question, motivating further work on the robustness and alignment of (deep) interpretable models.
zh

[CV-16] SegAgent : Exploring Pixel Understanding Capabilities in MLLM s by Imitating Human Annotator Trajectories CVPR2025

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在像素级理解能力上的不足，当前评估任务如视觉问答（VQA）和视觉接地（visual grounding）过于粗略，无法准确评估其细粒度像素理解能力。尽管分割是像素级理解的基础，但现有方法通常需要MLLMs生成隐式标记并通过外部像素解码器解码，这会干扰文本输出空间，可能削弱语言能力并降低灵活性与可扩展性，同时未能反映模型内在的像素级理解能力。为此，论文引入了人类相似掩码标注任务（HLMAT），这是一种新范式，让MLLMs模仿人类注释者使用交互式分割工具。通过将分割建模为一个多步骤马尔可夫决策过程，HLMAT使MLLMs能够迭代生成基于文本的点击点，从而实现高质量掩码而无需架构变化或隐式标记。关键在于通过这一设置开发的SegAgent，它在人类相似注释轨迹上微调，性能达到最先进的水平，并支持额外的任务如掩码细化和注释过滤。HLMAT提供了一种评估MLLMs细粒度像素理解能力的协议，并引入了一个以视觉为中心、多步骤决策制定的任务，促进了对MLLMs视觉推理能力的探索。此外，论文对策略改进方法StaR和PRM引导树搜索的适应进一步增强了复杂分割任务中的模型鲁棒性，为未来细粒度视觉感知和多步骤决策制定的进展奠定了基础。

链接: https://arxiv.org/abs/2503.08625
作者: Muzhi Zhu,Yuzhuo Tian,Hao Chen,Chunluan Zhou,Qingpei Guo,Yang Liu,Ming Yang,Chunhua Shen
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025;Code will be released at \url{ this https URL }

点击查看摘要

Abstract:While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM’s text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model’s intrinsic pixel-level understanding. Thus, we introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to state-of-the-art (SOTA) methods and supports additional tasks like mask refinement and annotation filtering. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs’ visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM-guided tree search further enhance model robustness in complex segmentation tasks, laying a foundation for future advancements in fine-grained visual perception and multi-step decision-making for MLLMs. Comments: CVPR2025;Code will be released at \urlthis https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.08625 [cs.CV] (or arXiv:2503.08625v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.08625 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-17] LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization

【速读】：该论文旨在解决文本到图像生成模型对大规模数据集和高参数量架构的依赖问题，这严重限制了缺乏充足计算资源的研究人员和实践者的可及性。论文提出了一种名为LightGen的高效训练范式，其关键是结合知识蒸馏（Knowledge Distillation, KD）和直接偏好优化（Direct Preference Optimization, DPO）。通过从最先进的文本到图像生成模型中提取知识，并将其压缩到仅包含0.7B参数的紧凑掩码自回归（Masked Autoregressive, MAR）架构中，同时利用一个仅包含200万高质量合成图像的小型数据集，证明了数据多样性比数据量更能决定模型性能。此外，为了克服合成数据的固有缺陷，如高频细节缺失和空间不准确性，引入了DPO技术以提升图像保真度和位置精度。实验结果表明，LightGen在保持与顶级模型相当的生成质量的同时，大幅降低了计算需求并提升了资源受限环境下的可访问性。

链接: https://arxiv.org/abs/2503.08619
作者: Xianfeng Wu,Yajing Bai,Haoze Zheng,Harold Haodong Chen,Yexin Liu,Zihao Wang,Xuran Ma,Wen-Jie Shu,Xianzu Wu,Harry Yang,Ser-Nam Lim
机构: The Hong Kong University of Science and Technology (香港科技大学); Everlyn AI; University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Recent advances in text-to-image generation have primarily relied on extensive datasets and parameter-heavy architectures. These requirements severely limit accessibility for researchers and practitioners who lack substantial computational resources. In this paper, we introduce \model, an efficient training paradigm for image generation models that uses knowledge distillation (KD) and Direct Preference Optimization (DPO). Drawing inspiration from the success of data KD techniques widely adopted in Multi-Modal Large Language Models (MLLMs), LightGen distills knowledge from state-of-the-art (SOTA) text-to-image models into a compact Masked Autoregressive (MAR) architecture with only 0.7B parameters. Using a compact synthetic dataset of just 2M high-quality images generated from varied captions, we demonstrate that data diversity significantly outweighs data volume in determining model performance. This strategy dramatically reduces computational demands and reduces pre-training time from potentially thousands of GPU-days to merely 88 GPU-days. Furthermore, to address the inherent shortcomings of synthetic data, particularly poor high-frequency details and spatial inaccuracies, we integrate the DPO technique that refines image fidelity and positional accuracy. Comprehensive experiments confirm that LightGen achieves image generation quality comparable to SOTA models while significantly reducing computational resources and expanding accessibility for resource-constrained environments. Code is available at this https URL
zh

[CV-18] HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder

【速读】：该论文旨在解决现有端到端自动驾驶（End-to-End Autonomous Driving, E2E-AD）技术在闭环评估中的性能不足问题，特别是规划模块在查询设计与交互中的潜力尚未被充分挖掘的问题。论文的关键创新在于提出了一种多粒度规划查询表示方法，通过整合空间、时间及驾驶风格等多种采样模式下的异构路标点（heterogeneous waypoints），为轨迹预测提供额外监督，从而提升自车的精确闭环控制能力。此外，论文利用规划轨迹的几何特性，结合可变形注意力机制，实现了基于物理位置的图像特征有效检索。这些策略共同构成了一个统一解码器内的全新端到端自动驾驶框架HiP-AD，实现了感知、预测和规划的同时执行，并通过BEV空间中规划查询与感知查询的迭代交互以及动态提取透视视图中的图像特征，实现了全面的交互能力。实验结果表明，HiP-AD在Bench2Drive闭环基准测试中超越所有现有方法，并在nuScenes真实数据集上取得了竞争性表现。

链接: https://arxiv.org/abs/2503.08612
作者: Yingqi Tang,Zhuoran Xu,Zhaotie Meng,Erkang Cheng
机构: Nullmax
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although end-to-end autonomous driving (E2E-AD) technologies have made significant progress in recent years, there remains an unsatisfactory performance on closed-loop evaluation. The potential of leveraging planning in query design and interaction has not yet been fully explored. In this paper, we introduce a multi-granularity planning query representation that integrates heterogeneous waypoints, including spatial, temporal, and driving-style waypoints across various sampling patterns. It provides additional supervision for trajectory prediction, enhancing precise closed-loop control for the ego vehicle. Additionally, we explicitly utilize the geometric properties of planning trajectories to effectively retrieve relevant image features based on physical locations using deformable attention. By combining these strategies, we propose a novel end-to-end autonomous driving framework, termed HiP-AD, which simultaneously performs perception, prediction, and planning within a unified decoder. HiP-AD enables comprehensive interaction by allowing planning queries to iteratively interact with perception queries in the BEV space while dynamically extracting image features from perspective views. Experiments demonstrate that HiP-AD outperforms all existing end-to-end autonomous driving methods on the closed-loop benchmark Bench2Drive and achieves competitive performance on the real-world dataset nuScenes.
zh

[CV-19] uning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling

【速读】：该论文旨在解决在单一推理过程中生成高质量真实感长视频的挑战，主要由于有限的数据和高昂的计算成本导致现有文本到视频扩散模型难以直接扩展至长视频生成。尽管已有工作通过无调参方法（如使用多个提示实现动态可控的内容变化）来应对这一问题，但这些方法往往侧重于相邻帧之间的平滑过渡，容易引发内容漂移及长时间序列上的语义一致性丧失。论文的关键解决方案是提出了一种名为同步耦合采样（Synchronized Coupled Sampling, SynCoS）的新颖推理框架，该框架在整个视频范围内同步去噪路径，确保相邻与远距离帧之间的长期一致性。SynCoS结合了两种互补的采样策略——反向采样和基于优化的采样，分别保证局部平滑过渡和全局一致性，同时通过固定的接地时间步长和基准噪声将两者同步，以实现完全耦合并对齐的去噪轨迹，从而避免独立操作带来的不一致性和意外内容更改。实验结果表明，SynCoS显著提升了多事件长视频生成的质量，在平滑过渡和长期一致性方面优于现有方法。

链接: https://arxiv.org/abs/2503.08605
作者: Subin Kim,Seoung Wug Oh,Jui-Hsien Wang,Joon-Young Lee,Jinwoo Shin
机构: KAIST; Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page with visuals: this https URL

点击查看摘要

Abstract:While recent advancements in text-to-video diffusion models enable high-quality short video generation from a single prompt, generating real-world long videos in a single pass remains challenging due to limited data and high computational costs. To address this, several works propose tuning-free approaches, i.e., extending existing models for long video generation, specifically using multiple prompts to allow for dynamic and controlled content changes. However, these methods primarily focus on ensuring smooth transitions between adjacent frames, often leading to content drift and a gradual loss of semantic coherence over longer sequences. To tackle such an issue, we propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video, ensuring long-range consistency across both adjacent and distant frames. Our approach combines two complementary sampling strategies: reverse and optimization-based sampling, which ensure seamless local transitions and enforce global coherence, respectively. However, directly alternating between these samplings misaligns denoising trajectories, disrupting prompt guidance and introducing unintended content changes as they operate independently. To resolve this, SynCoS synchronizes them through a grounded timestep and a fixed baseline noise, ensuring fully coupled sampling with aligned denoising paths. Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence, outperforming previous approaches both quantitatively and qualitatively.
zh

[CV-20] CellStyle: Improved Zero-Shot Cell Segmentation via Style Transfer

【速读】：该论文旨在解决细胞显微镜数据集之间因细胞类型、成像设备和染色技术差异导致的领域差距（domain gap）问题，使得即使经过大规模预训练的分割模型在源数据集上表现良好，但在未标注的目标数据集上泛化能力不足的问题。论文的关键解决方案是提出CellStyle方法，通过将目标数据集（未标注）的纹理、颜色和噪声属性迁移到源数据集（已标注），同时保持源图像中的细胞形状不变，从而实现零样本适应（zero-shot adaptation）。这种方法利用样式迁移后的合成图像及其现有标注来微调通用分割模型，以适用于未标注的目标数据集，显著提升了跨多样数据集的零样本细胞分割性能。

链接: https://arxiv.org/abs/2503.08603
作者: Rüveyda Yilmaz,Zhu Chen,Yuli Wu,Johannes Stegmaier
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cell microscopy data are abundant; however, corresponding segmentation annotations remain scarce. Moreover, variations in cell types, imaging devices, and staining techniques introduce significant domain gaps between datasets. As a result, even large, pretrained segmentation models trained on diverse datasets (source datasets) struggle to generalize to unseen datasets (target datasets). To overcome this generalization problem, we propose CellStyle, which improves the segmentation quality of such models without requiring labels for the target dataset, thereby enabling zero-shot adaptation. CellStyle transfers the attributes of an unannotated target dataset, such as texture, color, and noise, to the annotated source dataset. This transfer is performed while preserving the cell shapes of the source images, ensuring that the existing source annotations can still be used while maintaining the visual characteristics of the target dataset. The styled synthetic images with the existing annotations enable the finetuning of a generalist segmentation model for application to the unannotated target data. We demonstrate that CellStyle significantly improves zero-shot cell segmentation performance across diverse datasets by finetuning multiple segmentation models on the style-transferred data. The code will be made publicly available.
zh

[CV-21] LiSu: A Dataset and Method for LiDAR Surface Normal Estimation CVPR2025

【速读】：该论文旨在解决基于激光雷达点云的表面法线估计问题，主要面临两个挑战：缺乏大规模标注数据集以及现有方法难以高效处理稀疏且常带噪声的激光雷达数据。为应对这些限制，论文提出了一种创新解决方案，其关键是利用自动驾驶数据的空间-时间特性来提升表面法线估计的准确性。通过引入两种正则化项，分别确保邻近点之间的空间一致性及连续帧间的时序平滑性，该方法在自训练场景下尤其有效，能够减轻伪标签噪声的影响，实现鲁棒的实际部署。此外，论文构建了LiSu数据集，这是首个包含真实表面法线标注的大规模合成激光雷达点云数据集，从而避免了繁琐的手动标注过程。

链接: https://arxiv.org/abs/2503.08601
作者: Dušan Malić,Christian Fruhwirth-Reisinger,Samuel Schulter,Horst Possegger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:While surface normals are widely used to analyse 3D scene geometry, surface normal estimation from LiDAR point clouds remains severely underexplored. This is caused by the lack of large-scale annotated datasets on the one hand, and lack of methods that can robustly handle the sparse and often noisy LiDAR data in a reasonable time on the other hand. We address these limitations using a traffic simulation engine and present LiSu, the first large-scale, synthetic LiDAR point cloud dataset with ground truth surface normal annotations, eliminating the need for tedious manual labeling. Additionally, we propose a novel method that exploits the spatiotemporal characteristics of autonomous driving data to enhance surface normal estimation accuracy. By incorporating two regularization terms, we enforce spatial consistency among neighboring points and temporal smoothness across consecutive LiDAR frames. These regularizers are particularly effective in self-training settings, where they mitigate the impact of noisy pseudo-labels, enabling robust real-world deployment. We demonstrate the effectiveness of our method on LiSu, achieving state-of-the-art performance in LiDAR surface normal estimation. Moreover, we showcase its full potential in addressing the challenging task of synthetic-to-real domain adaptation, leading to improved neural surface reconstruction on real-world data.
zh

[CV-22] X-Field: A Physically Grounded Representation for 3D X-ray Reconstruction

【速读】：该论文致力于解决X射线成像中由于辐射暴露限制导致的视角生成和CT体积重建难题。传统方法借用三维重建领域的表示方法，但这些方法主要针对强调反射和散射效应的可见光成像，未能充分考虑X射线成像中的穿透和衰减特性。为了解决这一问题，论文引入了X-Field，这是一种专为X射线成像设计的全新三维表示方法，基于不同材料的能量吸收率构建。其关键创新在于使用具有不同衰减系数的三维椭球体来精确建模内部结构中的多种材料，并通过一种高效的路径划分算法来估计每种材料的X射线能量吸收。此外，还提出了混合渐进初始化以提高X-Field的几何精度，并结合基于材料的优化来增强模型在材料边界处的拟合效果。实验结果表明，X-Field在真实人体器官数据集和合成物体数据集上均表现出色，优于现有最先进的X射线新视图合成和CT重建方法。

链接: https://arxiv.org/abs/2503.08596
作者: Feiran Wang,Jiachen Tao,Junyi Wu,Haoxuan Wang,Bin Duan,Kai Wang,Zongxin Yang,Yan Yan
机构: Illinois Institute of Technology (伊利诺伊理工学院); University of Illinois Chicago (芝加哥伊利诺伊大学); University of Michigan (密歇根大学); National University of Singapore (新加坡国立大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: \url{ this https URL }, Github Code: \url{ this https URL }

点击查看摘要

Abstract:X-ray imaging is indispensable in medical diagnostics, yet its use is tightly regulated due to potential health risks. To mitigate radiation exposure, recent research focuses on generating novel views from sparse inputs and reconstructing Computed Tomography (CT) volumes, borrowing representations from the 3D reconstruction area. However, these representations originally target visible light imaging that emphasizes reflection and scattering effects, while neglecting penetration and attenuation properties of X-ray imaging. In this paper, we introduce X-Field, the first 3D representation specifically designed for X-ray imaging, rooted in the energy absorption rates across different materials. To accurately model diverse materials within internal structures, we employ 3D ellipsoids with distinct attenuation coefficients. To estimate each material’s energy absorption of X-rays, we devise an efficient path partitioning algorithm accounting for complex ellipsoid intersections. We further propose hybrid progressive initialization to refine the geometric accuracy of X-Filed and incorporate material-based optimization to enhance model fitting along material boundaries. Experiments show that X-Field achieves superior visual fidelity on both real-world human organ and synthetic object datasets, outperforming state-of-the-art methods in X-ray Novel View Synthesis and CT Reconstruction.
zh

[CV-23] 3D Point Cloud Generation via Autoregressive Up-sampling

【速读】：该论文旨在解决3D点云生成问题，特别是针对点云数据无序且不规则的固有特性，提出了一种基于自回归（Autoregressive）机制的新颖模型PointARU。论文的关键解决方案在于引入了一个渐进式的细化过程，从粗尺度到细尺度逐步优化3D点云，并采用两阶段训练范式：首先学习多尺度离散表示，然后训练一个自回归Transformer用于下一尺度预测。在第二阶段，模型通过结合基于解码点云的3D绝对位置编码以及专门设计的基于点的上采样网络模块，有效应对点云的无序性和不规则性。这一创新方法不仅超越了当前最先进的扩散（Diffusion-based）方法，在生成质量和参数效率方面表现优异，还特别擅长处理部分3D形状补全和稀疏点云上采样的任务。

链接: https://arxiv.org/abs/2503.08594
作者: Ziqiao Meng,Qichao Wang,Zhipeng Zhou,Irwin King,Peilin Zhao
机构: Tencent AI Lab (腾讯人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a pioneering autoregressive generative model for 3D point cloud generation. Inspired by visual autoregressive modeling (VAR), we conceptualize point cloud generation as an autoregressive up-sampling process. This leads to our novel model, PointARU, which progressively refines 3D point clouds from coarse to fine scales. PointARU follows a two-stage training paradigm: first, it learns multi-scale discrete representations of point clouds, and then it trains an autoregressive transformer for next-scale prediction. To address the inherent unordered and irregular structure of point clouds, we incorporate specialized point-based up-sampling network modules in both stages and integrate 3D absolute positional encoding based on the decoded point cloud at each scale during the second stage. Our model surpasses state-of-the-art (SoTA) diffusion-based approaches in both generation quality and parameter efficiency across diverse experimental settings, marking a new milestone for autoregressive methods in 3D point cloud generation. Furthermore, PointARU demonstrates exceptional performance in completing partial 3D shapes and up-sampling sparse point clouds, outperforming existing generative models in these tasks.
zh

[CV-24] Integration of nested cross-validation automated hyperparameter optimization high-performance computing to reduce and quantify the variance of test performance estimation of deep learning models

【速读】：该论文旨在解决深度学习模型在医学影像领域实际应用性能基准测试中存在的变异性（variance）和偏差（biases）问题，这些问题影响了模型在真实世界部署中的可信度。传统方法通过固定单一测试集的方法难以量化测试性能指标估计的变异性。论文的关键解决方案是提出了NACHOS框架，它通过嵌套交叉验证（Nested Cross-Validation, NCV）、自动超参数优化（Automated Hyperparameter Optimization, AHPO）以及高性能计算（High-Performance Computing, HPC）的集成，有效减少并量化了深度学习模型测试性能指标的变异性。此外，进一步引入DACHOS框架，在全数据集上利用AHPO和交叉验证构建最终模型，以提升预期部署性能。论文强调了NCV在量化和减少估计方差、AHPO在跨测试折分一致优化超参数、以及HPC在保证计算可行性方面的重要性，从而提供了一个可扩展、可重复且值得信赖的深度学习模型评估与部署框架。

链接: https://arxiv.org/abs/2503.08589
作者: Paul Calle,Averi Bates,Justin C. Reynolds,Yunlong Liu,Haoyang Cui,Sinaro Ly,Chen Wang,Qinghao Zhang,Alberto J. de Armendi,Shashank S. Shettar,Kar Ming Fung,Qinggong Tang,Chongle Pan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The variability and biases in the real-world performance benchmarking of deep learning models for medical imaging compromise their trustworthiness for real-world deployment. The common approach of holding out a single fixed test set fails to quantify the variance in the estimation of test performance metrics. This study introduces NACHOS (Nested and Automated Cross-validation and Hyperparameter Optimization using Supercomputing) to reduce and quantify the variance of test performance metrics of deep learning models. NACHOS integrates Nested Cross-Validation (NCV) and Automated Hyperparameter Optimization (AHPO) within a parallelized high-performance computing (HPC) framework. NACHOS was demonstrated on a chest X-ray repository and an Optical Coherence Tomography (OCT) dataset under multiple data partitioning schemes. Beyond performance estimation, DACHOS (Deployment with Automated Cross-validation and Hyperparameter Optimization using Supercomputing) is introduced to leverage AHPO and cross-validation to build the final model on the full dataset, improving expected deployment performance. The findings underscore the importance of NCV in quantifying and reducing estimation variance, AHPO in optimizing hyperparameters consistently across test folds, and HPC in ensuring computational feasibility. By integrating these methodologies, NACHOS and DACHOS provide a scalable, reproducible, and trustworthy framework for DL model evaluation and deployment in medical imaging.
zh

[CV-25] HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding CVPR2025

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在中长视频理解中的局限性，特别是由于帧数和上下文长度限制导致的帧采样依赖问题。帧采样的方法可能遗漏关键信息，并缺乏任务相关性。为了解决这些问题，论文提出了一种名为HierarQ的任务感知分层Q-Former框架，通过顺序处理帧来避免帧采样的需求，同时克服传统大语言模型上下文长度的限制。关键创新在于引入了一个轻量级的双流语言引导特征调制器，其中实体流捕获短上下文中帧级别的对象信息，场景流识别长时间范围内的交互关系。每个流由专用的记忆库支持，使得所提出的分层查询变换器（Hierarchical Querying Transformer, HierarQ）能够有效捕捉短时和长时上下文。这一方案的核心在于通过任务感知和分层处理实现高效且全面的视频分析能力。

链接: https://arxiv.org/abs/2503.08585
作者: Shehreen Azad,Vibhav Vineet,Yogesh Singh Rawat
机构: Center for Research in Computer Vision, University of Central Florida (中佛罗里达大学); Microsoft Research (微软研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2025

点击查看摘要

Abstract:Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM’s context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed Hierachical Querying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ’s state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis.
zh

[CV-26] MsaMIL-Net: An End-to-End Multi-Scale Aware Multiple Instance Learning Network for Efficient Whole Slide Image Classification ICCV2025

【速读】：该论文旨在解决基于包的多重实例学习（Bag-based Multiple Instance Learning, MIL）方法在全片图像（Whole Slide Image, WSI）分类中的两个主要问题：一是现有方法采用分段训练策略，导致特征提取网络与MIL网络之间协作优化不足，无法实现端到端联合优化，从而限制模型整体性能；二是传统方法通常从固定大小的所有patch中提取特征，忽略了病理学家的多尺度观察特性，在肿瘤区域比例较低的数据集（如Camelyon16）中造成显著的计算资源浪费，并可能导致次优解。为解决这些问题，论文提出了一种端到端的多尺度WSI分类框架，其关键在于将多尺度特征提取与多重实例学习相结合，具体包括：(1)语义特征过滤模块以减少非病灶区域的干扰；(2)多尺度特征提取模块以捕获不同层次的病理信息；(3)多尺度融合MIL模块用于全局建模和特征整合。通过端到端的训练策略，同时优化特征提取器和MIL网络，确保两者之间的最佳兼容性。实验结果表明，该方法在DigestPath2019、BCNB和UBC-OCEAN三个跨中心数据集上均优于现有最先进的方法，无论是准确率（ACC）还是AUC指标都表现出色。

链接: https://arxiv.org/abs/2503.08581
作者: Jiangping Wen,Jinyu Wen,Emei Fang
机构: Guangzhou University (广州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: summited to ICCV2025

点击查看摘要

Abstract:Bag-based Multiple Instance Learning (MIL) approaches have emerged as the mainstream methodology for Whole Slide Image (WSI) classification. However, most existing methods adopt a segmented training strategy, which first extracts features using a pre-trained feature extractor and then aggregates these features through MIL. This segmented training approach leads to insufficient collaborative optimization between the feature extraction network and the MIL network, preventing end-to-end joint optimization and thereby limiting the overall performance of the model. Additionally, conventional methods typically extract features from all patches of fixed size, ignoring the multi-scale observation characteristics of pathologists. This not only results in significant computational resource waste when tumor regions represent a minimal proportion (as in the Camelyon16 dataset) but may also lead the model to suboptimal solutions. To address these limitations, this paper proposes an end-to-end multi-scale WSI classification framework that integrates multi-scale feature extraction with multiple instance learning. Specifically, our approach includes: (1) a semantic feature filtering module to reduce interference from non-lesion areas; (2) a multi-scale feature extraction module to capture pathological information at different levels; and (3) a multi-scale fusion MIL module for global modeling and feature integration. Through an end-to-end training strategy, we simultaneously optimize both the feature extractor and MIL network, ensuring maximum compatibility between them. Experiments were conducted on three cross-center datasets (DigestPath2019, BCNB, and UBC-OCEAN). Results demonstrate that our proposed method outperforms existing state-of-the-art approaches in terms of both accuracy (ACC) and AUC metrics. Comments: summited to ICCV2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.08581 [cs.CV] (or arXiv:2503.08581v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.08581 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-27] Comparing Satellite Data for Next-Day Wildfire Predictability

【速读】：该论文试图解决使用卫星数据进行次日野火预测的问题，特别是比较MODIS（Moderate Resolution Imaging Spectroradiometer）和VIIRS（Visible Infrared Imaging Radiometer Suite）两种卫星数据在这一任务中的适用性。论文的关键在于评估这两种卫星数据作为输入以及相应的火掩模产品（MOD14和VNP14）作为目标变量时的预测性能，并发现以VIIRS为输入且以VNP14为目标的模型表现最佳。此外，研究揭示了MOD14由于其高度随机性而不适合用于次日火预测，而通过改进MODIS的火检测模型可以显著提升预测能力。因此，论文的关键解决方案在于优选VIIRS数据结合VNP14目标，并提出改进MODIS火检测模型的可能性。

链接: https://arxiv.org/abs/2503.08580
作者: Justus Karlsson,Yonghao Xu,Amanda Berg,Leif Haglund
机构: Linköping University (林雪平大学); Maxar Intelligence (马克斯智能), Linköping (林雪平), Sweden; Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multiple studies have performed next-day fire prediction using satellite imagery. Two main satellites are used to detect wildfires: MODIS and VIIRS. Both satellites provide fire mask products, called MOD14 and VNP14, respectively. Studies have used one or the other, but there has been no comparison between them to determine which might be more suitable for next-day fire prediction. In this paper, we first evaluate how well VIIRS and MODIS data can be used to forecast wildfire spread one day ahead. We find that the model using VIIRS as input and VNP14 as target achieves the best results. Interestingly, the model using MODIS as input and VNP14 as target performs significantly better than using VNP14 as input and MOD14 as target. Next, we discuss why MOD14 might be harder to use for predicting next-day fires. We find that the MOD14 fire mask is highly stochastic and does not correlate with reasonable fire spread patterns. This is detrimental for machine learning tasks, as the model learns irrational patterns. Therefore, we conclude that MOD14 is unsuitable for next-day fire prediction and that VNP14 is a much better option. However, using MODIS input and VNP14 as target, we achieve a significant improvement in predictability. This indicates that an improved fire detection model is possible for MODIS. The full code and dataset is available online: this https URL
zh

[CV-28] RAG -Adapter: A Plug-and-Play RAG -enhanced Framework for Long Video Understanding

【速读】：该论文旨在解决现有长视频理解基准测试中，由于直接采用均匀帧采样而导致的信息丢失问题，这影响了对多模态大语言模型（MLLMs）真实能力评估的准确性。论文的关键解决方案是提出了一种名为RAG-Adapter的即插即用框架，通过采样与给定问题最相关的帧来减少测试过程中的信息损失。此外，引入了分组监督对比学习（Grouped-supervised Contrastive Learning, GCL）方法，通过在构建的MMAT数据集上微调进一步提升RAG-Adapter的采样效果。实验结果表明，RAG-Adapter采样显著优于均匀采样，为长视频基准测试提供了更准确的评估方法。

链接: https://arxiv.org/abs/2503.08576
作者: Xichen Tan,Yunfan Ye,Yuanjing Luo,Qian Wan,Fang Liu,Zhiping Cai
机构: College of Computer Science and Technology, National University of Defense Technology (国防科技大学); School of Design, Hunan University (湖南大学); College of Computer and Mathematics, Central South University of Forestry and Technology (中南林业科技大学); Faculty of Artificial Intelligence in Education, Central China Normal University (华中师范大学); School of Design, Hunan University (湖南大学); College of Computer Science and Technology, National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 36 figures

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluations in reflecting the true abilities of MLLMs. To address this, we propose RAG-Adapter, a plug-and-play framework that reduces information loss during testing by sampling frames most relevant to the given question. Additionally, we introduce a Grouped-supervised Contrastive Learning (GCL) method to further enhance sampling effectiveness of RAG-Adapter through fine-tuning on our constructed MMAT dataset. Finally, we test numerous baseline MLLMs on various video understanding benchmarks, finding that RAG-Adapter sampling consistently outperforms uniform sampling (e.g., Accuracy of GPT-4o increases by 9.3 percent on Video-MME), providing a more accurate testing method for long video benchmarks.
zh

[CV-29] Modular Customization of Diffusion Models via Blockwise-Parameterized Low-Rank Adaptation

【速读】：该论文致力于解决多概念模块化定制（modular customization）的问题，即如何高效地将分散训练的不同概念合并到一个定制模型中，同时确保每个概念的身份（identity）不被破坏。现有方法主要存在两个问题：一是后训练方法（post-training methods）仅适用于固定的概念集合，新组合需要重新训练；二是即时合并方法（instant merging methods）容易导致身份丢失（identity loss）和个体概念间的干扰（interference）。论文的关键在于提出BlockLoRA，这是一种即时合并方法，通过Randomized Output Erasure技术减少不同定制模型之间的干扰，并采用Blockwise LoRA Parameterization降低即时合并过程中的身份丢失问题，从而实现高保真（high fidelity）地合并多达15个不同类型的概念，包括人物、主题、场景和风格。

链接: https://arxiv.org/abs/2503.08575
作者: Mingkang Zhu,Xi Chen,Zhongdao Wang,Bei Yu,Hengshuang Zhao,Jiaya Jia
机构: CUHK (香港中文大学); HKUST (香港科技大学); HKU (香港大学); Huawei (华为); SmartMore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent diffusion model customization has shown impressive results in incorporating subject or style concepts with a handful of images. However, the modular composition of multiple concepts into a customized model, aimed to efficiently merge decentralized-trained concepts without influencing their identities, remains unresolved. Modular customization is essential for applications like concept stylization and multi-concept customization using concepts trained by different users. Existing post-training methods are only confined to a fixed set of concepts, and any different combinations require a new round of retraining. In contrast, instant merging methods often cause identity loss and interference of individual merged concepts and are usually limited to a small number of concepts. To address these issues, we propose BlockLoRA, an instant merging method designed to efficiently combine multiple concepts while accurately preserving individual concepts’ identity. With a careful analysis of the underlying reason for interference, we develop the Randomized Output Erasure technique to minimize the interference of different customized models. Additionally, Blockwise LoRA Parameterization is proposed to reduce the identity loss during instant model merging. Extensive experiments validate the effectiveness of BlockLoRA, which can instantly merge 15 concepts of people, subjects, scenes, and styles with high fidelity.
zh

[CV-30] ComicsPAP: understanding comic strips by picking the correct panel

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在理解漫画条连贯性和上下文依赖性方面的显著局限性。尽管现有LMMs在图像描述、视觉问答（VQA）以及视频理解方面取得了进展，但在处理漫画中的复杂时空线索时仍表现不佳。为应对这一挑战，论文引入了ComicsPAP，这是一个专为漫画条理解设计的大规模基准数据集。ComicsPAP包含超过100,000个样本，并通过Pick-a-Panel框架划分为5种子任务，要求模型识别序列中的缺失面板。研究发现，当前最先进的LMMs在这类任务上的表现接近随机水平，揭示了其在捕捉顺序和上下文依赖关系上的重大不足。为弥合这一差距，论文的关键解决方案是针对漫画条理解对LMMs进行适配优化，从而在ComicsPAP上取得了优于10倍更大模型的结果，表明ComicsPAP是一个推动多模态漫画理解研究的重要资源。

链接: https://arxiv.org/abs/2503.08561
作者: Emanuele Vivoli,Artemis Llabrés,Mohamed Ali Soubgui,Marco Bertini,Ernest Valveny Llobet,Dimosthenis Karatzas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) have made impressive strides in image captioning, VQA, and video comprehension, yet they still struggle with the intricate temporal and spatial cues found in comics. To address this gap, we introduce ComicsPAP, a large-scale benchmark designed for comic strip understanding. Comprising over 100k samples and organized into 5 subtasks under a Pick-a-Panel framework, ComicsPAP demands models to identify the missing panel in a sequence. Our evaluations, conducted under both multi-image and single-image protocols, reveal that current state-of-the-art LMMs perform near chance on these tasks, underscoring significant limitations in capturing sequential and contextual dependencies. To close the gap, we adapted LMMs for comic strip understanding, obtaining better results on ComicsPAP than 10x bigger models, demonstrating that ComicsPAP offers a robust resource to drive future research in multimodal comic comprehension.
zh

[CV-31] LA: Tactile-Language-Action Model for Contact-Rich Manipulation

【速读】：该论文致力于解决语言条件下的机器人接触密集型操作任务中触觉感知研究不足的问题，特别是在复杂装配任务中的触觉反馈处理与策略生成。论文的关键解决方案是提出Tactile-Language-Action (TLA) 模型，通过跨模态语言接地有效处理序列化的触觉反馈，从而在高接触场景中实现鲁棒的策略生成。此外，论文构建了一个包含24k对定制化指尖插件装配触觉动作指令数据的综合数据集，为TLA模型的训练和评估提供了重要资源。

链接: https://arxiv.org/abs/2503.08548
作者: Peng Hao,Chaofan Zhang,Dingzhe Li,Xiaoge Cao,Xiaoshuai Hao,Shaowei Cui,Shuo Wang
机构: Samsung Research China - Beijing (SRC-B) (三星研究中国 - 北京); The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (多模态人工智能系统国家重点实验室，中国科学院自动化研究所); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Significant progress has been made in vision-language models. However, language-conditioned robotic manipulation for contact-rich tasks remains underexplored, particularly in terms of tactile sensing. To address this gap, we introduce the Tactile-Language-Action (TLA) model, which effectively processes sequential tactile feedback via cross-modal language grounding to enable robust policy generation in contact-intensive scenarios. In addition, we construct a comprehensive dataset that contains 24k pairs of tactile action instruction data, customized for fingertip peg-in-hole assembly, providing essential resources for TLA training and evaluation. Our results show that TLA significantly outperforms traditional imitation learning methods (e.g., diffusion policy) in terms of effective action generation and action accuracy, while demonstrating strong generalization capabilities by achieving over 85% success rate on previously unseen assembly clearances and peg shapes. We publicly release all data and code in the hope of advancing research in language-conditioned tactile manipulation skill learning. Project website: this https URL
zh

[CV-32] Deformable Linear Object Surface Placement Using Elastica Planning and Local Shape Control

【速读】：该论文旨在解决在受限环境中操作可变形线状物体（Deformable Linear Objects, DLOs）放置于平面表面的挑战性任务。论文提出了一种两层结构的方法：高层采用基于欧拉弹性杆解的新型DLO表面放置方法，通过机械手夹持DLO的一个端点，并以DLO的可变内部点作为与放置表面对齐部分的起始点；底层构建了一个管道控制器，利用残差神经网络（ResNet）估计DLO当前形状，并通过低层反馈确保任务执行，即使存在建模和放置误差也能维持操作稳定性。该方法的关键在于结合高层规划与底层反馈控制，从而能够在高阶操作失败时恢复状态，满足实际机器人操作系统的实用需求。实验验证了该方法的有效性，使用了针对生鲜食品应用设计的硅胶模拟对象进行仿真和实证研究。

链接: https://arxiv.org/abs/2503.08545
作者: I. Grinberg,A. Levin,E. D. Rimon
机构: Dept. of ME, Technion (Technion - Israel Institute of Technology)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Manipulation of deformable linear objects (DLOs) in constrained environments is a challenging task. This paper describes a two-layered approach for placing DLOs on a flat surface using a single robot hand. The high-level layer is a novel DLO surface placement method based on Euler’s elastica solutions. During this process one DLO endpoint is manipulated by the robot gripper while a variable interior point of the DLO serves as the start point of the portion aligned with the placement surface. The low-level layer forms a pipeline controller. The controller estimates the DLO current shape using a Residual Neural Network (ResNet) and uses low-level feedback to ensure task execution in the presence of modeling and placement errors. The resulting DLO placement approach can recover from states where the high-level manipulation planner has failed as required by practical robot manipulation systems. The DLO placement approach is demonstrated with simulations and experiments that use silicon mock-up objects prepared for fresh food applications.
zh

[CV-33] ChromaFormer: A Scalable and Accurate Transformer Architecture for Land Cover Classification

【速读】：该论文旨在解决高分辨率遥感影像在多光谱数据处理上的性能瓶颈问题，特别是在大规模密集标注数据集上的语义分割任务。传统卷积网络如UNet等因受限于仅能有效处理RGB三通道数据，在面对多光谱遥感影像时表现不足。同时，虽然已有部分基于Transformer架构的方法被提出，但它们通常局限于小规模数据集且缺乏全面评估。

论文的关键在于提出了ChromaFormer这一系列多光谱Transformer模型，并设计了一种新颖的多光谱注意力机制来优化跨波段信息融合。通过在比利时弗拉芒地区覆盖超过13,500平方公里、包含15个类别的密集标注数据集上的实验验证，证明了相较于传统架构（如UNet），参数量大一个数量级以上的多光谱Transformer模型能够显著提升分类精度，例如将准确率从UNet++的不到65%提高到超过95%。

链接: https://arxiv.org/abs/2503.08534
作者: Mingshi Li,Dusan Grujicic,Ben Somers,Stien Heremans,Steven De Saeger,Matthew B. Blaschko
机构: ESAT-PSI, KU Leuven (鲁汶大学); Department of Earth and Environmental Sciences, KU Leuven (鲁汶大学); Research Institute for Nature and Forest (INBO), Belgium (比利时自然与森林研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Remote sensing imagery from systems such as Sentinel provides full coverage of the Earth’s surface at around 10-meter resolution. The remote sensing community has transitioned to extensive use of deep learning models due to their high performance on benchmarks such as the UCMerced and ISPRS Vaihingen datasets. Convolutional models such as UNet and ResNet variations are commonly employed for remote sensing but typically only accept three channels, as they were developed for RGB imagery, while satellite systems provide more than ten. Recently, several transformer architectures have been proposed for remote sensing, but they have not been extensively benchmarked and are typically used on small datasets such as Salinas Valley. Meanwhile, it is becoming feasible to obtain dense spatial land-use labels for entire first-level administrative divisions of some countries. Scaling law observations suggest that substantially larger multi-spectral transformer models could provide a significant leap in remote sensing performance in these settings. In this work, we propose ChromaFormer, a family of multi-spectral transformer models, which we evaluate across orders of magnitude differences in model parameters to assess their performance and scaling effectiveness on a densely labeled imagery dataset of Flanders, Belgium, covering more than 13,500 km^2 and containing 15 classes. We propose a novel multi-spectral attention strategy and demonstrate its effectiveness through ablations. Furthermore, we show that models many orders of magnitude larger than conventional architectures, such as UNet, lead to substantial accuracy improvements: a UNet++ model with 23M parameters achieves less than 65% accuracy, while a multi-spectral transformer with 655M parameters achieves over 95% accuracy on the Biological Valuation Map of Flanders. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2503.08534 [cs.CV] (or arXiv:2503.08534v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.08534 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-34] Visual Attention Graph

【速读】：该论文旨在解决如何有效编码人类大脑心理世界中视觉对象关系的问题，特别是在预测场景观察过程中的人类视觉注意模式时，现有方法因未充分考虑场景语义信息而导致个体间和个体内高变异性的问题。为了解决这一挑战，论文提出了一种新的注意力表示方法——“注意力图（Attention Graph）”。其关键是将视觉显著性和扫描路径同时以图结构形式进行表征，通过基于语义的扫描路径定义图上的路径，以及通过节点上的注视密度计算对象的显著性，从而更好地揭示人类观察者的共同注意行为。这一创新方法不仅在评估视觉注意预测模型方面提供了更优的基准，还展示了在评估人类认知状态（如自闭症筛查和年龄分类）中的潜力。

链接: https://arxiv.org/abs/2503.08531
作者: Kai-Fu Yang,Yong-Jie Li
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 20 pages, 14 figures

点击查看摘要

Abstract:Visual attention plays a critical role when our visual system executes active visual tasks by interacting with the physical scene. However, how to encode the visual object relationship in the psychological world of our brain deserves to be explored. In the field of computer vision, predicting visual fixations or scanpaths is a usual way to explore the visual attention and behaviors of human observers when viewing a scene. Most existing methods encode visual attention using individual fixations or scanpaths based on the raw gaze shift data collected from human observers. This may not capture the common attention pattern well, because without considering the semantic information of the viewed scene, raw gaze shift data alone contain high inter- and intra-observer variability. To address this issue, we propose a new attention representation, called Attention Graph, to simultaneously code the visual saliency and scanpath in a graph-based representation and better reveal the common attention behavior of human observers. In the attention graph, the semantic-based scanpath is defined by the path on the graph, while saliency of objects can be obtained by computing fixation density on each node. Systemic experiments demonstrate that the proposed attention graph combined with our new evaluation metrics provides a better benchmark for evaluating attention prediction methods. Meanwhile, extra experiments demonstrate the promising potentials of the proposed attention graph in assessing human cognitive states, such as autism spectrum disorder screening and age classification.
zh

[CV-35] SignRep: Enhancing Self-Supervised Sign Representations

【速读】：该论文旨在解决手语表示学习中因手势复杂时空特性及标注数据稀缺所面临的独特挑战。现有方法通常依赖于在通用视觉任务上预训练但缺乏手语特定特征的模型，或采用复杂的多模态多分支架构。为弥合这一差距，论文提出了一种可扩展的自监督手语表示学习框架。其关键在于，在RGB模型训练过程中利用重要归纳性手语先验知识，通过在掩码自动编码器预训练时结合基于骨架的简单但重要的线索，同时引入手语特定先验、特征正则化以及对抗风格无关损失，构建强大的骨干网络。此外，该模型在推理阶段无需骨骼关键点，避免了关键点相关模型在下游任务中的局限性。最终，经过微调后，该模型在多个手语识别数据集上达到最先进的性能，并且在手语词典检索和翻译等任务中表现出色，同时降低了现有手语翻译模型的训练计算成本。

链接: https://arxiv.org/abs/2503.08529
作者: Ryan Wong,Necati Cihan Camgoz,Richard Bowden
机构: University of Surrey (萨里大学); Meta Reality Labs (Meta现实实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sign language representation learning presents unique challenges due to the complex spatio-temporal nature of signs and the scarcity of labeled datasets. Existing methods often rely either on models pre-trained on general visual tasks, that lack sign-specific features, or use complex multimodal and multi-branch architectures. To bridge this gap, we introduce a scalable, self-supervised framework for sign representation learning. We leverage important inductive (sign) priors during the training of our RGB model. To do this, we leverage simple but important cues based on skeletons while pretraining a masked autoencoder. These sign specific priors alongside feature regularization and an adversarial style agnostic loss provide a powerful backbone. Notably, our model does not require skeletal keypoints during inference, avoiding the limitations of keypoint-based models during downstream tasks. When finetuned, we achieve state-of-the-art performance for sign recognition on the WLASL, ASL-Citizen and NMFs-CSL datasets, using a simpler architecture and with only a single-modality. Beyond recognition, our frozen model excels in sign dictionary retrieval and sign translation, surpassing standard MAE pretraining and skeletal-based representations in retrieval. It also reduces computational costs for training existing sign translation models while maintaining strong performance on Phoenix2014T, CSL-Daily and How2Sign.
zh

[CV-36] GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

【速读】：本文旨在解决强化学习在视觉环境中训练目标导向动作推理的视觉语言模型（Vision-Language Models, VLMs）时，因仅基于动作结果奖励导致生成式思维坍塌（thought collapse）的问题。当奖励仅依赖于动作结果时，强化学习无法有效激励VLM中的链式思维（chain-of-thought, CoT）推理，表现为思维多样性迅速丧失、与状态无关且不完整的推理过程，以及由此产生的无效动作，最终导致负向奖励。为应对这一挑战，本文的关键解决方案是引入过程引导（process guidance），提出一种自动校正器，在每个强化学习步骤中评估并优化代理的推理。这种简单且可扩展的引导式思维强化（Guided Thought Reinforcement, GTR）框架能够在无需密集逐步人工标注的情况下同时训练推理与动作能力。实验表明，GTR显著提升了LLaVA-7b模型在多种视觉环境中的性能与泛化能力，相比现有技术（SoTA）模型实现了3至5倍更高的任务成功率，同时保持了更小的模型规模。

链接: https://arxiv.org/abs/2503.08525
作者: Tong Wei,Yijun Yang,Junliang Xing,Yuanchun Shi,Zongqing Lu,Deheng Ye
机构: Tsinghua University (清华大学); Tencent (腾讯); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs). Yet, its efficacy in training vision-language model (VLM) agents for goal-directed action reasoning in visual environments is less established. This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse, characterized by a rapid loss of diversity in the agent’s thoughts, state-irrelevant and incomplete reasoning, and subsequent invalid actions, resulting in negative rewards. To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent’s reasoning at each RL step. This simple and scalable GTR (Guided Thought Reinforcement) framework trains reasoning and action simultaneously without the need for dense, per-step human labeling. Our experiments demonstrate that GTR significantly enhances the performance and generalization of the LLaVA-7b model across various visual environments, achieving 3-5 times higher task success rates compared to SoTA models with notably smaller model sizes.
zh

[CV-37] High-Quality 3D Head Reconstruction from Any Single Portrait Image

【速读】：该论文致力于解决从单张肖像图像（无论视角、表情或配饰如何）重建高保真3D头部模型的问题。现有方法在利用二维生成模型进行新视角合成和三维优化时，往往难以生成高质量的3D头像，主要受限于身份、表情、头发及配饰等关键信息的缺失。为应对这些挑战，论文构建了一个包含227组序列的高质量数字人肖像数据集，涵盖96种不同视角、总计21,792帧，并包含丰富的表情与配饰变化。解决方案的关键在于将身份和表情信息融入多视角扩散过程，通过身份和表情感知引导与监督机制提取精确的面部表示，从而增强跨视角的面部一致性，并确保生成过程中高度的身份和表情一致性。最终，通过生成包含96个多视角帧的轨道视频，用于3D头像模型重建，该方法在包括侧脸角度和复杂配饰在内的挑战性场景中表现出稳健性能。

链接: https://arxiv.org/abs/2503.08516
作者: Jianfu Zhang,yujie Gao,Jiahui Zhan,Wentao Wang,Yiyi Zhang,Haohua Zhao,Liqing Zhang
机构: Shanghai Jiaotong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we introduce a novel high-fidelity 3D head reconstruction method from a single portrait image, regardless of perspective, expression, or accessories. Despite significant efforts in adapting 2D generative models for novel view synthesis and 3D optimization, most methods struggle to produce high-quality 3D portraits. The lack of crucial information, such as identity, expression, hair, and accessories, limits these approaches in generating realistic 3D head models. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring diverse expressions and accessories. To further improve performance, we integrate identity and expression information into the multi-view diffusion process to enhance facial consistency across views. Specifically, we apply identity- and expression-aware guidance and supervision to extract accurate facial representations, which guide the model and enforce objective functions to ensure high identity and expression consistency during generation. Finally, we generate an orbital video around the portrait consisting of 96 multi-view frames, which can be used for 3D portrait model reconstruction. Our method demonstrates robust performance across challenging scenarios, including side-face angles and complex accessories
zh

[CV-38] Segmentation-Guided CT Synthesis with Pixel-Wise Conformal Uncertainty Bounds MICCAI2025

【速读】：该论文旨在解决质子治疗中锥形束 CT（Cone Beam CT, CBCT）图像因严重伪影和低质量导致无法精确用于剂量计算的问题。现有基于深度学习的 CBCT-to-CT 翻译方法虽有潜力，但普遍存在解剖结构不一致性和缺乏可靠的不确定性估计，限制了其临床应用。论文提出的解决方案——STF-RUE 框架的关键在于两个组成部分：首先，STF 是一种分割引导的 CBCT-to-CT 翻译方法，通过利用来自计划 CT（planning CT, pCT）的分割先验增强解剖一致性；其次，RUE 是一种一致性预测方法，为预测的合成 CT 提供逐像素一致性预测区间，为临床医生提供可靠的可靠性指标。该框架在两个基准数据集上的实验结果表明，其显著提升了翻译准确性，并提供了更好的合成 CT 不确定性量化，从而推动了更安全有效的自适应质子治疗。

链接: https://arxiv.org/abs/2503.08515
作者: David Vallmanya Poch,Yorick Estievenart,Elnura Zhalieva,Sukanya Patra,Mohammad Yaqub,Souhaib Ben Taieb
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: MICCAI 2025 Conference Submission. Follows the required LNCS format. 12 pages including references. Contains 4 figures and 1 table

点击查看摘要

Abstract:Accurate dose calculations in proton therapy rely on high-quality CT images. While planning CTs (pCTs) serve as a reference for dosimetric planning, Cone Beam CT (CBCT) is used throughout Adaptive Radiotherapy (ART) to generate sCTs for improved dose calculations. Despite its lower cost and reduced radiation exposure advantages, CBCT suffers from severe artefacts and poor image quality, making it unsuitable for precise dosimetry. Deep learning-based CBCT-to-CT translation has emerged as a promising approach. Still, existing methods often introduce anatomical inconsistencies and lack reliable uncertainty estimates, limiting their clinical adoption. To bridge this gap, we propose STF-RUE, a novel framework integrating two key components. First, STF, a segmentation-guided CBCT-to-CT translation method that enhances anatomical consistency by leveraging segmentation priors extracted from pCTs. Second, RUE, a conformal prediction method that augments predicted CTs with pixel-wise conformal prediction intervals, providing clinicians with robust reliability indicator. Comprehensive experiments using UNet++ and Fast-DDPM on two benchmark datasets demonstrate that STF-RUE significantly improves translation accuracy, as measured by a novel soft-tissue-focused metric designed for precise dose computation. Additionally, STF-RUE provides better-calibrated uncertainty sets for synthetic CT, reinforcing trust in synthetic CTs. By addressing both anatomical fidelity and uncertainty quantification, STF-RUE marks a crucial step toward safer and more effective adaptive proton therapy. Code is available at this https URL.
zh

[CV-39] SAS: Segment Any 3D Scene with Integrated 2D Priors

【速读】：本文旨在解决传统3D模型因固定类别训练而无法识别复杂动态场景中未见过物体的问题，提出了一种名为SAS的简单而有效的方案。关键在于通过文本桥接将多种2D模型对齐到同一嵌入空间（Model Alignment via Text），利用扩散模型显式量化2D模型对不同类别的识别能力（Annotation-Free Model Capability Construction），并在此基础上融合来自不同2D模型的点云特征与构建的能力指导，最终通过特征蒸馏将集成的2D开放词汇能力迁移到3D领域。这一方法在ScanNet v2、Matterport3D和nuScenes等多个数据集上大幅超越现有方法，并在高斯分割和实例分割等下游任务中验证了其泛化性。

链接: https://arxiv.org/abs/2503.08512
作者: Zhuoyuan Li,Jiahao Lu,Jiacheng Deng,Hanzhi Chang,Lifan Wu,Yanzhe Liang,Tianzhu Zhang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation-Free Model Capability Construction to explicitly quantify the 2D model’s capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.
zh

[CV-40] PCGS: Progressive Compression of 3D Gaussian Splatting

【速读】：该论文试图解决3D Gaussian Splatting (3DGS) 在实际应用中因数据量庞大而导致的资源浪费问题，尤其是在按需应用场景下，现有压缩技术由于缺乏渐进性（progressivity）而无法有效利用已有位流。论文的关键解决方案是提出PCGS (Progressive Compression of 3D Gaussian Splatting)，通过自适应控制高斯点（或锚点）的数量与质量实现渐进性。具体而言，对于数量控制，引入渐进掩码策略，在新增锚点的同时优化已有锚点以提升保真度；对于质量控制，采用渐进量化方法逐步减小量化步长，以更精细地建模高斯属性。此外，通过利用已有量化结果优化概率预测，提高熵编码在各渐进层级上的效率，从而紧凑增量位流。最终，PCGS实现了渐进性，并保持了与非渐进方法相当的压缩性能。

链接: https://arxiv.org/abs/2503.08511
作者: Yihang Chen,Mengyao Li,Qianyi Wu,Weiyao Lin,Mehrtash Harandi,Jianfei Cai
机构: Shanghai Jiao Tong University (上海交通大学); Monash University (蒙纳士大学); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) achieves impressive rendering fidelity and speed for novel view synthesis. However, its substantial data size poses a significant challenge for practical applications. While many compression techniques have been proposed, they fail to efficiently utilize existing bitstreams in on-demand applications due to their lack of progressivity, leading to a waste of resource. To address this issue, we propose PCGS (Progressive Compression of 3D Gaussian Splatting), which adaptively controls both the quantity and quality of Gaussians (or anchors) to enable effective progressivity for on-demand applications. Specifically, for quantity, we introduce a progressive masking strategy that incrementally incorporates new anchors while refining existing ones to enhance fidelity. For quality, we propose a progressive quantization approach that gradually reduces quantization step sizes to achieve finer modeling of Gaussian attributes. Furthermore, to compact the incremental bitstreams, we leverage existing quantization results to refine probability prediction, improving entropy coding efficiency across progressive levels. Overall, PCGS achieves progressivity while maintaining compression performance comparable to SoTA non-progressive methods. Code available at: this http URL.
zh

[CV-41] External Knowledge Injection for CLIP-Based Class-Incremental Learning

【速读】：该论文旨在解决在基于 Class-Incremental Learning (CIL) 的场景下，利用预训练视觉-语言模型（如 CLIP）进行持续学习时存在的两个主要问题：一是 CLIP 仅通过匹配视觉嵌入与类别名称进行决策，未能充分利用语言模态中丰富的上下文信息；二是由于模型持续更新，详细特征容易被覆盖，需要外部知识进行补充。论文的关键解决方案是提出了一种名为 ExterNal knowledGe INjEction (ENGINE) 的方法，通过引入一种双分支注入微调框架，从视觉和文本两个模态中编码有用的知识。具体而言，视觉分支通过数据增强丰富视觉特征，而文本分支则借助 GPT-4 重写判别性描述符。此外，还实现了推理阶段的后注入知识以重新排序预测结果，从而更好地捕捉随数据演化而来的有用特征。

链接: https://arxiv.org/abs/2503.08510
作者: Da-Wei Zhou,Kai-Wen Li,Jingyi Ning,Han-Jia Ye,Lijun Zhang,De-Chuan Zhan
机构: School of Artificial Intelligence, Nanjing University (南京大学); National Key Laboratory for Novel Software Technology, Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Class-Incremental Learning (CIL) enables learning systems to continuously adapt to evolving data streams. With the advancement of pre-training, leveraging pre-trained vision-language models (e.g., CLIP) offers a promising starting point for CIL. However, CLIP makes decisions by matching visual embeddings to class names, overlooking the rich contextual information conveyed through language. For instance, the concept of ``cat’’ can be decomposed into features like tail, fur, and face for recognition. Besides, since the model is continually updated, these detailed features are overwritten in CIL, requiring external knowledge for compensation. In this paper, we introduce ExterNal knowledGe INjEction (ENGINE) for CLIP-based CIL. To enhance knowledge transfer from outside the dataset, we propose a dual-branch injection tuning framework that encodes informative knowledge from both visual and textual modalities. The visual branch is enhanced with data augmentation to enrich the visual features, while the textual branch leverages GPT-4 to rewrite discriminative descriptors. In addition to this on-the-fly knowledge injection, we also implement post-tuning knowledge by re-ranking the prediction results during inference. With the injected knowledge, the model can better capture informative features for downstream tasks as data evolves. Extensive experiments demonstrate the state-of-the-art performance of ENGINE. Code is available at: this https URL
zh

[CV-42] Referring to Any Person

【速读】：该论文旨在解决现有模型在根据自然语言描述检测任意个体任务（referring to any person）上的实际可用性不足问题，以及当前基准数据集因局限于一对一指代关系而阻碍该领域进展的问题。为应对这些挑战，论文从任务定义、数据集设计和模型架构三个关键视角重新审视该任务。解决方案的关键在于提出了一种新的数据集HumanRef，它更好地反映了真实世界的应用场景，并且通过将多模态大语言模型与目标检测框架相结合，构建了一个名为RexSeek的鲁棒指代表征模型。实验结果表明，尽管最先进的模型在RefCOCO/+/g等常用基准上表现良好，但在HumanRef上却表现出对多人检测的不足，而RexSeek不仅在人物指代方面表现出色，还能有效推广到普通物体指代任务，使其在多种感知任务中具有广泛应用潜力。

链接: https://arxiv.org/abs/2503.08507
作者: Qing Jiang,Lin Wu,Zhaoyang Zeng,Tianhe Ren,Yuda Xiong,Yihao Chen,Qin Liu,Lei Zhang
机构: International Digital Economy Academy (IDEA); South China University of Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at this https URL
zh

[CV-43] CFNet: Optimizing Remote Sensing Change Detection through Content-Aware Enhancement

【速读】：该论文旨在解决双时相遥感图像变化检测任务中因获取条件变化导致的图像样式差异问题，这些差异会不可避免地影响深度神经网络（DNNs）对变化区域的准确检测能力。为解决上述问题，论文提出了一种名为Content Focuser Network (CFNet) 的方法，其关键在于采用内容感知策略（Content-Aware）。具体而言，CFNet 利用 EfficientNet-B5 提取特征，并通过开发一种约束策略优先关注双时相图像的内容特征以减轻样式特征的误导影响；同时设计了一个基于双时相图像特征余弦距离的重加权模块（Focuser），使模型能够根据不同阶段的需求灵活聚焦于变化或未变化区域。

链接: https://arxiv.org/abs/2503.08505
作者: Fan Wu,Sijun Dong,Xiaoliang Meng
机构: School of Remote Sensing and Information Engineering, Wuhan University (武汉大学); Hubei Luojia Laboratory (湖北珞珈实验室), Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 12 figures

点击查看摘要

Abstract:Change detection is a crucial and widely applied task in remote sensing, aimed at identifying and analyzing changes occurring in the same geographical area over time. Due to variability in acquisition conditions, bi-temporal remote sensing images often exhibit significant differences in image style. Even with the powerful generalization capabilities of DNNs, these unpredictable style variations between bi-temporal images inevitably affect model’s ability to accurately detect changed areas. To address issue above, we propose the Content Focuser Network (CFNet), which takes content-aware strategy as a key insight. CFNet employs EfficientNet-B5 as the backbone for feature extraction. To enhance the model’s focus on the content features of images while mitigating the misleading effects of style features, we develop a constraint strategy that prioritizes the content features of bi-temporal images, termed Content-Aware. Furthermore, to enable the model to flexibly focus on changed and unchanged areas according to the requirements of different stages, we design a reweighting module based on the cosine distance between bi-temporal image features, termed Focuser. CFNet achieve outstanding performance across three well-known change detection datasets: CLCD (F1: 81.41%, IoU: 68.65%), LEVIR-CD (F1: 92.18%, IoU: 85.49%), and SYSU-CD (F1: 82.89%, IoU: 70.78%). The code and pretrained models of CFNet are publicly released at this https URL.
zh

[CV-44] MMRL: Multi-Modal Representation Learning for Vision-Language Models CVPR2025

【速读】：该论文旨在解决大规模预训练视觉-语言模型（Vision-Language Models, VLMs）在仅使用少量样本进行适配时容易过拟合的问题，从而影响其在新任务上的性能。为了解决这一挑战，论文提出了一种新颖的多模态表示学习（Multi-Modal Representation Learning, MMRL）框架。该框架的关键在于引入一个共享、可学习且与模态无关的表示空间，通过将表征令牌投影到文本和图像表示令牌来促进更有效的多模态交互。不同于仅优化类别令牌特征的传统方法，MMRL 在编码器的高层整合表征令牌，以捕捉数据集特定的特征，同时在低层保留通用知识。此外，在训练过程中，同时优化表征和类别特征，并通过可训练的投影层处理表征令牌，而类别令牌投影层保持冻结以保留预训练知识。另外，引入正则化项以对齐类别特征和文本特征与冻结 VLM 的零样本特征，从而保护模型的泛化能力。在推理阶段采用解耦策略，即基类使用表征和类别特征，而新任务仅使用保留更多通用知识的类别特征。大量实验结果表明，MMRL 方法在 15 个数据集上优于现有技术，实现了任务特定适配与泛化之间的平衡。

链接: https://arxiv.org/abs/2503.08497
作者: Yuncheng Guo,Xiaodong Gu
机构: Department of Electronic Engineering, Fudan University (复旦大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders–where dataset-specific features are more prominent–while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model’s generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL outperforms state-of-the-art methods, achieving a balanced trade-off between task-specific adaptation and generalization. Code is available at this https URL.
zh

[CV-45] SuperCap: Multi-resolution Superpixel-based Image Captioning

【速读】：该论文致力于解决图像描述任务中过度依赖目标检测的问题，旨在通过结合超像素（Superpixels）与视觉语言模型（Vision Language Models, VLMs）来弥合基于检测器的描述架构与仅在大规模数据集上预训练的架构之间的差距。解决方案的关键在于提出了一种新颖的超像素方法，确保模型能够接收类似对象的特征，同时利用VLMs赋予模型开放集对象理解能力。此外，该研究进一步扩展了模型架构，使其能够处理多分辨率输入，并采用注意力机制确定与描述最相关的图像部分。这一系列设计使得模型在COCO Karpathy分割测试集上取得了竞争性的CIDEr分数136.9。

链接: https://arxiv.org/abs/2503.08496
作者: Henry Senior,Luca Rossi,Gregory Slabaugh,Shanxin Yuan
机构: Queen Mary University of London (伦敦玛丽女王大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:It has been a longstanding goal within image captioning to move beyond a dependence on object detection. We investigate using superpixels coupled with Vision Language Models (VLMs) to bridge the gap between detector-based captioning architectures and those that solely pretrain on large datasets. Our novel superpixel approach ensures that the model receives object-like features whilst the use of VLMs provides our model with open set object understanding. Furthermore, we extend our architecture to make use of multi-resolution inputs, allowing our model to view images in different levels of detail, and use an attention mechanism to determine which parts are most relevant to the caption. We demonstrate our model’s performance with multiple VLMs and through a range of ablations detailing the impact of different architectural choices. Our full model achieves a competitive CIDEr score of 136.9 on the COCO Karpathy split.
zh

[CV-46] -GaussOcc: Test-Time Compute for Self-Supervised Occupancy Prediction via Spatio-Temporal Gaussian Splatting

【速读】：本文旨在解决自监督3D占用预测中密集体素解码器训练成本高、难以适应不同体素分辨率和新类别以及对快速移动物体处理能力不足的问题。论文的关键在于提出了一种名为TT-GaussOcc的实用且灵活的测试时占用预测框架。该方案在运行时从原始传感器数据增量优化时间感知的3D高斯分布，支持任意用户指定的体素分辨率。具体而言，TT-GaussOcc通过“提升-移动-体素化”三步法实现：首先利用2D视觉基础模型提取的周边视图语义信息，在非空3D空间实例化高斯分布；接着将前一帧中的动态高斯分布沿估计的高斯场景流移动以完成外观修复并消除快速移动物体的拖尾伪影，同时累积静态高斯分布以增强时间一致性；最后通过周期性平滑相邻高斯分布来减轻语义预测和场景流向量中的固有噪声，采用联合考虑颜色、语义和空间亲和性的三边RBF核函数。最终，结合历史静态和当前动态高斯分布进行体素化以生成占用预测。大量实验表明，TT-GaussOcc在Occ3D和nuCraft数据集上的mIoU指标比自监督基线高出46%，并且无需任何离线训练即可支持更精细的体素分辨率，同时保持每秒2.6帧的推理速度。

链接: https://arxiv.org/abs/2503.08485
作者: Fengyi Zhang,Huitong Yang,Zheng Zhang,Zi Huang,Yadan Luo
机构: The University of Queensland (澳大利亚昆士兰大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised 3D occupancy prediction offers a promising solution for understanding complex driving scenes without requiring costly 3D annotations. However, training dense voxel decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and such models often fail to adapt to varying voxel resolutions or new classes without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-GaussOcc. Our approach incrementally optimizes time-aware 3D Gaussians instantiated from raw sensor streams at runtime, enabling voxelization at arbitrary user-specified resolution. Specifically, TT-GaussOcc operates in a “lift-move-voxel” symphony: we first “lift” surrounding-view semantics obtained from 2D vision foundation models (VLMs) to instantiate Gaussians at non-empty 3D space; Next, we “move” dynamic Gaussians from previous frames along estimated Gaussian scene flow to complete appearance and eliminate trailing artifacts of fast-moving objects, while accumulating static Gaussians to enforce temporal consistency; Finally, we mitigate inherent noises in semantic predictions and scene flow vectors by periodically smoothing neighboring Gaussians during optimization, using proposed trilateral RBF kernels that jointly consider color, semantic, and spatial affinities. The historical static and current dynamic Gaussians are then combined and voxelized to generate occupancy prediction. Extensive experiments on Occ3D and nuCraft with varying voxel resolutions demonstrate that TT-GaussOcc surpasses self-supervised baselines by 46% on mIoU without any offline training, and supports finer voxel resolutions at 2.6 FPS inference speed.
zh

[CV-47] Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum

【速读】：该论文试图解决生成式 AI (Generative AI) 图像检测在未见过的生成模型上的泛化性能不足的问题。解决方案的关键在于利用不同生成模型产生的图像所共有的光谱分数分形自相似性特征。论文提出的方法通过分析光谱的分形分支之间的相似性，而非直接分析光谱本身，从而减轻了不同生成器之间光谱特性变化的影响，提高了对未见过生成模型所产生图像的检测性能。

链接: https://arxiv.org/abs/2503.08484
作者: Shengpeng Xiao,Yuanfang Guo,Heqi Peng,Zeming Liu,Liang Yang,Yunhong Wang
机构: Beihang University (北京航空航天大学); Hebei University of Technology (河北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The generalization performance of AI-generated image detection remains a critical challenge. Although most existing methods perform well in detecting images from generative models included in the training set, their accuracy drops significantly when faced with images from unseen generators. To address this limitation, we propose a novel detection method based on the fractal self-similarity of the spectrum, a common feature among images generated by different models. Specifically, we demonstrate that AI-generated images exhibit fractal-like spectral growth through periodic extension and low-pass filtering. This observation motivates us to exploit the similarity among different fractal branches of the spectrum. Instead of directly analyzing the spectrum, our method mitigates the impact of varying spectral characteristics across different generators, improving detection performance for images from unseen models. Experiments on a public benchmark demonstrated the generalized detection performance across both GANs and diffusion models.
zh

[CV-48] GAS-NeRF: Geometry-Aware Stylization of Dynamic Radiance Fields

【速读】：该论文旨在解决现有3D风格化技术主要关注静态场景的问题，而现实世界是动态的，包含移动物体和变化环境。此外，现有的风格迁移方法主要集中在外观（如颜色和纹理变换）上，却往往忽略了风格图像的几何特征，而这对于实现完整且一致的风格化效果至关重要。为了解决这些问题，论文提出了一种名为GAS-NeRF的新方法，用于动态辐射场中的外观与几何联合风格化。该方法的关键在于利用深度图提取并转移几何细节到辐射场中，然后进行外观迁移。实验结果表明，此方法显著提升了风格化质量，并在动态场景中保持了时间一致性。

链接: https://arxiv.org/abs/2503.08483
作者: Nhat Phuong Anh Vu,Abhishek Saroha,Or Litany,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Technion (以色列理工学院); Nvidia (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current 3D stylization techniques primarily focus on static scenes, while our world is inherently dynamic, filled with moving objects and changing environments. Existing style transfer methods primarily target appearance – such as color and texture transformation – but often neglect the geometric characteristics of the style image, which are crucial for achieving a complete and coherent stylization effect. To overcome these shortcomings, we propose GAS-NeRF, a novel approach for joint appearance and geometry stylization in dynamic Radiance Fields. Our method leverages depth maps to extract and transfer geometric details into the radiance field, followed by appearance transfer. Experimental results on synthetic and real-world datasets demonstrate that our approach significantly enhances the stylization quality while maintaining temporal coherence in dynamic scenes.
zh

[CV-49] A Multimodal Physics-Informed Neural Network Approach for Mean Radiant Temperature Modeling

【速读】：本文旨在解决户外热舒适性评估中传统方法（现场测量和计算模拟）资源消耗大且难以高效推广的问题。论文的关键在于提出了一种基于Physics-Informed Neural Network (PINN) 的新方法，该方法结合短波和长波辐射建模与深度学习技术，利用包含气象数据、建成环境特征及鱼眼图像衍生遮阳信息的多模态数据集，在保持物理一致性的同时提升预测精度。通过这一创新方案，论文展示了如何弥合计算建模与实际应用之间的差距，为城市热舒适性评估提供了一种可扩展且可解释的解决方案。

链接: https://arxiv.org/abs/2503.08482
作者: Pouya Shaeri,Saud AlKhaled,Ariane Middel
机构: School of Computing and Augmented Intelligence, Arizona State University (亚利桑那州立大学); College of Architecture, Kuwait University (科威特大学); School of Arts, Media and Engineering, Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Outdoor thermal comfort is a critical determinant of urban livability, particularly in hot desert climates where extreme heat poses challenges to public health, energy consumption, and urban planning. Mean Radiant Temperature ( T_mrt ) is a key parameter for evaluating outdoor thermal comfort, especially in urban environments where radiation dynamics significantly impact human thermal exposure. Traditional methods of estimating T_mrt rely on field measurements and computational simulations, both of which are resource intensive. This study introduces a Physics-Informed Neural Network (PINN) approach that integrates shortwave and longwave radiation modeling with deep learning techniques. By leveraging a multimodal dataset that includes meteorological data, built environment characteristics, and fisheye image-derived shading information, our model enhances predictive accuracy while maintaining physical consistency. Our experimental results demonstrate that the proposed PINN framework outperforms conventional deep learning models, with the best-performing configurations achieving an RMSE of 3.50 and an R^2 of 0.88. This approach highlights the potential of physics-informed machine learning in bridging the gap between computational modeling and real-world applications, offering a scalable and interpretable solution for urban thermal comfort assessments.
zh

[CV-50] PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability

【速读】：该论文旨在解决现有状态-of-the-art视觉-语言模型（Vision-Language Models, VLMs）在执行具身视觉推理任务时，因缺乏对机器人物理可达性的理解而产生不准确或不切实际响应的问题。论文的关键解决方案在于提出了一种跨多种机器人的统一物理可达性表示——空间-物理可达性图（Space-Physical Reachability Map, S-P Map），以及一种将可达性信息整合到视觉推理中的视觉-语言模型PhysVLM。具体而言，S-P Map将机器人的物理可达性抽象为一种通用的空间表示，从而摆脱特定机器人配置的束缚；PhysVLM通过引入额外的功能编码器处理S-P Map，扩展了传统VLM架构，在不损害其一般视觉-语言能力的前提下增强了对物理可达性的推理能力。这一方案的核心创新点在于通过S-P Map实现了物理可达性与视觉-语言推理的有效结合。

链接: https://arxiv.org/abs/2503.08481
作者: Weijie Zhou(1,2),Manli Tao(2),Chaoyang Zhao(2,3),Haiyun Guo(2),Honghui Dong(1),Ming Tang(2),Jinqiao Wang(1,2,3,4) ((1) School of Traffic and Transportation, Beijing Jiaotong University, (2) Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, (3) ObjectEye Inc., (4) Guangdong Provincial Key Laboratory of Intellectual Property \amp; Big Data, Guangdong Polytechnic Normal University)
机构: School of Traffic and Transportation, Beijing Jiaotong University (北京交通大学交通与运输学院); Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所基础模型研究中心); ObjectEye Inc. (ObjectEye Inc.); Guangdong Provincial Key Laboratory of Intellectual Property & Big Data, Guangdong Polytechnic Normal University (广东技术师范大学知识产权与大数据广东省重点实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding the environment and a robot’s physical reachability is crucial for task execution. While state-of-the-art vision-language models (VLMs) excel in environmental perception, they often generate inaccurate or impractical responses in embodied visual reasoning tasks due to a lack of understanding of robotic physical reachability. To address this issue, we propose a unified representation of physical reachability across diverse robots, i.e., Space-Physical Reachability Map (S-P Map), and PhysVLM, a vision-language model that integrates this reachability information into visual reasoning. Specifically, the S-P Map abstracts a robot’s physical reachability into a generalized spatial representation, independent of specific robot configurations, allowing the model to focus on reachability features rather than robot-specific parameters. Subsequently, PhysVLM extends traditional VLM architectures by incorporating an additional feature encoder to process the S-P Map, enabling the model to reason about physical reachability without compromising its general vision-language capabilities. To train and evaluate PhysVLM, we constructed a large-scale multi-robot dataset, Phys100K, and a challenging benchmark, EQA-phys, which includes tasks for six different robots in both simulated and real-world environments. Experimental results demonstrate that PhysVLM outperforms existing models, achieving a 14% improvement over GPT-4o on EQA-phys and surpassing advanced embodied VLMs such as RoboMamba and SpatialVLM on the RoboVQA-val and OpenEQA benchmarks. Additionally, the S-P Map shows strong compatibility with various VLMs, and its integration into GPT-4o-mini yields a 7.1% performance improvement.
zh

[CV-51] NullFace: Training-Free Localized Face Anonymization

【速读】：该论文旨在解决数字时代日益增长的摄像头隐私问题，尽管现有的去标识化方法能够隐藏身份信息，但通常难以同时保持图像的实用性。论文提出了一种无需训练的面部去标识化方法，能够保留与身份无关的关键属性。解决方案的关键在于利用预训练的文本到图像扩散模型，通过条件化的去噪扩散过程修改身份嵌入，从而实现与原始身份不同的匿名化人脸生成，同时支持局部区域的去标识化操作。这种方法在去标识化效果、属性保留及图像质量方面表现出色，并具备灵活性和鲁棒性，适合实际应用。

链接: https://arxiv.org/abs/2503.08478
作者: Han-Wei Kung,Tuomas Varanka,Terence Sim,Nicu Sebe
机构: University of Trento (特伦托大学); University of Oulu (奥卢大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Privacy concerns around ever increasing number of cameras are increasing in today’s digital age. Although existing anonymization methods are able to obscure identity information, they often struggle to preserve the utility of the images. In this work, we introduce a training-free method for face anonymization that preserves key non-identity-related attributes. Our approach utilizes a pre-trained text-to-image diffusion model without requiring optimization or training. It begins by inverting the input image to recover its initial noise. The noise is then denoised through an identity-conditioned diffusion process, where modified identity embeddings ensure the anonymized face is distinct from the original identity. Our approach also supports localized anonymization, giving users control over which facial regions are anonymized or kept intact. Comprehensive evaluations against state-of-the-art methods show our approach excels in anonymization, attribute preservation, and image quality. Its flexibility, robustness, and practicality make it well-suited for real-world applications. Code and data can be found at this https URL .
zh

[CV-52] rackOcc: Camera-based 4D Panoptic Occupancy Tracking ICRA2025

【速读】：该论文致力于解决从单一相机输入中同时实现全景占用分割和物体跟踪的问题，即提出了 Camera-based 4D Panoptic Occupancy Tracking 这一全新任务。传统基于相机的感知任务（如 3D 物体跟踪和语义占用预测）在空间全面性或时间一致性方面存在不足。为解决此问题，论文提出了一种名为 TrackOcc 的创新方法，通过流式端到端处理方式结合 4D 占用查询来完成任务。关键在于利用定位感知损失函数 (localization-aware loss)，在不依赖额外技巧的情况下显著提升 4D 全景占用跟踪的准确性。实验结果表明，TrackOcc 在 Waymo 数据集上达到了当前最优性能。

链接: https://arxiv.org/abs/2503.08471
作者: Zhuoguang Chen,Kenan Li,Xiuyu Yang,Tao Jiang,Yiming Li,Hang Zhao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); IIIS, Tsinghua University (清华大学交叉信息研究院); Shanghai Qi Zhi Institute (上海智锦研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICRA 2025

点击查看摘要

Abstract:Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addresses panoptic occupancy segmentation and object tracking from camera-only input. Furthermore, we propose TrackOcc, a cutting-edge approach that processes image inputs in a streaming, end-to-end manner with 4D panoptic queries to address the proposed task. Leveraging the localization-aware loss, TrackOcc enhances the accuracy of 4D panoptic occupancy tracking without bells and whistles. Experimental results demonstrate that our method achieves state-of-the-art performance on the Waymo dataset. The source code will be released at this https URL.
zh

[CV-53] A Data Aggregation Visualization System supported by Processing-in-Memory

【速读】：该论文试图解决在大规模数据集上高效生成高质量二维可视化的问题，特别是针对聚合查询的数据可视化。传统方法在处理高频数据值时分配的像素资源不足，导致可视化效果受限。论文的关键解决方案在于DIVAN系统，它通过频率归一化（frequency normalization）在一维轴上进行自适应分箱（binning），从而在二维可视化中为高频数据值分配更多像素资源。此外，DIVAN利用中央处理器（CPU）或处理在内存（Processing-in-Memory, PIM）架构加速聚合计算，显著提升了大尺度数据集上的可视化生成效率，在包含1亿行和32列的数据集中实现了每分钟生成4,960个大小为128×128×128的聚合体的能力，并且相比现代CPU，使用PIM可将聚合计算速度提升45%-64%。

链接: https://arxiv.org/abs/2503.08463
作者: Junyoung Kim,Madhulika Balakumar,Kenneth Ross
机构: Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 11 figures

点击查看摘要

Abstract:Data visualization of aggregation queries is one of the most common ways of doing data exploration and data science as it can help identify correlations and patterns in the data. We propose DIVAN, a system that automatically normalizes the one-dimensional axes by frequency to generate large numbers of two-dimensional visualizations. DIVAN normalizes the input data via binning to allocate more pixels to data values that appear more frequently in the dataset. DIVAN can utilize either CPUs or Processing-in-Memory (PIM) architectures to quickly calculate aggregates to support the visualizations. On real world datasets, we show that DIVAN generates visualizations that highlight patterns and correlations, some expected and some unexpected. By using PIM, we can calculate aggregates 45%-64% faster than modern CPUs on large datasets. For use cases with 100 million rows and 32 columns, our system is able to compute 4,960 aggregates (each of size 128x128x128) in about a minute.
zh

[CV-54] Controlling Latent Diffusion Using Latent CLIP

【速读】：该论文试图解决的问题是如何在降低计算成本的同时，有效利用对比学习（CLIP）模型评估真实图像与生成图像的内容。当前的CLIP模型在像素空间中运行，而扩散模型已迁移到潜在空间，这种不匹配导致在处理生成图像时需要昂贵的变分自编码器（VAE）解码步骤。论文的解决方案的关键在于引入Latent-CLIP，这是一种直接在潜在空间中操作的CLIP模型。通过在2.7亿对潜在图像和描述性文本上进行训练，Latent-CLIP不仅实现了与现有CLIP模型相当的零样本分类性能，还显著降低了基于奖励的噪声优化（ReNO）管道的成本，并成功引导生成过程远离有害内容。

链接: https://arxiv.org/abs/2503.08455
作者: Jason Becker,Chris Wendler,Peter Baylies,Robert West,Christian Wressnegger
机构: EPFL (瑞士联邦理工学院); Northeastern University (东北大学); Leonardo AI; EPFL (瑞士联邦理工学院); KASTEL Security Research Labs, Karlsruhe Institute of Technology (KIT) (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Instead of performing text-conditioned denoising in the image domain, latent diffusion models (LDMs) operate in latent space of a variational autoencoder (VAE), enabling more efficient processing at reduced computational costs. However, while the diffusion process has moved to the latent space, the contrastive language-image pre-training (CLIP) models, as used in many image processing tasks, still operate in pixel space. Doing so requires costly VAE-decoding of latent images before they can be processed. In this paper, we introduce Latent-CLIP, a CLIP model that operates directly in the latent space. We train Latent-CLIP on 2.7B pairs of latent images and descriptive texts, and show that it matches zero-shot classification performance of similarly sized CLIP models on both the ImageNet benchmark and a LDM-generated version of it, demonstrating its effectiveness in assessing both real and generated content. Furthermore, we construct Latent-CLIP rewards for reward-based noise optimization (ReNO) and show that they match the performance of their CLIP counterparts on GenEval and T2I-CompBench while cutting the cost of the total pipeline by 21%. Finally, we use Latent-CLIP to guide generation away from harmful content, achieving strong performance on the inappropriate image prompts (I2P) benchmark and a custom evaluation, without ever requiring the costly step of decoding intermediate images.
zh

[CV-55] ICPR 2024 Competition on Rider Intention Prediction

【速读】：该论文旨在解决摩托车骑手在复杂交通环境中因意图预测不足导致的安全隐患问题，通过提前预测骑手的潜在操作（如转弯或变道）来增强骑行安全性。解决方案的关键在于构建了一个名为“骑手动作预判数据集（RAAD）”的新数据集，并设计了单视图与多视图骑手意图预测（RIP）任务以涵盖多样化的交通场景和挑战性操作。此外，论文评估了三种方法（状态空间模型Mamba2、基于SVM的方法及CNN-LSTM模型）在RAAD上的表现，发现状态空间模型在整体性能和类别平衡性方面表现最优，而基于SVM的方法在处理类别不平衡问题时采用了随机采样与SMOTE技术后表现出次优性能，相比之下，CNN-LSTM方法因类别失衡问题表现欠佳。因此，该研究的核心在于通过优化的状态空间模型实现更精准的骑手意图预测，从而提升主动安全系统的有效性。

链接: https://arxiv.org/abs/2503.08437
作者: Shankar Gangisetty,Abdul Wasi,Shyam Nandan Rai,C. V. Jawahar,Sajay Raj,Manish Prajapati,Ayesha Choudhary,Aaryadev Chandra,Dev Chandan,Shireen Chand,Suvaditya Mukherjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The recent surge in the vehicle market has led to an alarming increase in road accidents. This underscores the critical importance of enhancing road safety measures, particularly for vulnerable road users like motorcyclists. Hence, we introduce the rider intention prediction (RIP) competition that aims to address challenges in rider safety by proactively predicting maneuvers before they occur, thereby strengthening rider safety. This capability enables the riders to react to the potential incorrect maneuvers flagged by advanced driver assistance systems (ADAS). We collect a new dataset, namely, rider action anticipation dataset (RAAD) for the competition consisting of two tasks: single-view RIP and multi-view RIP. The dataset incorporates a spectrum of traffic conditions and challenging navigational maneuvers on roads with varying lighting conditions. For the competition, we received seventy-five registrations and five team submissions for inference of which we compared the methods of the top three performing teams on both the RIP tasks: one state-space model (Mamba2) and two learning-based approaches (SVM and CNN-LSTM). The results indicate that the state-space model outperformed the other methods across the entire dataset, providing a balanced performance across maneuver classes. The SVM-based RIP method showed the second-best performance when using random sampling and SMOTE. However, the CNN-LSTM method underperformed, primarily due to class imbalance issues, particularly struggling with minority classes. This paper details the proposed RAAD dataset and provides a summary of the submissions for the RIP 2024 competition.
zh

[CV-56] Bokeh Diffusion: Defocus Blur Control in Text-to-Image Diffusion Models

【速读】：该论文旨在解决现有文本到图像扩散模型在模拟传统摄影中精确控制视觉美学（如景深）时存在的问题。传统方法通常依赖提示工程来近似这些效果，但这种方法往往导致粗略的近似并无意中改变场景内容。论文的关键解决方案是提出Bokeh Diffusion框架，它通过明确地将扩散模型与物理散焦模糊参数相结合，实现了对景深调整的一致性控制，从而保持了场景结构的完整性。为了解决真实世界配对图像数据稀缺的问题，研究引入了一种混合训练管道，将野外图像与合成模糊增强对齐。实验表明，该方法不仅实现了灵活的镜头般模糊控制，还支持真实图像编辑等应用。

链接: https://arxiv.org/abs/2503.08434
作者: Armando Fortes,Tianyi Wei,Shangchen Zhou,Xingang Pan
机构: S-Lab, Nanyang Technological University (南洋理工大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in large-scale text-to-image models have revolutionized creative fields by generating visually captivating outputs from textual prompts; however, while traditional photography offers precise control over camera settings to shape visual aesthetics – such as depth-of-field – current diffusion models typically rely on prompt engineering to mimic such effects. This approach often results in crude approximations and inadvertently altering the scene content. In this work, we propose Bokeh Diffusion, a scene-consistent bokeh control framework that explicitly conditions a diffusion model on a physical defocus blur parameter. By grounding depth-of-field adjustments, our method preserves the underlying scene structure as the level of blur is varied. To overcome the scarcity of paired real-world images captured under different camera settings, we introduce a hybrid training pipeline that aligns in-the-wild images with synthetic blur augmentations. Extensive experiments demonstrate that our approach not only achieves flexible, lens-like blur control but also supports applications such as real image editing via inversion.
zh

[CV-57] Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing CVPR2025

【速读】：该论文旨在解决图像压缩感知（Compressive Sensing, CS）中的高质量重建问题。传统深度展开网络（Deep Unfolding Network, DUN）依赖于从数据中学习的先验知识，而引入更强的先验知识可以进一步提升重建质量。另一方面，预训练扩散模型（diffusion model）虽然具备强大的先验知识且具有坚实的理论基础和良好的可扩展性，但其重建过程需要大量迭代步骤。为了解决这些问题，论文提出了一种新的方法，即在DUN中利用预训练扩散模型的强大先验知识，以较少的迭代步数实现高保真的图像CS重建。

方案的关键在于设计了一种名为扩散消息传递（Diffusion Message Passing, DMP）的迭代优化算法，并将其与深度展开网络相结合形成DMP-DUN。具体而言，DMP将预训练的扩散模型嵌入到每次迭代过程中，然后通过深度展开技术将DMP转化为神经网络。DMP-DUN能够使用轻量级神经网络来映射测量数据到逆扩散过程的中间步骤，并直接逼近扩散模型的散度（divergence），从而显著提高重建效率。实验结果表明，所提出的DMP-DUN达到了最先进的性能，且最少仅需两步即可完成图像重建。

链接: https://arxiv.org/abs/2503.08429
作者: Chen Liao,Yan Shen,Dan Li,Zhongli Wang
机构: Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025 accepted

点击查看摘要

Abstract:Recently, Deep Unfolding Networks (DUNs) have achieved impressive reconstruction quality in the field of image Compressive Sensing (CS) by unfolding iterative optimization algorithms into neural networks. The reconstruction quality of DUNs depends on the learned prior knowledge, so introducing stronger prior knowledge can further improve reconstruction quality. On the other hand, pre-trained diffusion models contain powerful prior knowledge and have a solid theoretical foundation and strong scalability, but it requires a large number of iterative steps to achieve reconstruction. In this paper, we propose to use the powerful prior knowledge of pre-trained diffusion model in DUNs to achieve high-quality reconstruction with less steps for image CS. Specifically, we first design an iterative optimization algorithm named Diffusion Message Passing (DMP), which embeds a pre-trained diffusion model into each iteration process of DMP. Then, we deeply unfold the DMP algorithm into a neural network named DMP-DUN. The proposed DMP-DUN can use lightweight neural networks to achieve mapping from measurement data to the intermediate steps of the reverse diffusion process and directly approximate the divergence of the diffusion model, thereby further improving reconstruction efficiency. Extensive experiments show that our proposed DMP-DUN achieves state-of-the-art performance and requires at least only 2 steps to reconstruct the image. Codes are available at this https URL.
zh

[CV-58] JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data

【速读】：该论文旨在解决基于深度学习的自动驾驶感知在LiDAR数据标注依赖上的局限性问题，具体表现为真实标注数据规模限制以及罕见场景（corner cases）的缺乏。此外，引入仿真数据以增强真实感知性能面临两大挑战：仿真数据的样本效率低下及仿真到真实场景之间的域差距（simulation-to-real gap）。为克服这些挑战，论文提出了一种名为JiSAM的方法，其核心在于结合抖动增强（Jittering augmentation）、领域感知骨干网络（domain-aware backbone）以及基于记忆的扇区对齐（memory-based Sectorized AlignMent）。实验结果表明，JiSAM能够在仅使用少量真实标注数据（2.5%）的情况下，达到与全真实数据训练模型相当的性能，并显著提升未标注目标的平均精度（mAP）。

链接: https://arxiv.org/abs/2503.08422
作者: Runjian Chen,Wenqi Shao,Bo Zhang,Shaoshuai Shi,Li Jiang,Ping Luo
机构: The University of Hong Kong (香港大学); Shanghai AI Laboratory (上海人工智能实验室); Voyager Research, Didi Chuxing (滴滴出行远航研究); The Chinese University of Hong Kong (Shenzhen) (香港中文大学（深圳）); pjlab (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep-learning-based autonomous driving (AD) perception introduces a promising picture for safe and environment-friendly transportation. However, the over-reliance on real labeled data in LiDAR perception limits the scale of on-road attempts. 3D real world data is notoriously time-and-energy-consuming to annotate and lacks corner cases like rare traffic participants. On the contrary, in simulators like CARLA, generating labeled LiDAR point clouds with corner cases is a piece of cake. However, introducing synthetic point clouds to improve real perception is non-trivial. This stems from two challenges: 1) sample efficiency of simulation datasets 2) simulation-to-real gaps. To overcome both challenges, we propose a plug-and-play method called JiSAM , shorthand for Jittering augmentation, domain-aware backbone and memory-based Sectorized AlignMent. In extensive experiments conducted on the famous AD dataset NuScenes, we demonstrate that, with SOTA 3D object detector, JiSAM is able to utilize the simulation data and only labels on 2.5% available real data to achieve comparable performance to models trained on all real data. Additionally, JiSAM achieves more than 15 mAPs on the objects not labeled in the real training set. We will release models and codes.
zh

[CV-59] Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels

【速读】：该论文旨在解决无监督3D目标检测中因数据稀疏性和有限视角导致的聚类式伪标签质量较低的问题。论文提出的关键解决方案是利用多智能体协作数据集中的互补观测信息，在无外部标注的情况下实现高质量伪标签的生成。具体而言，方法首先通过协作智能体共享的自运动姿态和自车身形状初始化检测器，并利用神经网络的泛化能力初步生成伪标签；随后结合各智能体的互补观测进行多尺度编码，区分高、低质量伪标签，并进一步作为提示引导正确的特征学习过程，从而提升无监督3D目标检测任务的性能。实验验证了所提方法在V2V4Real和OPV2V数据集上的优越性。

链接: https://arxiv.org/abs/2503.08421
作者: Qiming Xia,Wenkai Lin,Haoen Xiang,Xun Huang,Siheng Chen,Zhen Dong,Cheng Wang,Chenglu Wen
机构: Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University (厦门大学), China; Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University (厦门大学), China; Shanghai Jiao Tong University (上海交通大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Unsupervised 3D object detection serves as an important solution for offline 3D object annotation. However, due to the data sparsity and limited views, the clustering-based label fitting in unsupervised object detection often generates low-quality pseudo-labels. Multi-agent collaborative dataset, which involves the sharing of complementary observations among agents, holds the potential to break through this bottleneck. In this paper, we introduce a novel unsupervised method that learns to Detect Objects from Multi-Agent LiDAR scans, termed DOtA, without using labels from external. DOtA first uses the internally shared ego-pose and ego-shape of collaborative agents to initialize the detector, leveraging the generalization performance of neural networks to infer preliminary labels. Subsequently,DOtA uses the complementary observations between agents to perform multi-scale encoding on preliminary labels, then decodes high-quality and low-quality labels. These labels are further used as prompts to guide a correct feature learning process, thereby enhancing the performance of the unsupervised object detection task. Extensive experiments on the V2V4Real and OPV2V datasets show that our DOtA outperforms state-of-the-art unsupervised 3D object detection methods. Additionally, we also validate the effectiveness of the DOtA labels under various collaborative perception this http URL code is available at this https URL.
zh

[CV-60] Generalizable and Explainable Deep Learning for Medical Image Computing: An Overview

【速读】：该论文旨在解决深度学习在医学影像领域中模型透明性和可解释性不足的问题。解决方案的关键在于结合ResNet50与五种常见的可解释人工智能（XAI）技术，以提高模型预测的可解释性，并通过定量指标（如置信度增加）评估XAI技术的有效性。实验结果表明，ResNet50在多个数据集上实现了可行的准确率和F1分数，同时某些XAI方法（如XgradCAM）在突出医学图像相关异常区域方面表现优于其他方法（如EigenGradCAM）。这一研究为增强深度学习模型在生物医学成像领域的鲁棒性和通用性提供了未来研究方向。

链接: https://arxiv.org/abs/2503.08420
作者: Ahmad Chaddad,Yan Hu,Yihang Wu,Binbin Wen,Reem Kateb
机构: Artificial Intelligence for Personalized Medicine, School of Artificial Intelligence, Guilin University of Electronic Technology (桂林电子科技大学), Guilin 541004, China; Laboratory for Imagery Vision and Artificial Intelligence, École de Technologie Supérieure (ETS) (ÉTS)(魁北克省高等技术学院), Montréal, QC H3C 1K3, Canada; College of Computer Science and Engineering, Taibah University (Taibah大学), Madinah, 42353, Saudi Arabia; College of Computer Science and Engineering, Jeddah University (Jeddah大学), Jeddah, 23445, Saudi Arabia
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in Current Opinion in Biomedical Engineering

点击查看摘要

Abstract:Objective. This paper presents an overview of generalizable and explainable artificial intelligence (XAI) in deep learning (DL) for medical imaging, aimed at addressing the urgent need for transparency and explainability in clinical applications. Methodology. We propose to use four CNNs in three medical datasets (brain tumor, skin cancer, and chest x-ray) for medical image classification tasks. In addition, we perform paired t-tests to show the significance of the differences observed between different methods. Furthermore, we propose to combine ResNet50 with five common XAI techniques to obtain explainable results for model prediction, aiming at improving model transparency. We also involve a quantitative metric (confidence increase) to evaluate the usefulness of XAI techniques. Key findings. The experimental results indicate that ResNet50 can achieve feasible accuracy and F1 score in all datasets (e.g., 86.31% accuracy in skin cancer). Furthermore, the findings show that while certain XAI methods, such as XgradCAM, effectively highlight relevant abnormal regions in medical images, others, like EigenGradCAM, may perform less effectively in specific scenarios. In addition, XgradCAM indicates higher confidence increase (e.g., 0.12 in glioma tumor) compared to GradCAM++ (0.09) and LayerCAM (0.08). Implications. Based on the experimental results and recent advancements, we outline future research directions to enhance the robustness and generalizability of DL models in the field of biomedical imaging. Comments: Published in Current Opinion in Biomedical Engineering Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2503.08420 [cs.CV] (or arXiv:2503.08420v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.08420 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1016/j.cobme.2024.100567 Focus to learn more DOI(s) linking to related resources Submission history From: Yihang Wu [view email] [v1] Tue, 11 Mar 2025 13:31:09 UTC (2,304 KB)
zh

[CV-61] AnyMoLe: Any Character Motion In-betweening Leverag ing Video Diffusion Models CVPR2025

【速读】：该论文旨在解决基于学习的运动插值方法中对角色特定数据集的依赖问题。解决方案的关键在于提出AnyMoLe方法，它利用视频扩散模型生成任意角色的中间运动帧，而无需外部数据。AnyMoLe采用两阶段帧生成过程以增强上下文理解，并通过ICAdapt技术微调视频扩散模型以弥合真实世界与渲染角色动画之间的领域差距。此外，引入“运动-视频模仿”优化技术，利用2D和3D感知特征实现任意关节结构角色的平滑且逼真的运动生成。这些创新显著降低了数据依赖性，使其适用于广泛的运动插值任务。

链接: https://arxiv.org/abs/2503.08417
作者: Kwan Yun,Seokhyeon Hong,Chaelin Kim,Junyong Noh
机构: KAIST (韩国科学技术院)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 11 pages, 10 figures, CVPR 2025

点击查看摘要

Abstract:Despite recent advancements in learning-based motion in-betweening, a key limitation has been overlooked: the requirement for character-specific datasets. In this work, we introduce AnyMoLe, a novel method that addresses this limitation by leveraging video diffusion models to generate motion in-between frames for arbitrary characters without external data. Our approach employs a two-stage frame generation process to enhance contextual understanding. Furthermore, to bridge the domain gap between real-world and rendered character animations, we introduce ICAdapt, a fine-tuning technique for video diffusion models. Additionally, we propose a ``motion-video mimicking’’ optimization technique, enabling seamless motion generation for characters with arbitrary joint structures using 2D and 3D-aware features. AnyMoLe significantly reduces data dependency while generating smooth and realistic transitions, making it applicable to a wide range of motion in-betweening tasks.
zh

[CV-62] WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images

【速读】：该论文旨在解决现有交互式三维分割模型在实时场景应用中的局限性，即这些模型通常需要针对特定场景进行大量训练才能准确重建和分割物体。论文提出的解决方案核心在于WildSeg3D，这是一种通过前馈机制实现任意三维物体在多样化环境中分割的有效方法。关键挑战在于多视角二维图像中三维配准误差的累积可能导致分割结果不准确。为了解决这一问题，论文提出了动态全局配准（Dynamic Global Aligning, DGA）技术，通过关注跨图像难以匹配的三维点并使用动态调整函数来提高全局多视角配准的准确性。此外，为了实现实时交互式分割，还引入了多视角分组映射（Multi-view Group Mapping, MGM）方法，利用对象掩码缓存整合多视角分割结果并快速响应用户提示。WildSeg3D展示了其在任意场景中的鲁棒泛化能力，并且相比现有的最先进模型，在保持相同精度的同时实现了40倍的速度提升。

链接: https://arxiv.org/abs/2503.08407
作者: Yansong Guo,Jie Hu,Yansong Qu,Liujuan Cao
机构: Xiamen University (厦门大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in interactive 3D segmentation from 2D images have demonstrated impressive performance. However, current models typically require extensive scene-specific training to accurately reconstruct and segment objects, which limits their applicability in real-time scenarios. In this paper, we introduce WildSeg3D, an efficient approach that enables the segmentation of arbitrary 3D objects across diverse environments using a feed-forward mechanism. A key challenge of this feed-forward approach lies in the accumulation of 3D alignment errors across multiple 2D views, which can lead to inaccurate 3D segmentation results. To address this issue, we propose Dynamic Global Aligning (DGA), a technique that improves the accuracy of global multi-view alignment by focusing on difficult-to-match 3D points across images, using a dynamic adjustment function. Additionally, for real-time interactive segmentation, we introduce Multi-view Group Mapping (MGM), a method that utilizes an object mask cache to integrate multi-view segmentations and respond rapidly to user prompts. WildSeg3D demonstrates robust generalization across arbitrary scenes, thereby eliminating the need for scene-specific training. Specifically, WildSeg3D not only attains the accuracy of state-of-the-art (SOTA) methods but also achieves a 40\times speedup compared to existing SOTA models. Our code will be publicly available.
zh

[CV-63] DyArtbank: Diverse Artistic Style Transfer via Pre-trained Stable Diffusion and Dynamic Style Prompt Artbank

【速读】：该论文旨在解决现有艺术风格迁移方法难以生成足够多样且高度逼真的艺术风格化图像的问题。为了解决这一问题，论文提出了一种名为DyArtbank的新颖艺术风格迁移框架，其关键是引入了动态风格提示艺术库（Dynamic Style Prompt ArtBank, DSPA）和关键内容特征提示（Key Content Feature Prompt, KCFP）模块。DSPA是一组可学习的参数，能够从艺术品集合中学习并存储风格信息，动态引导预训练的稳定扩散模型生成多样且逼真的艺术风格化图像，并支持基于学习到的风格信息生成随机的艺术图像样本，为数据增强提供新思路。KCFP模块则为预训练的稳定扩散模型提供了充足的内容提示，以保留输入内容图像的详细结构。

链接: https://arxiv.org/abs/2503.08392
作者: Zhanjie Zhang,Quanwei Zhang,Guangyuan Li,Junsheng Luan,Mengyuan Yang,Yun Wang,Lei Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Knowledge-Based Systems

点击查看摘要

Abstract:Artistic style transfer aims to transfer the learned style onto an arbitrary content image. However, most existing style transfer methods can only render consistent artistic stylized images, making it difficult for users to get enough stylized images to enjoy. To solve this issue, we propose a novel artistic style transfer framework called DyArtbank, which can generate diverse and highly realistic artistic stylized images. Specifically, we introduce a Dynamic Style Prompt ArtBank (DSPA), a set of learnable parameters. It can learn and store the style information from the collection of artworks, dynamically guiding pre-trained stable diffusion to generate diverse and highly realistic artistic stylized images. DSPA can also generate random artistic image samples with the learned style information, providing a new idea for data augmentation. Besides, a Key Content Feature Prompt (KCFP) module is proposed to provide sufficient content prompts for pre-trained stable diffusion to preserve the detailed structure of the input content image. Extensive qualitative and quantitative experiments verify the effectiveness of our proposed method. Code is available: this https URL
zh

[CV-64] Recognition-Synergistic Scene Text Editing CVPR2025

【速读】：该论文旨在解决场景文本编辑（Scene Text Editing）的问题，即在保持风格一致性的同时修改场景图像中的文本内容。传统方法通过显式分离源图像的风格与内容，并融合目标内容与风格来实现这一目标，同时借助预训练的识别模型确保内容一致性，但其复杂管道导致在复杂场景下的性能不佳。论文提出了一种名为Recognition-Synergistic Scene Text Editing (RS-STE) 的新方法，关键在于充分利用文本识别与编辑之间的内在协同作用。该方法在一个统一框架中无缝集成文本识别与编辑，利用识别模型隐式分离风格与内容的能力，并通过双循环自监督微调策略，在无配对真实数据的情况下增强风格与内容一致性。此外，其基于Transformer架构的多模态并行解码器能够同步预测文本内容和风格化图像，从而实现卓越的性能表现。

链接: https://arxiv.org/abs/2503.08387
作者: Zhengyao Fang,Pengyuan Lyu,Jingjing Wu,Chengquan Zhang,Jun Yu,Guangming Lu,Wenjie Pei
机构: Harbin Institute of Technology (哈尔滨工业大学); Tencent (腾讯); Department of Computer Vision Technology, Baidu Inc. (百度计算机视觉技术部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by CVPR2025

点击查看摘要

Abstract:Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model’s ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, \mymodel achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at this https URL.
zh

[CV-65] Prototype-Based Multiple Instance Learning for Gigapixel Whole Slide Image Classification

【速读】：该论文旨在解决现有多重实例学习（MIL）方法在组织病理学全切片图像（WSI）分析中的两个主要局限性：一是大多数MIL模型仅提供基于注意力的解释，未能忠实反映模型的决策机制；二是缺乏与人类用户的交互能力。为了解决这些问题，论文提出了一种名为ProtoMIL的固有可解释MIL模型。其关键创新在于利用稀疏自动编码器从图像特征空间中发现人类易于理解的概念，并以此训练模型，使预测结果能够以概念的线性组合形式表示，从而实现决策过程的透明化。此外，ProtoMIL允许用户通过修改输入概念进行模型干预，进一步增强人机协作能力。实验表明，ProtoMIL在保持分类性能与当前最先进的MIL模型相当的同时，提供了直观且易理解的解释，并可通过人为干预消除对诊断无关信息的依赖。

链接: https://arxiv.org/abs/2503.08384
作者: Susu Sun,Dominique van Midden,Geert Litjens,Christian F. Baumgartner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multiple Instance Learning (MIL) methods have succeeded remarkably in histopathology whole slide image (WSI) analysis. However, most MIL models only offer attention-based explanations that do not faithfully capture the model’s decision mechanism and do not allow human-model interaction. To address these limitations, we introduce ProtoMIL, an inherently interpretable MIL model for WSI analysis that offers user-friendly explanations and supports human intervention. Our approach employs a sparse autoencoder to discover human-interpretable concepts from the image feature space, which are then used to train ProtoMIL. The model represents predictions as linear combinations of concepts, making the decision process transparent. Furthermore, ProtoMIL allows users to perform model interventions by altering the input concepts. Experiments on two widely used pathology datasets demonstrate that ProtoMIL achieves a classification performance comparable to state-of-the-art MIL models while offering intuitively understandable explanations. Moreover, we demonstrate that our method can eliminate reliance on diagnostically irrelevant information via human intervention, guiding the model toward being right for the right reason. Code will be publicly available at this https URL.
zh

[CV-66] winner: Shining Light on Digital Twins in a Few Snaps

【速读】：该论文旨在解决从少量视角图像中恢复场景光照、物体几何形状以及材质属性的问题。为实现这一目标，论文提出了Twinner模型，其解决方案的关键创新点包括：1）引入了一种内存高效的体素网格Transformer，其内存需求仅与体素网格大小的平方成比例增长；2）构建了一个大规模全合成数据集，包含通过程序生成的PBR纹理对象，并在不同光照条件下渲染，以应对高质量地面真值PBR阴影模型稀缺的问题；3）通过可微物理基础着色模型对模型进行微调，缩小了合成到真实世界的差距，避免了在现实场景中难以获取的地面真值光照和材质属性的需求。实验结果显示，Twinner在真实世界的StanfordORB基准测试中，凭借少量输入视图实现了显著优于现有前馈重建网络的重建质量，且与显著更慢的场景特定优化方法相当。

链接: https://arxiv.org/abs/2503.08382
作者: Jesus Zarzar,Tom Monnier,Roman Shapovalov,Andrea Vedaldi,David Novotny
机构: VGG (University of Oxford); KAUST; Meta AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present the first large reconstruction model, Twinner, capable of recovering a scene’s illumination as well as an object’s geometry and material properties from only a few posed images. Twinner is based on the Large Reconstruction Model and innovates in three key ways: 1) We introduce a memory-efficient voxel-grid transformer whose memory scales only quadratically with the size of the voxel grid. 2) To deal with scarcity of high-quality ground-truth PBR-shaded models, we introduce a large fully-synthetic dataset of procedurally-generated PBR-textured objects lit with varied illumination. 3) To narrow the synthetic-to-real gap, we finetune the model on real life datasets by means of a differentiable physically-based shading model, eschewing the need for ground-truth illumination or material properties which are challenging to obtain in real life. We demonstrate the efficacy of our model on the real life StanfordORB benchmark where, given few input views, we achieve reconstruction quality significantly superior to existing feedforward reconstruction networks, and comparable to significantly slower per-scene optimization methods.
zh

[CV-67] Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens

【速读】：该论文旨在解决现有图像tokenization方法在效率与保真度之间难以平衡的问题，即高分辨率图像重建要么需要过多的tokens，要么通过token降维而牺牲关键细节。为了解决这一挑战，论文提出了Latent Consistency Tokenizer（Layton），其关键在于将离散视觉tokens与预训练的Latent Diffusion Models（LDMs）的紧凑潜空间相结合，从而仅用256个tokens即可高效表示1024x1024的图像，相比VQGAN实现了16倍的压缩。Layton的核心由Transformer编码器、量化码本以及改进的潜一致性解码器组成。直接使用LDM作为解码器会导致颜色和亮度偏差，因此论文将其转化为潜一致性解码器，并通过减少多步采样至1-2步实现像素级直接监督。实验表明，Layton在MSCOCO-2017 5K基准上的1024x1024图像重建达到了10.8的Frechet Inception Distance，展现了其在高保真重建方面的优势。此外，论文还扩展Layton至文本到图像生成模型LaytonGen，在自回归模式下实现了0.73的GenEval评分，超越当前最先进的方法。

链接: https://arxiv.org/abs/2503.08377
作者: Qingsong Xie,Zhao Zhang,Zhe Huang,Yanhao Zhang,Haonan Lu,Zhenyu Yang
机构: OPPO AI Center; ByteDance (字节跳动); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (Layton) that bridges discrete visual tokens with the compact latent space of pre-trained Latent Diffusion Models (LDMs), enabling efficient representation of 1024x1024 images using only 256 tokens-a 16 times compression over VQGAN. Layton integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies. Thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. Experiments demonstrate Layton’s superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024x1024 image reconstruction. We also extend Layton to a text-to-image generation model, LaytonGen, working in autoregression. It achieves 0.73 score on GenEval benchmark, surpassing current state-of-the-art methods. The code and model will be released.
zh

[CV-68] nnInteractive: Redefining 3D Promptable Segmentation KR

【速读】：该论文旨在解决3D医学图像分割领域中现有方法存在的准确性不足、交互限制及可用性较差的问题。具体而言，尽管如SAM等基础模型在交互分割方面取得了革命性进展，但其2D设计和领域迁移限制使其难以胜任3D医学图像任务。当前的适应性方法虽有所改进，但仍存在缺乏体积感知能力、交互受限或仅支持有限结构与模态等问题。此外，工具的可用性也受到限制，因为它们通常未集成到现有的成像平台中，且依赖于功能受限的基于网络的界面。

论文提出的解决方案关键在于nnInteractive，这是一种全面的3D交互开放式分割方法。它支持多样化的提示方式（包括点、涂鸦、框选以及一种新型套索提示），并通过直观的2D交互生成完整的3D分割结果。通过在超过120个多样化三维数据集（如CT、MRI、PET、3D显微镜等）上的训练，nnInteractive在准确性、适应性和可用性方面达到了新的高度。尤为突出的是，它是首个集成到广泛使用的图像查看器（如Napari和MITK）中的方法，确保了其在实际临床和研究应用中的广泛可访问性。广泛的基准测试表明，nnInteractive远远超越现有方法，为AI驱动的交互式3D分割设定了新标准。

链接: https://arxiv.org/abs/2503.08373
作者: Fabian Isensee,Maximilian Rokuss,Lars Krämer,Stefan Dinkelacker,Ashis Ravindran,Florian Stritzke,Benjamin Hamm,Tassilo Wald,Moritz Langenberg,Constantin Ulrich,Jonathan Deissler,Ralf Floca,Klaus Maier-Hein
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Fabian Isensee, Maximilian Rokuss and Lars Krämer contributed equally. Each co-first author may list themselves as lead author on their CV

点击查看摘要

Abstract:Accurate and efficient 3D segmentation is essential for both clinical and research applications. While foundation models like SAM have revolutionized interactive segmentation, their 2D design and domain shift limitations make them ill-suited for 3D medical images. Current adaptations address some of these challenges but remain limited, either lacking volumetric awareness, offering restricted interactivity, or supporting only a small set of structures and modalities. Usability also remains a challenge, as current tools are rarely integrated into established imaging platforms and often rely on cumbersome web-based interfaces with restricted functionality. We introduce nnInteractive, the first comprehensive 3D interactive open-set segmentation method. It supports diverse prompts-including points, scribbles, boxes, and a novel lasso prompt-while leveraging intuitive 2D interactions to generate full 3D segmentations. Trained on 120+ diverse volumetric 3D datasets (CT, MRI, PET, 3D Microscopy, etc.), nnInteractive sets a new state-of-the-art in accuracy, adaptability, and usability. Crucially, it is the first method integrated into widely used image viewers (e.g., Napari, MITK), ensuring broad accessibility for real-world clinical and research applications. Extensive benchmarking demonstrates that nnInteractive far surpasses existing methods, setting a new standard for AI-driven interactive 3D segmentation. nnInteractive is publicly available: this https URL (Napari plugin), this https URL (MITK integration), this https URL (Python backend).
zh

[CV-69] Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking

【速读】：该论文旨在解决动态室内布局估计（Dynamic Indoor Layout Estimation）的问题，特别是在事件相机（Event Camera）数据驱动下的精度提升。为应对这一挑战，论文提出了解决方案的关键在于设计了一个新颖的事件驱动布局估计管道（Event-based Layout Estimation Pipeline），其中包含两个核心模块：一是用于有效聚合时空信息的事件-时间分布特征模块（Event-Temporal Distribution Feature Module）；二是能够轻松集成到Transformer模块中的空间-时间特征融合模块（Spatio-Temporal Feature Fusion Module）。这些创新模块共同提升了基于事件的动态室内布局估计的准确性，显著优于现有方法。

链接: https://arxiv.org/abs/2503.08370
作者: Xucheng Guo,Yiran Shen,Xiaofang Xiao,Yuanfeng Zhou,Lin Wang
机构: School of Software, Shandong University, China (山东大学软件学院，中国); School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore (新加坡南洋理工大学电气与电子工程学院，新加坡)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents Ev-Layout, a novel large-scale event-based multi-modal dataset designed for indoor layout estimation and tracking. Ev-Layout makes key contributions to the community by: Utilizing a hybrid data collection platform (with a head-mounted display and VR interface) that integrates both RGB and bio-inspired event cameras to capture indoor layouts in motion. Incorporating time-series data from inertial measurement units (IMUs) and ambient lighting conditions recorded during data collection to highlight the potential impact of motion speed and lighting on layout estimation accuracy. The dataset consists of 2.5K sequences, including over 771.3K RGB images and 10 billion event data points. Of these, 39K images are annotated with indoor layouts, enabling research in both event-based and video-based indoor layout estimation. Based on the dataset, we propose an event-based layout estimation pipeline with a novel event-temporal distribution feature module to effectively aggregate the spatio-temporal information from events. Additionally, we introduce a spatio-temporal feature fusion module that can be easily integrated into a transformer module for fusion purposes. Finally, we conduct benchmarking and extensive experiments on the Ev-Layout dataset, demonstrating that our approach significantly improves the accuracy of dynamic indoor layout estimation compared to existing event-based methods.
zh

[CV-70] Debiased Prompt Tuning in Vision-Language Model without Annotations

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在提示调优（prompt tuning）后可能因依赖虚假相关性特征（如背景或性别等无关数据特征）而导致的鲁棒性下降问题，尤其是在分布外（out-of-distribution）数据上的表现。传统消除虚假相关性的方法通常需要已知每个样本的虚假属性标签，但在实际应用中难以实现。为此，论文提出了一种无需人工标注虚假特征的方法，利用视觉-语言模型的零样本图像识别能力来自动识别虚假特征，并通过伪标记的虚假属性进一步调整不同组别的训练权重。这种方法的关键在于通过伪标注技术自动优化训练过程，从而显著提升最差组别（worst-group）的准确性，在CelebA、Waterbirds和MetaShift数据集上的实验验证了其有效性，实现了最差组别与整体准确性之间的最佳鲁棒性差距。

链接: https://arxiv.org/abs/2503.08368
作者: Chaoquan Jiang,Yunfan Yang,Rui Hu,Jitao Sang
机构: Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prompt tuning of Vision-Language Models (VLMs) such as CLIP, has demonstrated the ability to rapidly adapt to various downstream tasks. However, recent studies indicate that tuned VLMs may suffer from the problem of spurious correlations, where the model relies on spurious features (e.g. background and gender) in the data. This may lead to the model having worse robustness in out-of-distribution data. Standard methods for eliminating spurious correlation typically require us to know the spurious attribute labels of each sample, which is hard in the real world. In this work, we explore improving the group robustness of prompt tuning in VLMs without relying on manual annotation of spurious features. We notice the zero - shot image recognition ability of VLMs and use this ability to identify spurious features, thus avoiding the cost of manual annotation. By leveraging pseudo-spurious attribute annotations, we further propose a method to automatically adjust the training weights of different groups. Extensive experiments show that our approach efficiently improves the worst-group accuracy on CelebA, Waterbirds, and MetaShift datasets, achieving the best robustness gap between the worst-group accuracy and the overall accuracy.
zh

[CV-71] Embodied Crowd Counting

【速读】：该论文旨在解决人群计数中的遮挡问题（Crowd Counting with Occlusion），这是该领域的一个基础性挑战。现有数据驱动方法在应对这一问题时效果有限，主要因为它们所依赖的训练数据集大多基于被动摄像头，限制了其感知环境的能力。论文提出了一种新颖的任务——具身人群计数（Embodied Crowd Counting, ECC），并通过构建一个交互式模拟器（ECC Dataset）来生成大规模场景和大量目标，同时引入先验概率分布以逼近真实的人群分布。解决方案的关键在于提出了零样本导航方法（Zero-Shot Embodied Crowd Counting, ZECC），该方法包含由多语言大模型（MLLM）驱动的粗到细导航机制，支持主动的Z轴探索，并结合基于法线的人群分布分析方法实现精确计数。实验结果表明，该方法在计数准确性和导航成本之间实现了最佳权衡。

链接: https://arxiv.org/abs/2503.08367
作者: Runling Long,Yunlong Wang,Jia Wan,Xiang Deng,Xinting Zhu,Weili Guan,Antoni B. Chan,Liqiang Nie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Occlusion is one of the fundamental challenges in crowd counting. In the community, various data-driven approaches have been developed to address this issue, yet their effectiveness is limited. This is mainly because most existing crowd counting datasets on which the methods are trained are based on passive cameras, restricting their ability to fully sense the environment. Recently, embodied navigation methods have shown significant potential in precise object detection in interactive scenes. These methods incorporate active camera settings, holding promise in addressing the fundamental issues in crowd counting. However, most existing methods are designed for indoor navigation, showing unknown performance in analyzing complex object distribution in large scale scenes, such as crowds. Besides, most existing embodied navigation datasets are indoor scenes with limited scale and object quantity, preventing them from being introduced into dense crowd analysis. Based on this, a novel task, Embodied Crowd Counting (ECC), is proposed. We first build up an interactive simulator, Embodied Crowd Counting Dataset (ECCD), which enables large scale scenes and large object quantity. A prior probability distribution that approximates realistic crowd distribution is introduced to generate crowds. Then, a zero-shot navigation method (ZECC) is proposed. This method contains a MLLM driven coarse-to-fine navigation mechanism, enabling active Z-axis exploration, and a normal-line-based crowd distribution analysis method for fine counting. Experimental results against baselines show that the proposed method achieves the best trade-off between counting accuracy and navigation cost.
zh

[CV-72] Parametric Point Cloud Completion for Polygonal Surface Reconstruction CVPR2025

【速读】：该论文旨在解决现有基于多边形表面重建方法对输入完整性依赖性强且在不完整点云场景下表现不佳的问题。尽管当前点云补全技术能够恢复缺失点，但它们并未针对多边形表面重建进行优化，忽视了底层曲面的参数化表示。为填补这一空白，论文引入了一种名为“参数化补全”(Parametric Completion) 的新范式，其核心在于通过恢复参数化基元（parametric primitives）而非单个点来表达高级几何结构。论文提出的PaCo方法利用平面代理（plane proxies），这些代理同时包含平面参数和内点集，尤其在高度不完整的数据场景中表现出色。关键在于通过参数化基元的恢复，使重建结果不仅完整，还能保持高质量的几何特性。

链接: https://arxiv.org/abs/2503.08363
作者: Zhaiyu Chen,Yuqing Wang,Liangliang Nan,Xiao Xiang Zhu
机构: Technical University of Munich (慕尼黑工业大学); Delft University of Technology (代尔夫特理工大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Existing polygonal surface reconstruction methods heavily depend on input completeness and struggle with incomplete point clouds. We argue that while current point cloud completion techniques may recover missing points, they are not optimized for polygonal surface reconstruction, where the parametric representation of underlying surfaces remains overlooked. To address this gap, we introduce parametric completion, a novel paradigm for point cloud completion, which recovers parametric primitives instead of individual points to convey high-level geometric structures. Our presented approach, PaCo, enables high-quality polygonal surface reconstruction by leveraging plane proxies that encapsulate both plane parameters and inlier points, proving particularly effective in challenging scenarios with highly incomplete data. Comprehensive evaluations of our approach on the ABC dataset establish its effectiveness with superior performance and set a new standard for polygonal surface reconstruction from incomplete data. Project page: this https URL.
zh

[CV-73] Robust Latent Matters: Boosting Image Generation with Sampling Error

【速读】：本文旨在解决现有图像生成方案中，基于冻结图像tokenizer构建离散潜空间时存在的重建质量与生成质量之间的不一致性问题。当前评估tokenizer的指标（如rFID）无法精确衡量tokenizer性能，并有效关联其表现与生成质量（如gFID）。为解决此问题，论文提出了一种新颖的即插即用tokenizer训练方案以改进潜空间构建。关键在于引入潜在扰动方法模拟生成过程中产生的意外采样噪声（即未预期token），基于此提出了一种新的tokenizer评估指标pFID，成功建立了tokenizer性能与生成质量间的关联，并设计了一种能够显著提升tokenizer鲁棒性的训练方案，从而提高生成质量和收敛速度。通过与11种先进离散图像tokenizer及两种自回归生成模型进行广泛基准测试验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.08354
作者: Kai Qiu,Xiang Li,Jason Kuen,Hao Chen,Xiaohao Xu,Jiuxiang Gu,Yinyi Luo,Bhiksha Raj,Zhe Lin,Marios Savvides
机构: Carnegie Mellon University (卡内基梅隆大学); Adobe Research (Adobe 研究院); UMich (密歇根大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 13 figures, 6 tables

点击查看摘要

Abstract:Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we comprehensively analyze the reason for the discrepancy of reconstruction and generation qualities in a discrete latent space, and, from which, we propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction. Specifically, a latent perturbation approach is proposed to simulate sampling noises, i.e., the unexpected tokens sampled, from the generative process. With the latent perturbation, we further propose (1) a novel tokenizer evaluation metric, i.e., pFID, which successfully correlates the tokenizer performance to generation quality and (2) a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer thus boosting the generation quality and convergence speed. Extensive benchmarking are conducted with 11 advanced discrete image tokenizers with 2 autoregressive generation models to validate our approach. The tokenizer trained with our proposed latent perturbation achieve a notable 1.60 gFID with classifier-free guidance (CFG) and 3.45 gFID without CFG with a \sim 400M generator. Code: this https URL.
zh

[CV-74] Mitigating Ambiguities in 3D Classification with Gaussian Splatting CVPR2025

【速读】：该论文旨在解决基于点云输入的3D分类中存在的两个主要问题：一是由于点云表示的离散性和材料描述的不足，难以区分类似电线的表面和平面，以及透明或反射物体；二是现有方法在处理这些模糊类别时存在歧义。为了解决这些问题，论文提出了一种基于高斯泼溅（Gaussian Splatting, GS）点云的3D分类方法。其关键是利用GS点云中的尺度和旋转系数来表征表面类型，具体而言，类似电线的表面由多个细长的高斯椭球体组成，而平面则由少数扁平的高斯椭球体构成；同时，GS点云中的不透明度反映了物体的透明特性。通过这种方式，可以有效缓解基于点云的3D分类中的歧义问题。为了验证GS点云输入的有效性，研究构建了一个包含20个类别、每类200个对象的真实世界GS点云数据集，并通过实验不仅证明了GS点云输入的优越性，特别是在区分模糊对象方面，还展示了其在不同分类方法中的泛化能力。

链接: https://arxiv.org/abs/2503.08352
作者: Ruiqi Zhang,Hao Zhu,Jingyi Zhao,Qi Zhang,Xun Cao,Zhan Ma
机构: Nanjing University (南京大学); Imperial College London (帝国理工学院); Vivo Company (维沃移动通信公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:3D classification with point cloud input is a fundamental problem in 3D vision. However, due to the discrete nature and the insufficient material description of point cloud representations, there are ambiguities in distinguishing wire-like and flat surfaces, as well as transparent or reflective objects. To address these issues, we propose Gaussian Splatting (GS) point cloud-based 3D classification. We find that the scale and rotation coefficients in the GS point cloud help characterize surface types. Specifically, wire-like surfaces consist of multiple slender Gaussian ellipsoids, while flat surfaces are composed of a few flat Gaussian ellipsoids. Additionally, the opacity in the GS point cloud represents the transparency characteristics of objects. As a result, ambiguities in point cloud-based 3D classification can be mitigated utilizing GS point cloud as input. To verify the effectiveness of GS point cloud input, we construct the first real-world GS point cloud dataset in the community, which includes 20 categories with 200 objects in each category. Experiments not only validate the superiority of GS point cloud input, especially in distinguishing ambiguous objects, but also demonstrate the generalization ability across different classification methods.
zh

[CV-75] Design and Implementation of FourCropNet: A CNN-Based System for Efficient Multi-Crop Disease Detection and Management

【速读】：该论文旨在解决农业领域中植物疾病检测这一关键任务，以提升作物产量、保障粮食安全并促进可持续农业实践。论文提出了一种名为FourCropNet的新颖深度学习模型，专门用于多作物（包括CottonLeaf、Grape、Soybean和Corn）疾病检测。该解决方案的关键在于其先进的架构设计，包括残差块（residual blocks）以实现高效特征提取、注意力机制（attention mechanisms）以增强对疾病相关区域的关注，以及轻量级层（lightweight layers）以提高计算效率。这些组件共同使FourCropNet在不同数据集和类别复杂度下均表现出卓越性能，从单一作物数据集到包含15个类别的组合数据集皆如此。实验结果表明，FourCropNet在葡萄、玉米及组合数据集上的准确率分别达到99.7%、99.5%和95.3%，且具有良好的可扩展性和泛化能力。与现有技术相比，FourCropNet在多种评估指标上始终优于MobileNet、VGG16和EfficientNet等模型。因此，FourCropNet凭借其创新设计和稳定表现，成为农业实时疾病检测的一种可靠工具，有望帮助农民及时诊断疾病，减少经济损失并推动可持续农业发展。

链接: https://arxiv.org/abs/2503.08348
作者: H. P. Khandagale,Sangram Patil,V. S. Gavali,S. V. Chavan,P. P. Halkarnikar,Prateek A. Meshram
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Plant disease detection is a critical task in agriculture, directly impacting crop yield, food security, and sustainable farming practices. This study proposes FourCropNet, a novel deep learning model designed to detect diseases in multiple crops, including CottonLeaf, Grape, Soybean, and Corn. The model leverages an advanced architecture comprising residual blocks for efficient feature extraction, attention mechanisms to enhance focus on disease-relevant regions, and lightweight layers for computational efficiency. These components collectively enable FourCropNet to achieve superior performance across varying datasets and class complexities, from single-crop datasets to combined datasets with 15 classes. The proposed model was evaluated on diverse datasets, demonstrating high accuracy, specificity, sensitivity, and F1 scores. Notably, FourCropNet achieved the highest accuracy of 99.7% for Grape, 99.5% for Corn, and 95.3% for the combined dataset. Its scalability and ability to generalize across datasets underscore its robustness. Comparative analysis shows that FourCropNet consistently outperforms state-of-the-art models such as MobileNet, VGG16, and EfficientNet across various metrics. FourCropNet’s innovative design and consistent performance make it a reliable solution for real-time disease detection in agriculture. This model has the potential to assist farmers in timely disease diagnosis, reducing economic losses and promoting sustainable agricultural practices.
zh

[CV-76] Pathology-Aware Adaptive Watermarking for Text-Driven Medical Image Synthesis

【速读】：该论文旨在解决文本条件扩散模型在生成高质量医学图像时可能被用于不道德用途的问题，特别是在保险欺诈或伪造医疗记录方面。为应对这一挑战，论文提出的关键解决方案是MedSign，这是一种基于深度学习的水印框架，专为文本到医学图像的合成设计。MedSign通过自适应调整水印强度来保护病理学上有意义的区域，其关键是利用跨注意力机制生成病理定位图，该图整合了不同层、头和时间步长上的标记级注意力。通过此地图优化Latent Diffusion Model (LDM)解码器，在图像合成过程中嵌入水印，确保水印与图像的一致性集成同时最小化对诊断关键区域的影响。实验结果表明，MedSign在MIMIC-CXR和OIA-ODIR数据集上实现了最先进的性能，既保持了诊断完整性又保证了水印的鲁棒性。

链接: https://arxiv.org/abs/2503.08346
作者: Chanyoung Kim,Dayun Ju,Jinyeong Kim,Woojung Han,Roberto Alcover-Couso,Seong Jae Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As recent text-conditioned diffusion models have enabled the generation of high-quality images, concerns over their potential misuse have also grown. This issue is critical in the medical domain, where text-conditioned generated medical images could enable insurance fraud or falsified records, highlighting the urgent need for reliable safeguards against unethical use. While watermarking techniques have emerged as a promising solution in general image domains, their direct application to medical imaging presents significant challenges. A key challenge is preserving fine-grained disease manifestations, as even minor distortions from a watermark may lead to clinical misinterpretation, which compromises diagnostic integrity. To overcome this gap, we present MedSign, a deep learning-based watermarking framework specifically designed for text-to-medical image synthesis, which preserves pathologically significant regions by adaptively adjusting watermark strength. Specifically, we generate a pathology localization map using cross-attention between medical text tokens and the diffusion denoising network, aggregating token-wise attention across layers, heads, and time steps. Leveraging this map, we optimize the LDM decoder to incorporate watermarking during image synthesis, ensuring cohesive integration while minimizing interference in diagnostically critical regions. Experimental results show that our MedSign preserves diagnostic integrity while ensuring watermark robustness, achieving state-of-the-art performance in image quality and detection accuracy on MIMIC-CXR and OIA-ODIR datasets.
zh

[CV-77] DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos

【速读】：该论文致力于解决自视角视频（egocentric videos）中环境理解的挑战，这类视频因佩戴者与环境的动态交互而具有复杂性，传统方法在处理这些场景时往往局限于孤立片段或未能有效整合丰富的语义与几何信息，从而限制了场景理解能力。论文的关键解决方案是提出了一种名为Dynamic Image-Video Feature Fields (DIV-FF) 的框架，该框架通过将自视角场景分解为持久性、动态性和角色相关的组成部分，并结合图像与视频语言特征，实现了详细的分割、捕捉功能性（affordances）、理解周围环境以及保持时间一致性等能力。相比现有技术，DIV-FF尤其在动态演变场景中表现出色，展现了其推动长期时空场景理解发展的潜力。

链接: https://arxiv.org/abs/2503.08344
作者: Lorenzo Mur-Labadia,Josechu Guerrero,Ruben Martinez-Cantin
机构: University of Zaragoza(萨拉戈萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Environment understanding in egocentric videos is an important step for applications like robotics, augmented reality and assistive technologies. These videos are characterized by dynamic interactions and a strong dependence on the wearer engagement with the environment. Traditional approaches often focus on isolated clips or fail to integrate rich semantic and geometric information, limiting scene comprehension. We introduce Dynamic Image-Video Feature Fields (DIV FF), a framework that decomposes the egocentric scene into persistent, dynamic, and actor based components while integrating both image and video language features. Our model enables detailed segmentation, captures affordances, understands the surroundings and maintains consistent understanding over time. DIV-FF outperforms state-of-the-art methods, particularly in dynamically evolving scenarios, demonstrating its potential to advance long term, spatio temporal scene understanding.
zh

[CV-78] Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLM s

【速读】：该论文试图解决多模态大语言模型（MLLMs）在处理任务时普遍存在幻觉（hallucinations）的问题。解决方案的关键在于提出了一种名为注意力重分配（Attention Reallocation, AttnReal）的方法，通过重新分配模型的注意力机制，将过多关注历史输出令牌（output tokens）的注意力转移至视觉令牌（visual tokens），从而减少模型对语言先验的依赖，使解码过程更多地基于视觉输入。这种方法几乎不增加额外的计算开销，同时通过调节AttnReal的强度，能够在响应准确性与整体性能之间实现广泛的权衡。实验结果验证了AttnReal在六种开源MLLMs和三种解码策略下的有效性。

链接: https://arxiv.org/abs/2503.08342
作者: Chongjun Tu,Peng Ye,Dongzhan Zhou,Lei Bai,Gang Yu,Tao Chen,Wanli Ouyang
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); StepFun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-Modal Large Language Models (MLLMs) stand out in various tasks but still struggle with hallucinations. While recent training-free mitigation methods mostly introduce additional inference overhead via retrospection strategy and contrastive decoding, we propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost. Our approach is motivated by the key observations that, MLLM’s unreasonable attention distribution causes features to be dominated by historical output tokens, which further contributes to hallucinated responses because of the distribution gap between different token types. Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM’s reliance on language priors and ensures the decoding process depends more on the visual inputs. More interestingly, we find that, by controlling the intensity of AttnReal, we can achieve a wide-range trade-off between the response faithfulness and overall performance. Comprehensive results from different benchmarks validate the effectiveness of AttnReal across six open-source MLLMs and three decoding strategies.
zh

[CV-79] Diffusion Transformer Meets Random Masks: An Advanced PET Reconstruction Framework

【速读】：本文旨在解决正电子发射断层成像（PET）图像重建中的数据缺失问题，传统方法在重建过程中利用掩码进行插值任务，但将其引入PET重建框架具有变革潜力。为了解决这一挑战，论文提出了一种名为Diffusion tRansformer mEets rAndom Masks (DREAM) 的先进PET重建框架。关键创新在于首次将掩码机制同时集成到投影域（sinogram domain）和潜在空间（latent space），通过高维堆叠方法扩展解空间以捕获更丰富的空间关系，并设计了驱动掩码的潜在空间加速扩散过程，同时保持计算效率与数据特征。此外，引入分层掩码策略引导模型从关注局部细节逐步过渡到全局模式，平衡了细节特征保留与整体上下文理解。这种综合方法显著提升了重建图像的整体质量和临床关键信息的保存能力。

链接: https://arxiv.org/abs/2503.08339
作者: Bin Huang,Binzhong He,Yanhan Chen,Zhili Liu,Xinyue Wang,Binxuan Li,Qiegen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has significantly advanced PET image re-construction, achieving remarkable improvements in image quality through direct training on sinogram or image data. Traditional methods often utilize masks for inpainting tasks, but their incorporation into PET reconstruction frameworks introduces transformative potential. In this study, we pro-pose an advanced PET reconstruction framework called Diffusion tRansformer mEets rAndom Masks (DREAM). To the best of our knowledge, this is the first work to integrate mask mechanisms into both the sinogram domain and the latent space, pioneering their role in PET reconstruction and demonstrating their ability to enhance reconstruction fidelity and efficiency. The framework employs a high-dimensional stacking approach, transforming masked data from two to three dimensions to expand the solution space and enable the model to capture richer spatial rela-tionships. Additionally, a mask-driven latent space is de-signed to accelerate the diffusion process by leveraging sinogram-driven and mask-driven compact priors, which reduce computational complexity while preserving essen-tial data characteristics. A hierarchical masking strategy is also introduced, guiding the model from focusing on fi-ne-grained local details in the early stages to capturing broader global patterns over time. This progressive ap-proach ensures a balance between detailed feature preservation and comprehensive context understanding. Experimental results demonstrate that DREAM not only improves the overall quality of reconstructed PET images but also preserves critical clinical details, highlighting its potential to advance PET imaging technology. By inte-grating compact priors and hierarchical masking, DREAM offers a promising and efficient avenue for future research and application in PET imaging. The open-source code is available at: this https URL.
zh

[CV-80] alk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving

【速读】：本文旨在解决基于现有二维视觉语言模型（Vision-Language Models, VLMs）的场景理解方法在处理动态驾驶环境中的局限性。这些方法依赖于有限的场景感知上下文，而忽略了点云传感器（如LiDAR）提供的丰富深度信息和细粒度三维表征，以及新兴的4D毫米波雷达在运动趋势、速度和反射强度检测方面的优势。为克服这些问题，论文提出了一种名为TPCNet的新方法，这是首个基于提示引导点云传感器组合范式的室外三维视觉定位模型，集成了LiDAR和雷达两种模态。关键在于设计了一个两阶段异构模态自适应融合（Two-Stage Heterogeneous Modal Adaptive Fusion）框架，包括双向代理交叉注意力（Bidirectional Agent Cross-Attention, BACA）用于全局感受野特征查询，动态门控图融合（Dynamic Gated Graph Fusion, DGGF）模块用于定位感兴趣区域，以及创新的基于最近物体边缘的三维回归头（C3D-RECHead），从而实现更精准的三维视觉定位。实验结果表明，TPCNet及其各模块在Talk2Radar和Talk2Car数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2503.08336
作者: Runwei Guan,Jianan Liu,Ningwei Ouyang,Daizong Liu,Xiaolou Sun,Lianqing Zheng,Ming Xu,Yutao Yue,Hui Xiong
机构: Thrust of Artificial Intelligence, Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China (香港科技大学（广州）人工智能研究所，中国广州); Momoni AI, Gothenburg, Sweden (瑞典哥德堡Momoni AI); School of Advanced Technology, Xi’an Jiaotong-Livepool University, Suzhou, China (西安交通大学利物浦大学先进技术学院，中国苏州); Wangxuan Institute of Computer Technology, Peking University, Beijing, China (北京大学王选计算机研究所，中国北京); School of Automation, Southeast University, Nanjing, China (东南大学自动化学院，中国南京); School of Automotive Studies, Tongji University, Shanghai 201804, China (同济大学汽车学院，中国上海)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 11 figures

点击查看摘要

Abstract:Embodied outdoor scene understanding forms the foundation for autonomous agents to perceive, analyze, and react to dynamic driving environments. However, existing 3D understanding is predominantly based on 2D Vision-Language Models (VLMs), collecting and processing limited scene-aware contexts. Instead, compared to the 2D planar visual information, point cloud sensors like LiDAR offer rich depth information and fine-grained 3D representations of objects. Meanwhile, the emerging 4D millimeter-wave (mmWave) radar is capable of detecting the motion trend, velocity, and reflection intensity of each object. Therefore, the integration of these two modalities provides more flexible querying conditions for natural language, enabling more accurate 3D visual grounding. To this end, in this paper, we exploratively propose a novel method called TPCNet, the first outdoor 3D visual grounding model upon the paradigm of prompt-guided point cloud sensor combination, including both LiDAR and radar contexts. To adaptively balance the features of these two sensors required by the prompt, we have designed a multi-fusion paradigm called Two-Stage Heterogeneous Modal Adaptive Fusion. Specifically, this paradigm initially employs Bidirectional Agent Cross-Attention (BACA), which feeds dual-sensor features, characterized by global receptive fields, to the text features for querying. Additionally, we have designed a Dynamic Gated Graph Fusion (DGGF) module to locate the regions of interest identified by the queries. To further enhance accuracy, we innovatively devise an C3D-RECHead, based on the nearest object edge. Our experiments have demonstrated that our TPCNet, along with its individual modules, achieves the state-of-the-art performance on both the Talk2Radar and Talk2Car datasets.
zh

[CV-81] Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos

【速读】：该论文旨在解决长视频（从几分钟到几小时）在教育和新闻领域中多模态理解的挑战，特别是在需要专业知识的场景下，传统基于人工标注字幕的数据集难以满足需求的问题。为应对这一挑战，论文提出利用大型语言模型（Large Language Models, LLMs），结合自动语音识别（Automatic Speech Recognition, ASR）和光学字符识别（Optical Character Recognition, OCR）技术，从音频和视频帧中提取文本信息以实现对整个视频内容的理解。解决方案的关键在于通过ASR获取音频中的文本内容，通过OCR提取视频帧中的文本内容，并探索提示工程（prompt engineering）技术以优化LLMs对长视频多模态数据集的整体理解能力。

链接: https://arxiv.org/abs/2503.08335
作者: Soumya Shamarao Jahagirdar,Jayasree Saha,C V Jawahar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVIP 2024

点击查看摘要

Abstract:Learning multimodal video understanding typically relies on datasets comprising video clips paired with manually annotated captions. However, this becomes even more challenging when dealing with long-form videos, lasting from minutes to hours, in educational and news domains due to the need for more annotators with subject expertise. Hence, there arises a need for automated solutions. Recent advancements in Large Language Models (LLMs) promise to capture concise and informative content that allows the comprehension of entire videos by leveraging Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR) technologies. ASR provides textual content from audio, while OCR extracts textual content from specific frames. This paper introduces a dataset comprising long-form lectures and news videos. We present baseline approaches to understand their limitations on this dataset and advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively.
zh

[CV-82] 1LoRA: Summation Compression for Very Low-Rank Adaptation

【速读】：本文旨在解决在参数高效微调（PEFT）方法中，如何进一步降低每层所需调整的参数量，同时保持或提升模型性能的问题。论文聚焦于“极低秩（very low rank）”场景，并提出了一种名为1LoRA（Summation Low-Rank Adaptation）的新方法。其关键在于通过固定压缩采用特征求和的方式，仅使用单个可训练向量进行解压缩，从而显著减少每个线性层所需的参数量。与现有先进方法（如LoRA、VeRA及MoRA）相比，1LoRA不仅大幅降低了内存占用和计算成本，还实现了更均衡的层间微调，避免了过度关注特定层（如注意力层），从而进一步提升了整体性能。

链接: https://arxiv.org/abs/2503.08333
作者: Alessio Quercia,Zhuo Cao,Arya Bangun,Richard D. Paul,Abigail Morrison,Ira Assent,Hanno Scharr
机构: Forschungszentrum Juelich ( Forschungszentrum于利希研究中心 ); RWTH Aachen University (RWTH亚琛工业大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); Aarhus University (奥胡斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods have transformed the approach to fine-tuning large models for downstream tasks by enabling the adjustment of significantly fewer parameters than those in the original model matrices. In this work, we study the “very low rank regime”, where we fine-tune the lowest amount of parameters per linear layer for each considered PEFT method. We propose 1LoRA (Summation Low-Rank Adaptation), a compute, parameter and memory efficient fine-tuning method which uses the feature sum as fixed compression and a single trainable vector as decompression. Differently from state-of-the-art PEFT methods like LoRA, VeRA, and the recent MoRA, 1LoRA uses fewer parameters per layer, reducing the memory footprint and the computational cost. We extensively evaluate our method against state-of-the-art PEFT methods on multiple fine-tuning tasks, and show that our method not only outperforms them, but is also more parameter, memory and computationally efficient. Moreover, thanks to its memory efficiency, 1LoRA allows to fine-tune more evenly across layers, instead of focusing on specific ones (e.g. attention layers), improving performance further.
zh

[CV-83] MINT-Demo: Membership Inference Test Demonstrator CVPR24 AAAI25

【速读】：该论文试图解决机器学习模型训练过程缺乏透明性的问题，通过提出Membership Inference Test (MINT) 技术，验证是否特定数据曾被用于训练机器学习模型。解决方案的关键在于设计一种实验技术，能够高精度（高达89%的准确率）推断出训练数据集的成员资格信息，从而揭示模型训练过程中使用的数据来源，促进人工智能训练过程的透明化。

链接: https://arxiv.org/abs/2503.08332
作者: Daniel DeAlcala,Aythami Morales,Julian Fierrez,Gonzalo Mancera,Ruben Tolosana,Ruben Vera-Rodriguez
机构: Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid (马德里自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Demo Paper Presented at Demo Track CVPR 24’ and at AAAI 25’ AIGOV workshop

点击查看摘要

Abstract:We present the Membership Inference Test Demonstrator, to emphasize the need for more transparent machine learning training processes. MINT is a technique for experimentally determining whether certain data has been used during the training of machine learning models. We conduct experiments with popular face recognition models and 5 public databases containing over 22M images. Promising results, up to 89% accuracy are achieved, suggesting that it is possible to recognize if an AI model has been trained with specific data. Finally, we present a MINT platform as demonstrator of this technology aimed to promote transparency in AI training.
zh

[CV-84] -WiViG: Interpretable Window Vision GNN

【速读】：该论文旨在解决基于图神经网络（Graph Neural Network, GNN）的视觉模型因黑箱特性而在关键应用中普及受限的问题。论文的关键解决方案是提出了一种可解释窗口视觉图神经网络（Interpretable Window Vision GNN, i-WiViG）方法，通过自动识别与模型预测相关的子图来提供解释。这一目标通过基于窗口的图像图处理技术实现，该技术将节点的感受野限制在局部图像区域，并结合自解释图瓶颈模块来评估图像区域间长距离关系的重要性。实验结果表明，该方法在遥感分类和回归任务中实现了竞争性的性能，同时提供了内在且忠实的解释，且其后验解释的不忠度低于其他视觉GNN模型，同时保持了解释的稀疏性。

链接: https://arxiv.org/abs/2503.08321
作者: Ivica Obadic,Dmitry Kangin,Dario Oliveira,Plamen P Angelov,Xiao Xiang Zhu
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); University of Lancaster (兰卡斯特大学); University of Manchester (曼彻斯特大学); Getulio Vargas Foundation (热图利奥·瓦加斯基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models based on graph neural networks have emerged as a popular approach for solving computer vision problems. They encode the image into a graph structure and can be beneficial for efficiently capturing the long-range dependencies typically present in remote sensing imagery. However, an important drawback of these methods is their black-box nature which may hamper their wider usage in critical applications. In this work, we tackle the self-interpretability of the graph-based vision models by proposing our Interpretable Window Vision GNN (i-WiViG) approach, which provides explanations by automatically identifying the relevant subgraphs for the model prediction. This is achieved with window-based image graph processing that constrains the node receptive field to a local image region and by using a self-interpretable graph bottleneck that ranks the importance of the long-range relations between the image regions. We evaluate our approach to remote sensing classification and regression tasks, showing it achieves competitive performance while providing inherent and faithful explanations through the identified relations. Further, the quantitative evaluation reveals that our model reduces the infidelity of post-hoc explanations compared to other Vision GNN models, without sacrificing explanation sparsity.
zh

[CV-85] RFLAV: Rolling Flow matching for infinite Audio Video generation

【速读】：该论文致力于解决联合音频-视频（Audio-Video, AV）生成中的三个关键挑战：生成样本的质量、多模态同步与时间一致性，以及音频与视觉数据相互匹配且支持无限时长视频的能力。论文提出了一种基于Transformer的新架构\arch，通过探索三种不同的跨模态交互模块，其轻量级的时间融合模块被证明是在对齐音频和视觉模态方面最有效且计算效率最高的方法。解决方案的关键在于设计了这一高效的时间融合模块，从而显著提升了多模态AV生成任务的表现。

链接: https://arxiv.org/abs/2503.08307
作者: Alex Ergasti,Giuseppe Gabriele Tarollo,Filippo Botti,Tomaso Fontanini,Claudio Ferrari,Massimo Bertozzi,Andrea Prati
机构: University of Parma (帕尔马大学), Department of Engineering and Architecture (工程与建筑系); University of Siena (锡耶纳大学), Department of Information engineering and mathematics (信息工程与数学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present \arch, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that \arch outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at this https URL.
zh

[CV-86] Reasoning in visual navigation of end-to-end trained agents : a dynamical systems approach

【速读】：该论文试图解决如何使端到端训练的机器人在真实环境中展现复杂的导航行为，并理解其学习到的高层次推理能力。论文的关键在于通过大规模实验证明，经过端到端训练的机器人能够习得逼真的动力学模型用于开环预测，并利用潜在记忆存储场景结构和探索过程中获取的信息。此外，研究发现该模型具备有限视野内的精确规划能力，且后验分析表明其学到的价值函数与长期规划相关。这些结果展示了计算机视觉与序列决策工具在机器人控制中的新应用潜力。

链接: https://arxiv.org/abs/2503.08306
作者: Steeven Janny,Hervé Poirier,Leonid Antsfeld,Guillaume Bono,Gianluca Monaci,Boris Chidlovskii,Francesco Giuliari,Alessio Del Bue,Christian Wolf
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Progress in Embodied AI has made it possible for end-to-end-trained agents to navigate in photo-realistic environments with high-level reasoning and zero-shot or language-conditioned behavior, but benchmarks are still dominated by simulation. In this work, we focus on the fine-grained behavior of fast-moving real robots and present a large-scale experimental study involving \numepisodes navigation episodes in a real environment with a physical robot, where we analyze the type of reasoning emerging from end-to-end training. In particular, we study the presence of realistic dynamics which the agent learned for open-loop forecasting, and their interplay with sensing. We analyze the way the agent uses latent memory to hold elements of the scene structure and information gathered during exploration. We probe the planning capabilities of the agent, and find in its memory evidence for somewhat precise plans over a limited horizon. Furthermore, we show in a post-hoc analysis that the value function learned by the agent relates to long-term planning. Put together, our experiments paint a new picture on how using tools from computer vision and sequential decision making have led to new capabilities in robotics and control. An interactive tool is available at this http URL.
zh

[CV-87] Feature Alignment with Equivariant Convolutions for Burst Image Super-Resolution

【速读】：该论文旨在解决Burst Image Super-Resolution (BISR)任务中图像对齐（alignment）阶段存在的精度不足问题。现有方法通常依赖于可变形卷积（deformable convolutions）或光流（optical flow）等技术，这些方法要么仅关注局部变换，要么缺乏理论基础，从而限制了性能。为了解决这些问题，论文提出了一种基于等变卷积（equivariant convolution）的新框架，通过在图像域中实现显式监督学习一致变换，确保图像域与特征域之间转换的理论合理性，从而显著提高对齐精度。此外，论文设计了一个包含先进深度架构的重建模块，用于提升超分辨率重建的质量。实验结果表明，该方法在定量指标和视觉效果方面均优于现有方法。

链接: https://arxiv.org/abs/2503.08300
作者: Xinyi Liu,Feiyu Tan,Qi Xie,Qian Zhao,Deyu Meng
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Burst image processing (BIP), which captures and integrates multiple frames into a single high-quality image, is widely used in consumer cameras. As a typical BIP task, Burst Image Super-Resolution (BISR) has achieved notable progress through deep learning in recent years. Existing BISR methods typically involve three key stages: alignment, upsampling, and fusion, often in varying orders and implementations. Among these stages, alignment is particularly critical for ensuring accurate feature matching and further reconstruction. However, existing methods often rely on techniques such as deformable convolutions and optical flow to realize alignment, which either focus only on local transformations or lack theoretical grounding, thereby limiting their performance. To alleviate these issues, we propose a novel framework for BISR, featuring an equivariant convolution-based alignment, ensuring consistent transformations between the image and feature domains. This enables the alignment transformation to be learned via explicit supervision in the image domain and easily applied in the feature domain in a theoretically sound way, effectively improving alignment accuracy. Additionally, we design an effective reconstruction module with advanced deep architectures for upsampling and fusion to obtain the final BISR result. Extensive experiments on BISR benchmarks show the superior performance of our approach in both quantitative metrics and visual quality.
zh

[CV-88] SegDesicNet: Lightweight Semantic Segmentation in Remote Sensing with Geo-Coordinate Embeddings for Domain Adaptation WACV2025 WACV

【速读】：该论文旨在解决高分辨率遥感图像（HRSIs）语义分割中由于地理区域、天气和环境变化导致的数据域差异问题，现有方法通常局限于特定数据域且需要专家标注和专用设备，限制了模型的泛化能力和应用范围。论文提出了一种新颖的无监督领域自适应技术，通过利用遥感数据中易于获取的地理坐标作为元数据，在数据集中构建位置编码特征，结合地球表面的球面特性来弥合领域差距。关键在于提出的SegDesicNet模块，它将地理坐标投影到单位球面上并回归其GRID位置编码，从而获得领域损失，以实现更高效的跨域语义分割。实验结果显示，SegDesicNet在FLAIR #1数据集的基准子集上比最先进的领域自适应方法提高了约6%的平均交并比（MIoU），同时参数量减少了约27%，并在ISPRS Potsdam数据集的自定义划分上进一步验证了性能，力求减少人工神经网络与人类对物理世界理解之间的建模差异，使技术更加以人为本且可扩展。

链接: https://arxiv.org/abs/2503.08290
作者: Sachin Verma,Frank Lindseth,Gabriel Kiss
机构: Norwegian University of Science and Technology(挪威科技大学), Norway
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Semantic segmentation is essential for analyzing highdefinition remote sensing images (HRSIs) because it allows the precise classification of objects and regions at the pixel level. However, remote sensing data present challenges owing to geographical location, weather, and environmental variations, making it difficult for semantic segmentation models to generalize across diverse scenarios. Existing methods are often limited to specific data domains and require expert annotators and specialized equipment for semantic labeling. In this study, we propose a novel unsupervised domain adaptation technique for remote sensing semantic segmentation by utilizing geographical coordinates that are readily accessible in remote sensing setups as metadata in a dataset. To bridge the domain gap, we propose a novel approach that considers the combination of an imageś location encoding trait and the spherical nature of Earthś surface. Our proposed SegDesicNet module regresses the GRID positional encoding of the geo coordinates projected over the unit sphere to obtain the domain loss. Our experimental results demonstrate that the proposed SegDesicNet outperforms state of the art domain adaptation methods in remote sensing image segmentation, achieving an improvement of approximately ~6% in the mean intersection over union (MIoU) with a ~ 27% drop in parameter count on benchmarked subsets of the publicly available FLAIR #1 dataset. We also benchmarked our method performance on the custom split of the ISPRS Potsdam dataset. Our algorithm seeks to reduce the modeling disparity between artificial neural networks and human comprehension of the physical world, making the technology more human centric and scalable.
zh

[CV-89] OminiControl2: Efficient Conditioning for Diffusion Transformers

【速读】：该论文致力于解决文本到图像扩散变换模型（Text-to-Image Diffusion Transformer, DiT）在实际部署中细粒度控制的挑战。尽管近期如OminiControl等方法实现了多样化控制信号的可控生成，但这些方法在处理长条件输入时面临显著的计算效率低下问题。论文提出的OminiControl2是一种高效的框架，旨在实现高效图像条件下的图像生成。

OminiControl2的关键创新包括：(1) 动态压缩策略，在生成过程中仅保留语义相关性最高的标记，以简化条件输入；(2) 条件特征重用机制，即仅在首次计算条件标记特征后在整个去噪步骤中重复使用这些特征。这些架构改进在保持原始框架参数效率和多模态灵活性的同时，大幅降低了计算成本。实验表明，OminiControl2相比其前身减少了超过90%的条件处理开销，在多条件生成场景下整体提速5.9倍。这种效率提升使得复杂的多模态控制在高质量图像合成中的实际应用成为可能。

链接: https://arxiv.org/abs/2503.08280
作者: Zhenxiong Tan,Qiaochu Xue,Xingyi Yang,Songhua Liu,Xinchao Wang
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-grained control of text-to-image diffusion transformer models (DiT) remains a critical challenge for practical deployment. While recent advances such as OminiControl and others have enabled a controllable generation of diverse control signals, these methods face significant computational inefficiency when handling long conditional inputs. We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation. OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps. These architectural improvements preserve the original framework’s parameter efficiency and multi-modal versatility while dramatically reducing computational costs. Our experiments demonstrate that OminiControl2 reduces conditional processing overhead by over 90% compared to its predecessor, achieving an overall 5.9 \times speedup in multi-conditional generation scenarios. This efficiency enables the practical implementation of complex, multi-modal control for high-quality image synthesis with DiT models.
zh

[CV-90] PromptLNet: Region-Adaptive Aesthetic Enhancement via Prompt Guidance in Low-Light Enhancement Net

【速读】：该论文致力于解决低光图像增强领域中基于客观指标（如FID、PSNR等）评估方法导致的模型缺乏美学质量以及现有方法主要关注全局亮度提升而忽略细节优化的问题。为填补这一研究空白，论文的关键创新点在于：1）通过收集多数据集（如LOL、LOL2、LOM、DCIM、MEF等）的人类美学评价文本对及评分，训练了一个低光图像美学评价模型，并结合优化算法对扩散模型进行微调；2）提出了一种提示驱动的亮度调整模块，能够针对特定实例或区域实现精细的亮度与美学调整。这些方案旨在提高生成图像的视觉质量和实际应用中的灵活性与可控性。

链接: https://arxiv.org/abs/2503.08276
作者: Jun Yin,Yangfan He,Miao Zhang,Pengyu Zeng,Tianyi Wang,Shuai Lu,Xueqian Wang
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院, China); University of Minnesota - Twin Cities (明尼苏达大学双城分校, USA); Department of Mechanical Engineering & Materials Science, Yale University (耶鲁大学机械工程与材料科学系, USA)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning and improving large language models through human preference feedback has become a mainstream approach, but it has rarely been applied to the field of low-light image enhancement. Existing low-light enhancement evaluations typically rely on objective metrics (such as FID, PSNR, etc.), which often result in models that perform well objectively but lack aesthetic quality. Moreover, most low-light enhancement models are primarily designed for global brightening, lacking detailed refinement. Therefore, the generated images often require additional local adjustments, leading to research gaps in practical applications. To bridge this gap, we propose the following innovations: 1) We collect human aesthetic evaluation text pairs and aesthetic scores from multiple low-light image datasets (e.g., LOL, LOL2, LOM, DCIM, MEF, etc.) to train a low-light image aesthetic evaluation model, supplemented by an optimization algorithm designed to fine-tune the diffusion model. 2) We propose a prompt-driven brightness adjustment module capable of performing fine-grained brightness and aesthetic adjustments for specific instances or regions. 3) We evaluate our method alongside existing state-of-the-art algorithms on mainstream benchmarks. Experimental results show that our method not only outperforms traditional methods in terms of visual quality but also provides greater flexibility and controllability, paving the way for improved aesthetic quality.
zh

[CV-91] HERO: Human Reaction Generation from Videos

【速读】：本文旨在解决传统人类反应生成方法局限于合成给定人体运动序列的反应性动作，且仅关注人类-人类交互而忽略情绪影响的问题。为了解决这一局限，论文提出从RGB视频中生成三维人体反应，这涵盖了更广泛的交互类别，并自然包含了可能反映主体情绪的表情信息。解决方案的关键在于提出HERO框架，该框架结合视频的全局与帧级局部表示来提取交互意图，并利用此意图指导反应的合成。此外，通过持续注入局部视觉表示，最大化挖掘视频中的动态特性。同时，收集了包含配对Video-Motion数据的ViMo数据集以支持任务需求，数据集不仅包括人类-人类交互，还涵盖动物-人类及场景-人类交互。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.08270
作者: Chengjun Yu,Wei Zhai,Yuhang Yang,Yang Cao,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human reaction generation represents a significant research domain for interactive AI, as humans constantly interact with their surroundings. Previous works focus mainly on synthesizing the reactive motion given a human motion sequence. This paradigm limits interaction categories to human-human interactions and ignores emotions that may influence reaction generation. In this work, we propose to generate 3D human reactions from RGB videos, which involves a wider range of interaction categories and naturally provides information about expressions that may reflect the subject’s emotions. To cope with this task, we present HERO, a simple yet powerful framework for Human rEaction geneRation from videOs. HERO considers both global and frame-level local representations of the video to extract the interaction intention, and then uses the extracted interaction intention to guide the synthesis of the reaction. Besides, local visual representations are continuously injected into the model to maximize the exploitation of the dynamic properties inherent in videos. Furthermore, the ViMo dataset containing paired Video-Motion data is collected to support the task. In addition to human-human interactions, these video-motion pairs also cover animal-human interactions and scene-human interactions. Extensive experiments demonstrate the superiority of our methodology. The code and dataset will be publicly available at this https URL.
zh

[CV-92] Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks CVPR-25

【速读】：该论文旨在解决现有定制化人像生成（Customized Portrait Generation, CPG）方法在生成高保真人脸图像的同时无法有效防止恶意人脸识别系统追踪和滥用的问题。为应对这一挑战，论文提出了一种结合面部对抗攻击的定制化人像生成框架（Adv-CPG）。解决方案的关键在于设计了一个轻量级的局部身份加密器和一个加密增强模块，通过直接注入目标身份信息以及添加额外的身份引导，实现渐进式的双层加密保护。此外，为了实现精细化和个性化的肖像生成，开发了一种多模态图像定制器，能够生成可控的精细面部特征。据作者所知，Adv-CPG是首次将面部对抗攻击引入到CPG领域的研究。实验结果表明，Adv-CPG的平均攻击成功率比最先进的基于噪声的攻击方法和无约束攻击方法分别高出28.1%和2.86%。

链接: https://arxiv.org/abs/2503.08269
作者: Junying Wang,Hongyuan Zhang,Yuan Yuan
机构: School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University (西北工业大学); The University of Hong Kong (香港大学); Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR-25

点击查看摘要

Abstract:Recent Customized Portrait Generation (CPG) methods, taking a facial image and a textual prompt as inputs, have attracted substantial attention. Although these methods generate high-fidelity portraits, they fail to prevent the generated portraits from being tracked and misused by malicious face recognition systems. To address this, this paper proposes a Customized Portrait Generation framework with facial Adversarial attacks (Adv-CPG). Specifically, to achieve facial privacy protection, we devise a lightweight local ID encryptor and an encryption enhancer. They implement progressive double-layer encryption protection by directly injecting the target identity and adding additional identity guidance, respectively. Furthermore, to accomplish fine-grained and personalized portrait generation, we develop a multi-modal image customizer capable of generating controlled fine-grained facial features. To the best of our knowledge, Adv-CPG is the first study that introduces facial adversarial attacks into CPG. Extensive experiments demonstrate the superiority of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is 28.1% and 2.86% higher compared to the SOTA noise-based attack methods and unconstrained attack methods, respectively.
zh

[CV-93] DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness CVPR2025

【速读】：该论文旨在解决通用灵巧手在处理任意物体时生成高质量、可用抓取姿态（grasping poses）的鲁棒性挑战。解决方案的关键在于提出DexGrasp Anything方法，通过将物理约束有效整合到基于扩散模型（diffusion-based generative model）的训练和采样阶段，实现了在几乎所有公开数据集上的最先进性能。此外，论文还构建了一个包含超过340万个多样化抓取姿态的新灵巧抓取数据集，涵盖超过15,000种不同物体，展示了其推动通用灵巧抓取发展的潜力。

链接: https://arxiv.org/abs/2503.08257
作者: Yiming Zhong,Qi Jiang,Jingyi Yu,Yuexin Ma
机构: ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:A dexterous hand capable of grasping any object is essential for the development of general-purpose embodied intelligent robots. However, due to the high degree of freedom in dexterous hands and the vast diversity of objects, generating high-quality, usable grasping poses in a robust manner is a significant challenge. In this paper, we introduce DexGrasp Anything, a method that effectively integrates physical constraints into both the training and sampling phases of a diffusion-based generative model, achieving state-of-the-art performance across nearly all open datasets. Additionally, we present a new dexterous grasping dataset containing over 3.4 million diverse grasping poses for more than 15k different objects, demonstrating its potential to advance universal dexterous grasping. The code of our method and our dataset will be publicly released soon.
zh

[CV-94] SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models

【速读】：该论文旨在解决现代扩散模型（Diffusion Models）在训练效率与生成质量之间存在的根本权衡问题。现有表示对齐方法（如REPA）通过块级对齐加速收敛，但通常无法捕捉视觉表征中的结构关系，也无法确保预训练编码器与去噪网络之间的全局分布一致性。为解决这些局限性，论文提出了一种分层表示对齐框架SARA，其关键在于引入多层次的表示约束：(1) 块级对齐以保留局部语义细节；(2) 自相关矩阵对齐以保持表征内部的结构一致性；(3) 对抗分布对齐以缓解全局表示差异。不同于以往方法，SARA通过自相似矩阵显式建模表征内的相关性，并通过对抗分布对齐显式建模分布间的连贯性，从而实现局部到全局的全面对齐。实验结果表明，SARA在ImageNet-256上的FID达到1.36，收敛速度是REPA的两倍，超越了近期最先进的图像生成方法。这项工作建立了一个通过分层表示对齐优化扩散训练的系统性范式。

链接: https://arxiv.org/abs/2503.08253
作者: Hesen Chen,Junyan Wang,Zhiyu Tan,Hao Li
机构: Fudan University (复旦大学); Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学学院); Australian Institute for Machine Learning, The University of Adelaide (澳大利亚机器学习研究所, 阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report

点击查看摘要

Abstract:Modern diffusion models encounter a fundamental trade-off between training efficiency and generation quality. While existing representation alignment methods, such as REPA, accelerate convergence through patch-wise alignment, they often fail to capture structural relationships within visual representations and ensure global distribution consistency between pretrained encoders and denoising networks. To address these limitations, we introduce SARA, a hierarchical alignment framework that enforces multi-level representation constraints: (1) patch-wise alignment to preserve local semantic details, (2) autocorrelation matrix alignment to maintain structural consistency within representations, and (3) adversarial distribution alignment to mitigate global representation discrepancies. Unlike previous approaches, SARA explicitly models both intra-representation correlations via self-similarity matrices and inter-distribution coherence via adversarial alignment, enabling comprehensive alignment across local and global scales. Experiments on ImageNet-256 show that SARA achieves an FID of 1.36 while converging twice as fast as REPA, surpassing recent state-of-the-art image generation methods. This work establishes a systematic paradigm for optimizing diffusion training through hierarchical representation alignment.
zh

[CV-95] Aligning Text to Image in Diffusion Models is Easier Than You Think

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）生成模型中仍存在的文本与图像表示之间的残余错配问题。尽管已有研究通过微调模型结合多种奖励模型等方法尝试缓解这一问题，但这些方法从表征对齐（Representation Alignment, REPA）的角度仍有改进空间。论文的关键解决方案是提出了一种轻量级的对比学习微调策略——SoftREPA，其创新之处在于利用软文本标记（soft text tokens）来同时利用正样本对（positive pairs）和负样本对（negative pairs），从而更有效地实现文本与图像表征的对齐。这种方法通过在预训练模型中仅增加不到100万个可训练参数，显著提升了语义一致性，且计算开销极小。理论分析表明，该方法显式地增加了文本与图像表征之间的互信息，实验结果验证了其在文本到图像生成及文本引导的图像编辑任务中的有效性。

链接: https://arxiv.org/abs/2503.08250
作者: Jaa-Yeon Lee,Byunghee Cha,Jeongsol Kim,Jong Chul Ye
机构: Kim Jaechul Graduate School of AI, KAIST (金在哲人工智能研究生院, 韩国科学技术院); Department of Bio and Brain Engineering, KAIST (生物与脑工程系, 韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While recent advancements in generative modeling have significantly improved text-image alignment, some residual misalignment between text and image representations still remains. Although many approaches have attempted to address this issue by fine-tuning models using various reward models, etc., we revisit the challenge from the perspective of representation alignment-an approach that has gained popularity with the success of REPresentation Alignment (REPA). We first argue that conventional text-to-image (T2I) diffusion models, typically trained on paired image and text data (i.e., positive pairs) by minimizing score matching or flow matching losses, is suboptimal from the standpoint of representation alignment. Instead, a better alignment can be achieved through contrastive learning that leverages both positive and negative pairs. To achieve this efficiently even with pretrained models, we introduce a lightweight contrastive fine tuning strategy called SoftREPA that uses soft text tokens. This approach improves alignment with minimal computational overhead by adding fewer than 1M trainable parameters to the pretrained model. Our theoretical analysis demonstrates that our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency. Experimental results across text-to-image generation and text-guided image editing tasks validate the effectiveness of our approach in improving the semantic consistency of T2I generative models.
zh

[CV-96] HASARD: A Benchmark for Vision-Based Safe Reinforcement Learning in Embodied Agents ICLR2025

【速读】：该论文旨在解决现有基于视觉的三维基准测试仅限于简单导航任务的问题，提出了一种名为\textbf{HASARD}的新基准，用于评估强化学习（Reinforcement Learning, RL）代理在复杂环境中的安全性和能力。HASARD包含多样且复杂的任务，要求代理进行策略决策、理解空间关系并预测短期未来，具有三个难度级别和两种动作空间。其关键创新在于专注于以自我中心视角（egocentric vision-based）的学习，并通过开放源代码的方式提供了一个成本效益高且富有洞察力的方法来探索当前及未来安全RL方法的潜力与边界。此外，通过可视化训练过程中的全局热图以及逐步增加难度的隐式学习课程，进一步增强了对学习过程的理解。

链接: https://arxiv.org/abs/2503.08241
作者: Tristan Tomilin,Meng Fang,Mykola Pechenizkiy
机构: Eindhoven University of Technology (埃因霍芬理工大学); University of Liverpool (利物浦大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Advancing safe autonomous systems through reinforcement learning (RL) requires robust benchmarks to evaluate performance, analyze methods, and assess agent competencies. Humans primarily rely on embodied visual perception to safely navigate and interact with their surroundings, making it a valuable capability for RL agents. However, existing vision-based 3D benchmarks only consider simple navigation tasks. To address this shortcoming, we introduce \textbfHASARD, a suite of diverse and complex tasks to \textbfHA rness \textbfSA fe \textbfR L with \textbfD oom, requiring strategic decision-making, comprehending spatial relationships, and predicting the short-term future. HASARD features three difficulty levels and two action spaces. An empirical evaluation of popular baseline methods demonstrates the benchmark’s complexity, unique challenges, and reward-cost trade-offs. Visualizing agent navigation during training with top-down heatmaps provides insight into a method’s learning process. Incrementally training across difficulty levels offers an implicit learning curriculum. HASARD is the first safe RL benchmark to exclusively target egocentric vision-based learning, offering a cost-effective and insightful way to explore the potential and boundaries of current and future safe RL methods. The environments and baseline implementations are open-sourced at this https URL.
zh

[CV-97] EnergyFormer: Energy Attention with Fourier Embedding for Hyperspectral Image Classification

【速读】：该论文旨在解决高光谱成像（Hyperspectral Imaging, HSI）数据因高维度和光谱可变性导致的特征提取与分类难题。为应对这些挑战，论文提出了EnergyFormer框架，其关键是通过以下三项创新实现：(1) 多头能量注意力机制（Multi-Head Energy Attention, MHEA），优化能量函数以选择性增强关键光谱-空间特征，提升特征区分能力；(2) 傅里叶位置嵌入（Fourier Position Embedding, FoPE），自适应编码光谱和空间依赖关系以强化长距离交互；(3) 改进的卷积块注意力模块（Enhanced Convolutional Block Attention Module, ECBAM），选择性放大信息丰富的波长带和空间结构，提升表征学习能力。实验结果表明，EnergyFormer在WHU-Hi-HanChuan、Salinas和Pavia University数据集上的总体精度分别达到99.28%、98.63%和98.72%，显著优于现有的CNN、Transformer及Mamba基线模型。

链接: https://arxiv.org/abs/2503.08239
作者: Saad Sohail,Muhammad Usama,Usman Ghous,Manuel Mazzara,Salvatore Distefano,Muhammad Ahmad
机构: Department of Computer Science, National University of Computer and Emerging Sciences (NUCES)(国家科技大学和新兴科学大学); Institute of Software Development and Engineering, Innopolis University (因诺波利斯大学); Dipartimento di Matematica e Informatica—MIFT, University of Messina (墨西拿大学数学与计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) provides rich spectral-spatial information across hundreds of contiguous bands, enabling precise material discrimination in applications such as environmental monitoring, agriculture, and urban analysis. However, the high dimensionality and spectral variability of HSI data pose significant challenges for feature extraction and classification. This paper presents EnergyFormer, a transformer-based framework designed to address these challenges through three key innovations: (1) Multi-Head Energy Attention (MHEA), which optimizes an energy function to selectively enhance critical spectral-spatial features, improving feature discrimination; (2) Fourier Position Embedding (FoPE), which adaptively encodes spectral and spatial dependencies to reinforce long-range interactions; and (3) Enhanced Convolutional Block Attention Module (ECBAM), which selectively amplifies informative wavelength bands and spatial structures, enhancing representation learning. Extensive experiments on the WHU-Hi-HanChuan, Salinas, and Pavia University datasets demonstrate that EnergyFormer achieves exceptional overall accuracies of 99.28%, 98.63%, and 98.72%, respectively, outperforming state-of-the-art CNN, transformer, and Mamba-based models. The source code will be made available at this https URL.
zh

[CV-98] Modeling Variants of Prompts for Vision-Language Models

【速读】：该论文旨在解决大型预训练视觉-语言模型（Vision-Language Models, VLMs）在性能上对提示模板设计高度敏感的问题。尽管已有方法通过学习可调提示取代自然语言提示来缓解这一问题，但这些方法通常难以被人理解。为确保模型在不同提示模板下的表现一致性，并增强其适应多样化表达的能力，论文提出了RobustPrompt基准，用于系统性评估VLMs对提示模板变化的鲁棒性。此外，论文还提出了一种名为Modeling Variants of Prompts (MVP) 的简单而有效的方法，通过解耦提示模板与类别名称，并利用变分自编码器（Variational Autoencoders, VAE）建模多样化的提示结构分布，从而减轻对提示敏感性的影响。实验结果表明，MVP能够在不降低模型性能的前提下显著提升其对输入提示变化的鲁棒性。

链接: https://arxiv.org/abs/2503.08229
作者: Ao Li,Zongfang Liu,Xinhua Li,Jinghui Zhang,Pengwei Wang,Hu Wang
机构: Shandong University (山东大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Large pre-trained vision-language models (VLMs) offer a promising approach to leveraging human language for enhancing downstream tasks. However, VLMs such as CLIP face significant limitation: its performance is highly sensitive to prompt template design. Although prompt learning methods can address the sensitivity issue by replacing natural language prompts with learnable ones, they are incomprehensible to humans. Ensuring consistent performance across various prompt templates enables models to adapt seamlessly to diverse phrasings, enhancing their ability to handle downstream tasks without requiring extensive prompt engineering. In this work, we introduce the RobustPrompt Benchmark, a systematic benchmark to evaluate robustness to different prompt templates for VLMs. It includes a dataset with hundreds of carefully designed prompt templates, divided into six types, covering a wide variety of commonly used templates. Beside the benchmark, we propose Modeling Variants of Prompts (MVP), a simple yet effective method that mitigates sensitivity by modeling variants of prompt structures. The innovation of MVP lies in decoupling prompts into templates and class names, and using Variational Autoencoders (VAE) to model the distribution of diverse prompt structures. Experiments across 11 datasets demonstrate that MVP can greatly enhance model robustness to variations in input prompts without a drop in performance. The code is available at this https URL.
zh

[CV-99] HRAvatar: High-Quality and Relightable Gaussian Head Avatar

【速读】：该论文致力于解决从单目视频重建可动画且高质量的三维人脸 avatar 的难题，尤其关注在真实感重照明（relighting）方面的表现。现有方法通过结合 3D Gaussian Splatting 和参数化人脸模型实现了实时性能，但存在人脸跟踪精度不足以及变形模型表达能力有限的问题，同时在新型照明条件下难以生成逼真的效果。为了解决这些问题，论文提出了一种基于 3DGS 的方法 HRAvatar。其关键在于通过端到端优化减少跟踪误差，并利用可学习的 blendshapes 和线性混合蒙皮（linear blend skinning）更好地捕捉个体面部变形；此外，HRAvatar 将头部外观分解为多个物理属性，并引入基于物理的着色模型以适应环境光照条件，从而实现高质量且可重照明的三维人脸 avatar 重建。

链接: https://arxiv.org/abs/2503.08224
作者: Dongbin Zhang,Yunfei Liu,Lijian Lin,Ye Zhu,Kangjie Chen,Minghan Qin,Yu Li,Haoqian Wang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院, 清华大学); International Digital Economy Academy (IDEA) (国际数字经济研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Reconstructing animatable and high-quality 3D head avatars from monocular videos, especially with realistic relighting, is a valuable task. However, the limited information from single-view input, combined with the complex head poses and facial movements, makes this challenging. Previous methods achieve real-time performance by combining 3D Gaussian Splatting with a parametric head model, but the resulting head quality suffers from inaccurate face tracking and limited expressiveness of the deformation model. These methods also fail to produce realistic effects under novel lighting conditions. To address these issues, we propose HRAvatar, a 3DGS-based method that reconstructs high-fidelity, relightable 3D head avatars. HRAvatar reduces tracking errors through end-to-end optimization and better captures individual facial deformations using learnable blendshapes and learnable linear blend skinning. Additionally, it decomposes head appearance into several physical properties and incorporates physically-based shading to account for environmental lighting. Extensive experiments demonstrate that HRAvatar not only reconstructs superior-quality heads but also achieves realistic visual effects under varying lighting conditions.
zh

[CV-100] EgoBlind: Towards Egocentric Visual Assistance for the Blind People

【速读】：该论文试图解决盲人视觉辅助领域中当代多模态大型语言模型（Multimodal Large Language Models, MLLMs）在第一视角（Egocentric）视觉问答任务上的性能不足问题。论文的关键在于构建了一个名为EgoBlind的新数据集，该数据集由盲人亲自参与录制视频（1,210段）并提出或验证问题（4,927个），以真实反映他们在日常生活中的视觉需求。通过这一数据集，论文全面评估了15种领先的MLLMs，并揭示其在该任务上的准确率仅为约56%，远低于人类水平的87.4%。基于此，论文进一步识别并总结了现有MLLMs的主要局限性，并提出了针对性的改进建议，旨在为开发更有效的AI辅助工具以提升盲人的生活独立性提供坚实基础。

链接: https://arxiv.org/abs/2503.08221
作者: Junbin Xiao,Nanxin Huang,Hao Qiu,Zhulin Tao,Xun Yang,Richang Hong,Meng Wang,Angela Yao
机构: National University of Singapore (新加坡国立大学); Communication University of China (中国传媒大学); University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Preprint. Under Review

点击查看摘要

Abstract:We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,210 videos that record the daily lives of real blind users from a first-person perspective. It also features 4,927 questions directly posed or generated and verified by blind individuals to reflect their needs for visual assistance under various scenarios. We provide each question with an average of 3 reference answers to alleviate subjective evaluation. Using EgoBlind, we comprehensively evaluate 15 leading MLLMs and find that all models struggle, with the best performers achieving accuracy around 56%, far behind human performance of 87.4%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and provide heuristic suggestions for improvement. With these efforts, we hope EgoBlind can serve as a valuable foundation for developing more effective AI assistants to enhance the independence of the blind individuals’ lives.
zh

[CV-101] CL-MVSNet: Unsupervised Multi-view Stereo with Dual-level Contrastive Learning ICCV2023

【速读】：该论文旨在解决无监督多视图立体视觉（Unsupervised Multi-View Stereo, MVS）方法中存在的两个主要问题：一是由于光度一致性假设导致在不可区分区域（如低纹理区域或反射区域）的深度估计不完整；二是对视角相关的效应（view-dependent effects）鲁棒性不足。为了解决这些问题，论文提出了一种新的双层级对比学习方法CL-MVSNet。其关键是将图像级对比分支与场景级对比分支集成到无监督MVS框架中，以构建额外的监督信号。图像级对比分支增强上下文感知能力，改善不可区分区域的深度估计完整性；场景级对比分支提升表征能力，提高对视角相关效应的鲁棒性。此外，引入L0.5光度一致性损失进一步优化模型对精确点的关注，同时缓解对不良点的梯度惩罚。

链接: https://arxiv.org/abs/2503.08219
作者: Kaiqiang Xiong,Rui Peng,Zhe Zhang,Tianxing Feng,Jianbo Jiao,Feng Gao,Ronggang Wang
机构: School of Electronic and Computer Engineering, Peking University (北京大学电子与计算机工程学院); Peng Cheng Laboratory (鹏城实验室); Migu Culture Technology Co., Ltd (咪咕文化科技有限公司); School of Computer Science, University of Birmingham (伯明翰大学计算机科学学院); School of Arts, Peking University (北京大学艺术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accpetd by ICCV2023

点击查看摘要

Abstract:Unsupervised Multi-View Stereo (MVS) methods have achieved promising progress recently. However, previous methods primarily depend on the photometric consistency assumption, which may suffer from two limitations: indistinguishable regions and view-dependent effects, e.g., low-textured areas and reflections. To address these issues, in this paper, we propose a new dual-level contrastive learning approach, named CL-MVSNet. Specifically, our model integrates two contrastive branches into an unsupervised MVS framework to construct additional supervisory signals. On the one hand, we present an image-level contrastive branch to guide the model to acquire more context awareness, thus leading to more complete depth estimation in indistinguishable regions. On the other hand, we exploit a scene-level contrastive branch to boost the representation ability, improving robustness to view-dependent effects. Moreover, to recover more accurate 3D geometry, we introduce an L0.5 photometric consistency loss, which encourages the model to focus more on accurate points while mitigating the gradient penalty of undesirable ones. Extensive experiments on DTU and TanksTemples benchmarks demonstrate that our approach achieves state-of-the-art performance among all end-to-end unsupervised MVS frameworks and outperforms its supervised counterpart by a considerable margin without fine-tuning.
zh

[CV-102] MVD-HuGaS: Human Gaussians from a Single Image via 3D Human Multi-view Diffusion Prior

【速读】：本文旨在解决基于单目图像进行高质量自由视点3D人体重建的问题。现有方法通常依赖扩散模型进行引导，通过Score Distillation Sampling (SDS)优化3D表示或生成背面图像以辅助重建，但这些方法容易产生不理想的结果（如人体结构扁平化或因多视角不一致先验导致的过度平滑）且在真实场景中的泛化能力有限。为此，本文提出了一种名为MVD-HuGaS的方法，通过多视角人体扩散模型实现从单目图像到自由视点3D人体渲染的能力。关键在于首先利用经过高质量3D人体数据集微调的增强型多视角扩散模型生成多视角图像，并结合3D几何先验和人体结构先验；其次引入对齐模块以联合优化3D高斯体素和相机姿态，从而提高重建精度；此外，通过基于深度的面部畸变缓解模块进一步优化人脸区域，提升整体重建质量。最终，利用优化后的多视角图像及其精确的相机姿态，MVD-HuGaS实现了目标人体3D高斯体素的优化，从而获得高保真的自由视点渲染结果。实验表明，该方法在Thuman2.0和2K2K数据集上的单目3D人体渲染任务中达到了当前最优性能。

链接: https://arxiv.org/abs/2503.08218
作者: Kaiqiang Xiong,Ying Feng,Qi Zhang,Jianbo Jiao,Yang Zhao,Zhihao Liang,Huachen Gao,Ronggang Wang
机构: School of Electronic and Computer Engineering, Peking University (北京大学电子与计算机工程学院); Peng Cheng Laboratory (鹏城实验室); Migu Culture Technology Co., Ltd (咪咕文化科技有限公司); vivo Mobile Communication (Hangzhou) Co., Ltd. (维沃移动通信（杭州）有限公司); vivo Mobile Communication Co. Ltd (维沃移动通信有限公司); School of Computer Science, University of Birmingham (伯明翰大学计算机科学学院); School of Computer and Information, Hefei University of Technology (合肥工业大学计算机与信息学院); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating one back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textite.g. flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emphMVD-HuGaS, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the this http URL, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.
zh

[CV-103] S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction

【速读】：该论文致力于解决大规模街景场景中基于3D Gaussian Splatting (3DGS) 方法的高重建成本问题，特别是在场景规模增大时，现有方法会导致每视点重建成本迅速增加，带来显著的计算开销。论文通过重新审视传统流程，发现三个关键因素导致此问题：不必要的局部到全局变换、过多的3D到2D投影以及低效的远距离内容渲染。为应对这些挑战，论文提出了一种名为S3R-GS的3DGS框架，该框架通过Streamlining（流线型）优化大规模街景重建流程，有效缓解了上述限制。此外，大多数现有的街景3DGS方法依赖于精确的三维边界框来分离动态与静态组件，但三维边界框难以获取，限制了其实际应用。为此，论文提出了一种替代方案，使用二维边界框，这些边界框更容易标注或可通过现成的视觉基础模型预测。这些设计使S3R-GS能够更好地适应大规模真实世界场景。大量实验表明，S3R-GS不仅提升了渲染质量，还显著加速了重建过程，在Argoverse2数据集的视频测试中，达到了最先进的PSNR和SSIM性能，将重建时间减少至竞争方法的50%甚至20%。

链接: https://arxiv.org/abs/2503.08217
作者: Guangting Zheng,Jiajun Deng,Xiaomeng Chu,Yu Yuan,Houqiang Li,Yanyong Zhang
机构: University of Science and Technology of China (中国科学技术大学); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has reshaped the field of photorealistic 3D reconstruction, achieving impressive rendering quality and speed. However, when applied to large-scale street scenes, existing methods suffer from rapidly escalating per-viewpoint reconstruction costs as scene size increases, leading to significant computational overhead. After revisiting the conventional pipeline, we identify three key factors accounting for this issue: unnecessary local-to-global transformations, excessive 3D-to-2D projections, and inefficient rendering of distant content. To address these challenges, we propose S3R-GS, a 3DGS framework that Streamlines the pipeline for large-scale Street Scene Reconstruction, effectively mitigating these limitations. Moreover, most existing street 3DGS methods rely on ground-truth 3D bounding boxes to separate dynamic and static components, but 3D bounding boxes are difficult to obtain, limiting real-world applicability. To address this, we propose an alternative solution with 2D boxes, which are easier to annotate or can be predicted by off-the-shelf vision foundation models. Such designs together make S3R-GS readily adapt to large, in-the-wild scenarios. Extensive experiments demonstrate that S3R-GS enhances rendering quality and significantly accelerates reconstruction. Remarkably, when applied to videos from the challenging Argoverse2 dataset, it achieves state-of-the-art PSNR and SSIM, reducing reconstruction time to below 50%–and even 20%–of competing methods.
zh

[CV-104] Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation

【速读】：该论文试图解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在处理视觉任务时易产生幻觉（hallucinations）的问题。现有研究表明，幻觉的主要原因是模型在解码过程中对图像标记的注意力不足，而本文的研究发现，幻觉还来源于指令标记（instruction tokens）的干扰。为此，论文提出了一个名为“注意力劫持”（Attention Hijacking）的新现象，指出某些干扰性指令标记会误导模型的视觉感知，使其注意力偏离图像的关键区域，从而限制了模型整合更广泛的上下文信息的能力，最终导致幻觉的产生。

为了解决这一问题，论文提出了一种无需训练的策略——注意力劫持者检测与解耦（Attention Hijackers Detection and Disentanglement, AID）。AID 的关键在于通过三个组件来消除干扰性指令标记的影响：首先，注意力劫持者检测模块通过计算由指令驱动的视觉显著性来识别注意力劫持者；其次，注意力解耦机制通过屏蔽这些被识别出的劫持者的视觉注意力，减轻其对后续标记的干扰；最后，重新解耦模块重新平衡指令驱动和图像驱动的视觉显著性，避免过度屏蔽的影响。实验结果表明，AID 在多个基准数据集上显著减少了各种LVLMs的幻觉现象。

链接: https://arxiv.org/abs/2503.08216
作者: Beitao Chen,Xinyu Lyu,Lianli Gao,Jingkuan Song,Heng Tao Shen
机构: Center for Future Media, University of Electronic Science and Technology of China (电子科技大学未来媒体中心); Southwestern University of Finance and Economics (西南财经大学, 中国); Tongji University (同济大学); Engineering Research Center of Intelligent Finance, Ministry of Education (智能金融教育部工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite their success, Large Vision-Language Models (LVLMs) remain vulnerable to hallucinations. While existing studies attribute the cause of hallucinations to insufficient visual attention to image tokens, our findings indicate that hallucinations also arise from interference from instruction tokens during decoding. Intuitively, certain instruction tokens continuously distort LVLMs’ visual perception during decoding, hijacking their visual attention toward less discriminative visual regions. This distortion prevents them integrating broader contextual information from images, ultimately leading to hallucinations. We term this phenomenon ‘Attention Hijacking’, where disruptive instruction tokens act as ‘Attention Hijackers’. To address this, we propose a novel, training-free strategy namely Attention HIjackers Detection and Disentanglement (AID), designed to isolate the influence of Hijackers, enabling LVLMs to rely on their context-aware intrinsic attention map. Specifically, AID consists of three components: First, Attention Hijackers Detection identifies Attention Hijackers by calculating instruction-driven visual salience. Next, Attention Disentanglement mechanism is proposed to mask the visual attention of these identified Hijackers, and thereby mitigate their disruptive influence on subsequent tokens. Finally, Re-Disentanglement recalculates the balance between instruction-driven and image-driven visual salience to avoid over-masking effects. Extensive experiments demonstrate that AID significantly reduces hallucination across various LVLMs on several benchmarks.
zh

[CV-105] Explaining Human Preferences via Metrics for Structured 3D Reconstruction

【速读】：该论文试图解决的问题是如何有效评估自动化度量方法在结构化三维重建中的性能，并为不同应用场景推荐合适的度量指标。论文的关键解决方案在于通过专家偏好分析深入探讨现有度量方法的局限性，提出一套系统化的“单元测试”来验证理想的度量属性，并基于人类专家判断提炼出一个学习型度量方法以提供更精准的评估能力。

链接: https://arxiv.org/abs/2503.08208
作者: Jack Langerman,Denys Rozumnyi,Yuzhong Huang,Dmytro Mishkin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:“What cannot be measured cannot be improved” while likely never uttered by Lord Kelvin, summarizes effectively the purpose of this work. This paper presents a detailed evaluation of automated metrics for evaluating structured 3D reconstructions. Pitfalls of each metric are discussed, and a thorough analyses through the lens of expert 3D modelers’ preferences is presented. A set of systematic “unit tests” are proposed to empirically verify desirable properties, and context aware recommendations as to which metric to use depending on application are provided. Finally, a learned metric distilled from human expert judgments is proposed and analyzed.
zh

[CV-106] OLMD: Orientation-aware Long-term Motion Decoupling for Continuous Sign Language Recognition

【速读】：该论文旨在解决连续手语识别（Continuous Sign Language Recognition, CSLR）中的两大核心挑战：多方向性和长时间动作的存在对识别精度的影响。当前研究未能充分考虑这些关键方面，导致性能受限。为应对这些问题，论文提出了一种名为“Orientation-aware Long-term Motion Decoupling (OLMD)”的新框架。其关键是通过创新的Long-term Motion Aggregation (LMA)模块高效聚合长时间动作特征，并将多方向信号解耦为易于解释的成分；同时，通过分解复杂运动为水平和垂直分量增强方向感知能力，实现双方向的动作净化。此外，文中还引入了阶段耦合和跨阶段耦合两种机制，以丰富多尺度特征并提升模型的泛化能力。实验表明，OLMD在PHOENIX14、PHOENIX14-T和CSL-Daily三个大规模数据集上达到SOTA性能，尤其在PHOENIX14上实现了绝对1.6%的词错误率（WER）改进。

链接: https://arxiv.org/abs/2503.08205
作者: Yiheng Yu,Sheng Liu,Yuan Feng,Min Xu,Zhelun Jin,Xuhua Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The primary challenge in continuous sign language recognition (CSLR) mainly stems from the presence of multi-orientational and long-term motions. However, current research overlooks these crucial aspects, significantly impacting accuracy. To tackle these issues, we propose a novel CSLR framework: Orientation-aware Long-term Motion Decoupling (OLMD), which efficiently aggregates long-term motions and decouples multi-orientational signals into easily interpretable components. Specifically, our innovative Long-term Motion Aggregation (LMA) module filters out static redundancy while adaptively capturing abundant features of long-term motions. We further enhance orientation awareness by decoupling complex movements into horizontal and vertical components, allowing for motion purification in both orientations. Additionally, two coupling mechanisms are proposed: stage and cross-stage coupling, which together enrich multi-scale features and improve the generalization capabilities of the model. Experimentally, OLMD shows SOTA performance on three large-scale datasets: PHOENIX14, PHOENIX14-T, and CSL-Daily. Notably, we improved the word error rate (WER) on PHOENIX14 by an absolute 1.6% compared to the previous SOTA
zh

[CV-107] A Theoretical Framework for Preventing Class Collapse in Supervised Contrastive Learning

【速读】：该论文旨在解决监督对比学习（Supervised Contrastive Learning, SupCL）中如何在监督损失与自监督损失之间实现最优平衡的问题，以防止类别坍塌（class collapse），即避免同一类别的个体嵌入（embeddings）之间的区分度下降。论文的关键解决方案是提出Simplex-to-Simplex Embedding Model (SSEM)，这是一个基于理论的框架，用于建模SupCL中的多种嵌入结构，包括所有能够最小化监督对比损失的嵌入。通过SSEM，作者分析了超参数对学习表示的影响，并提供了实用的超参数选择指南，以降低类别坍塌的风险。这些理论发现得到了合成数据集和真实数据集上的实证结果的支持。

链接: https://arxiv.org/abs/2503.08203
作者: Chungpa Lee,Jeongheon Oh,Kibok Lee,Jy-yong Sohn
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Supervised contrastive learning (SupCL) has emerged as a prominent approach in representation learning, leveraging both supervised and self-supervised losses. However, achieving an optimal balance between these losses is challenging; failing to do so can lead to class collapse, reducing discrimination among individual embeddings in the same class. In this paper, we present theoretically grounded guidelines for SupCL to prevent class collapse in learned representations. Specifically, we introduce the Simplex-to-Simplex Embedding Model (SSEM), a theoretical framework that models various embedding structures, including all embeddings that minimize the supervised contrastive loss. Through SSEM, we analyze how hyperparameters affect learned representations, offering practical guidelines for hyperparameter selection to mitigate the risk of class collapse. Our theoretical findings are supported by empirical results across synthetic and real-world datasets.
zh

[CV-108] Scale-Aware Pre-Training for Human-Centric Visual Perception: Enabling Lightweight and Generalizable Models

【速读】：该论文旨在解决现有以人为中心的视觉感知（HVP）模型在适应实际应用时面临的两大挑战：一是预训练目标仅关注特定视觉模式，限制了所学模式在多样化下游任务中的泛化能力；二是模型通常规模过大，难以满足边缘设备计算可持续性和兼容性的需求。为了解决这些问题，论文提出了一种名为Scale-Aware Image Pretraining (SAIP) 的新框架。SAIP的关键在于通过跨尺度一致性原则引入三种学习目标：Cross-scale Matching (CSM)、Cross-scale Reconstruction (CSR) 和 Cross-scale Search (CSS)，使轻量级视觉模型能够学习适用于HVP下游任务的多尺度通用模式。实验结果表明，SAIP在9个人为中心的视觉任务中展现出显著的泛化能力，并在单人区分、密集预测和多人视觉理解等任务中实现了3%-13%、1%-11%和1%-6%的性能提升。

链接: https://arxiv.org/abs/2503.08201
作者: Xuanhan Wang,Huimin Deng,Lianli Gao,Jingkuan Song
机构: University of Electronic Science and Technology of China (电子科技大学); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human-centric visual perception (HVP) has recently achieved remarkable progress due to advancements in large-scale self-supervised pretraining (SSP). However, existing HVP models face limitations in adapting to real-world applications, which require general visual patterns for downstream tasks while maintaining computationally sustainable costs to ensure compatibility with edge devices. These limitations primarily arise from two issues: 1) the pretraining objectives focus solely on specific visual patterns, limiting the generalizability of the learned patterns for diverse downstream tasks; and 2) HVP models often exhibit excessively large model sizes, making them incompatible with real-world applications. To address these limitations, we introduce Scale-Aware Image Pretraining (SAIP), a novel SSP framework enabling lightweight vision models to acquire general patterns for HVP. Specifically, SAIP incorporates three learning objectives based on the principle of cross-scale consistency: 1) Cross-scale Matching (CSM) which contrastively learns image-level invariant patterns from multi-scale single-person images; 2) Cross-scale Reconstruction (CSR) which learns pixel-level consistent visual structures from multi-scale masked single-person images; and 3) Cross-scale Search (CSS) which learns to capture diverse patterns from multi-scale multi-person images. Three objectives complement one another, enabling lightweight models to learn multi-scale generalizable patterns essential for HVP downstream this http URL experiments conducted across 12 HVP datasets demonstrate that SAIP exhibits remarkable generalization capabilities across 9 human-centric vision tasks. Moreover, it achieves significant performance improvements over existing methods, with gains of 3%-13% in single-person discrimination tasks, 1%-11% in dense prediction tasks, and 1%-6% in multi-person visual understanding tasks.
zh

[CV-109] A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models

【速读】：本文旨在解决传统强化学习（Reinforcement Learning, RL）在模拟人类行为、多智能体场景中的有效泛化以及克服其固有的可解释性等方面的局限性。特别是在需要深度环境理解、智能体协作及动态优化的任务中，这些问题更加突出。虽然基于大语言模型（Large Language Model, LLM）增强的方法在泛化性和互操作性方面展现出潜力，但往往忽视了必要的多智能体协作需求。为此，论文提出了级联协同多智能体（Cascading Cooperative Multi-agent, CCMA）框架，其关键是将传统的RL用于个体交互，经过微调的LLM用于区域合作，设计全局优化的奖励函数，并结合检索增强生成机制以实现复杂驾驶场景下的动态决策优化。实验结果表明，CCMA框架在微观和宏观层面均显著优于现有RL方法，在复杂驾驶环境中表现出色。

链接: https://arxiv.org/abs/2503.08199
作者: Miao Zhang,Zhenlong Fang,Tianyi Wang,Qian Zhang,Shuai Lu,Junfeng Jiao,Tianyu Shi
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院, China); Department of Computer Science & Engineering, University of Minnesota, Twin Cities (明尼苏达大学双城分校计算机科学与工程系, USA); Department of Mechanical Engineering & Materials Science, Yale University (耶鲁大学机械工程与材料科学系, USA); Computer Vision, BYD COMPANY LIMITED (比亚迪公司计算机视觉部门, China); School of Architecture, University of Texas at Austin (德克萨斯大学奥斯汀分校建筑学院, USA); Faculty of Applied Science & Engineering, University of Toronto (多伦多大学应用科学与工程学院, Canada)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Traditional Reinforcement Learning (RL) suffers from replicating human-like behaviors, generalizing effectively in multi-agent scenarios, and overcoming inherent interpretability this http URL tasks are compounded when deep environment understanding, agent coordination and dynamic optimization are required. While Large Language Model (LLM) enhanced methods have shown promise in generalization and interoperability, they often neglect necessary multi-agent coordination. Therefore, we introduce the Cascading Cooperative Multi-agent (CCMA) framework, integrating RL for individual interactions, a fine-tuned LLM for regional cooperation, a reward function for global optimization, and the Retrieval-augmented Generation mechanism to dynamically optimize decision-making across complex driving scenarios. Our experiments demonstrate that the CCMA outperforms existing RL methods, demonstrating significant improvements in both micro and macro-level performance in complex driving environments.
zh

[CV-110] owards Synthesized and Editable Motion In-Betweening Through Part-Wise Phase Representation

【速读】：该论文致力于解决现有运动中间插值（motion in-betweening）方法在编码运动风格时通常关注全身动作而忽视个体身体部位表示的问题，这限制了填充运动的灵活性，尤其是在独立调整特定肢体的运动风格方面。为克服这一挑战，论文提出了一种新颖的框架，通过身体部位层面建模运动风格，从而增强填充运动的多样性和可控性。解决方案的关键在于利用与相位相关的见解，采用周期自动编码器自动提取每个身体部位的相位，捕捉独特的局部风格特征；同时，通过整合图像和运动域中的运动流形学习与条件生成技术，有效分离运动源与合成控制，使运动源能够跨多种风格生成高质量的运动，并为后续任务提供可直接用于受控合成的提取运动和风格特征。

链接: https://arxiv.org/abs/2503.08180
作者: Minyue Dai,Jingbo Wang,Ke Fan,Bin Ji,Haoyu Zhao,Junting Dong,Bo Dai
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shanghai Jiaotong University (上海交通大学); Wuhan University (武汉大学); University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Styled motion in-betweening is crucial for computer animation and gaming. However, existing methods typically encode motion styles by modeling whole-body motions, often overlooking the representation of individual body parts. This limitation reduces the flexibility of infilled motion, particularly in adjusting the motion styles of specific limbs independently. To overcome this challenge, we propose a novel framework that models motion styles at the body-part level, enhancing both the diversity and controllability of infilled motions. Our approach enables more nuanced and expressive animations by allowing precise modifications to individual limb motions while maintaining overall motion coherence. Leveraging phase-related insights, our framework employs periodic autoencoders to automatically extract the phase of each body part, capturing distinctive local style features. Additionally, we effectively decouple the motion source from synthesis control by integrating motion manifold learning and conditional generation techniques from both image and motion domains. This allows the motion source to generate high-quality motions across various styles, with extracted motion and style features readily available for controlled synthesis in subsequent tasks. Comprehensive evaluations demonstrate that our method achieves superior speed, robust generalization, and effective generation of extended motion sequences.
zh

[CV-111] owards All-in-One Medical Image Re-Identification CVPR2025

【速读】：该论文致力于解决医疗图像重识别（Medical Image Re-identification, MedReID）这一未充分探索的问题，其关键应用在于个性化医疗和隐私保护。论文提出了一种全面的基准测试和统一模型来应对这一挑战。解决方案的关键在于引入了一种新颖的连续模态参数适配器（Continuous Modality-based Parameter Adapter, ComPA），它能够将医学内容压缩为连续的模态表示，并在运行时动态调整模态无关模型的模态特定参数，从而允许单一模型自适应地学习和处理多样化的模态数据。此外，通过与预训练的医学基础模型在差异特征上的对齐，进一步融入了医学先验知识。论文还展示了所提方法在11个图像数据集上优于25个基础模型和8个大型多模态语言模型的一致优越性能，并将其应用于实际的个性化诊断增强和医疗隐私保护中。

链接: https://arxiv.org/abs/2503.08173
作者: Yuan Tian,Kaiyuan Ji,Rongzhao Zhang,Yankai Jiang,Chunyi Li,Xiaosong Wang,Guangtao Zhai
机构: Shanghai AI Laboratory (上海人工智能实验室); School of Communication and Electronic Engineering, East China Normal University (华东师范大学通信与电子工程学院); Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University (上海交通大学图像通信与网络工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:Medical image re-identification (MedReID) is under-explored so far, despite its critical applications in personalized healthcare and privacy protection. In this paper, we introduce a thorough benchmark and a unified model for this problem. First, to handle various medical modalities, we propose a novel Continuous Modality-based Parameter Adapter (ComPA). ComPA condenses medical content into a continuous modality representation and dynamically adjusts the modality-agnostic model with modality-specific parameters at runtime. This allows a single model to adaptively learn and process diverse modality data. Furthermore, we integrate medical priors into our model by aligning it with a bag of pre-trained medical foundation models, in terms of the differential features. Compared to single-image feature, modeling the inter-image difference better fits the re-identification problem, which involves discriminating multiple images. We evaluate the proposed model against 25 foundation models and 8 large multi-modal language models across 11 image datasets, demonstrating consistently superior performance. Additionally, we deploy the proposed MedReID technique to two real-world applications, i.e., history-augmented personalized diagnosis and medical privacy protection. Codes and model is available at \hrefthis https URLthis https URL.
zh

[CV-112] CQVPR: Landmark-aware Contextual Queries for Visual Place Recognition

【速读】：本文旨在解决视觉位置识别（Visual Place Recognition, VPR）中精确识别查询图像所处位置的问题，特别是在地标众多且视觉相似性高的复杂场景（如城市环境）下。论文的关键在于不仅利用地标信息，还结合其周围的上下文特征（如树木、道路等），提出了一种名为上下文查询视觉位置识别（Contextual Query VPR, CQVPR）的方法。该方法通过引入可学习的上下文查询，自动提取与地标及其周边区域相关的高级上下文信息，并以热图形式表示每个查询关注的区域，作为上下文感知特征，从而增强对场景的理解。此外，文中设计了一种查询匹配损失函数来监督上下文查询的提取过程。实验结果表明，该方法在多个数据集上的表现优于现有最先进的方法，尤其在具有挑战性的场景中表现出色。

链接: https://arxiv.org/abs/2503.08170
作者: Dongyue Li,Daisuke Deguchi,Hiroshi Murase
机构: Nagoya University (名古屋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) aims to estimate the location of the given query image within a database of geo-tagged images. To identify the exact location in an image, detecting landmarks is crucial. However, in some scenarios, such as urban environments, there are numerous landmarks, such as various modern buildings, and the landmarks in different cities often exhibit high visual similarity. Therefore, it is essential not only to leverage the landmarks but also to consider the contextual information surrounding them, such as whether there are trees, roads, or other features around the landmarks. We propose the Contextual Query VPR (CQVPR), which integrates contextual information with detailed pixel-level visual features. By leveraging a set of learnable contextual queries, our method automatically learns the high-level contexts with respect to landmarks and their surrounding areas. Heatmaps depicting regions that each query attends to serve as context-aware features, offering cues that could enhance the understanding of each scene. We further propose a query matching loss to supervise the extraction process of contextual queries. Extensive experiments on several datasets demonstrate that the proposed method outperforms other state-of-the-art methods, especially in challenging scenarios.
zh

[CV-113] SCnet: A Text-driven Semantic-level Controllable Framework for Customized Low-Light Image Enhancement

【速读】：该论文旨在解决现有基于深度学习的图像增强方法在低光照条件下虽能有效降噪和提升可见性，但缺乏个性化调整灵活性的问题。传统方法通常基于一对一映射，无法满足个体对光线偏好的个性化需求。为克服这一局限，论文提出了一种新的光照增强任务及框架，通过提示驱动、语义级别和定量亮度调整实现定制化光照控制。方案的关键在于引入大型语言模型（LLM）解析自然语言提示以识别目标对象，利用基于Retinex的推理分割（RRS）模块生成精确的目标定位掩码，借助基于文本的亮度可控（TBC）模块依据生成的光照图调节亮度，并通过自适应上下文补偿（ACC）模块整合多模态输入调控条件扩散模型，从而确保增强效果的无缝与精准。实验结果验证了该框架在提高可见性、保持自然色彩平衡及细节放大方面的优越性能，同时展示了其在复杂开放世界环境中通过自然语言交互进行高级语义级光照调整的强大泛化能力。

链接: https://arxiv.org/abs/2503.08168
作者: Miao Zhang,Jun Yin,Pengyu Zeng,Yiqing Shen,Shuai Lu,Xueqian Wang
机构: Tsinghua University Shenzhen Graduate School (清华大学深圳国际研究生院); Johns Hopkins University (约翰斯·霍普金斯大学); Tsinghua University Shenzhen Graduate School (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based image enhancement methods show significant advantages in reducing noise and improving visibility in low-light conditions. These methods are typically based on one-to-one mapping, where the model learns a direct transformation from low light to specific enhanced images. Therefore, these methods are inflexible as they do not allow highly personalized mapping, even though an individual’s lighting preferences are inherently personalized. To overcome these limitations, we propose a new light enhancement task and a new framework that provides customized lighting control through prompt-driven, semantic-level, and quantitative brightness adjustments. The framework begins by leveraging a Large Language Model (LLM) to understand natural language prompts, enabling it to identify target objects for brightness adjustments. To localize these target objects, the Retinex-based Reasoning Segment (RRS) module generates precise target localization masks using reflection images. Subsequently, the Text-based Brightness Controllable (TBC) module adjusts brightness levels based on the generated illumination map. Finally, an Adaptive Contextual Compensation (ACC) module integrates multi-modal inputs and controls a conditional diffusion model to adjust the lighting, ensuring seamless and precise enhancements accurately. Experimental results on benchmark datasets demonstrate our framework’s superior performance at increasing visibility, maintaining natural color balance, and amplifying fine details without creating artifacts. Furthermore, its robust generalization capabilities enable complex semantic-level lighting adjustments in diverse open-world environments through natural language interactions.
zh

[CV-114] Dynamic Scene Reconstruction: Recent Advance in Real-time Rendering and Streaming

【速读】：该论文旨在解决从2D图像表示和渲染动态场景这一基础 yet 挑战性的问题。论文的关键在于系统性地回顾和总结基于神经辐射场（Neural Radiance Fields）和3D高斯点 splatting 的重建方法的最新进展，通过分类现有方法、整理相关数据集、对比不同方法在基准测试中的性能，同时探讨该快速发展的领域的挑战与未来研究方向。其解决方案的关键在于结合这些先进的神经网络架构与几何建模技术，以实现对复杂动态场景的高效表示与逼真渲染。

链接: https://arxiv.org/abs/2503.08166
作者: Jiaxuan Zhu,Hao Tang
机构: School of Computer Science and Engineering, Southeast University (东南大学计算机科学与工程学院), China; School of Computer Science, Peking University (北京大学计算机学院), China
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Representing and rendering dynamic scenes from 2D images is a fundamental yet challenging problem in computer vision and graphics. This survey provides a comprehensive review of the evolution and advancements in dynamic scene representation and rendering, with a particular emphasis on recent progress in Neural Radiance Fields based and 3D Gaussian Splatting based reconstruction methods. We systematically summarize existing approaches, categorize them according to their core principles, compile relevant datasets, compare the performance of various methods on these benchmarks, and explore the challenges and future research directions in this rapidly evolving field. In total, we review over 170 relevant papers, offering a broad perspective on the state of the art in this domain.
zh

[CV-115] Multimodal Generation of Animatable 3D Human Models with AvatarForge

【速读】：该论文旨在解决高质量、可定制化生成三维（3D）人体 avatar 的难题，同时克服现有方法在 avatar 动画生成方面的局限性。传统扩散模型（Diffusion-based Models）虽然在通用 3D 物体生成方面取得了进展，但由于人体形状和姿态的复杂多样性以及高质量数据的稀缺性，在生成高保真且可定制的人体 avatar 方面表现不佳。此外，现有的动画方法也难以有效处理此类 avatar 的动画生成任务。

论文提出的解决方案的关键在于结合大型语言模型（LLM）驱动的常识推理与现成的 3D 人体生成器，从而实现对身体和面部细节的精细控制。AvatarForge 不依赖于预训练的数据集，避免了对个体人体特征缺乏精确控制的问题，而是通过引入迭代设计和建模循环，辅以自动验证系统（auto-verification system），实现了生成 avatar 的持续优化与高精度、高度定制化的效果。这一组合方法不仅提升了生成质量和灵活性，还显著改善了文本到 avatar 和图像到 avatar 的转换性能，使其成为艺术创作和动画领域的强大工具。

链接: https://arxiv.org/abs/2503.08165
作者: Xinhang Liu,Yu-Wing Tai,Chi-Keung Tang
机构: HKUST; Dartmouth College (达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce AvatarForge, a framework for generating animatable 3D human avatars from text or image inputs using AI-driven procedural generation. While diffusion-based methods have made strides in general 3D object generation, they struggle with high-quality, customizable human avatars due to the complexity and diversity of human body shapes, poses, exacerbated by the scarcity of high-quality data. Additionally, animating these avatars remains a significant challenge for existing methods. AvatarForge overcomes these limitations by combining LLM-based commonsense reasoning with off-the-shelf 3D human generators, enabling fine-grained control over body and facial details. Unlike diffusion models which often rely on pre-trained datasets lacking precise control over individual human features, AvatarForge offers a more flexible approach, bringing humans into the iterative design and modeling loop, with its auto-verification system allowing for continuous refinement of the generated avatars, and thus promoting high accuracy and customization. Our evaluations show that AvatarForge outperforms state-of-the-art methods in both text- and image-to-avatar generation, making it a versatile tool for artistic creation and animation.
zh

[CV-116] U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers

【速读】：该论文试图解决现有艺术风格迁移方法在生成超高质量艺术风格化图像时存在的明显伪影和不和谐图案的问题。为了解决这些问题，论文提出了一种新颖的艺术图像风格迁移方法U-StyDiT，其关键在于结合基于Transformer的扩散模型（Diffusion-based models, DiT），实现内容-风格解耦，并通过多视图风格调节器（Multi-view Style Modulator, MSM）从局部和全局视角学习风格信息，同时引入StyDiT块以同步学习内容和风格条件。此外，还构建了一个包含10个类别的超高质量艺术图像数据集Aes4M，解决了现有方法因数据集规模和图像质量不足导致无法生成高质量艺术风格化图像的问题。最终的实验验证了U-StyDiT在生成高质量风格化图像方面的优越性，且该方法首次利用基于Transformer的扩散模型实现了超高质量风格化图像的生成。

链接: https://arxiv.org/abs/2503.08157
作者: Zhanjie Zhang,Ao Ma,Ke Cao,Jing Wang,Shanyuan Liu,Yuhang Ma,Bo Cheng,Dawei Leng,Yuhui Yin
机构: College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); 360 AI Research (360 人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-high quality artistic style transfer refers to repainting an ultra-high quality content image using the style information learned from the style image. Existing artistic style transfer methods can be categorized into style reconstruction-based and content-style disentanglement-based style transfer approaches. Although these methods can generate some artistic stylized images, they still exhibit obvious artifacts and disharmonious patterns, which hinder their ability to produce ultra-high quality artistic stylized images. To address these issues, we propose a novel artistic image style transfer method, U-StyDiT, which is built on transformer-based diffusion (DiT) and learns content-style disentanglement, generating ultra-high quality artistic stylized images. Specifically, we first design a Multi-view Style Modulator (MSM) to learn style information from a style image from local and global perspectives, conditioning U-StyDiT to generate stylized images with the learned style information. Then, we introduce a StyDiT Block to learn content and style conditions simultaneously from a style image. Additionally, we propose an ultra-high quality artistic image dataset, Aes4M, comprising 10 categories, each containing 400,000 style images. This dataset effectively solves the problem that the existing style transfer methods cannot produce high-quality artistic stylized images due to the size of the dataset and the quality of the images in the dataset. Finally, the extensive qualitative and quantitative experiments validate that our U-StyDiT can create higher quality stylized images compared to state-of-the-art artistic style transfer methods. To our knowledge, our proposed method is the first to address the generation of ultra-high quality stylized images using transformer-based diffusion.
zh

[CV-117] owards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model

【速读】：该论文旨在解决化学反应图像向机器可读形式转换的难题，当前这一过程主要依赖手动整理，缺乏高效可靠的自动化方法。论文的关键解决方案是引入了Reaction Image Multimodal大型语言模型（RxnIM），这是首个专为解析化学反应图像而设计的多模态大模型。RxnIM不仅能够从反应图像中提取关键化学成分，还能解读描述反应条件的文本内容。结合专门设计的大规模数据集生成方法以支持模型训练，该方法在多种基准测试中实现了平均F1分数88%，比现有方法提高了5%，从而显著推动了从化学文献图像中自动构建机器可读反应数据库的技术发展。

链接: https://arxiv.org/abs/2503.08156
作者: Yufan Chen,Ching Ting Leung,Jianwei Sun,Yong Huang,Linyan Li,Hao Chen,Hanyu Gao
机构: Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology (香港科技大学); Department of Computer Science and Engineering, Hong Kong University of Science and Technology (香港科技大学); Department of Chemistry, Hong Kong University of Science and Technology (香港科技大学); Department of Data Science, City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) has demonstrated significant promise in advancing organic chemistry research; however, its effectiveness depends on the availability of high-quality chemical reaction data. Currently, most published chemical reactions are not available in machine-readable form, limiting the broader application of AI in this field. The extraction of published chemical reactions into structured databases still relies heavily on manual curation, and robust automatic parsing of chemical reaction images into machine-readable data remains a significant challenge. To address this, we introduce the Reaction Image Multimodal large language model (RxnIM), the first multimodal large language model specifically designed to parse chemical reaction images into machine-readable reaction data. RxnIM not only extracts key chemical components from reaction images but also interprets the textual content that describes reaction conditions. Together with specially designed large-scale dataset generation method to support model training, our approach achieves excellent performance, with an average F1 score of 88% on various benchmarks, surpassing literature methods by 5%. This represents a crucial step toward the automatic construction of large databases of machine-readable reaction data parsed from images in the chemistry literature, providing essential data resources for AI research in chemistry. The source code, model checkpoints, and datasets developed in this work are released under permissive licenses. An instance of the RxnIM web application can be accessed at this https URL.
zh

[CV-118] Unifying Structure and Activation: A Comprehensive Approach of Parameter and Memory Efficient Transfer Learning

【速读】：该论文旨在解决现有参数高效微调（PETL）方法在模型规模增大时，其内存占用未能与可学习参数的减少保持同等比例下降的问题，这一局限性阻碍了PETL方法在内存受限设备上的实际部署。为了解决这一问题，论文提出了一种新的PETL框架——Structure to Activation (S2A)，其关键在于通过设计参数化模型结构中的激活模块（如偏置、提示符和侧模块），显著减少了可调整参数和激活内存的规模；同时，针对非参数化结构（如非线性函数），采用基于导数的4位量化技术，在保持精度的同时大幅降低了内存使用。这些创新点使得S2A方法在参数和内存占用方面均提供了轻量化的解决方案。

链接: https://arxiv.org/abs/2503.08154
作者: Tian Jin,Enjun Du,Changwei Wang,Wenhao Xu,Ding Luo
机构: Sichuan University (四川大学); TongJi University (同济大学); Shandong Computer Science Center (山东计算机科学中心); Beijing University of Posts and Telecommunications (北京邮电大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）);
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parameter-efficient transfer learning (PETL) aims to reduce the scales of pre-trained models for multiple downstream tasks. However, as the models keep scaling up, the memory footprint of existing PETL methods is not significantly reduced compared to the reduction of learnable parameters. This limitation hinders the practical deployment of PETL methods on memory-constrained devices. To this end, we proposed a new PETL framework, called Structure to Activation (S2A), to reduce the memory footprint of activation during fine-tuning. Specifically, our framework consists of: 1)Activation modules design(i.e. bias, prompt and side modules) in the parametric model structure, which results in a significant reduction of adjustable parameters and activation memory 2) 4-bit quantisation of activations based on their derivatives for non-parametric structures (e.g., nonlinear functions), which maintains accuracy while significantly reducing memory usage. Our S2A method consequently offers a lightweight solution in terms of both parameter and memory footprint. We evaluate S2A with different backbones and conduct extensive experiments on various datasets to evaluate the effectiveness. The results show that our method not only outperforms existing PETL techniques, achieving a fourfold reduction in GPU memory footprint on average, but also shows competitive performance in accuracy with lower tunable parameters. These also demonstrate that our method is highly suitable for practical transfer learning on hardware-constrained devices.
zh

[CV-119] WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation

【速读】：该论文旨在解决现有文本到视频（Text-to-Video, T2V）生成模型难以理解和遵循抽象物理定律的问题。这一挑战源于抽象物理原理与生成模型之间存在显著差距，导致缺乏明确的物理信息指导。为了解决此问题，论文提出了World Simulator Assistant (WISA)，这是一种有效的框架，用于分解并整合物理原理至T2V模型中。WISA的关键创新在于将物理原理分解为文本描述、定性分类以及定量属性，并通过引入Mixture-of-Physical-Experts Attention (MoPA)和物理分类器等设计，有效嵌入这些物理特性到生成过程中，从而提升模型的物理感知能力。此外，论文还指出现有数据集因物理现象表现较弱或与其他过程混杂而限制了其作为学习明确物理原理资源的适用性，为此构建了一个包含32,000个视频的新数据集WISA-32K，涵盖三大物理领域中的17条物理定律，进一步支持模型训练与评估。

链接: https://arxiv.org/abs/2503.08153
作者: Jing Wang,Ao Ma,Ke Cao,Jun Zheng,Zhanjie Zhang,Jiasong Feng,Shanyuan Liu,Yuhang Ma,Bo Cheng,Dawei Leng,Yuhui Yin,Xiaodan Liang
机构: Shenzhen Campus of Sun Yat-Sen University (中山大学深圳校区); 2360 AI Research (360 AI 研究院); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent rapid advancements in text-to-video (T2V) generation, such as SoRA and Kling, have shown great potential for building world simulators. However, current T2V models struggle to grasp abstract physical principles and generate videos that adhere to physical laws. This challenge arises primarily from a lack of clear guidance on physical information due to a significant gap between abstract physical principles and generation models. To this end, we introduce the World Simulator Assistant (WISA), an effective framework for decomposing and incorporating physical principles into T2V models. Specifically, WISA decomposes physical principles into textual physical descriptions, qualitative physical categories, and quantitative physical properties. To effectively embed these physical attributes into the generation process, WISA incorporates several key designs, including Mixture-of-Physical-Experts Attention (MoPA) and a Physical Classifier, enhancing the model’s physics awareness. Furthermore, most existing datasets feature videos where physical phenomena are either weakly represented or entangled with multiple co-occurring processes, limiting their suitability as dedicated resources for learning explicit physical principles. We propose a novel video dataset, WISA-32K, collected based on qualitative physical categories. It consists of 32,000 videos, representing 17 physical laws across three domains of physics: dynamics, thermodynamics, and optics. Experimental results demonstrate that WISA can effectively enhance the compatibility of T2V models with real-world physical laws, achieving a considerable improvement on the VideoPhy benchmark. The visual exhibitions of WISA and WISA-32K are available in the this https URL.
zh

[CV-120] Depth-Assisted Network for Indiscernible Marine Object Counting with Adaptive Motion-Differentiated Feature Encoding

【速读】：该论文旨在解决水下不可分辨海洋目标计数问题，面临的挑战包括水下场景的有限可见性、目标之间的相互遮挡与重叠，以及背景与前景在外观、颜色和纹理上的动态相似性。为应对基于视频的不可分辨物体计数数据集稀缺的问题，研究者开发了一个包含50段视频的新数据集，并从中提取约800帧进行标注，总计约40,800个点级目标标签，以真实反映复杂水下环境中的计数难题。针对这些挑战，论文提出了一种深度辅助网络，其关键在于自适应运动区分特征编码。该网络由骨干编码模块及三个分支组成：深度辅助分支、密度估计分支和运动权重生成分支。其中，深度辅助分支提取的深度感知特征通过增强型编码器优化目标表示；运动权重生成分支产生的权重则用于改进自适应流估计模块中的多尺度感知特征。实验结果表明，所提方法不仅在新构建的数据集上达到当前最优性能，还在其他三个基于视频的人群计数数据集上表现出竞争力。相关资源已公开发布。

链接: https://arxiv.org/abs/2503.08152
作者: Chengzhi Ma,Kunqian Li,Shuaixin Liu,Han Mei
机构: College of Engineering, Ocean University of China (海洋大学工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Indiscernible marine object counting encounters numerous challenges, including limited visibility in underwater scenes, mutual occlusion and overlap among objects, and the dynamic similarity in appearance, color, and texture between the background and foreground. These factors significantly complicate the counting process. To address the scarcity of video-based indiscernible object counting datasets, we have developed a novel dataset comprising 50 videos, from which approximately 800 frames have been extracted and annotated with around 40,800 point-wise object labels. This dataset accurately represents real underwater environments where indiscernible marine objects are intricately integrated with their surroundings, thereby comprehensively illustrating the aforementioned challenges in object counting. To address these challenges, we propose a depth-assisted network with adaptive motion-differentiated feature encoding. The network consists of a backbone encoding module and three branches: a depth-assisting branch, a density estimation branch, and a motion weight generation branch. Depth-aware features extracted by the depth-assisting branch are enhanced via a depth-enhanced encoder to improve object representation. Meanwhile, weights from the motion weight generation branch refine multi-scale perception features in the adaptive flow estimation module. Experimental results demonstrate that our method not only achieves state-of-the-art performance on the proposed dataset but also yields competitive results on three additional video-based crowd counting datasets. The pre-trained model, code, and dataset are publicly available at this https URL.
zh

[CV-121] Few-Shot Class-Incremental Model Attribution Using Learnable Representation From CLIP-ViT Features

【速读】：该论文试图解决在生成式 AI (Generative AI) 模型持续演进的背景下，如何通过模型 attribution (MA) 方法有效识别新出现的生成模型的问题。当前基于深度学习的 MA 方法需要针对新数据从头训练以识别未见过的模型，这不仅耗时而且对数据量要求高。论文的关键解决方案是将少量-shot 类增量学习 (Few-shot Class-Incremental Learning, FSCIL) 机制引入 MA 问题，并提出自适应集成模块 (Adaptive Integration Module, AIM)。与现有 FSCIL 方法专注于高层对象分类不同，本文方法强调分析合成图像中的低层细节（如颜色和纹理）。AIM 通过计算 CLIP-ViT 特征块的加权和，自适应地整合多层级表征，从而提升识别生成模型的能力。实验表明，所提方法能够有效扩展至先前以及最新的生成模型。

链接: https://arxiv.org/abs/2503.08148
作者: Hanbyul Lee,Juneho Yi
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Recently, images that distort or fabricate facts using generative models have become a social concern. To cope with continuous evolution of generative artificial intelligence (AI) models, model attribution (MA) is necessary beyond just detection of synthetic images. However, current deep learning-based MA methods must be trained from scratch with new data to recognize unseen models, which is time-consuming and data-intensive. This work proposes a new strategy to deal with persistently emerging generative models. We adapt few-shot class-incremental learning (FSCIL) mechanisms for MA problem to uncover novel generative AI models. Unlike existing FSCIL approaches that focus on object classification using high-level information, MA requires analyzing low-level details like color and texture in synthetic images. Thus, we utilize a learnable representation from different levels of CLIP-ViT features. To learn an effective representation, we propose Adaptive Integration Module (AIM) to calculate a weighted sum of CLIP-ViT block features for each image, enhancing the ability to identify generative models. Extensive experiments show our method effectively extends from prior generative models to recent ones.
zh

[CV-122] FilmComposer: LLM -Driven Music Production for Silent Film Clips

【速读】：该论文旨在解决无声电影片段配乐生产中的挑战，通过提出一种结合大型生成模型与多智能体方法的FilmComposer系统来实现这一目标。解决方案的关键在于FilmComposer首次整合了波形音乐与符号音乐生成的优势，并专注于提升电影音乐生产的三个核心要素：音质（audio quality）、音乐性（musicality）以及音乐发展（musical development）。此外，它引入了节奏、语义及视觉等多种控制机制以增强这些关键方面。具体而言，FilmComposer包含视觉处理模块、可调节节奏的MusicGen模块以及多智能体评估、编排与混音组件。同时，该框架能够无缝集成到实际的音乐制作流程中，并支持用户在每一步进行干预，提供高度互动性和创造性自由。为了应对现有专业高质量电影音乐数据集的缺乏，研究还提出了包含7,418个电影片段及其相关音乐、描述、节奏点和主旋律的MusicPro-7k数据集。最终，无论是标准指标还是新提出的专门化指标均表明，本模型生成的音乐在质量、与视频的一致性、多样性、音乐性和音乐发展等方面达到了最先进的性能水平。

链接: https://arxiv.org/abs/2503.08147
作者: Zhifeng Xie,Qile He,Youjia Zhu,Qiwei He,Mengtian Li
机构: Shanghai University (上海大学); Shanghai Engineering Research Center of Motion Picture Special Effects (上海电影特效工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Project page: this https URL

点击查看摘要

Abstract:In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film-audio quality, musicality, and musical development-and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm-controllable MusicGen, and multi-agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro-7k which includes 7,418 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high-quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state-of-the-art performance in terms of quality, consistency with video, diversity, musicality, and musical development. Project page: this https URL
zh

[CV-123] Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking

【速读】：该论文旨在解决开放词汇多目标跟踪（OV-MOT）中因受限于预定义类别集而导致的跟踪能力不足问题。当前OV-MOT方法通常侧重实例级别的检测与关联，而忽视了轨迹信息这一独特且关键的要素，尤其是在遮挡和类别模糊场景下。论文提出的关键解决方案是引入一种利用轨迹信息的方法，以提升对象关联性和分类准确性。具体而言，论文提出了\textbfTRACT框架，其核心包括两个关键部分：一是通过改进目标身份和类别一致性来增强跟踪性能的\textitTrajectory Consistency Reinforcement (\textbfTCR)策略；二是可插拔的\textbfTraCLIP模块，结合\textitTrajectory Feature Aggregation (\textbfTFA)和\textitTrajectory Semantic Enrichment (\textbfTSE)策略，从视觉和语言角度充分挖掘轨迹信息以优化分类结果。实验表明，这些方法显著提升了OV-MOT的性能，强调了轨迹信息在OV-MOT中的重要价值。

链接: https://arxiv.org/abs/2503.08145
作者: Yunhao Li,Yifan Jiao,Dan Meng,Heng Fan,Libo Zhang
机构: Institute of Software Chinese Academy of Science (中国科学院软件研究所); University of Chinese Academy of Science (中国科学院大学); University of North Texas (北德克萨斯大学); OPPO Research Institute (OPPO 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Open-Vocabulary Multi-Object Tracking (OV-MOT) aims to enable approaches to track objects without being limited to a predefined set of categories. Current OV-MOT methods typically rely primarily on instance-level detection and association, often overlooking trajectory information that is unique and essential for object tracking tasks. Utilizing trajectory information can enhance association stability and classification accuracy, especially in cases of occlusion and category ambiguity, thereby improving adaptability to novel classes. Thus motivated, in this paper we propose \textbfTRACT, an open-vocabulary tracker that leverages trajectory information to improve both object association and classification in OV-MOT. Specifically, we introduce a \textitTrajectory Consistency Reinforcement (\textbfTCR) strategy, that benefits tracking performance by improving target identity and category consistency. In addition, we present \textbfTraCLIP, a plug-and-play trajectory classification module. It integrates \textitTrajectory Feature Aggregation (\textbfTFA) and \textitTrajectory Semantic Enrichment (\textbfTSE) strategies to fully leverage trajectory information from visual and language perspectives for enhancing the classification results. Extensive experiments on OV-TAO show that our TRACT significantly improves tracking performance, highlighting trajectory information as a valuable asset for OV-MOT. Code will be released.
zh

[CV-124] Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）和视觉语言模型（Vision-Language Models, VLMs）在遥感图像检测任务中的应用挑战。传统VLMs难以直接理解遥感图像，尤其是在检测任务中面临显著困难。论文的关键解决方案是通过将遥感目标检测数据集（如SSDD、HRSID、NWPU-VHR-10）的传统标注信息转化为自然语言，构建指令微调（Supervised Fine-Tuning, SFT）数据集，以适配VLM的训练需求。通过评估不同微调策略的检测性能，优化模型权重，并验证模型在遥感图像检测和视觉问答（Vision Question Answering, VQA）任务中的能力，证明仅使用自然语言即可实现高效的遥感目标检测，而无需修改模型架构。

链接: https://arxiv.org/abs/2503.08144
作者: Fei Wang,Chengcheng Chen,Hongyu Chen,Yugang Chang,Weiming Zeng
机构: Digital Imaging and Intelligent Computing Laboratory, Shanghai Maritime University (上海海事大学数字图像与智能计算实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) and visionlanguage models (VLMs) have achieved significant success, demonstrating remarkable capabilities in understanding various images and videos, particularly in classification and detection tasks. However, due to the substantial differences between remote sensing images and conventional optical images, these models face considerable challenges in comprehension, especially in detection tasks. Directly prompting VLMs with detection instructions often fails to yield satisfactory results. To address this issue, this letter explores the application of VLMs for object detection in remote sensing images. Specifically, we utilize publicly available remote sensing object detection datasets, including SSDD, HRSID, and NWPU-VHR-10, to convert traditional annotation information into natural language, thereby constructing an instruction-tuning (SFT) dataset for VLM training. We then evaluate the detection performance of different fine-tuning strategies for VLMs and obtain optimized model weights for object detection in remote sensing images. Finally, we assess the model’s prior knowledge capabilities through natural language this http URL results demonstrate that, without modifying the model architecture, remote sensing object detection can be effectively achieved using natural language alone. Additionally, the model exhibits the ability to perform certain vision question answering (VQA) tasks. Our dataset and relevant code will be released soon.
zh

[CV-125] A Framework for Reducing the Complexity of Geometric Vision Problems and its Application to Two-View Triangulation with Approximation Bounds

【速读】：本文旨在通过针对重投影误差最小化所使用的代价函数进行目标性重新加权，来降低几何视觉问题的计算复杂度。论文的关键解决方案在于将双视图三角化（triangulation）问题中的代价函数重新加权，从而将原本需要求解六阶一元多项式的最优三角化问题简化为二阶多项式求解，进而获得解析解，同时保持了较高的几何精度。为此，作者推导了最优加权策略，建立了近似误差的理论边界，并通过真实数据实验验证了所提出方法相较于标准方法的有效性。尽管研究聚焦于双视图三角化问题，但其框架可推广至其他几何视觉任务。

链接: https://arxiv.org/abs/2503.08142
作者: Felix Rydell,Georg Bökman,Fredrik Kahl,Kathlén Kohn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
备注:

点击查看摘要

Abstract:In this paper, we present a new framework for reducing the computational complexity of geometric vision problems through targeted reweighting of the cost functions used to minimize reprojection errors. Triangulation - the task of estimating a 3D point from noisy 2D projections across multiple images - is a fundamental problem in multiview geometry and Structure-from-Motion (SfM) pipelines. We apply our framework to the two-view case and demonstrate that optimal triangulation, which requires solving a univariate polynomial of degree six, can be simplified through cost function reweighting reducing the polynomial degree to two. This reweighting yields a closed-form solution while preserving strong geometric accuracy. We derive optimal weighting strategies, establish theoretical bounds on the approximation error, and provide experimental results on real data demonstrating the effectiveness of the proposed approach compared to standard methods. Although this work focuses on two-view triangulation, the framework generalizes to other geometric vision problems.
zh

[CV-126] HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views CVPR2025

【速读】：该论文旨在解决大规模3D场景中跨地面与空中视角（ground-to-ground 和 ground-to-aerial）的定位与环境识别问题，特别是在城市和森林复杂环境中点云数据分布密度不均及噪声干扰下的精准位置识别挑战。为应对这些难题，论文提出的关键解决方案包括：(1) 基于八叉树结构的多尺度注意力机制（octree-based multi-scale attention mechanism），以捕获空间与语义特征；(2) 柱状八叉树注意力窗口（cylindrical octree attention windows），用于动态反映旋转激光雷达扫描点云的分布特性；(3) 中继标记（relay tokens）实现高效全局-局部交互与多尺度表示学习；(4) 金字塔注意力池化（pyramid attentional pooling），用于生成鲁棒的全局描述符。这些创新方法显著提升了在CS-Wild-Places基准数据集以及传统城市和森林数据集上的性能表现。

链接: https://arxiv.org/abs/2503.08140
作者: Ethan Griffiths,Maryam Haghighat,Simon Denman,Clinton Fookes,Milad Ramezani
机构: Queensland University of Technology (QUT); CSIRO Robotics, Data61, CSIRO (澳大利亚联邦科学与工业研究组织机器人部门, 数据61, 澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 16 pages, 13 figures, 10 tables, Accepted to CVPR 2025

点击查看摘要

Abstract:We present HOTFormerLoc, a novel and versatile Hierarchical Octree-based Transformer, for large-scale 3D place recognition in both ground-to-ground and ground-to-aerial scenarios across urban and forest environments. We propose an octree-based multi-scale attention mechanism that captures spatial and semantic features across granularities. To address the variable density of point distributions from spinning lidar, we present cylindrical octree attention windows to reflect the underlying distribution during attention. We introduce relay tokens to enable efficient global-local interactions and multi-scale representation learning at reduced computational cost. Our pyramid attentional pooling then synthesises a robust global descriptor for end-to-end place recognition in challenging environments. In addition, we introduce CS-Wild-Places, a novel 3D cross-source dataset featuring point cloud data from aerial and ground lidar scans captured in dense forests. Point clouds in CS-Wild-Places contain representational gaps and distinctive attributes such as varying point densities and noise patterns, making it a challenging benchmark for cross-view localisation in the wild. HOTFormerLoc achieves a top-1 average recall improvement of 5.5% - 11.5% on the CS-Wild-Places benchmark. Furthermore, it consistently outperforms SOTA 3D place recognition methods, with an average performance gain of 5.8% on well-established urban and forest datasets. The code and CS-Wild-Places benchmark is available at https://csiro-robotics.github.io/HOTFormerLoc .
zh

[CV-127] FlowDPS: Flow-Driven Posterior Sampling for Inverse Problems

【速读】：该论文旨在解决基于流模型（flow models）的逆问题求解（inverse problem solving）中后验采样（posterior sampling）尚未被严格研究的问题。尽管扩散模型（diffusion models）在逆问题求解方面已得到广泛探索，但扩散逆求解器（DIS）等方法尚未在更广泛的流模型框架内进行系统性扩展。为此，论文提出将扩散逆求解器的思想推广到流模型框架，并通过驱动Tweedie公式的流版本，将流常微分方程（ODE）分解为两个组件：一个用于干净图像估计，另一个用于噪声估计。关键在于分别将似然梯度（likelihood gradient）和随机噪声整合到这两个组件中，从而实现高效的后验采样。论文提出的方案名为Flow-Driven Posterior Sampling (FlowDPS)，其创新点在于无缝集成至具有变压器架构的潜在流模型中，同时在四个线性逆问题上验证了其超越现有最先进方法的能力，且无需额外训练。

链接: https://arxiv.org/abs/2503.08136
作者: Jeongsol Kim,Bryan Sangwoo Kim,Jong Chul Ye
机构: Department of Bio and Brain Engineering, KAIST (KAIST 生物与脑工程系); Kim Jaechul Graduate School of AI, KAIST (KAIST 杰出教授人工智能研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Flow matching is a recent state-of-the-art framework for generative modeling based on ordinary differential equations (ODEs). While closely related to diffusion models, it provides a more general perspective on generative modeling. Although inverse problem solving has been extensively explored using diffusion models, it has not been rigorously examined within the broader context of flow models. Therefore, here we extend the diffusion inverse solvers (DIS) - which perform posterior sampling by combining a denoising diffusion prior with an likelihood gradient - into the flow framework. Specifically, by driving the flow-version of Tweedie’s formula, we decompose the flow ODE into two components: one for clean image estimation and the other for noise estimation. By integrating the likelihood gradient and stochastic noise into each component, respectively, we demonstrate that posterior sampling for inverse problem solving can be effectively achieved using flows. Our proposed solver, Flow-Driven Posterior Sampling (FlowDPS), can also be seamlessly integrated into a latent flow model with a transformer architecture. Across four linear inverse problems, we confirm that FlowDPS outperforms state-of-the-art alternatives, all without requiring additional training.
zh

[CV-128] ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting

【速读】：该论文致力于解决基于RGB外观和运动参数估计的同时进行零件级并发重建的问题，以构建带有活动关节物体的数字孪生模型。解决方案的关键在于引入了一种名为ArticulatedGS的自监督综合框架，通过多步优化过程解耦多个高度相互依赖的参数，从而实现稳定优化流程和高质量结果，同时无需依赖3D监督、运动线索或语义标签。

链接: https://arxiv.org/abs/2503.08135
作者: Junfu Guo,Yu Xin,Gaoyi Liu,Kai Xu,Ligang Liu,Ruizhen Hu
机构: University of Science and Technology of China (中国科学技术大学); National University of Defense Technology (国防科技大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We tackle the challenge of concurrent reconstruction at the part level with the RGB appearance and estimation of motion parameters for building digital twins of articulated objects using the 3D Gaussian Splatting (3D-GS) method. With two distinct sets of multi-view imagery, each depicting an object in separate static articulation configurations, we reconstruct the articulated object in 3D Gaussian representations with both appearance and geometry information at the same time. Our approach decoupled multiple highly interdependent parameters through a multi-step optimization process, thereby achieving a stable optimization procedure and high-quality outcomes. We introduce ArticulatedGS, a self-supervised, comprehensive framework that autonomously learns to model shapes and appearances at the part level and synchronizes the optimization of motion parameters, all without reliance on 3D supervision, motion cues, or semantic labels. Our experimental results demonstrate that, among comparable methodologies, our approach has achieved optimal outcomes in terms of part segmentation accuracy, motion estimation accuracy, and visual quality.
zh

[CV-129] MGHanD: Multi-modal Guidance for authentic Hand Diffusion

【速读】：该论文旨在解决扩散模型（Diffusion-based Models）在文本到图像（T2I）生成任务中生成逼真人手时面临的挑战，这些问题包括不正确的手指数量以及结构变形等。论文的关键解决方案在于引入多模态引导（multi-modal guidance）机制：通过视觉引导使用一个在包含配对手绘与真实图像及其描述的特定数据集上训练的判别器，同时利用LoRA适配器实现基于文本的引导，以学习从“手”到更具体提示如“自然的手”或“解剖学正确手指”的潜在方向。此外，还采用了逐步扩大的累积手部掩码（cumulative hand mask），它在指定的时间步长内逐渐增大，从而在保持预训练模型丰富生成能力的同时优化手部细节。这些方法共同确保了无需特定条件或先验知识即可获得高质量的人手生成结果。

链接: https://arxiv.org/abs/2503.08133
作者: Taehyeon Eum,Jieun Choi,Tae-Kyun Kim
机构: KCVL Lab, KAIST (KAIST 韩国科学技术院); KT Corporation (KT Corporation)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Diffusion-based methods have achieved significant successes in T2I generation, providing realistic images from text prompts. Despite their capabilities, these models face persistent challenges in generating realistic human hands, often producing images with incorrect finger counts and structurally deformed hands. MGHanD addresses this challenge by applying multi-modal guidance during the inference process. For visual guidance, we employ a discriminator trained on a dataset comprising paired real and generated images with captions, derived from various hand-in-the-wild datasets. We also employ textual guidance with LoRA adapter, which learns the direction from hands' towards more detailed prompts such as natural hands’, and `anatomically correct fingers’ at the latent level. A cumulative hand mask which is gradually enlarged in the assigned time step is applied to the added guidance, allowing the hand to be refined while maintaining the rich generative capabilities of the pre-trained model. In the experiments, our method achieves superior hand generation qualities, without any specific conditions or priors. We carry out both quantitative and qualitative evaluations, along with user studies, to showcase the benefits of our approach in producing high-quality hand images.
zh

[CV-130] oward Stable World Models: Measuring and Addressing World Instability in Generative Environments

【速读】：该论文旨在解决世界模型在保持场景内容一致性（World Stability）方面存在的不足。尽管基于扩散的生成模型在生成沉浸式和逼真的环境方面表现出色，但它们往往无法长期保留先前生成的场景，这种缺陷可能引入噪声并影响强化学习代理的学习效果，尤其在安全性至关重要的应用场景中表现不佳。论文的关键解决方案是提出了一种评估框架，通过让世界模型执行一系列动作及其逆操作以返回初始视角，从而量化起始和最终观测之间的一致性来衡量世界稳定性。此外，研究还探索了多种改进策略以提升世界稳定性。研究结果强调了世界稳定性在世界建模中的重要性，并为未来相关领域的研究提供了实际可行的见解。

链接: https://arxiv.org/abs/2503.08122
作者: Soonwoo Kwon,Jin-Young Kim,Hyojun Go,Kyungjune Baek
机构: Sejong University (世宗大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:We present a novel study on enhancing the capability of preserving the content in world models, focusing on a property we term World Stability. Recent diffusion-based generative models have advanced the synthesis of immersive and realistic environments that are pivotal for applications such as reinforcement learning and interactive game engines. However, while these models excel in quality and diversity, they often neglect the preservation of previously generated scenes over time–a shortfall that can introduce noise into agent learning and compromise performance in safety-critical settings. In this work, we introduce an evaluation framework that measures world stability by having world models perform a sequence of actions followed by their inverses to return to their initial viewpoint, thereby quantifying the consistency between the starting and ending observations. Our comprehensive assessment of state-of-the-art diffusion-based world models reveals significant challenges in achieving high world stability. Moreover, we investigate several improvement strategies to enhance world stability. Our results underscore the importance of world stability in world modeling and provide actionable insights for future research in this domain.
zh

[CV-131] AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification CVPR

【速读】：本文旨在解决跨平台（无人机与地面监控及可穿戴设备）视角下视频人物重识别 (Video-based Person Re-Identification, ReID) 的挑战性问题，构建了一个包含6,632个身份、32,321条轨迹片段和960万帧图像的大规模数据集AG-VPReID。针对这一任务，论文提出了AG-VPReID-Net，一种端到端框架，其关键在于融合三个互补的流：(1) 适应时空流 (Adapted Temporal-Spatial Stream)，用于处理运动模式不一致性和时间特征学习；(2) 归一化表观流 (Normalized Appearance Stream)，利用物理启发技术应对分辨率和外观变化；(3) 多尺度注意力流 (Multi-Scale Attention Stream)，以应对不同无人机高度带来的尺度变化。通过整合这些流中的视觉语义信息，该方法能够生成鲁棒且视点不变的人物表示。实验表明，AG-VPReID-Net在新数据集及其他现有基准上均优于现有方法，同时数据集中方法表现普遍较低的现象进一步凸显了其难度。

链接: https://arxiv.org/abs/2503.08121
作者: Huy Nguyen,Kien Nguyen,Akila Pemasiri,Feng Liu,Sridha Sridharan,Clinton Fookes
机构: School of Electrical Engineering and Robotics, Queensland University of Technology (昆士兰科技大学); Department of Electrical and Computer Engineering, Drexel University (德雷塞尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Computer Vision and Pattern Recognition Conference (CVPR) 2025

点击查看摘要

Abstract:We introduce AG-VPReID, a challenging large-scale benchmark dataset for aerial-ground video-based person re-identification (ReID), comprising 6,632 identities, 32,321 tracklets, and 9.6 million frames captured from drones (15-120m altitude), CCTV, and wearable cameras. This dataset presents a real-world benchmark to investigate the robustness of Person ReID approaches against the unique challenges of cross-platform aerial-ground settings. To address these challenges, we propose AG-VPReID-Net, an end-to-end framework combining three complementary streams: (1) an Adapted Temporal-Spatial Stream addressing motion pattern inconsistencies and temporal feature learning, (2) a Normalized Appearance Stream using physics-informed techniques to tackle resolution and appearance changes, and (3) a Multi-Scale Attention Stream handling scale variations across drone altitudes. Our approach integrates complementary visual-semantic information from all streams to generate robust, viewpoint-invariant person representations. Extensive experiments demonstrate that AG-VPReID-Net outperforms state-of-the-art approaches on both our new dataset and other existing video-based ReID benchmarks, showcasing its effectiveness and generalizability. The relatively lower performance of all state-of-the-art approaches, including our proposed approach, on our new dataset highlights its challenging nature. The AG-VPReID dataset, code and models are available at this https URL.
zh

[CV-132] UnitextbfF2ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

【速读】：该论文致力于解决现有统一多模态模型（Unified Multimodal Models, UMMs）在人脸领域粗粒度面部属性理解能力有限且缺乏生成能力的问题。为克服这些局限性，论文提出了一种专门针对细粒度人脸理解和生成的统一模型——Uni $\mathbf{F}^2$ ace。其关键解决方案包括：首先构建了一个包含130K图像-文本对及100万问答对的大规模人脸数据集Uni $\mathbf{F}^2$ ace-130K；其次通过理论连接离散扩散分数匹配与掩码生成模型，同时优化证据下界，显著提升了模型合成面部细节的能力；最后引入了令牌级和序列级混合专家架构，以实现高效细粒度表示学习，从而在理解和生成任务中均表现出色。实验结果表明，相比现有UMMs和生成模型，Uni $\mathbf{F}^2$ ace在理解与生成任务上均达到了更优性能。

链接: https://arxiv.org/abs/2503.08120
作者: Junzhe Li,Xuerui Qiu,Linrui Xu,Liya Guo,Delin Qu,Tingting Long,Chun Fan,Ming Li
机构: School of Computer Science, Peking University (北京大学计算机科学学院); Computer Center, Peking University (北京大学计算中心); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Central South University (中南大学); Yau Mathematical Sciences Center and Department of Mathematical Sciences, Tsinghua University (清华大学丘成桐数学科学中心和数学科学系); Fudan University (复旦大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东人工智能与数字经济实验室 (深圳)); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Unified multimodal models (UMMs) have emerged as a powerful paradigm in foundational computer vision research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily focuses on \textbfcoarse facial attribute understanding, with limited capacity to handle \textbffine-grained facial attributes and without addressing generation capabilities. To overcome these limitations, we propose Uni \textbfF^2 ace, the first UMM tailored specifically for fine-grained face understanding and generation. In general, we train Uni \textbfF^2 ace on a self-constructed, specialized dataset utilizing two mutually beneficial diffusion techniques and a two-level mixture-of-experts architecture. Specifically, we first build a large-scale facial dataset, Uni \textbfF^2 ace-130K, which contains 130K image-text pairs with one million question-answering pairs that span a wide range of facial attributes. Second, we establish a theoretical connection between discrete diffusion score matching and masked generative models, optimizing both evidence lower bounds simultaneously, which significantly improves the model’s ability to synthesize facial details. Finally, we introduce both token-level and sequence-level mixture-of-experts, enabling efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on Uni \textbfF^2 ace-130K demonstrate that Uni \textbfF^2 ace outperforms existing UMMs and generative models, achieving superior performance across both understanding and generation tasks.
zh

[CV-133] ACE: Concept Editing in Diffusion Models without Performance Degradation

【速读】：该论文试图解决扩散模型在文本到图像生成过程中产生的不安全内容问题，同时确保模型保持其通用生成能力。传统概念编辑方法在此类任务中难以兼顾去除不安全概念与保留高质量图像生成性能之间的平衡。论文提出的ACE（Adversarial Concept Editing）方法的关键在于引入了一种新颖的跨零空间投影技术，能够精准擦除不安全概念，同时维持模型生成高保真、语义一致图像的能力。实验结果表明，ACE相比现有先进基线方法，在语义一致性提升24.56%，图像生成质量提升34.82%的同时，仅需1%的时间成本，显著提升了概念编辑的实际可用性，为扩散模型的广泛应用奠定了基础。

链接: https://arxiv.org/abs/2503.08116
作者: Ruipeng Wang,Junfeng Fang,Jiaqi Li,Hao Chen,Jie Shi,Kun Wang,Xiang Wang
机构: University of Science and Technology of China (中国科学技术大学); Southeast University (东南大学); Beijing University of Posts and Telecommunications (北京邮电大学); Huawei (华为)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based text-to-image models have demonstrated remarkable capabilities in generating realistic images, but they raise societal and ethical concerns, such as the creation of unsafe content. While concept editing is proposed to address these issues, they often struggle to balance the removal of unsafe concept with maintaining the model’s general genera-tive capabilities. In this work, we propose ACE, a new editing method that enhances concept editing in diffusion models. ACE introduces a novel cross null-space projection approach to precisely erase unsafe concept while maintaining the model’s ability to generate high-quality, semantically consistent images. Extensive experiments demonstrate that ACE significantly outperforms the advancing baselines,improving semantic consistency by 24.56% and image generation quality by 34.82% on average with only 1% of the time cost. These results highlight the practical utility of concept editing by mitigating its potential risks, paving the way for broader applications in the field. Code is avaliable at this https URL
zh

[CV-134] MaRI: Material Retrieval Integration across Domains

【速读】：该论文旨在解决材料检索中因现有数据集多样性有限且真实世界泛化能力不足导致的性能瓶颈问题。目前多数方法依赖传统图像搜索技术，难以有效捕捉材料空间的独特属性，从而在检索任务中表现欠佳。为应对这些挑战，论文提出了一种名为MaRI的框架，其关键是通过对比学习策略构建一个共享嵌入空间，该空间能够协调视觉与材料属性，通过联合训练图像编码器和材料编码器，将相似的材料和图像拉近，同时在特征空间中分离不相似的样本对。此外，论文还构建了一个包含高质量合成材料（受控形状变化和多样化光照条件）以及经处理和标准化的真实材料的综合数据集，以支持该框架。实验结果验证了MaRI在多样性和复杂性检索任务中的优越性能、准确性和泛化能力，显著超越现有方法。

链接: https://arxiv.org/abs/2503.08111
作者: Jianhui Wang,Zhifei Yang,Yangfan He,Huixiong Zhang,Yuxuan Chen,Jingwei Huang
机构: University of Electronic Science and Technology of China (电子科技大学); Peking University (北京大学); University of Minnesota (明尼苏达大学); Fudan University (复旦大学); Tencent Hunyuan3D (腾讯混元3D)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate material retrieval is critical for creating realistic 3D assets. Existing methods rely on datasets that capture shape-invariant and lighting-varied representations of materials, which are scarce and face challenges due to limited diversity and inadequate real-world generalization. Most current approaches adopt traditional image search techniques. They fall short in capturing the unique properties of material spaces, leading to suboptimal performance in retrieval tasks. Addressing these challenges, we introduce MaRI, a framework designed to bridge the feature space gap between synthetic and real-world materials. MaRI constructs a shared embedding space that harmonizes visual and material attributes through a contrastive learning strategy by jointly training an image and a material encoder, bringing similar materials and images closer while separating dissimilar pairs within the feature space. To support this, we construct a comprehensive dataset comprising high-quality synthetic materials rendered with controlled shape variations and diverse lighting conditions, along with real-world materials processed and standardized using material transfer techniques. Extensive experiments demonstrate the superior performance, accuracy, and generalization capabilities of MaRI across diverse and complex material retrieval tasks, outperforming existing methods.
zh

[CV-135] Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning

【速读】：本文针对基于查询的方法在3D目标检测任务中的高效运行挑战展开研究，特别是密集特征下模型计算需求随着大图像尺寸和多Transformer层增加而显著提升，这在边缘设备上的运行带来了困难。现有剪枝和蒸馏方法要么需要重新训练，要么专为ViT模型设计，难以迁移到3D检测器。为了解决这些问题，论文提出了一种针对3D目标检测模型Transformer解码器的零样本运行时剪枝方法，称为tgGBC（渐进式裁剪键并由分类分数引导）。该方法通过扩展分类分数并将其与注意力图相乘以获得每个键的重要性得分，并根据重要性得分逐层裁剪特定的键。关键在于其系统性地依据键的重要性进行裁剪，从而在最新ToC3D模型的Transformer解码器中实现了1.99倍的速度提升，同时性能损失小于1%。有趣的是，对于某些模型，该方法甚至提升了性能。此外，论文在边缘设备上部署了带有tgGBC的3D检测器，进一步验证了方法的有效性。

链接: https://arxiv.org/abs/2503.08101
作者: Lizhen Xu,Xiuxiu Bai,Xiaojun Jia,Jianwu Fang,Shanmin Pang
机构: Xi’an Jiaotong University (西安交通大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Query-based methods with dense features have demonstrated remarkable success in 3D object detection tasks. However, the computational demands of these models, particularly with large image sizes and multiple transformer layers, pose significant challenges for efficient running on edge devices. Existing pruning and distillation methods either need retraining or are designed for ViT models, which are hard to migrate to 3D detectors. To address this issue, we propose a zero-shot runtime pruning method for transformer decoders in 3D object detection models. The method, termed tgGBC (trim keys gradually Guided By Classification scores), systematically trims keys in transformer modules based on their importance. We expand the classification score to multiply it with the attention map to get the importance score of each key and then prune certain keys after each transformer layer according to their importance scores. Our method achieves a 1.99x speedup in the transformer decoder of the latest ToC3D model, with only a minimal performance loss of less than 1%. Interestingly, for certain models, our method even enhances their performance. Moreover, we deploy 3D detectors with tgGBC on an edge device, further validating the effectiveness of our method. The code can be found at this https URL.
zh

[CV-136] MegaSR: Mining Customized Semantics and Expressive Guidance for Image Super-Resolution

【速读】：该论文旨在解决现有文本到图像（Text-to-Image, T2I）扩散模型在真实场景图像超分辨率（Real-World Image Super-Resolution, Real-ISR）任务中面临的两个主要问题：一是统一的抽象文本语义在不同深度的模块中未能充分适应图像本身固有的细粒度具体语义；二是单一类型的引导导致重建一致性不足。为了解决这些问题，论文提出了MegaSR框架，其关键是通过动态整合每一块的图像属性来挖掘定制化的块级语义与表达性引导，并引入HED边缘图、深度图及分割图作为最具有表现力的引导，结合多阶段聚合策略将其融入T2I模型中，从而实现更丰富的语义表达和结构一致性。

链接: https://arxiv.org/abs/2503.08096
作者: Xinrui Li,Jianlong Wu,Xinchuan Huang,Chong Chen,Weili Guan,Xian-Sheng Hua,Liqiang Nie
机构: Harbin Institute of Technology (哈尔滨工业大学); Terminus Group (未翻译)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pioneering text-to-image (T2I) diffusion models have ushered in a new era of real-world image super-resolution (Real-ISR), significantly enhancing the visual perception of reconstructed images. However, existing methods typically integrate uniform abstract textual semantics across all blocks, overlooking the distinct semantic requirements at different depths and the fine-grained, concrete semantics inherently present in the images themselves. Moreover, relying solely on a single type of guidance further disrupts the consistency of reconstruction. To address these issues, we propose MegaSR, a novel framework that mines customized block-wise semantics and expressive guidance for diffusion-based ISR. Compared to uniform textual semantics, MegaSR enables flexible adaptation to multi-granularity semantic awareness by dynamically incorporating image attributes at each block. Furthermore, we experimentally identify HED edge maps, depth maps, and segmentation maps as the most expressive guidance, and propose a multi-stage aggregation strategy to modulate them into the T2I models. Extensive experiments demonstrate the superiority of MegaSR in terms of semantic richness and structural consistency.
zh

[CV-137] MVGSR: Multi-View Consistency Gaussian Splatting for Robust Surface Reconstruction

【速读】：该论文旨在解决在非静态环境中（如包含动态物体和干扰因素的场景）使用3D Gaussian Splatting (3DGS) 进行表面重建时面临的浮点伪影（floating artifacts）和颜色误差（color errors）问题，这些问题源于不同视角之间的一致性不匹配。论文的关键解决方案包括提出了一种名为Multi-View Consistency Gaussian Splatting for Robust Surface Reconstruction (MVGSR) 的方法：通过利用轻量级的高斯模型以及启发式引导的干扰物掩膜策略来实现鲁棒的表面重建；采用多视角特征一致性比较分离干扰物与静态场景元素，从而早期获得精确的干扰物掩膜；引入基于多视角贡献的剪枝措施以重置透射率，有效减少浮点伪影；最后应用多视角一致性损失函数以提升表面重建任务的质量。实验结果表明，MVGSR 在几何精度和渲染保真度方面达到了最先进的水平。

链接: https://arxiv.org/abs/2503.08093
作者: Chenfeng Hou,Qi Xun Yeo,Mengqi Guo,Yongxin Su,Yanyan Li,Gim Hee Lee
机构: Beihang University (北京航空航天大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page \url{ this https URL }

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has gained significant attention for its high-quality rendering capabilities, ultra-fast training, and inference speeds. However, when we apply 3DGS to surface reconstruction tasks, especially in environments with dynamic objects and distractors, the method suffers from floating artifacts and color errors due to inconsistency from different viewpoints. To address this challenge, we propose Multi-View Consistency Gaussian Splatting for the domain of Robust Surface Reconstruction (\textbfMVGSR), which takes advantage of lightweight Gaussian models and a heuristics-guided distractor masking strategy for robust surface reconstruction in non-static environments. Compared to existing methods that rely on MLPs for distractor segmentation strategies, our approach separates distractors from static scene elements by comparing multi-view feature consistency, allowing us to obtain precise distractor masks early in training. Furthermore, we introduce a pruning measure based on multi-view contributions to reset transmittance, effectively reducing floating artifacts. Finally, a multi-view consistency loss is applied to achieve high-quality performance in surface reconstruction tasks. Experimental results demonstrate that MVGSR achieves competitive geometric accuracy and rendering fidelity compared to the state-of-the-art surface reconstruction algorithms. More information is available on our project page (\hrefthis https URLthis url)
zh

[CV-138] SparseVoxFormer: Sparse Voxel-based Transformer for Multi-modal 3D Object Detection

【速读】：该论文旨在解决基于多模态LiDAR和相机的3D目标检测方法中，利用鸟瞰图（BEV）空间进行中间特征表示导致分辨率降低的问题。BEV空间通过牺牲z轴信息以减少整体特征分辨率，可能导致检测精度下降。为应对这一低分辨率特征的问题，论文聚焦于LiDAR点云数据的稀疏特性，并观察到即使体素的分辨率显著高于BEV地图，由LiDAR数据构建的3D体素中占用单元的数量可能仍少于BEV地图中的总单元数。基于此，论文引入了一种新颖的基于稀疏体素的Transformer网络，称为SparseVoxFormer。其关键创新在于直接利用稀疏体素特征作为基于Transformer检测器的输入，而非执行BEV特征提取。此外，针对相机模态，提出了显式的多模态融合方法，即投影3D体素坐标至2D图像并收集对应的图像特征。这些组件使得方法能够利用几何上更丰富的多模态特征，同时降低计算成本。最终的实验结果表明，使用显著更少的稀疏特征可以大幅减少3D目标检测器的计算开销，同时提升整体性能和远距离检测能力。

链接: https://arxiv.org/abs/2503.08092
作者: Hyeongseok Son,Jia He,Seung-In Park,Ying Min,Yunhao Zhang,ByungIn Yoo
机构: Samsung Electronics, AI Center (三星电子, 人工智能中心); Samsung R&D Institute China Xi’an (SRCX) (三星中国西安研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most previous 3D object detection methods that leverage the multi-modality of LiDAR and cameras utilize the Bird’s Eye View (BEV) space for intermediate feature representation. However, this space uses a low x, y-resolution and sacrifices z-axis information to reduce the overall feature resolution, which may result in declined accuracy. To tackle the problem of using low-resolution features, this paper focuses on the sparse nature of LiDAR point cloud data. From our observation, the number of occupied cells in the 3D voxels constructed from a LiDAR data can be even fewer than the number of total cells in the BEV map, despite the voxels’ significantly higher resolution. Based on this, we introduce a novel sparse voxel-based transformer network for 3D object detection, dubbed as SparseVoxFormer. Instead of performing BEV feature extraction, we directly leverage sparse voxel features as the input for a transformer-based detector. Moreover, with regard to the camera modality, we introduce an explicit modality fusion approach that involves projecting 3D voxel coordinates onto 2D images and collecting the corresponding image features. Thanks to these components, our approach can leverage geometrically richer multi-modal features while even reducing the computational cost. Beyond the proof-of-concept level, we further focus on facilitating better multi-modal fusion and flexible control over the number of sparse features. Finally, thorough experimental results demonstrate that utilizing a significantly smaller number of sparse features drastically reduces computational costs in a 3D object detector while enhancing both overall and long-range performance.
zh

[CV-139] PRISM: Privacy-Preserving Improved Stochastic Masking for Federated Generative Models

【速读】：该论文旨在解决联邦学习（Federated Learning, FL）中生成式模型应用受限的问题，主要挑战包括在异构数据环境中通信成本高以及训练不稳定。论文提出的解决方案PRISM是一种专为生成式模型设计的联邦学习框架，其关键在于通过以随机方式搜索最优的随机二值掩码（stochastic binary mask），而非直接更新模型权重，从而识别出具有高生成性能的稀疏子网络，即所谓的“强彩票票”（strong lottery ticket）。此方法结合服务器端的最大均值差异（Maximum Mean Discrepancy, MMD）损失函数与掩码感知动态移动平均聚合（Mask-Aware Dynamic Moving Average Aggregation, MADA），有效缓解了FL场景中的局部发散问题，实现了稳定且强大的生成能力。此外，PRISM因其稀疏化特性，无需额外剪枝或量化即可获得轻量级模型，适合边缘设备等资源受限环境。实验结果表明，PRISM在MNIST、FMNIST、CelebA和CIFAR10等复杂数据集上优于现有方法，并在非独立同分布（non-IID）及隐私保护的FL环境下成功生成图像。

链接: https://arxiv.org/abs/2503.08085
作者: Kyeongkook Seo,Dong-Jun Han,Jaejun Yoo
机构: Ulsan National Institute of Science and Technology (UNIST); Department of Computer Science and Engineering, Yonsei University
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advancements in federated learning (FL), the integration of generative models into FL has been limited due to challenges such as high communication costs and unstable training in heterogeneous data environments. To address these issues, we propose PRISM, a FL framework tailored for generative models that ensures (i) stable performance in heterogeneous data distributions and (ii) resource efficiency in terms of communication cost and final model size. The key of our method is to search for an optimal stochastic binary mask for a random network rather than updating the model weights, identifying a sparse subnetwork with high generative performance; i.e., a ``strong lottery ticket’'. By communicating binary masks in a stochastic manner, PRISM minimizes communication overhead. This approach, combined with the utilization of maximum mean discrepancy (MMD) loss and a mask-aware dynamic moving average aggregation method (MADA) on the server side, facilitates stable and strong generative capabilities by mitigating local divergence in FL scenarios. Moreover, thanks to its sparsifying characteristic, PRISM yields a lightweight model without extra pruning or quantization, making it ideal for environments such as edge devices. Experiments on MNIST, FMNIST, CelebA, and CIFAR10 demonstrate that PRISM outperforms existing methods, while maintaining privacy with minimal communication costs. PRISM is the first to successfully generate images under challenging non-IID and privacy-preserving FL environments on complex datasets, where previous methods have struggled.
zh

[CV-140] rend-Aware Supervision: On Learning Invariance for Semi-Supervised Facial Action Unit Intensity Estimation

【速读】：该论文旨在解决半监督面部动作单元（Action Unit, AU）强度估计中因缺乏标注导致的虚假相关性问题，特别是由AU共现和主体差异引起的非鲁棒强度估计。论文的关键在于提出了一种名为趋势感知监督（Trend-Aware Supervision, TAS）的方法，通过挖掘关键帧标注中的趋势信息，提升模型对特定AU面部外观变化趋势的认知，从而学习到与AU相关的不变特征，缓解虚假相关性问题并实现强度估计的鲁棒性。实验结果表明，TAS在BP4D和DISFA两个基准数据集上的有效性，并且在推理过程中无需额外的计算或存储成本。

链接: https://arxiv.org/abs/2503.08078
作者: Yingjie Chen,Jiarui Zhang,Tao Wang,Yun Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the increasing need for facial behavior analysis, semi-supervised AU intensity estimation using only keyframe annotations has emerged as a practical and effective solution to relieve the burden of annotation. However, the lack of annotations makes the spurious correlation problem caused by AU co-occurrences and subject variation much more prominent, leading to non-robust intensity estimation that is entangled among AUs and biased among subjects. We observe that trend information inherent in keyframe annotations could act as extra supervision and raising the awareness of AU-specific facial appearance changing trends during training is the key to learning invariant AU-specific features. To this end, we propose \textbfTrend-\textbfAware \textbfSupervision (TAS), which pursues three kinds of trend awareness, including intra-trend ranking awareness, intra-trend speed awareness, and inter-trend subject awareness. TAS alleviates the spurious correlation problem by raising trend awareness during training to learn AU-specific features that represent the corresponding facial appearance changes, to achieve intensity estimation invariance. Experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of each kind of awareness. And under trend-aware supervision, the performance can be improved without extra computational or storage costs during inference.
zh

[CV-141] Seeing Beyond Haze: Generative Nighttime Image Dehazing

【速读】：该论文旨在解决夜间浓雾和强光导致背景信息严重退化或完全丢失的去雾难题。现有方法因缺乏有效的背景先验和生成能力而面临挑战，这是处理此类条件所必需的。论文的关键解决方案包括：通过调整图像扩散模型获取强大的背景先验，并利用引导训练增强对受雾和光影响区域的生成能力。具体而言，任务特定的夜间去雾知识被蒸馏到图像扩散模型中，同时模型通过成对图像训练进一步提升生成缺失背景细节的能力。为应对生成模型易产生幻觉的问题，框架设计允许用户控制生成程度，以平衡视觉真实感与事实准确性。实验表明，BeyondHaze 方法能够有效恢复浓重夜间雾天环境中的可见性。

链接: https://arxiv.org/abs/2503.08073
作者: Beibei Lin,Stephen Lin,Robby Tan
机构: National University of Singapore (新加坡国立大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nighttime image dehazing is particularly challenging when dense haze and intense glow severely degrade or completely obscure background information. Existing methods often encounter difficulties due to insufficient background priors and limited generative ability, both essential for handling such conditions. In this paper, we introduce BeyondHaze, a generative nighttime dehazing method that not only significantly reduces haze and glow effects but also infers background information in regions where it may be absent. Our approach is developed on two main ideas: gaining strong background priors by adapting image diffusion models to the nighttime dehazing problem, and enhancing generative ability for haze- and glow-obscured scene areas through guided training. Task-specific nighttime dehazing knowledge is distilled into an image diffusion model in a manner that preserves its capacity to generate clean images. The diffusion model is additionally trained on image pairs designed to improve its ability to generate background details and content that are missing in the input image due to haze effects. Since generative models are susceptible to hallucinations, we develop our framework to allow user control over the generative level, balancing visual realism and factual accuracy. Experiments on real-world images demonstrate that BeyondHaze effectively restores visibility in dense nighttime haze.
zh

[CV-142] GigaSLAM: Large-Scale Monocular SLAM with Hierachical Gaussian Splats

【速读】：本文旨在解决仅利用单目RGB输入在大规模无界室外环境中进行跟踪与建图（SLAM）的挑战。传统基于Neural Radiance Fields (NeRF) 和 3D Gaussian Splatting (3DGS) 的SLAM方法通常局限于小型室内场景。为克服这些限制，论文提出GigaSLAM，这是首个基于NeRF/3DGS的大规模（千米级）室外SLAM框架，并在KITTI及其360度数据集上进行了验证。其关键解决方案在于采用分层稀疏体素地图表示法，通过神经网络在不同细节层次解码高斯分布，从而实现高效可扩展的地图构建与高质量视点渲染。此外，GigaSLAM结合度量深度模型、对极几何及PnP算法实现了前端精确位姿估计，并引入基于词袋的回环检测机制以保持长时间轨迹的鲁棒对齐，最终提供高精度跟踪与视觉忠实渲染能力，显著拓展了Gaussian Splatting SLAM系统在无界室外环境中的适用性。

链接: https://arxiv.org/abs/2503.08071
作者: Kai Deng,Jian Yang,Shenlong Wang,Jin Xie
机构: Nankai University (南开大学); University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校); Nanjing University Suzhou Campus (南京大学苏州校区)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tracking and mapping in large-scale, unbounded outdoor environments using only monocular RGB input presents substantial challenges for existing SLAM systems. Traditional Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) SLAM methods are typically limited to small, bounded indoor settings. To overcome these challenges, we introduce GigaSLAM, the first NeRF/3DGS-based SLAM framework for kilometer-scale outdoor environments, as demonstrated on the KITTI and KITTI 360 datasets. Our approach employs a hierarchical sparse voxel map representation, where Gaussians are decoded by neural networks at multiple levels of detail. This design enables efficient, scalable mapping and high-fidelity viewpoint rendering across expansive, unbounded scenes. For front-end tracking, GigaSLAM utilizes a metric depth model combined with epipolar geometry and PnP algorithms to accurately estimate poses, while incorporating a Bag-of-Words-based loop closure mechanism to maintain robust alignment over long trajectories. Consequently, GigaSLAM delivers high-precision tracking and visually faithful rendering on urban outdoor benchmarks, establishing a robust SLAM solution for large-scale, long-term scenarios, and significantly extending the applicability of Gaussian Splatting SLAM systems to unbounded outdoor environments.
zh

[CV-143] Simulating Automotive Radar with Lidar and Camera Inputs IROS2025

【速读】：该论文旨在解决低成本毫米波汽车雷达在自动驾驶中因高质量数据集缺乏而阻碍研究与开发的问题。论文提出了一种新方法，能够利用相机图像、LiDAR 点云和自车速度信息模拟包含俯仰角（pitch）、偏航角（yaw）、距离（range）以及多普勒速度（Doppler velocity）的四维毫米波雷达信号，并同时生成雷达信号强度（RSS）。该方案的关键在于两个新型神经网络：1）DIS-Net，用于估计雷达信号的空间分布和数量；2）RSS-Net，基于外观和几何信息预测信号的 RSS。实验结果表明，该方法可成功生成高保真雷达信号，且通过使用合成雷达数据增强训练的流行目标检测神经网络，在性能上优于仅使用原始雷达数据训练的模型，为未来基于雷达的研究与发展提供了有力支持。

链接: https://arxiv.org/abs/2503.08068
作者: Peili Song,Dezhen Song,Yifan Yang,Enfan Lan,Jingtai Liu
机构: Institute of Robotics and Automatic Information System, Nankai University, Tianjin 300350, China (南开大学机器人与自动信息系统研究所); Tianjin Key Laboratory of Intelligent Robotics, Tianjin 300350, China (天津市智能机器人重点实验室); TBI center, Nankai University, Tianjin 300350, China (南开大学TBI中心); Department of Robotics, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, United Arab Emirates (阿联酋阿布扎比穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to IROS 2025

点击查看摘要

Abstract:Low-cost millimeter automotive radar has received more and more attention due to its ability to handle adverse weather and lighting conditions in autonomous driving. However, the lack of quality datasets hinders research and development. We report a new method that is able to simulate 4D millimeter wave radar signals including pitch, yaw, range, and Doppler velocity along with radar signal strength (RSS) using camera image, light detection and ranging (lidar) point cloud, and ego-velocity. The method is based on two new neural networks: 1) DIS-Net, which estimates the spatial distribution and number of radar signals, and 2) RSS-Net, which predicts the RSS of the signal based on appearance and geometric information. We have implemented and tested our method using open datasets from 3 different models of commercial automotive radar. The experimental results show that our method can successfully generate high-fidelity radar signals. Moreover, we have trained a popular object detection neural network with data augmented by our synthesized radar. The network outperforms the counterpart trained only on raw radar data, a promising result to facilitate future radar-based research and development.
zh

[CV-144] Continual Learning for Multiple Modalities

【速读】：该论文试图解决多模态连续学习中的知识保持问题，即在处理图像、视频、音频、深度信息及文本等多种模态数据时，如何在不断学习新任务的同时避免对先前学到的知识遗忘。论文的关键在于提出了一种新的多模态连续学习框架，通过将不同模态与文本对齐以利用其丰富的语义信息，并设计了一种跨模态和单模态的知识聚合方法来缓解模态间知识遗忘的问题。此外，还提出了重新对齐模态嵌入的策略以解决模态间的偏移对齐问题。这一系列方法确保了模型能够在多种连续学习场景中有效工作，无论模态身份是否已知。

链接: https://arxiv.org/abs/2503.08064
作者: Hyundong Jin,Eunwoo Kim
机构: Chung-Ang University (中央大学), South Korea
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Continual learning aims to learn knowledge of tasks observed in sequential time steps while mitigating the forgetting of previously learned knowledge. Existing methods were proposed under the assumption of learning a single modality (e.g., image) over time, which limits their applicability in scenarios involving multiple modalities. In this work, we propose a novel continual learning framework that accommodates multiple modalities (image, video, audio, depth, and text). We train a model to align various modalities with text, leveraging its rich semantic information. However, this increases the risk of forgetting previously learned knowledge, exacerbated by the differing input traits of each task. To alleviate the overwriting of the previous knowledge of modalities, we propose a method for aggregating knowledge within and across modalities. The aggregated knowledge is obtained by assimilating new information through self-regularization within each modality and associating knowledge between modalities by prioritizing contributions from relevant modalities. Furthermore, we propose a strategy that re-aligns the embeddings of modalities to resolve biased alignment between modalities. We evaluate the proposed method in a wide range of continual learning scenarios using multiple datasets with different modalities. Extensive experiments demonstrate that ours outperforms existing methods in the scenarios, regardless of whether the identity of the modality is given.
zh

[CV-145] DDO-IN: Dual Domains Optimization for Implicit Neural Network to Eliminate Motion Artifact in Magnetic Resonance Imaging

【速读】：该论文旨在解决磁共振成像（MRI）运动伪影严重影响临床诊断的问题，现有方法难以在消除伪影的同时保留精细结构细节并保持图像的生动性和清晰度。论文提出了一种新颖的双域优化（Dual-Domain Optimization, DDO）方法，通过隐式神经表示（Implicit Neural Representations, INRs）整合像素域和频率域的信息来恢复清晰的MRI图像。其关键是利用k空间中的低频成分作为参考以捕捉准确的组织纹理，同时结合高频和像素信息恢复细节，并设计了互补掩模与动态损失加权机制，从全局到局部注意力过渡，有效抑制伪影并保留重建所需的有用细节。

链接: https://arxiv.org/abs/2503.08056
作者: Zhongyu Mai,Zewei Zhan,Hanyu Guo,Yulang Huang,Weifeng Su
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) motion artifacts can seriously affect clinical diagnostics, making it challenging to interpret images accurately. Existing methods for eliminating motion artifacts struggle to retain fine structural details and simultaneously lack the necessary vividness and sharpness. In this study, we present a novel dual-domain optimization (DDO) approach that integrates information from the pixel and frequency domains guiding the recovery of clean magnetic resonance images through implicit neural representations(INRs). Specifically, our approach leverages the low-frequency components in the k-space as a reference to capture accurate tissue textures, while high-frequency and pixel information contribute to recover details. Furthermore, we design complementary masks and dynamic loss weighting transitioning from global to local attention that effectively suppress artifacts while retaining useful details for reconstruction. Experimental results on the NYU fastMRI dataset demonstrate that our method outperforms existing approaches in multiple evaluation metrics. Our code is available at this https URL.
zh

[CV-146] Unmasking the Unknown: Facial Deepfake Detection in the Open-Set Paradigm

【速读】：该论文旨在解决现有深度伪造（Deepfake）检测方法受限于封闭集范式的问题，即这些方法仅适用于检测训练数据集中已知伪造技术生成的深度伪造图像，而无法有效应对未知或新兴的伪造技术。这种局限性使得现有的检测模型在面对未见过的伪造手段时可靠性不足。为了解决这一问题，论文提出了一种基于监督对比学习（Supervised Contrastive Learning）的开放集深度伪造分类算法，关键在于引入开放集范式（Open-set Paradigm），使模型不仅能够识别已知伪造方法生成的图像，还能将未知伪造方法生成的图像正确分类为“未知”类别，而非误判为真实图像。通过这种方法，模型能够更稳健地处理新型深度伪造技术，并提升检测的可靠性和置信度，同时与数字取证分析形成互补。实验结果显示，该方法在前两个任务中达到当前最先进的性能，在第三个任务中也取得了具有竞争力的结果。

链接: https://arxiv.org/abs/2503.08055
作者: Nadarasar Bahavan,Sanjay Saha,Ken Chen,Sachith Seneviratne,Sanka Rasnayaka,Saman Halgamuge
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial forgery methods such as deepfakes can be misused for identity manipulation and spreading misinformation. They have evolved alongside advancements in generative AI, leading to new and more sophisticated forgery techniques that diverge from existing ‘known’ methods. Conventional deepfake detection methods use the closedset paradigm, thus limiting their applicability to detecting forgeries created using methods that are not part of the training dataset. In this paper, we propose a shift from the closed-set paradigm for deepfake detection. In the open-set paradigm, models are designed not only to identify images created by known facial forgery methods but also to identify and flag those produced by previously unknown methods as ‘unknown’ and not as unforged/real/unmanipulated. In this paper, we propose an open-set deepfake classification algorithm based on supervised contrastive learning. The open-set paradigm used in our model allows it to function as a more robust tool capable of handling emerging and unseen deepfake techniques, enhancing reliability and confidence, and complementing forensic analysis. In open-set paradigm, we identify three groups including the "unknown group that is neither considered known deepfake nor real. We investigate deepfake open-set classification across three scenarios, classifying deepfakes from unknown methods not as real, distinguishing real images from deepfakes, and classifying deepfakes from known methods, using the FaceForensics++ dataset as a benchmark. Our method achieves state of the art results in the first two tasks and competitive results in the third task.
zh

[CV-147] SphOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Learning Models

【速读】：该论文旨在解决深度学习分类器在开放集识别（Open-set Recognition, OSR）中的挑战，即如何有效识别训练集中未见过的未知类别数据。现有方法通常因依赖复杂的生成模型而计算成本高昂，或因高训练代价而难以广泛应用。论文从表示学习的角度出发，提出了一种基于球面嵌入（spherical embeddings）的高效解决方案。关键在于引入SphOR方法，通过将特征空间建模为von Mises-Fisher分布的混合模型，不仅实现了高效的表示学习，还利用语义模糊样本提升对未知类别的检测性能。此外，论文深入分析了OSR性能与表示学习特性之间的关系，验证了所提方法在多个OSR基准测试上的有效性，取得了最先进的性能，提升了最高达6%的准确率。

链接: https://arxiv.org/abs/2503.08049
作者: Nadarasar Bahavan,Sachith Seneviratne,Saman Halgamuge
机构: The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The widespread use of deep learning classifiers necessitates Open-set recognition (OSR), which enables the identification of input data not only from classes known during training but also from unknown classes that might be present in test data. Many existing OSR methods are computationally expensive due to the reliance on complex generative models or suffer from high training costs. We investigate OSR from a representation-learning perspective, specifically through spherical embeddings. We introduce SphOR, a computationally efficient representation learning method that models the feature space as a mixture of von Mises-Fisher distributions. This approach enables the use of semantically ambiguous samples during training, to improve the detection of samples from unknown classes. We further explore the relationship between OSR performance and key representation learning properties which influence how well features are structured in high-dimensional space. Extensive experiments on multiple OSR benchmarks demonstrate the effectiveness of our method, producing state-of-the-art results, with improvements up-to 6% that validate its performance.
zh

[CV-148] LongProLIP: A Probabilistic Vision-Language Model with Long Context Text ICLR2025

【速读】：该论文试图解决Probabilistic Language-Image Pre-Training (ProLIP) 模型在处理长上下文文本（超过64个上下文长度）时的局限性，以更好地捕捉更长文本序列中的丰富上下文信息。解决方案的关键在于提出了一种针对ProLIP的细调策略（LongProLIP），使其能够接受更长的文本输入（例如256个文本标记），同时通过实验验证该方法在提升长上下文理解能力的同时，尽可能减少细调带来的负面影响，并探讨了长上下文理解与零样本泛化能力之间的权衡关系。

链接: https://arxiv.org/abs/2503.08048
作者: Sanghyuk Chun,Sangdoo Yun
机构: NAVER AI Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as a tiny paper at the 1st workshop of “Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI” at ICLR 2025; code: this https URL models: this https URL

点击查看摘要

Abstract:Recently, Probabilistic Language-Image Pre-Training (ProLIP) has been proposed to tackle the multiplicity issue of vision-language (VL) tasks. Despite their success in probabilistic representation learning at a scale, the ProLIP models cannot handle long context texts longer than 64 context length, which limits their ability to capture rich contextual information from longer text sequences. To address this issue, this paper proposes a fine-tuning strategy for ProLIP to accept longer texts, e.g., 256 text tokens. Experimental results on Urban-1k and the DataComp evaluation suite show that the proposed LongProLIP recipe can improve understanding of long contexts while minimizing the negative effect of fine-tuning. We also observe a trade-off between the long context understanding (measured by Urban-1k) and general zero-shot capability (measured by ImageNet or the average of 38 zero-shot evaluation datasets by DataComp).
zh

[CV-149] Structural and Statistical Texture Knowledge Distillation and Learning for Segmentation

【速读】：该论文旨在重新强调深度网络中低级纹理信息的重要性，特别是在语义分割和相关知识蒸馏任务中的作用。传统高阶深度特征可能无法充分描述局部结构模式和全局统计特性（如边界、平滑性、规则性和颜色对比度）等低级纹理特征。为了解决这一问题，论文提出了一种新颖的结构与统计纹理知识蒸馏框架（Structural and Statistical Texture Knowledge Distillation, SSTKD），通过结合结构和统计纹理知识来增强低级纹理信息的利用。关键在于引入轮廓波分解模块（Contourlet Decomposition Module, CDM）以挖掘结构纹理知识，并设计纹理强度均衡模块（Texture Intensity Equalization Module, TIEM）结合量化一致性损失（Quantization Congruence Loss, QDL）提取和增强统计纹理知识。此外，还提出了协同TIEM（Co-occurrence TIEM, C-TIEM）以及通用分割框架（STLNet++和U-SSNet），使现有分割网络能够更有效地获取结构和统计纹理信息。实验结果表明，所提方法在三个分割任务中表现出色，并在七个流行基准数据集上达到最先进的性能。

链接: https://arxiv.org/abs/2503.08043
作者: Deyi Ji,Feng Zhao,Hongtao Lu,Feng Wu,Jieping Ye
机构: MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China (USTC); Department of Computer Science and Engineering, MOE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University; Alibaba Group (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TPAMI 2025

点击查看摘要

Abstract:Low-level texture feature/knowledge is also of vital importance for characterizing the local structural pattern and global statistical properties, such as boundary, smoothness, regularity, and color contrast, which may not be well addressed by high-level deep features. In this paper, we aim to re-emphasize the low-level texture information in deep networks for semantic segmentation and related knowledge distillation tasks. To this end, we take full advantage of both structural and statistical texture knowledge and propose a novel Structural and Statistical Texture Knowledge Distillation (SSTKD) framework for semantic segmentation. Specifically, Contourlet Decomposition Module (CDM) is introduced to decompose the low-level features with iterative Laplacian pyramid and directional filter bank to mine the structural texture knowledge, and Texture Intensity Equalization Module (TIEM) is designed to extract and enhance the statistical texture knowledge with the corresponding Quantization Congruence Loss (QDL). Moreover, we propose the Co-occurrence TIEM (C-TIEM) and generic segmentation frameworks, namely STLNet++ and U-SSNet, to enable existing segmentation networks to harvest the structural and statistical texture information more effectively. Extensive experimental results on three segmentation tasks demonstrate the effectiveness of the proposed methods and their state-of-the-art performance on seven popular benchmark datasets, respectively.
zh

[CV-150] Generalized Kullback-Leibler Divergence Loss NEURIPS

【速读】：该论文旨在解决基于 Kullback-Leibler (KL) 散度损失在优化过程中存在的收敛挑战及样本偏置问题，特别是在知识蒸馏和对抗训练场景下的局限性。论文的核心解决方案在于提出了一种改进的广义 Kullback-Leibler (GKL) 散度损失函数。关键之处在于：首先，通过分解 KL 损失的非对称优化特性并引入更平滑的权重函数，缓解了优化过程中的收敛难题，尤其是针对软标签中高预测分数类别；其次，将类别级全局信息融入 KL/DKL 损失中，以减少因单一样本导致的偏置问题。这些改进显著提升了模型在 RobustBench 领先榜上的对抗鲁棒性以及在 CIFAR、ImageNet 和 CLIP 模型上的知识蒸馏性能。

链接: https://arxiv.org/abs/2503.08038
作者: Jiequan Cui,Beier Zhu,Qingshan Xu,Zhuotao Tian,Xiaojuan Qi,Bei Yu,Hanwang Zhang,Richang Hong
机构: College of Computing & Data Science, Nanyang Technological University (南洋理工大学); Department of Computer Science & Engineering, The Chinese University of Hong Kong, ShaTin, Hong Kong (香港中文大学); The University of Hong Kong (香港大学); Harbin Institution of Technology, Shenzhen (哈尔滨工业大学深圳校区); Hefei University of Technology (合肥工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: extension of our NeurIPS paper “Decoupled Kullback-Leibler Divergence Loss”. arXiv admin note: substantial text overlap with arXiv:2305.13948

点击查看摘要

Abstract:In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of (1) a weighted Mean Square Error (wMSE) loss and (2) a Cross-Entropy loss incorporating soft labels. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL loss in scenarios like knowledge distillation by breaking its asymmetric optimization property along with a smoother weight function. This modification effectively alleviates convergence challenges in optimization, particularly for classes with high predicted scores in soft labels. Secondly, we introduce class-wise global information into KL/DKL to reduce bias arising from individual samples. With these two enhancements, we derive the Generalized Kullback-Leibler (GKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100, ImageNet, and vision-language datasets, focusing on adversarial training, and knowledge distillation tasks. Specifically, we achieve new state-of-the-art adversarial robustness on the public leaderboard – RobustBench and competitive knowledge distillation performance across CIFAR/ImageNet models and CLIP models, demonstrating the substantial practical merits. Our code is available at this https URL.
zh

[CV-151] ObjectMover: Generative Object Movement with Video Prior CVPR2025

【速读】：该论文旨在解决在图像中将物体移动到另一位置这一看似简单但实际上极具挑战性的图像编辑任务，具体包括重新协调光照、基于透视调整姿态、准确填补被遮挡区域以及确保阴影和反射的一致性同步，同时保持物体身份的完整性。论文的关键在于将此任务建模为序列到序列的问题，并微调视频生成模型以利用其在跨视频帧一致生成物体的知识。通过这种方法，模型能够适应复杂的现实场景，处理极端的光照协调及物体效果移动。由于缺乏大规模的物体移动数据，研究者构建了一个使用现代游戏引擎的数据生成管道来合成高质量的数据对。此外，还提出了一种多任务学习策略，以改善模型在真实世界视频数据上的泛化能力。实验结果表明，ObjectMover 在处理这些任务时表现出色，并且适应性强。

链接: https://arxiv.org/abs/2503.08037
作者: Xin Yu,Tianyu Wang,Soo Ye Kim,Paul Guerrero,Xi Chen,Qing Liu,Zhe Lin,Xiaojuan Qi
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, Project Page: this https URL

点击查看摘要

Abstract:Simple as it seems, moving an object to another location within an image is, in fact, a challenging image-editing task that requires re-harmonizing the lighting, adjusting the pose based on perspective, accurately filling occluded regions, and ensuring coherent synchronization of shadows and reflections while maintaining the object identity. In this paper, we present ObjectMover, a generative model that can perform object movement in highly challenging scenes. Our key insight is that we model this task as a sequence-to-sequence problem and fine-tune a video generation model to leverage its knowledge of consistent object generation across video frames. We show that with this approach, our model is able to adjust to complex real-world scenarios, handling extreme lighting harmonization and object effect movement. As large-scale data for object movement are unavailable, we construct a data generation pipeline using a modern game engine to synthesize high-quality data pairs. We further propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization. Through extensive experiments, we demonstrate that ObjectMover achieves outstanding results and adapts well to real-world scenarios.
zh

[CV-152] HOFAR: High-Order Augmentation of Flow Autoregressive Transformers

【速读】：该论文旨在解决现有 FlowAR 实现中受限于一阶轨迹建模（first-order trajectory modeling）的问题，以提升基于流（flow-based）自回归生成模型的合成保真度。论文的关键创新在于提出了一种通过高阶监督（high-order supervision）系统性增强流自回归变换器（flow autoregressive transformer）的新框架，命名为高阶 FlowAR (High-Order FlowAR, HOFAR)。通过引入高阶展开（high-order expansion）的理论分析与实验评估，该方法在生成质量上实现了可量化的改进，同时深化了对基于流的自回归建模中轨迹动力学的理解。

链接: https://arxiv.org/abs/2503.08032
作者: Yingyu Liang,Zhizhou Sha,Zhenmei Shi,Zhao Song,Mingda Wan
机构: The University of Hong Kong (香港大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Tsinghua University (清华大学); The Simons Institute for the Theory of Computing at UC Berkeley (伯克利加州大学西蒙斯计算理论研究所); Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Flow Matching and Transformer architectures have demonstrated remarkable performance in image generation tasks, with recent work FlowAR [Ren et al., 2024] synergistically integrating both paradigms to advance synthesis fidelity. However, current FlowAR implementations remain constrained by first-order trajectory modeling during the generation process. This paper introduces a novel framework that systematically enhances flow autoregressive transformers through high-order supervision. We provide theoretical analysis and empirical evaluation showing that our High-Order FlowAR (HOFAR) demonstrates measurable improvements in generation quality compared to baseline models. The proposed approach advances the understanding of flow-based autoregressive modeling by introducing a systematic framework for analyzing trajectory dynamics through high-order expansion.
zh

[CV-153] Dynamic PET Image Reconstruction via Non-negative INR Factorization

【速读】：该论文旨在解决动态正电子发射断层成像（PET）图像从噪声投影数据重建这一具有挑战性的问题。论文提出了一种无监督学习方法，即非负隐式神经表征分解（Non-negative Implicit Neural Representation Factorization, \texttt{NINRF}），其核心在于利用未知图像的低秩矩阵分解，并使用神经网络表示系数与基底。关键创新点在于将非负矩阵分解（Non-negative Matrix Factorization, NMF）扩展到连续函数域，通过隐式神经表征（Implicit Neural Representations, INRs）建立矩阵与连续函数之间的联系。该方法通过最小化Kullback-Leibler (KL) 散度优化神经网络参数，并附加系数与基底的稀疏正则化。实验结果表明，\texttt{NINRF} 在处理泊松噪声的动态PET图像重建中优于其他方法，同时提供了对象详细几何特征和区域浓度变化的连续表示。

链接: https://arxiv.org/abs/2503.08025
作者: Chaozhi Zhang,Wenxiang Ding,Roy Y. He,Xiaoqun Zhang,Qiaoqiao Ding
机构: School of Mathematical Sciences, Shanghai Jiao Tong University (上海交通大学数学科学学院); Institute of Natural Sciences, Shanghai Jiao Tong University (上海交通大学自然科学研究院); Department of Mathematics, City University of Hong Kong (香港城市大学数学系); Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai Jiao Tong University (上海交通大学国家应用数学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The reconstruction of dynamic positron emission tomography (PET) images from noisy projection data is a significant but challenging problem. In this paper, we introduce an unsupervised learning approach, Non-negative Implicit Neural Representation Factorization (\textttNINRF), based on low rank matrix factorization of unknown images and employing neural networks to represent both coefficients and bases. Mathematically, we demonstrate that if a sequence of dynamic PET images satisfies a generalized non-negative low-rank property, it can be decomposed into a set of non-negative continuous functions varying in the temporal-spatial domain. This bridges the well-established non-negative matrix factorization (NMF) with continuous functions and we propose using implicit neural representations (INRs) to connect matrix with continuous functions. The neural network parameters are obtained by minimizing the KL divergence, with additional sparsity regularization on coefficients and bases. Extensive experiments on dynamic PET reconstruction with Poisson noise demonstrate the effectiveness of the proposed method compared to other methods, while giving continuous representations for object’s detailed geometric features and regional concentration variation.
zh

[CV-154] AdaSCALE: Adaptive Scaling for OOD Detection

【速读】：该论文旨在解决深度学习模型在处理未见过分布（Out-of-Distribution, OOD）样本时，难以可靠区分这些样本与训练数据分布内（In-Distribution, ID）样本的问题。现有的最先进的 OOD 检测方法主要依赖于激活形状调整（activation shaping）来增强 ID 和 OOD 输入之间的分离，但它们通常采用固定的百分位阈值（static percentile threshold），忽略了样本本身的特性，导致分离性能不够理想。

论文的关键解决方案是提出了一种名为 AdaSCALE 的自适应缩放方法，其核心在于动态调整百分位阈值，使其基于样本估计的 OOD 可能性而变化。这种方法利用了一个重要观察：相比 ID 样本，OOD 样本在小扰动下高幅值激活区域的激活偏移更为显著。基于此，AdaSCALE 对可能属于 ID 的样本施加更强的缩放，而对可能属于 OOD 的样本则施加较弱的缩放，从而生成高度可分离的能量分数（energy scores）。通过这一机制，AdaSCALE 在多个架构上实现了当前最先进的 OOD 检测性能，并在 ImageNet-1k 数据集的近似 OOD 和远距离 OOD 场景中，平均 FPR@95 指标分别超越了最新竞争方法 OptFS 14.94 和 21.67。

链接: https://arxiv.org/abs/2503.08023
作者: Sudarshan Regmi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:The ability of the deep learning model to recognize when a sample falls outside its learned distribution is critical for safe and reliable deployment. Recent state-of-the-art out-of-distribution (OOD) detection methods leverage activation shaping to improve the separation between in-distribution (ID) and OOD inputs. These approaches resort to sample-specific scaling but apply a static percentile threshold across all samples regardless of their nature, resulting in suboptimal ID-OOD separability. In this work, we propose \textbfAdaSCALE, an adaptive scaling procedure that dynamically adjusts the percentile threshold based on a sample’s estimated OOD likelihood. This estimation leverages our key observation: OOD samples exhibit significantly more pronounced activation shifts at high-magnitude activations under minor perturbation compared to ID samples. AdaSCALE enables stronger scaling for likely ID samples and weaker scaling for likely OOD samples, yielding highly separable energy scores. Our approach achieves state-of-the-art OOD detection performance, outperforming the latest rival OptFS by 14.94 in near-OOD and 21.67 in far-OOD datasets in average FPR@95 metric on the ImageNet-1k benchmark across eight diverse architectures. The code is available at: this https URL
zh

[CV-155] Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models

【速读】：该论文旨在解决传统视觉-语言模型（Vision-Language Models, VLMs）中基于注意力分数的剪枝方法忽视空间位置和token相似性等关键因素的问题。为了解决这一挑战，论文提出了一种名为AdaptPrune的新颖剪枝方法，这是一种无需训练即可直接使用的插件式方法。AdaptPrune的关键在于通过引入空间距离和token相似性，并结合自适应非极大值抑制（adaptive NMS）机制，扩展了传统的基于注意力的剪枝方法。这种方法综合考虑了注意力、空间信息及相似性，从而实现了对token重要性的全面评估，并显著优化了剪枝决策。实验结果表明，AdaptPrune在多种VLMs和基准数据集上的表现均优于现有方法。

链接: https://arxiv.org/abs/2503.08019
作者: Bozhi Luan,Wengang Zhou,Hao Feng,Zhe Wang,Xiaosong Li,Houqiang Li
机构: University of Science and Technology of China (中国科学技术大学); Huawei Technologies (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As the computational needs of Large Vision-Language Models (LVLMs) increase, visual token pruning has proven effective in improving inference speed and memory efficiency. Traditional pruning methods in LVLMs predominantly focus on attention scores to determine token relevance, overlooking critical aspects such as spatial position and token similarity. To this end, we introduce AdaptPrune, a novel plug-and-play training-free pruning method that builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach. Our method is based on several observed phenomena in large models: the positional bias in the model’s image attention and the redundancy of token information ignored by previous approaches. By integrating attention, spatial, and similarity information, our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions. Our method has been extensively tested across various LVLMs and benchmarks, confirming its robustness and adaptability. The results demonstrate that AdaptPrune consistently outperforms existing methods across various pruning ratios. Code is available at this https URL.
zh

[CV-156] Partial differential equation system for binarization of degraded document images

【速读】：该论文旨在解决退化文本图像二值化（binarization）的问题。解决方案的关键在于提出了一种新的弱耦合偏微分方程（PDE）系统，该系统通过两个关键方程实现：第一个方程用于估计背景成分，结合了扩散项和保真项；第二个方程用于估计前景成分，并包含扩散项、保真项以及二值化源项。最终的二值化结果通过对估计的前景成分进行硬投影获得。实验结果表明，该模型在处理退化文本图像方面具有显著优势。

链接: https://arxiv.org/abs/2503.08017
作者: Youjin Liu,Yu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Dynamical Systems (math.DS)
备注: 30pages,13figures,2tables

点击查看摘要

Abstract:In recent years, partial differential equation (PDE) systems have been successfully applied to the binarization of text images, achieving promising results. Inspired by the DH model and incorporating a novel image modeling approach, this study proposes a new weakly coupled PDE system for degraded text image binarization. In this system, the first equation is designed to estimate the background component, incorporating both diffusion and fidelity terms. The second equation estimates the foreground component and includes diffusion, fidelity, and binarization source terms. The final binarization result is obtained by applying a hard projection to the estimated foreground component. Experimental results on 86 degraded text images demonstrate that the proposed model exhibits significant advantages in handling degraded text images.
zh

[CV-157] SGNetPose: Stepwise Goal-Driven Networks with Pose Information for Trajectory Prediction in Autonomous Driving

【速读】：本文旨在解决自动驾驶系统中行人轨迹预测的问题，以提高安全性并支持决策制定。准确的轨迹预测能够预防碰撞、预判行人穿越意图，并提升整体系统效率。论文的关键解决方案在于提出SGNetPose+模型，该模型通过整合人体骨架信息（Skeleton Information）或身体段角度与边界框（Bounding Boxes）来预测视频数据中的行人轨迹，从而规避潜在危险。具体而言，骨架信息由姿态估计模型提取，关节角度基于提取的关节点数据计算得出。此外，研究还通过水平翻转视频帧进行时间数据增强，以扩充数据集规模并优化性能。实验结果显示，SGNetPose+在JAAD和PIE数据集上达到了最先进的性能，且优于原始SGNet模型。

链接: https://arxiv.org/abs/2503.08016
作者: Akshat Ghiya,Ali K. AlShami,Jugal Kalita
机构: University of Colorado Colorado Springs (科罗拉多大学科罗拉多泉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting pedestrian trajectories is essential for autonomous driving systems, as it significantly enhances safety and supports informed decision-making. Accurate predictions enable the prevention of collisions, anticipation of crossing intent, and improved overall system efficiency. In this study, we present SGNetPose+, an enhancement of the SGNet architecture designed to integrate skeleton information or body segment angles with bounding boxes to predict pedestrian trajectories from video data to avoid hazards in autonomous driving. Skeleton information was extracted using a pose estimation model, and joint angles were computed based on the extracted joint data. We also apply temporal data augmentation by horizontally flipping video frames to increase the dataset size and improve performance. Our approach achieves state-of-the-art results on the JAAD and PIE datasets using pose data with the bounding boxes, outperforming the SGNet model. Code is available on Github: SGNetPose+.
zh

[CV-158] Exploring Bias in over 100 Text-to-Image Generative Models ICLR2025

【速读】：该论文旨在研究文本到图像生成模型随时间推移中的偏差趋势，特别是在开放平台（如Hugging Face）上模型可获得性增加的背景下。随着这些平台促进人工智能的普及，它们也加速了具有固有偏差的模型的传播，这些偏差通常由特定任务的微调所塑造。为确保AI部署的伦理性和透明性，需要强大的评估框架和可量化的偏差度量标准。论文的关键解决方案在于提出了一种综合评估方法，从分布偏差（distribution bias）、生成幻觉（generative hallucination）以及生成漏检率（generative miss-rate）三个核心维度对超过100个模型进行分析，揭示偏差模式随时间和生成任务演变的趋势。研究发现表明，艺术性和风格迁移模型表现出显著的偏差，而基础模型由于更广泛的训练分布正变得逐渐减少偏差。通过识别这些系统性趋势，论文贡献了一个大规模的评估数据集，以推动偏差研究和缓解策略的发展，从而促进更负责任的人工智能开发。

链接: https://arxiv.org/abs/2503.08012
作者: Jordan Vice,Naveed Akhtar,Richard Hartley,Ajmal Mian
机构: University of Western Australia (西澳大利亚大学); University of Melbourne (墨尔本大学); Australian National University (澳大利亚国立大学); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025 Workshop on Open Science for Foundation Models (SCI-FM)

点击查看摘要

Abstract:We investigate bias trends in text-to-image generative models over time, focusing on the increasing availability of models through open platforms like Hugging Face. While these platforms democratize AI, they also facilitate the spread of inherently biased models, often shaped by task-specific fine-tuning. Ensuring ethical and transparent AI deployment requires robust evaluation frameworks and quantifiable bias metrics. To this end, we assess bias across three key dimensions: (i) distribution bias, (ii) generative hallucination, and (iii) generative miss-rate. Analyzing over 100 models, we reveal how bias patterns evolve over time and across generative tasks. Our findings indicate that artistic and style-transferred models exhibit significant bias, whereas foundation models, benefiting from broader training distributions, are becoming progressively less biased. By identifying these systemic trends, we contribute a large-scale evaluation corpus to inform bias research and mitigation strategies, fostering more responsible AI development. Keywords: Bias, Ethical AI, Text-to-Image, Generative Models, Open-Source Models Comments: Accepted to ICLR 2025 Workshop on Open Science for Foundation Models (SCI-FM) Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.08012 [cs.CV] (or arXiv:2503.08012v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.08012 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-159] SKALD: Learning-Based Shot Assembly for Coherent Multi-Shot Video Creation

【速读】：本文旨在解决多镜头视频组装问题，即从候选镜头中构建连贯视频序列，尽量减少对文本的依赖。为实现这一目标，论文提出的关键解决方案是引入Learned Clip Assembly (LCA)评分机制，这是一种基于学习的方法，用于量化镜头之间的时间和语义关系以衡量叙事连贯性。此外，为了应对多镜头组合带来的指数级复杂度，论文采用了一种由LCA评分引导的高效束搜索算法。同时，通过设计Shot Coherence Learning和Feature Regression两项任务，利用对比学习和特征回归技术，在有限的人工标注数据下有效训练模型。最终，论文提出了两种变体：仅依赖视觉连贯性的基础SKALD模型以及在有辅助文本信息时使用的SKALD-text模型。

链接: https://arxiv.org/abs/2503.08010
作者: Chen Yi Lu,Md Mehrab Tanjim,Ishita Dasgupta,Somdeb Sarkhel,Gang Wu,Saayan Mitra,Somali Chaterji
机构: Purdue University; Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present SKALD, a multi-shot video assembly method that constructs coherent video sequences from candidate shots with minimal reliance on text. Central to our approach is the Learned Clip Assembly (LCA) score, a learning-based metric that measures temporal and semantic relationships between shots to quantify narrative coherence. We tackle the exponential complexity of combining multiple shots with an efficient beam-search algorithm guided by the LCA score. To train our model effectively with limited human annotations, we propose two tasks for the LCA encoder: Shot Coherence Learning, which uses contrastive learning to distinguish coherent and incoherent sequences, and Feature Regression, which converts these learned representations into a real-valued coherence score. We develop two variants: a base SKALD model that relies solely on visual coherence and SKALD-text, which integrates auxiliary text information when available. Experiments on the VSPD and our curated MSV3C datasets show that SKALD achieves an improvement of up to 48.6% in IoU and a 43% speedup over the state-of-the-art methods. A user study further validates our approach, with 45% of participants favoring SKALD-assembled videos, compared to 22% preferring text-based assembly methods.
zh

[CV-160] A Survey on Wi-Fi Sensing Generalizability: Taxonomy Techniques Datasets and Future Research Prospects

【速读】：该论文旨在解决 Wi-Fi 感知（Wi-Fi Sensing）技术在实际应用中因环境变化导致信号特征不一致而影响动作识别鲁棒性的问题。论文的关键在于系统性地回顾和分析了超过 200 篇相关研究，沿着整个感知管道（设备部署、信号处理、特征学习和模型部署）探讨了缓解环境变异性不利影响的技术方法。此外，论文还综述了开源数据集（如 Widar3.0、XRF55 和 XRFv2），强调其多模态融合与跨模态任务的独特特性及适用性，并展望了多模态方法和大型语言模型集成等新兴研究方向，以推动该领域的进一步发展。

链接: https://arxiv.org/abs/2503.08008
作者: Fei Wang,Tingting Zhang,Bintong Zhao,Libao Xing,Tiantian Wang,Han Ding,Tony Xiao Han
机构: School of Software Engineering, Xi’an Jiaotong University(XJTU 西安交通大学软件学院); School of Computer Science and Technology, Xi’an Jiaotong University(XJTU 西安交通大学计算机科学与技术学院); Wireless Technology Laboratory, Huawei Technologies Co. Ltd.(华为无线技术实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 318 references

点击查看摘要

Abstract:Wi-Fi sensing has emerged as a transformative technology that leverages ubiquitous wireless signals to enable a variety of applications ranging from activity and gesture recognition to indoor localization and health monitoring. However, the inherent dependency of Wi-Fi signals on environmental conditions introduces significant generalization challenges,variations in surroundings, human positions, and orientations often lead to inconsistent signal features, impeding robust action recognition. In this survey, we review over 200 studies on Wi-Fi sensing generalization, categorizing them along the entire sensing pipeline: device deployment, signal processing, feature learning, and model deployment. We systematically analyze state-of-the-art techniques, which are employed to mitigate the adverse effects of environmental variability. Moreover, we provide a comprehensive overview of open-source datasets such as Widar3.0, XRF55, and XRFv2, highlighting their unique characteristics and applicability for multimodal fusion and cross-modal tasks. Finally, we discuss emerging research directions, such as multimodal approaches and the integration of large language models,to inspire future advancements in this rapidly evolving field. Our survey aims to serve as a valuable resource for researchers, offering insights into current methodologies, available datasets, and promising avenues for further investigation.
zh

[CV-161] CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction

【速读】：该论文旨在解决单视图图像三维物体重建中的挑战，特别是在利用基于扩散模型的二维图像生成多视图图像以供大型重建模型（LRMs）提取三维内容时存在的问题。主要挑战包括：二维扩散模型难以生成具有强多视图一致性的密集图像，而LRMs在三维重建过程中往往会放大这些不一致性。为实现高质量且高效的三维重建，论文提出的关键解决方案是引入CDI3D框架，该框架通过前馈方式实现高效高质量的图像到三维生成，并包含视图插值功能。具体而言，论文设计了一个密集视图插值（DVI）模块，用于合成二维扩散模型生成的主要视图之间的插值图像，从而增强输入视图的一致性。此外，采用倾斜相机姿态轨迹捕获不同高度和视角的视图，并结合三平面网格重建策略从插值和原始视图中提取鲁棒特征，最终生成高质量三维网格。实验结果表明，该方法显著优于现有技术，在多个基准测试中生成了纹理保真度和几何精度更高的三维内容。

链接: https://arxiv.org/abs/2503.08005
作者: Zhiyuan Wu,Xibin Song,Senbo Wang,Weizhe Liu,Jiayu Yang,Ziang Cheng,Shenzhou Chen,Taizhang Shang,Weixuan Sun,Shan Luo,Pan Ji
机构: Department of Engineering, King’s College London (国王学院伦敦校区); Tencent XR Vision Labs (腾讯XR视觉实验室), Shanghai (上海), China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D object reconstruction from single-view image is a fundamental task in computer vision with wide-ranging applications. Recent advancements in Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content. However, challenges remain as 2D diffusion models often struggle to produce dense images with strong multi-view consistency, and LRMs tend to amplify these inconsistencies during the 3D reconstruction process. Addressing these issues is critical for achieving high-quality and efficient 3D reconstruction. In this paper, we present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view interpolation. To tackle the aforementioned challenges, we propose to integrate 2D diffusion-based view interpolation into the LRM pipeline to enhance the quality and consistency of the generated mesh. Specifically, our approach introduces a Dense View Interpolation (DVI) module, which synthesizes interpolated images between main views generated by the 2D diffusion model, effectively densifying the input views with better multi-view consistency. We also design a tilt camera pose trajectory to capture views with different elevations and perspectives. Subsequently, we employ a tri-plane-based mesh reconstruction strategy to extract robust tokens from these interpolated and original views, enabling the generation of high-quality 3D meshes with superior texture and geometry. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art approaches across various benchmarks, producing 3D content with enhanced texture fidelity and geometric accuracy.
zh

[CV-162] Efficient Dataset Distillation through Low-Rank Space Sampling

【速读】：该论文旨在解决深度学习中因冗余信息导致模型泛化能力下降及计算负担增加的问题。现有基于数据蒸馏（Dataset Distillation, DD）的方法通常将图像视为独立实体生成合成图像，忽略了数据间的共同特征。为应对这一挑战，论文提出了一种基于匹配训练轨迹与低秩空间采样的数据蒸馏方法（MTT-LSS）。其关键是利用低秩近似捕获原始数据的多个低维流形子空间，并通过这些子空间中的基向量和共享维度映射来表示合成数据，从而在降低单个数据点生成成本的同时有效减少信息冗余。实验结果显示，该方法在CIFAR-10、CIFAR-100和SVHN数据集上平均优于基准方法9.9%。

链接: https://arxiv.org/abs/2503.07998
作者: Hangyang Kong,Wenbo Zhou,Xuxiang He,Xiaotong Tu,Xinghao Ding
机构: Key Laboratory of Multimedia Trusted Perception and Efficient Computing (多媒体可信感知与高效计算重点实验室), Ministry of Education of China (中华人民共和国教育部), Xiamen University (厦门大学); Institute of Artificial Intelligence (人工智能研究所), Xiamen University (厦门大学); School of Informatics (信息学院), Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Huge amount of data is the key of the success of deep learning, however, redundant information impairs the generalization ability of the model and increases the burden of calculation. Dataset Distillation (DD) compresses the original dataset into a smaller but representative subset for high-quality data and efficient training strategies. Existing works for DD generate synthetic images by treating each image as an independent entity, thereby overlooking the common features among data. This paper proposes a dataset distillation method based on Matching Training Trajectories with Low-rank Space Sampling(MTT-LSS), which uses low-rank approximations to capture multiple low-dimensional manifold subspaces of the original data. The synthetic data is represented by basis vectors and shared dimension mappers from these subspaces, reducing the cost of generating individual data points while effectively minimizing information redundancy. The proposed method is tested on CIFAR-10, CIFAR-100, and SVHN datasets, and outperforms the baseline methods by an average of 9.9%.
zh

[CV-163] DiffEGG: Diffusion-Driven Edge Generation as a Pixel-Annotation-Free Alternative for Instance Annotation

【速读】：该论文旨在解决精确全景分割中实例标注成本高、人工标注引入偏差以及现有方法在边界处理上的局限性等问题。针对无监督实例分割（UIS）中相邻实例合并和单实例碎片化的问题，以及弱监督全景分割（WPS）中稀疏标注昂贵且易引入人为偏差的挑战，论文提出了一种完全无需标注的方法——DiffEGG（基于扩散驱动的边缘生成）。其关键在于利用预训练扩散模型提取实例感知特征，并生成精确的实例边缘图。与基于DINO的UIS方法不同，扩散模型天然具备捕捉细粒度实例感知特征的能力，从而实现更精准的边界划分。此外，通过引入RIP（一种任务无关的后处理技术），DiffEGG能够无缝集成到多种分割框架中，进一步提升性能。实验结果表明，DiffEGG不仅在UIS任务上提升了平均精度（+4.4 AP），还实现了无需实例标注的WPS（+1.7 PQ），显著降低了对人工干预的依赖。

链接: https://arxiv.org/abs/2503.07982
作者: Sanghyun Jo,Ziseok Lee,Wooyeol Lee,Kyungsu Kim
机构: OGQ (OGQ); Department of Biomedical Sciences, Seoul National University (首尔国立大学); Department of Computer Science and Engineering, Seoul National University (首尔国立大学); School of Transdisciplinary Innovations and Interdisciplinary Program in Artificial Intelligence, Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving precise panoptic segmentation relies on pixel-wise instance annotations, but obtaining such datasets is costly. Unsupervised instance segmentation (UIS) eliminates annotation requirements but struggles with adjacent instance merging and single-instance fragmentation, largely due to the limitations of DINO-based backbones which lack strong instance separation cues. Weakly-supervised panoptic segmentation (WPS) reduces annotation costs using sparse labels (e.g., points, boxes), yet these annotations remain expensive and introduce human bias and boundary errors. To address these challenges, we propose DiffEGG (Diffusion-Driven EdGe Generation), a fully annotation-free method that extracts instance-aware features from pretrained diffusion models to generate precise instance edge maps. Unlike DINO-based UIS methods, diffusion models inherently capture fine-grained, instance-aware features, enabling more precise boundary delineation. For WPS, DiffEGG eliminates annotation costs and human bias by operating without any form of manual supervision, addressing the key limitations of prior best methods. Additionally, we introduce RIP, a post-processing technique that fuses DiffEGG’s edge maps with segmentation masks in a task-agnostic manner. RIP allows DiffEGG to be seamlessly integrated into various segmentation frameworks. When applied to UIS, DiffEGG and RIP achieve an average +4.4\text AP improvement over prior best UIS methods. When combined with weakly-supervised semantic segmentation (WSS), DiffEGG enables WPS without instance annotations, outperforming prior best point-supervised WPS methods by +1.7\text PQ . These results demonstrate that DiffEGG’s edge maps serve as a cost-effective, annotation-free alternative to instance annotations, significantly improving segmentation without human intervention. Code is available at this https URL.
zh

[CV-164] Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning

【速读】：该论文致力于解决现有基于提示（prompt）的类增量学习（Class-Incremental Learning, CIL）方法因提示池查询和输入序列长度增加而导致的显著计算开销问题。论文提出了一种新颖的基于提示的方法，其关键在于训练一组在整个任务中共享的提示，并通过将提示直接添加到分类 token（CLS token）的注意力计算中，而非将其拼接到输入序列中。这种设计不仅大幅降低了计算复杂度，包括推理成本和可训练参数的数量，还避免了针对不同下游任务优化提示长度的需求，从而提供了一种高效且强大的无回放类增量学习解决方案。实验结果验证了该方法的有效性，并展示了其在多种CIL基准数据集以及通用识别基准上的优越性能。

链接: https://arxiv.org/abs/2503.07979
作者: Haoran Chen,Ping Wang,Zihan Zhou,Xu Zhang,Zuxuan Wu,Yu-Gang Jiang
机构: Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University (上海关键智能信息处理实验室, 复旦大学计算机学院); Shanghai Collaborative Innovation Center of Intelligent Visual Computing (上海智能视觉计算协同创新中心); APUS AI Lab (APUS人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Class-incremental learning (CIL) enables models to learn new classes progressively while preserving knowledge of previously learned ones. Recent advances in this field have shifted towards parameter-efficient fine-tuning techniques, with many approaches building upon the framework that maintains a pool of learnable prompts. Although effective, these methods introduce substantial computational overhead, primarily due to prompt pool querying and increased input sequence lengths from prompt concatenation. In this work, we present a novel prompt-based approach that addresses this limitation. Our method trains a single set of shared prompts across all tasks and, rather than concatenating prompts to the input, directly modifies the CLS token’s attention computation by adding the prompts to it. This simple and lightweight design not only significantly reduces computational complexity-both in terms of inference costs and the number of trainable parameters-but also eliminates the need to optimize prompt lengths for different downstream tasks, offering a more efficient yet powerful solution for rehearsal-free class-incremental learning. Extensive experiments across a diverse range of CIL benchmarks demonstrate the effectiveness of our approach, highlighting its potential to establish a new prompt-based CIL paradigm. Furthermore, experiments on general recognition benchmarks beyond the CIL setting also show strong performance, positioning our method as a promising candidate for a general parameter-efficient fine-tuning approach.
zh

[CV-165] 7ABAW-Compound Expression Recognition via Curriculum Learning ECCV

【速读】：该论文旨在解决复合表情识别（Compound Emotion Recognition, CE）任务中因标注数据有限及复合表情微妙变化所带来的挑战。论文提出了一种基于课程学习（Curriculum Learning）的框架，通过先在单表情数据集上预训练模型，再逐步引入多表情数据，确保模型首先掌握基础表情特征，随后适应复合表情的复杂性。关键在于将单表情预训练、动态复合表情生成（利用CutMix和Mixup技术）以及渐进式多表情集成相结合，从而有效提升复合表情识别性能。最终，该方法在ABAW竞赛的复合表情赛道中以F-score 0.6063的成绩获得最佳性能。

链接: https://arxiv.org/abs/2503.07969
作者: Chen Liu,Feng Qiu,Wei Zhang,Lincheng Li,Dadong Wang,Xin Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCVWorkshop as the report of the first place in 7th ABAW Track2 Competition

点击查看摘要

Abstract:With the advent of deep learning, expression recognition has made significant advancements. However, due to the limited availability of annotated compound expression datasets and the subtle variations of compound expressions, Compound Emotion Recognition (CE) still holds considerable potential for exploration. To advance this task, the 7th Affective Behavior Analysis in-the-wild (ABAW) competition introduces the Compound Expression Challenge based on C-EXPR-DB, a limited dataset without labels. In this paper, we present a curriculum learning-based framework that initially trains the model on single-expression tasks and subsequently incorporates multi-expression data. This design ensures that our model first masters the fundamental features of basic expressions before being exposed to the complexities of compound emotions. Specifically, our designs can be summarized as follows: 1) Single-Expression Pre-training: The model is first trained on datasets containing single expressions to learn the foundational facial features associated with basic emotions. 2) Dynamic Compound Expression Generation: Given the scarcity of annotated compound expression datasets, we employ CutMix and Mixup techniques on the original single-expression images to create hybrid images exhibiting characteristics of multiple basic emotions. 3) Incremental Multi-Expression Integration: After performing well on single-expression tasks, the model is progressively exposed to multi-expression data, allowing the model to adapt to the complexity and variability of compound expressions. The official results indicate that our method achieves the \textbfbest performance in this competition track with an F-score of 0.6063. Our code is released at this https URL.
zh

[CV-166] Pre-trained Models Succeed in Medical Imaging with Representation Similarity Degradation

【速读】：本文旨在解决跨域迁移学习中表征相似性演化这一关键问题，特别关注预训练模型在适应医学影像任务时尽管存在显著领域差距仍保持有效性的原因。论文通过严格定义问题，量化并分析微调过程中表征相似性轨迹，研究范围涵盖医学图像分析及更广泛的跨域适应场景。关键解决方案在于提出一种表征相似空间框架，该框架不仅揭示了知识迁移的动力学机制，还发现了三个重要发现：存在既能保持任务准确性又能保留与预训练模型相似性的高性能模型；层间相似性度量与表征质量指标之间存在稳健的线性相关关系；以及有监督与自监督预训练范式的适应模式差异。这些成果深化了对神经网络适应过程的理解，并为优化预训练模型利用提供了实用启示。

链接: https://arxiv.org/abs/2503.07958
作者: Wenqiang Zu,Shenghao Xie,Hao Chen,Lei Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:This paper investigates the critical problem of representation similarity evolution during cross-domain transfer learning, with particular focus on understanding why pre-trained models maintain effectiveness when adapted to medical imaging tasks despite significant domain gaps. The study establishes a rigorous problem definition centered on quantifying and analyzing representation similarity trajectories throughout the fine-tuning process, while carefully delineating the scope to encompass both medical image analysis and broader cross-domain adaptation scenarios. Our empirical findings reveal three critical discoveries: the potential existence of high-performance models that preserve both task accuracy and representation similarity to their pre-trained origins; a robust linear correlation between layer-wise similarity metrics and representation quality indicators; and distinct adaptation patterns that differentiate supervised versus self-supervised pre-training paradigms. The proposed similarity space framework not only provides mechanistic insights into knowledge transfer dynamics but also raises fundamental questions about optimal utilization of pre-trained models. These results advance our understanding of neural network adaptation processes while offering practical implications for transfer learning strategies that extend beyond medical imaging applications. The code will be available once accepted.
zh

[CV-167] NeRF-VIO: Map-Based Visual-Inertial Odometry with Initialization Leverag ing Neural Radiance Fields

【速读】：本文旨在解决在增强现实（AR）等情境感知应用中，基于先验地图的定位问题，特别是如何通过结合视觉与惯性测量单元（Visual-Inertial Odometry, VIO）算法以及神经辐射场（Neural Radiance Fields, NeRF）技术来缓解漂移现象并提高定位精度。论文的关键在于提出了一种名为NeRF-VIO的新算法，该算法通过使用多层感知机模型，并将损失函数重新定义为(SE(3))上的测地距离，确保初始化模型在(\mathfrakse(3))帧变换下的不变性。此外，通过在多状态约束卡尔曼滤波器（Multi-State Constraint Kalman Filter, MSCKF）框架内集成两阶段更新机制，该算法不仅利用车载摄像头捕捉到的图像信息，还结合了预训练NeRF模型渲染出的图像，从而实现了对状态的有效约束。实验结果表明，在真实世界AR数据集上的表现证明了所提出的两阶段更新流程优于传统的MSCKF方法。

链接: https://arxiv.org/abs/2503.07952
作者: Yanyu Zhang,Dongming Wang,Jie Xu,Mengyuan Liu,Pengxiang Zhu,Wei Ren
机构: Department of Electrical and Computer Engineering, University of California, Riverside (加州大学河滨分校电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:A prior map serves as a foundational reference for localization in context-aware applications such as augmented reality (AR). Providing valuable contextual information about the environment, the prior map is a vital tool for mitigating drift. In this paper, we propose a map-based visual-inertial localization algorithm (NeRF-VIO) with initialization using neural radiance fields (NeRF). Our algorithm utilizes a multilayer perceptron model and redefines the loss function as the geodesic distance on (SE(3)), ensuring the invariance of the initialization model under a frame change within (\mathfrakse(3)). The evaluation demonstrates that our model outperforms existing NeRF-based initialization solution in both accuracy and efficiency. By integrating a two-stage update mechanism within a multi-state constraint Kalman filter (MSCKF) framework, the state of NeRF-VIO is constrained by both captured images from an onboard camera and rendered images from a pre-trained NeRF model. The proposed algorithm is validated using a real-world AR dataset, the results indicate that our two-stage update pipeline outperforms MSCKF across all data sequences.
zh

[CV-168] xt-RGBT Person Retrieval: Multilevel Global-Local Cross-Modal Alignment and A High-quality Benchmark

【速读】：该论文旨在解决传统文本-图像人物检索任务在复杂环境（如光照变化）下性能易受影响的问题。为应对这一挑战，论文提出了一种名为文本-RGBT人物检索的新任务，通过整合可见光和热成像模态的优势实现鲁棒的人物检索。方案的关键在于对齐文本与多模态视觉表征，但由于可见光和热成像模态之间的异构性可能干扰视觉与文本模态的对齐，论文提出了一个多层次全局-局部跨模态对齐网络（MGANet），充分挖掘模态特定和模态协作视觉特征与文本之间的关系，以解决上述问题。此外，为推动该领域的发展，作者创建了一个高质量的数据集RGBT-PEDES，包含来自不同年龄组和性别的1,822个身份及多种具有挑战性的场景数据，并提供了详细的文本标注，实验结果表明所提方法优于现有文本-图像人物检索方法。

链接: https://arxiv.org/abs/2503.07950
作者: Yifei Deng,Zhengyu Chen,Ziheng Xu,Chenglong Li,Jin Tang
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The performance of traditional text-image person retrieval task is easily affected by lighting variations due to imaging limitations of visible spectrum sensors. In this work, we design a novel task called text-RGBT person retrieval that integrates complementary benefits from thermal and visible modalities for robust person retrieval in challenging environments. Aligning text and multi-modal visual representations is the key issue in text-RGBT person retrieval, but the heterogeneity between visible and thermal modalities may interfere with the alignment of visual and text modalities. To handle this problem, we propose a Multi-level Global-local cross-modal Alignment Network (MGANet), which sufficiently mines the relationships between modality-specific and modality-collaborative visual with the text, for text-RGBT person retrieval. To promote the research and development of this field, we create a high-quality text-RGBT person retrieval dataset, RGBT-PEDES. RGBT-PEDES contains 1,822 identities from different age groups and genders with 4,723 pairs of calibrated RGB and thermal images, and covers high-diverse scenes from both daytime and nighttime with a various of challenges such as occlusion, weak alignment and adverse lighting conditions. Additionally, we carefully annotate 7,987 fine-grained textual descriptions for all RGBT person image pairs. Extensive experiments on RGBT-PEDES demonstrate that our method outperforms existing text-image person retrieval methods. The code and dataset will be released upon the acceptance.
zh

[CV-169] 7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting

【速读】：该论文旨在解决动态场景中视点相关效果的真实-time渲染这一计算机图形学中的基础挑战。此前的方法虽在处理动态场景（4DGS）或单独处理视点相关效果（6DGS）方面取得进展，但尚未有方法能够在保持实时性能的同时统一这两种能力。论文提出的解决方案——7D高斯点 splatting (7DGS)，通过将场景元素表示为包含位置（3D）、时间（1D）和观察方向（3D）的七维高斯分布来实现这一点。其关键贡献是一种高效的条件切片机制，能够将7D高斯分布转换为视点和时间条件的3D高斯分布，同时与现有的3D高斯点 splatting 流程兼容，并支持联合优化。实验表明，7DGS 在具有复杂视点相关效果的挑战性动态场景中实现了高达401 FPS的真实-time渲染速度，且在PSNR上比先前方法高出最多7.36 dB。

链接: https://arxiv.org/abs/2503.07946
作者: Zhongpai Gao,Benjamin Planche,Meng Zheng,Anwesa Choudhuri,Terrence Chen,Ziyan Wu
机构: United Imaging Intelligence (联影智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time rendering of dynamic scenes with view-dependent effects remains a fundamental challenge in computer graphics. While recent advances in Gaussian Splatting have shown promising results separately handling dynamic scenes (4DGS) and view-dependent effects (6DGS), no existing method unifies these capabilities while maintaining real-time performance. We present 7D Gaussian Splatting (7DGS), a unified framework representing scene elements as seven-dimensional Gaussians spanning position (3D), time (1D), and viewing direction (3D). Our key contribution is an efficient conditional slicing mechanism that transforms 7D Gaussians into view- and time-conditioned 3D Gaussians, maintaining compatibility with existing 3D Gaussian Splatting pipelines while enabling joint optimization. Experiments demonstrate that 7DGS outperforms prior methods by up to 7.36 dB in PSNR while achieving real-time rendering (401 FPS) on challenging dynamic scenes with complex view-dependent effects. The project page is: this https URL.
zh

[CV-170] STEAD: Spatio-Temporal Efficient Anomaly Detection for Time and Compute Sensitive Applications

【速读】：该论文旨在解决自动化系统中时间与计算敏感场景下的异常检测问题，特别是在需要实时推理的环境中（如自动驾驶），现有方法在时序上下文方面的性能仍有显著提升空间。论文的关键在于提出了一种名为STEAD (Spatio-Temporal Efficient Anomaly Detection) 的新方法，其核心在于采用(2+1)D卷积((2+1)D Convolutions) 和 Performer线性注意力(Performer Linear Attention) 的设计，确保模型在保持高效计算的同时不牺牲检测性能。这一方案通过结合时空特征的有效建模，在保证实时性需求的同时提升了异常检测的精度与效率。

链接: https://arxiv.org/abs/2503.07942
作者: Andrew Gao,Jun Liu
机构: Department of Computer Engineering, San Jose State University (圣荷西州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a new method for anomaly detection in automated systems with time and compute sensitive requirements, such as autonomous driving, with unparalleled efficiency. As systems like autonomous driving become increasingly popular, ensuring their safety has become more important than ever. Therefore, this paper focuses on how to quickly and effectively detect various anomalies in the aforementioned systems, with the goal of making them safer and more effective. Many detection systems have been developed with great success under spatial contexts; however, there is still significant room for improvement when it comes to temporal context. While there is substantial work regarding this task, there is minimal work done regarding the efficiency of models and their ability to be applied to scenarios that require real-time inference, i.e., autonomous driving where anomalies need to be detected the moment they are within view. To address this gap, we propose STEAD (Spatio-Temporal Efficient Anomaly Detection), whose backbone is developed using (2+1)D Convolutions and Performer Linear Attention, which ensures computational efficiency without sacrificing performance. When tested on the UCF-Crime benchmark, our base model achieves an AUC of 91.34%, outperforming the previous state-of-the-art, and our fast version achieves an AUC of 88.87%, while having 99.70% less parameters and outperforming the previous state-of-the-art as well. The code and pretrained models are made publicly available at this https URL
zh

[CV-171] BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes

【速读】：该论文旨在解决基于深度学习的点云配准方法在泛化能力上的局限性，具体表现为大多数现有方法在新环境中仍需重新训练或手动调整参数。论文识别出限制泛化能力的三个关键因素：(a) 对特定环境体素大小和搜索半径的依赖；(b) 基于学习的关键点检测器在域外场景中的鲁棒性不足；© 直接使用原始坐标导致的尺度不一致问题。为解决这些问题，论文提出了一种名为BUFFER-X的零样本点云配准管道，其关键方案包括：(a) 自适应确定体素大小/搜索半径；(b) 使用最远点采样绕过学习型检测器；© 利用分块尺度归一化实现一致的坐标边界。此外，通过多尺度分块描述子生成和跨尺度的分层内点搜索，进一步提升了不同场景下的鲁棒性。论文还构建了一个包含11个数据集的新基准，覆盖多种室内/室外场景及传感器模态，验证了BUFFER-X在无需先验信息或手动参数调整的情况下实现了显著的泛化性能。

链接: https://arxiv.org/abs/2503.07940
作者: Minkyun Seo,Hyungtae Lim,Kanghee Lee,Luca Carlone,Jaesik Park
机构: Seoul National University (首尔国立大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 20 pages, 14 figures

点击查看摘要

Abstract:Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and © raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and © leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at this https URL.
zh

[CV-172] STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision

【速读】：该论文致力于解决基于视觉的定位（Vision-based Localization）问题，特别是在动态环境中实现高精度地理坐标映射的挑战。传统方法通常依赖于密集的卫星图像数据库进行检索，而该研究提出了一种生物启发式的生成模型解决方案，将定位任务转化为生成任务，从而避免了对大规模卫星图像数据库的依赖。关键在于引入了两种顺序生成模型：VAE-RNN 和 VAE-Transformer，它们能够将第一人称视角（FPP）观测转换为全局地图视角（GMP）表示，并直接学习不同视角间的映射关系。其中，VAE-Transformer 在两个真实世界环境中的表现尤为突出，其定位精度显著优于其他方法，同时具备更低的计算资源需求和更快的推理速度，展示了基于生物空间导航机制的模型在复杂动态环境中的高效性和精确性。

链接: https://arxiv.org/abs/2503.07939
作者: Hin Wai Lui,Jeffrey L. Krichmar
机构: University of California (加州大学); Department of Computer Science (计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:This paper explores vision-based localization through a biologically-inspired approach that mirrors how humans and animals link views or perspectives when navigating their world. We introduce two sequential generative models, VAE-RNN and VAE-Transformer, which transform first-person perspective (FPP) observations into global map perspective (GMP) representations and precise geographical coordinates. Unlike retrieval-based methods, our approach frames localization as a generative task, learning direct mappings between perspectives without relying on dense satellite image databases. We evaluate these models across two real-world environments: a university campus navigated by a Jackal robot and an urban downtown area navigated by a Tesla sedan. The VAE-Transformer achieves impressive precision, with median deviations of 2.29m (1.37% of environment size) and 4.45m (0.35% of environment size) respectively, outperforming both VAE-RNN and prior cross-view geo-localization approaches. Our comprehensive Localization Performance Characteristics (LPC) analysis demonstrates superior performance with the VAE-Transformer achieving an AUC of 0.777 compared to 0.295 for VIGOR 200 and 0.225 for TransGeo, establishing a new state-of-the-art in vision-based localization. In some scenarios, our vision-based system rivals commercial smartphone GPS accuracy (AUC of 0.797) while requiring 5x less GPU memory and delivering 3x faster inference than existing methods in cross-view geo-localization. These results demonstrate that models inspired by biological spatial navigation can effectively memorize complex, dynamic environments and provide precise localization with minimal computational resources.
zh

[CV-173] CAD-VAE: Leverag ing Correlation-Aware Latents for Comprehensive Fair Disentanglement

【速读】：该论文旨在解决深度生成模型在表示学习中的偏见与公平性问题，这些模型可能继承或放大敏感属性与预测特征之间的偏差。由于目标因素与敏感因素通常存在自然关联，严格独立解耦（strict independence in disentanglement）往往不现实。为此，论文提出了一种名为CAD-VAE（Correlation-Aware Disentangled Variational Autoencoder）的解决方案。其关键是引入一个相关潜在码（correlated latent code），用于捕获目标属性与敏感属性之间的共享信息，并通过直接最小化目标代码与敏感代码之间的条件互信息（conditional mutual information），有效分离重叠的因素，而无需额外领域知识。此外，通过一种基于相关性驱动的优化策略，进一步精炼相关潜在码，高效提取关键的相关特征并消除冗余，从而实现更公平的表示学习、更真实的反事实生成（counterfactual generation）以及改进的公平感知图像编辑能力。

链接: https://arxiv.org/abs/2503.07938
作者: Chenrui Ma,Rongchang Zhao,Xi Xiao,Hongyang Xie,Tianyang Wang,Xiao Wang,Hao Zhang,Yanning Shen
机构: University of California, Irvine (加州大学欧文分校); Central South University (中南大学); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Oak Ridge National Laboratory (橡树岭国家实验室); University of the Chinese Academy of Sciences (中国科学院大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:While deep generative models have significantly advanced representation learning, they may inherit or amplify biases and fairness issues by encoding sensitive attributes alongside predictive features. Enforcing strict independence in disentanglement is often unrealistic when target and sensitive factors are naturally correlated. To address this challenge, we propose CAD-VAE (Correlation-Aware Disentangled VAE), which introduces a correlated latent code to capture the shared information between target and sensitive attributes. Given this correlated latent, our method effectively separates overlapping factors without extra domain knowledge by directly minimizing the conditional mutual information between target and sensitive codes. A relevance-driven optimization strategy refines the correlated code by efficiently capturing essential correlated features and eliminating redundancy. Extensive experiments on benchmark datasets demonstrate that CAD-VAE produces fairer representations, realistic counterfactuals, and improved fairness-aware image editing.
zh

[CV-174] From Slices to Sequences: Autoregressive Tracking Transformer for Cohesive and Consistent 3D Lymph Node Detection in CT Scans

【速读】：该论文旨在解决三维CT扫描中散在分布且低对比度的淋巴结（Lymph Node, LN）检测难题，特别是在癌症分期和治疗规划等临床任务中的重要性。传统方法依赖于二维或基于切片的半三维检测器，但这些方法难以显式建模淋巴结作为三维物体的切片间一致性，并需要复杂的后处理步骤来生成最终的三维实例，且参数调优过程繁琐。

论文的关键创新在于将三维淋巴结检测重新定义为跟踪任务，并提出了名为LN-Tracker的新型淋巴结跟踪Transformer模型。该模型通过解耦Transformer解码器的查询为跟踪组和检测组，实现了端到端的联合检测与三维实例关联。其中，跟踪查询沿CT扫描的z轴自回归地追踪先前检测到的淋巴结实例，而新设计的Transformer解码器结合掩码注意力模块，既保证了跟踪查询与当前切片上下文的一致性，又保留了检测查询在当前切片上的高精度。此外，引入了切片间相似性损失以促进淋巴结在相邻切片间的连贯关联。实验结果表明，LN-Tracker在多个数据集上的平均灵敏度至少提升了2.7%，并在公开的肺结节和前列腺肿瘤检测任务中验证了其泛化能力。

链接: https://arxiv.org/abs/2503.07933
作者: Qinji Yu,Yirui Wang,Ke Yan,Dandan Zheng,Dashan Ai,Dazhou Guo,Zhanghexuan Ji,Yanzhou Su,Yun Bian,Na Shen,Xiaowei Ding,Le Lu,Xianghua Ye,Dakai Jin
机构: DAMO Academy, Alibaba Group (阿里巴巴集团 DAMO 学院); Shanghai Jiao Tong University (上海交通大学); The First Affiliated Hospital of Zhejiang University (浙江大学第一附属医院); Hupan Lab (湖畔实验室), 310023, Hangzhou, China; Fudan University Shanghai Cancer Center (复旦大学附属肿瘤医院); Chaanghai Hospital (长海医院); Zhongshan Hospital, Fudan University (复旦大学附属中山医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report (11 pages plus supplementary)

点击查看摘要

Abstract:Lymph node (LN) assessment is an essential task in the routine radiology workflow, providing valuable insights for cancer staging, treatment planning and beyond. Identifying scatteredly-distributed and low-contrast LNs in 3D CT scans is highly challenging, even for experienced clinicians. Previous lesion and LN detection methods demonstrate effectiveness of 2.5D approaches (i.e, using 2D network with multi-slice inputs), leveraging pretrained 2D model weights and showing improved accuracy as compared to separate 2D or 3D detectors. However, slice-based 2.5D detectors do not explicitly model inter-slice consistency for LN as a 3D object, requiring heuristic post-merging steps to generate final 3D LN instances, which can involve tuning a set of parameters for each dataset. In this work, we formulate 3D LN detection as a tracking task and propose LN-Tracker, a novel LN tracking transformer, for joint end-to-end detection and 3D instance association. Built upon DETR-based detector, LN-Tracker decouples transformer decoder’s query into the track and detection groups, where the track query autoregressively follows previously tracked LN instances along the z-axis of a CT scan. We design a new transformer decoder with masked attention module to align track query’s content to the context of current slice, meanwhile preserving detection query’s high accuracy in current slice. An inter-slice similarity loss is introduced to encourage cohesive LN association between slices. Extensive evaluation on four lymph node datasets shows LN-Tracker’s superior performance, with at least 2.7% gain in average sensitivity when compared to other top 3D/2.5D detectors. Further validation on public lung nodule and prostate tumor detection tasks confirms the generalizability of LN-Tracker as it achieves top performance on both tasks. Datasets will be released upon acceptance.
zh

[CV-175] Learning Gentle Grasping Using Vision Sound and Touch

【速读】：该论文致力于解决在抓取易损物体（如水果）时如何实现稳定且轻柔抓取的问题。传统方法通常依赖最大抓取力，而该研究强调以最小必要力进行抓取的重要性。解决方案的关键在于提出了一种端到端的多模态学习框架，通过视觉、触觉及听觉信号联合预测未来抓取动作的稳定性和轻柔性。具体而言，论文利用音频信号作为抓取轻柔性的指示器，并从原始的视觉-触觉输入中训练模型，从而选择并执行最优抓取动作。实验结果表明，与仅基于视觉的方法相比，该多模态模型不仅提高了抓取性能（准确率提升3.27%），而且在真实环境中实现了更高的稳定且轻柔抓取成功率（高出17%）。此外，该方法无需触觉传感器标定或分析式力建模，大幅降低了工程复杂度。

链接: https://arxiv.org/abs/2503.07926
作者: Ken Nakahara,Roberto Calandra
机构: Learning, Adaptive Systems, and Robotics (LASR) Lab, TU Dresden (TU 德累斯顿); Center for Tactile Internet with Human-in-the-Loop (CeTI)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:In our daily life, we often encounter objects that are fragile and can be damaged by excessive grasping force, such as fruits. For these objects, it is paramount to grasp gently – not using the maximum amount of force possible, but rather the minimum amount of force necessary. This paper proposes using visual, tactile, and auditory signals to learn to grasp and regrasp objects stably and gently. Specifically, we use audio signals as an indicator of gentleness during the grasping, and then train end-to-end an action-conditional model from raw visuo-tactile inputs that predicts both the stability and the gentleness of future grasping candidates, thus allowing the selection and execution of the most promising action. Experimental results on a multi-fingered hand over 1,500 grasping trials demonstrated that our model is useful for gentle grasping by validating the predictive performance (3.27% higher accuracy than the vision-only variant) and providing interpretations of their behavior. Finally, real-world experiments confirmed that the grasping performance with the trained multi-modal model outperformed other baselines (17% higher rate for stable and gentle grasps than vision-only). Our approach requires neither tactile sensor calibration nor analytical force modeling, drastically reducing the engineering effort to grasp fragile objects. Dataset and videos are available at this https URL.
zh

[CV-176] Visual and Text Prompt Segmentation: A Novel Multi-Model Framework for Remote Sensing

【速读】：该论文旨在解决遥感图像像素级分割中的三个主要挑战：首先，Segment Anything Model (SAM) 在无明确提示约束的情况下易生成冗余掩膜，增加后处理复杂性；其次，CLIP 模型侧重于全局特征对齐，往往忽略遥感图像中重要的局部目标，导致多目标识别不准确或焦点偏移；第三，现有模型未在多尺度航拍视图上进行预训练，增加了检测失败的风险。为应对这些挑战，论文提出了一种创新的 VTPSeg 流程，其关键在于结合 Grounding DINO+ (GD+) 模块生成初始候选边界框，并通过 CLIP Filter++ (CLIP++) 模块利用视觉和文本提示优化和过滤无关目标边界框，从而确保仅关注相关对象。最终，经过精炼的边界框作为特定提示传递给 FastSAM 模型以实现精确分割。

链接: https://arxiv.org/abs/2503.07911
作者: Xing Zi,Kairui Jin,Xian Tao,Jun Li,Ali Braytee,Rajiv Ratn Shah,Mukesh Prasad
机构: University of Technology Sydney, Australia (澳大利亚技术大学悉尼校区); Chinese Academy of Sciences (中国科学院); Indraprastha Institute of Information Technology, Delhi (德里英迪拉·普拉萨德信息技术学院)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Under Review - IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

点击查看摘要

Abstract:Pixel-level segmentation is essential in remote sensing, where foundational vision models like CLIP and Segment Anything Model(SAM) have demonstrated significant capabilities in zero-shot segmentation tasks. Despite their advances, challenges specific to remote sensing remain substantial. Firstly, The SAM without clear prompt constraints, often generates redundant masks, and making post-processing more complex. Secondly, the CLIP model, mainly designed for global feature alignment in foundational models, often overlooks local objects crucial to remote sensing. This oversight leads to inaccurate recognition or misplaced focus in multi-target remote sensing imagery. Thirdly, both models have not been pre-trained on multi-scale aerial views, increasing the likelihood of detection failures. To tackle these challenges, we introduce the innovative VTPSeg pipeline, utilizing the strengths of Grounding DINO, CLIP, and SAM for enhanced open-vocabulary image segmentation. The Grounding DINO+(GD+) module generates initial candidate bounding boxes, while the CLIP Filter++(CLIP++) module uses a combination of visual and textual prompts to refine and filter out irrelevant object bounding boxes, ensuring that only pertinent objects are considered. Subsequently, these refined bounding boxes serve as specific prompts for the FastSAM model, which executes precise segmentation. Our VTPSeg is validated by experimental and ablation study results on five popular remote sensing image segmentation datasets.
zh

[CV-177] FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction

【速读】：该论文旨在解决如何使机器人能够直接与环境交互的问题，具体目标是识别功能互动元素的位置及其使用方式。关键在于以更细粒度检测和存储物体，并聚焦于与功能相关的关键部分。由于缺乏扩展到实例级检测之外的数据以及捕捉详细物体特征的难度，论文通过利用现有3D资源生成2D数据并训练检测器，将其整合到标准3D场景图生成管道中，从而实现比现有方案更精确的任务驱动功能接地。

链接: https://arxiv.org/abs/2503.07909
作者: Dennis Rotondi,Fabio Scaparro,Hermann Blum,Kai O. Arras
机构: Socially Intelligent Robotics Lab, Institute for Artificial Intelligence, University of Stuttgart (斯图加特大学); Robot Perception and Learning Lab, LAMARR Institute for Machine Learning and Artificial Intelligence, University of Bonn (波恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The concept of 3D scene graphs is increasingly recognized as a powerful semantic and hierarchical representation of the environment. Current approaches often address this at a coarse, object-level resolution. In contrast, our goal is to develop a representation that enables robots to directly interact with their environment by identifying both the location of functional interactive elements and how these can be used. To achieve this, we focus on detecting and storing objects at a finer resolution, focusing on affordance-relevant parts. The primary challenge lies in the scarcity of data that extends beyond instance-level detection and the inherent difficulty of capturing detailed object features using robotic sensors. We leverage currently available 3D resources to generate 2D data and train a detector, which is then used to augment the standard 3D scene graph generation pipeline. Through our experiments, we demonstrate that our approach achieves functional element segmentation comparable to state-of-the-art 3D models and that our augmentation enables task-driven affordance grounding with higher accuracy than the current solutions.
zh

[CV-178] Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning ICLR2025

【速读】：该论文旨在解决图像描述任务中由于过时评估指标和粗粒度标注导致的详细图像描述评估不足的问题。论文的关键解决方案是提出了DeCapBench基准数据集以及专为此类任务设计的新颖评估指标DCScore。DCScore通过将响应分解为最小的自包含单元——即原始信息单元，并逐一评估这些单元来评价幻觉现象和细粒度全面性。此外，论文还介绍了基于此高级指标的自动细粒度反馈收集方法FeedQuill，以实现偏好优化，展示了在自动生成的偏好数据上的强大泛化能力。多项实验表明，所提出的方法不仅能显著减少幻觉现象，还能提升多种基准下的性能，从而实现更优的细节描述能力，甚至超越GPT-4o的表现。

链接: https://arxiv.org/abs/2503.07906
作者: Qinghao Ye,Xianhan Zeng,Fu Li,Chunyuan Li,Haoqi Fan
机构: ByteDance Research (字节跳动研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.
zh

[CV-179] Intelligent Framework for Human-Robot Collaboration: Safety Dynamic Ergonomics and Adaptive Decision-Making

【速读】：该论文旨在解决工业环境中协作机器人（Collaborative Robots）引入后所面临的操作员安全性和人体工程学挑战。解决方案的关键在于提出了一种创新框架，该框架集成了先进的视觉感知技术（包括深度学习模型 YOLO 和 SlowOnly、 Unscented Kalman Filter 跟踪算法）、实时人体工程学监测（基于 OWAS 方法）以及基于行为树（Behaviour Tree, BT）的自适应决策机制。与传统孤立或静态方法不同，该框架通过模块化、可扩展且自适应的设计，显著提升了姿势与动作检测的准确性、人机交互管理的灵活性以及通过及时机器人干预降低人体工程学风险的能力。其中，视觉感知模块优于 YOLOv9 和 YOLOv8，而行为树驱动的自适应角色管理则提供了比基于规则系统更高的响应能力，使其适用于复杂的工业场景。

链接: https://arxiv.org/abs/2503.07901
作者: Francesco Iodice,Elena De Momi,Arash Ajoudani
机构: Department of Electronics, Information, and Bioengineering, Politecnico di Milano (米兰理工大学); Istituto Italiano di Tecnologia (IIT) (意大利技术研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 14 pagine, 10 figure, 3 tabelle, formato conferenza IEEE

点击查看摘要

Abstract:The integration of collaborative robots into industrial environments has improved productivity, but has also highlighted significant challenges related to operator safety and ergonomics. This paper proposes an innovative framework that integrates advanced visual perception technologies, real-time ergonomic monitoring, and Behaviour Tree (BT)-based adaptive decision-making. Unlike traditional methods, which often operate in isolation or statically, our approach combines deep learning models (YOLO11 and SlowOnly), advanced tracking (Unscented Kalman Filter) and dynamic ergonomic assessments (OWAS), offering a modular, scalable and adaptive system. Experimental results show that the framework outperforms previous methods in several aspects: accuracy in detecting postures and actions, adaptivity in managing human-robot interactions, and ability to reduce ergonomic risk through timely robotic interventions. In particular, the visual perception module showed superiority over YOLOv9 and YOLOv8, while real-time ergonomic monitoring eliminated the limitations of static analysis. Adaptive role management, made possible by the Behaviour Tree, provided greater responsiveness than rule-based systems, making the framework suitable for complex industrial scenarios. Our system demonstrated a 92.5% accuracy in grasping intention recognition and successfully classified ergonomic risks with real-time responsiveness (average latency of 0.57 seconds), enabling timely robotic
zh

[CV-180] Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?

【速读】：该论文试图解决的问题是如何将生成式扩散模型（Generative Diffusion Models）应用于判别性任务（Discriminative Tasks），以实现其在地理空间基础模型（Geospatial Foundation Models, GFMs）中的潜力。目前，GFMs主要侧重于对比学习或掩码图像建模等判别性目标，而生成式扩散模型因其在图像生成过程中捕获多粒度语义信息的能力尚未被充分探索用于判别性应用。论文的关键在于提出SatDiFuser框架，通过系统分析基于扩散过程的多层次、噪声依赖特征，开发了三种融合策略来有效利用这些多样化的表示，从而将生成式扩散模型转化为强大的预训练工具，使其具备足够的判别能力。实验结果表明，SatDiFuser在语义分割和分类任务中分别提升了+5.7% mIoU和+7.9% F1分数，证明了基于扩散的生成式基础模型在判别性任务上的竞争力与优越性。

链接: https://arxiv.org/abs/2503.07890
作者: Yuru Jia,Valerio Marsocci,Ziyang Gong,Xue Yang,Maarten Vergauwen,Andrea Nascetti
机构: KU Leuven (鲁汶大学); KTH (皇家理工学院); ESA ΦΦ\Phiroman_Φ-lab (欧洲航天局 ΦΦ\Phiroman_Φ 实验室); Shanghai AI Lab (上海人工智能实验室); SJTU (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily focus on discriminative objectives, such as contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models–which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation–remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs. Code will be released.
zh

[CV-181] Measuring directional bias amplification in image captions using predictability

【速读】：该论文试图解决在复杂任务（如图像描述生成）中衡量模型预测偏见放大（bias amplification）的问题。现有基于共现性（co-occurrence-based）的度量方法在简单问题（如图像分类）中有效，但在处理语义丰富的任务（如图像描述生成）时失效，因为它们无法捕捉描述中的语义信息。为了解决这一问题，已有工作引入了一种基于可预测性（predictability-based）的度量方法——Leakage in Captioning (LIC)，但LIC存在方向性不可识别、数据集偏见估计不准确以及对攻击者模型（attacker model）敏感等局限性。论文的关键解决方案是提出了一种新的度量方法——Directional Predictability Amplification in Captioning (DPAC)，它能够衡量描述中偏见放大的方向性，通过改进的替换策略提供更可靠的偏见估计，并且对攻击者模型的敏感性更低。实验表明，DPAC在COCO数据集上的表现证明其是最可靠的偏见放大量化方法。

链接: https://arxiv.org/abs/2503.07878
作者: Rahul Nair,Bhanu Tokas,Hannah Kerner
机构: Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When we train models on biased ML datasets, they not only learn these biases but can inflate them at test time - a phenomenon called bias amplification. To measure bias amplification in ML datasets, many co-occurrence-based metrics have been proposed. Co-occurrence-based metrics are effective in measuring bias amplification in simple problems like image classification. However, these metrics are ineffective for complex problems like image captioning as they cannot capture the semantics of a caption. To measure bias amplification in captions, prior work introduced a predictability-based metric called Leakage in Captioning (LIC). While LIC captures the semantics and context of captions, it has limitations. LIC cannot identify the direction in which bias is amplified, poorly estimates dataset bias due to a weak vocabulary substitution strategy, and is highly sensitive to attacker models (a hyperparameter in predictability-based metrics). To overcome these issues, we propose Directional Predictability Amplification in Captioning (DPAC). DPAC measures directional bias amplification in captions, provides a better estimate of dataset bias using an improved substitution strategy, and is less sensitive to attacker models. Our experiments on the COCO captioning dataset show how DPAC is the most reliable metric to measure bias amplification in captions.
zh

[CV-182] opology-Preserving Loss for Accurate and Anatomically Consistent Cardiac Mesh Reconstruction

【速读】：该论文旨在解决从体积数据重建精确心脏网格时因现有基于变形的方法容易产生拓扑不一致（尤其是膜穿透）而导致解剖学合理性下降的问题。为了解决这一问题，论文引入了一种名为拓扑保持网格损失（Topology-Preserving Mesh Loss, TPM Loss）的新颖损失函数，其关键在于在网格变形过程中显式施加拓扑约束，通过识别违反拓扑规则的点来确保空间上的一致性重建。实验结果表明，TPM Loss 可将拓扑违规减少高达 93.1%，同时保持高分割精度（Dice 相似系数 DSC：89.1%-92.9%），并提高网格保真度（Chamfer 距离最多降低 0.26 毫米），从而有效防止膜穿透并显著提升心脏网格质量。

链接: https://arxiv.org/abs/2503.07874
作者: Chenyu Zhang,Yihao Luo,Yinzhe Wu,Choon Hwai Yap,Guang Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate cardiac mesh reconstruction from volumetric data is essential for personalized cardiac modeling and clinical analysis. However, existing deformation-based approaches are prone to topological inconsistencies, particularly membrane penetration, which undermines the anatomical plausibility of the reconstructed mesh. To address this issue, we introduce Topology-Preserving Mesh Loss (TPM Loss), a novel loss function that explicitly enforces topological constraints during mesh deformation. By identifying topology-violating points, TPM Loss ensures spatially consistent reconstructions. Extensive experiments on CT and MRI datasets show that TPM Loss reduces topology violations by up to 93.1% while maintaining high segmentation accuracy (DSC: 89.1%-92.9%) and improving mesh fidelity (Chamfer Distance reduction up to 0.26 mm). These results demonstrate that TPM Loss effectively prevents membrane penetration and significantly improves cardiac mesh quality, enabling more accurate and anatomically consistent cardiac reconstructions.
zh

[CV-183] Video Action Differencing ICLR2025

【速读】：该论文试图解决的问题是如何识别同一动作视频之间的细微差异（Video Action Differencing, VidDiff），这一任务在教练指导和技能学习等领域具有广泛应用。为推动该任务的发展，作者创建了一个包含549对视频及其详细标注的基准数据集VidDiffBench，并指出当前最先进的大型多模态模型（如GPT-4o和Qwen2-VL）在此任务上面临显著挑战。通过对这些模型失败案例的分析，研究强调了两个关键挑战：跨视频子动作的相关性定位以及细粒度帧级比较。为克服这些问题，论文提出了一种名为VidDiff的方法，它采用一种主动的工作流，将任务分解为三个阶段：动作差异提议、关键帧定位和帧级差异比较，每个阶段都使用专门的基础模型。为了促进该领域未来的研究，作者发布了相应的基准数据集和代码。

链接: https://arxiv.org/abs/2503.07860
作者: James Burgess,Xiaohan Wang,Yuhui Zhang,Anita Rau,Alejandro Lozano,Lisa Dunlap,Trevor Darrell,Serena Yeung-Levy
机构: Stanford (斯坦福大学); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2025 (International Conference on Learning Representations) Project page: this http URL Benchmark: this https URL

点击查看摘要

Abstract:How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at this https URL and code at this http URL.
zh

[CV-184] Blind Video Super-Resolution based on Implicit Kernels

【速读】：该论文旨在解决盲视频超分辨率（Blind Video Super-Resolution, BVSR）任务中因未知退化场景导致的性能瓶颈问题，特别是现有方法通常假设空间不变的模糊核（spatially invariant blur kernels），而忽略了视频中存在的潜在时空变化退化（spatio-temporal varying degradations）。为了解决这一问题，论文提出了一种基于隐式核（Implicit Kernels, BVSR-IK）的新模型，其关键在于构建了一个由隐式神经表示参数化的多尺度核字典，并结合新设计的循环Transformer网络来精确预测系数权重，用于帧校正和特征对齐过程中的滤波操作。实验结果表明，BVSR-IK在三个常用数据集上的表现优于四种最先进的BVSR方法，峰值信噪比（PSNR）最高提升达0.59 dB。

链接: https://arxiv.org/abs/2503.07856
作者: Qiang Zhu,Yuxuan Jiang,Shuyuan Zhu,Fan Zhang,David Bull,Bing Zeng
机构: University of Electronic Science and Technology of China (电子科技大学); University of Bristol (布里斯托尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blind video super-resolution (BVSR) is a low-level vision task which aims to generate high-resolution videos from low-resolution counterparts in unknown degradation scenarios. Existing approaches typically predict blur kernels that are spatially invariant in each video frame or even the entire video. These methods do not consider potential spatio-temporal varying degradations in videos, resulting in suboptimal BVSR performance. In this context, we propose a novel BVSR model based on Implicit Kernels, BVSR-IK, which constructs a multi-scale kernel dictionary parameterized by implicit neural representations. It also employs a newly designed recurrent Transformer to predict the coefficient weights for accurate filtering in both frame correction and feature alignment. Experimental results have demonstrated the effectiveness of the proposed BVSR-IK, when compared with four state-of-the-art BVSR models on three commonly used datasets, with BVSR-IK outperforming the second best approach, FMA-Net, by up to 0.59 dB in PSNR. Source code will be available at this https URL.
zh

[CV-185] Learning and Evaluating Hierarchical Feature Representations

【速读】：本文旨在解决类别层次结构在特征空间映射中一致性不足的问题，确保语义上更接近的类别在特征空间中距离更近，从而减轻分类错误的严重程度并实现一致的粗粒度类别预测。为此，作者提出了一种新颖的框架——层级正交子空间组合（Hierarchical Composition of Orthogonal Subspaces, Hier-COS），其关键在于通过一个简单的变换模块将神经网络主干学到的判别性特征映射到基于固定正交框架定义的子空间中，从而设计出与给定 taxonomy 树结构一致的向量空间。此外，本文还引入了一种基于偏好的评价指标 HOPS（Hierarchically Ordered Preference Score），以克服现有层级评估方法的局限性。通过在多个具有深层标签层次结构的数据集上的实验验证，证明了 Hier-COS 在保持顶级分类准确率的同时实现了最先进的层级性能。

链接: https://arxiv.org/abs/2503.07853
作者: Depanshu Sani,Saket Anand
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hierarchy-aware representations ensure that the semantically closer classes are mapped closer in the feature space, thereby reducing the severity of mistakes while enabling consistent coarse-level class predictions. Towards this end, we propose a novel framework, Hierarchical Composition of Orthogonal Subspaces (Hier-COS), which learns to map deep feature embeddings into a vector space that is, by design, consistent with the structure of a given taxonomy tree. Our approach augments neural network backbones with a simple transformation module that maps learned discriminative features to subspaces defined using a fixed orthogonal frame. This construction naturally improves the severity of mistakes and promotes hierarchical consistency. Furthermore, we highlight the fundamental limitations of existing hierarchical evaluation metrics popularly used by the vision community and introduce a preference-based metric, Hierarchically Ordered Preference Score (HOPS), to overcome these limitations. We benchmark our method on multiple large and challenging datasets having deep label hierarchies (ranging from 3 - 12 levels) and compare with several baselines and SOTA. Through extensive experiments, we demonstrate that Hier-COS achieves state-of-the-art hierarchical performance across all the datasets while simultaneously beating top-1 accuracy in all but one case. We also demonstrate the performance of a Vision Transformer (ViT) backbone and show that learning a transformation module alone can map the learned features from a pre-trained ViT to Hier-COS and yield substantial performance benefits.
zh

[CV-186] winTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces

【速读】：该论文旨在解决有限标注数据条件下基础模型微调的挑战，通过利用互信息分解方法提高在少量标注数据场景下的分类任务性能。论文的关键解决方案在于提出了一种半监督微调框架，该框架基于两个不同的下界优化：一是针对下游任务空间（如分类）的优化，结合条件与边缘交叉熵以及Kullback-Leibler散度；二是针对潜在表示空间的正则化与对齐，采用类似对比学习的分解方式。此外，该策略仅修改预训练模型中的专用投影模块，包括一个小规模Transformer网络和一种令牌聚合技术，从而保留了基础模型的原始结构。实验结果表明，在极低标注量条件下，该方法能够显著提升分类任务的表现，同时有效利用未标注数据。

链接: https://arxiv.org/abs/2503.07851
作者: Guillaume Quétant,Pavlo Molchanov,Slava Voloshynovskiy
机构: University of Geneva (日内瓦大学); NVIDIA
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present a semi-supervised fine-tuning framework for foundation models that utilises mutual information decomposition to address the challenges of training for a limited amount of labelled data. Our approach derives two distinct lower bounds: i) for the downstream task space, such as classification, optimised using conditional and marginal cross-entropy alongside Kullback-Leibler divergence, and ii) for the latent space representation, regularised and aligned using a contrastive-like decomposition. This fine-tuning strategy retains the pre-trained structure of the foundation model, modifying only a specialised projector module comprising a small transformer and a token aggregation technique. Experiments on several datasets demonstrate significant improvements in classification tasks under extremely low-labelled conditions by effectively leveraging unlabelled data.
zh

[CV-187] Fixing the RANSAC Stopping Criterion

【速读】：该论文旨在解决自1981年Fischler和Bolles提出RANSAC算法以来长期存在的一项错误，即在基于随机采样的假设生成过程中，其采样概率的近似计算导致的严重欠采样问题。这种近似方法虽然被广泛应用于现代RANSAC的各种变体和实现中，但并未受到后续研究的充分质疑或深入分析。论文通过理论推导和实验验证表明，该近似会导致难以发现高质量模型，尤其是在少量内点（inliers）和高模型复杂度的情况下。解决方案的关键在于提供一种精确计算采样概率的方法，该方法实现简单且效果显著，有望对计算机视觉系统的多个领域产生深远影响。

链接: https://arxiv.org/abs/2503.07829
作者: Johannes Schönberger,Viktor Larsson,Marc Pollefeys
机构: ETH Zurich (苏黎世联邦理工学院); Microsoft (微软); Lund University (隆德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:For several decades, RANSAC has been one of the most commonly used robust estimation algorithms for many problems in computer vision and related fields. The main contribution of this paper lies in addressing a long-standing error baked into virtually any system building upon the RANSAC algorithm. Since its inception in 1981 by Fischler and Bolles, many variants of RANSAC have been proposed on top of the same original idea relying on the fact that random sampling has a high likelihood of generating a good hypothesis from minimal subsets of measurements. An approximation to the sampling probability was originally derived by the paper in 1981 in support of adaptively stopping RANSAC and is, as such, used in the vast majority of today’s RANSAC variants and implementations. The impact of this approximation has since not been questioned or thoroughly studied by any of the later works. As we theoretically derive and practically demonstrate in this paper, the approximation leads to severe undersampling and thus failure to find good models. The discrepancy is especially pronounced in challenging scenarios with few inliers and high model complexity. An implementation of computing the exact probability is surprisingly simple yet highly effective and has potentially drastic impact across a large range of computer vision systems.
zh

[CV-188] Neural Radiance and Gaze Fields for Visual Attention Modeling in 3D Environments

【速读】：本文旨在解决如何在复杂三维场景中表示和重建视觉注意模式的问题。解决方案的关键在于引入了一种名为神经辐射与注视场（Neural Radiance and Gaze Fields, NeRGs）的新方法，通过将标准的神经辐射场（Neural Radiance Field, NeRF）扩展，加入一个额外的神经网络来建模注视概率分布，从而实现从任意视角渲染场景图像以及对应的像素级显著性图，该显著性图表示观察者在渲染图像中注视给定表面的条件概率。为了确保注视模式的一致性重建，NeRGs 还考虑了场景的三维结构约束，并处理了由于视线与渲染相机视角分离导致的遮挡情况。训练过程中利用了来自骨骼追踪数据的真实头部姿态或基于二维显著性模型的预测数据。

链接: https://arxiv.org/abs/2503.07828
作者: Andrei Chubarau,Yinan Wang,James J. Clark
机构: McGill University (麦吉尔大学); Centre for Intelligent Machines (智能机器中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:We introduce Neural Radiance and Gaze Fields (NeRGs) as a novel approach for representing visual attention patterns in 3D scenes. Our system renders a 2D view of a 3D scene with a pre-trained Neural Radiance Field (NeRF) and visualizes the gaze field for arbitrary observer positions, which may be decoupled from the render camera perspective. We achieve this by augmenting a standard NeRF with an additional neural network that models the gaze probability distribution. The output of a NeRG is a rendered image of the scene viewed from the camera perspective and a pixel-wise salience map representing conditional probability that an observer fixates on a given surface within the 3D scene as visible in the rendered image. Much like how NeRFs perform novel view synthesis, NeRGs enable the reconstruction of gaze patterns from arbitrary perspectives within complex 3D scenes. To ensure consistent gaze reconstructions, we constrain gaze prediction on the 3D structure of the scene and model gaze occlusion due to intervening surfaces when the observer’s viewpoint is decoupled from the rendering camera. For training, we leverage ground truth head pose data from skeleton tracking data or predictions from 2D salience models. We demonstrate the effectiveness of NeRGs in a real-world convenience store setting, where head pose tracking data is available.
zh

[CV-189] Helios 2.0: A Robust Ultra-Low Power Gesture Recognition System Optimised for Event-Sensor based Wearables

【速读】：该论文旨在解决在智能眼镜等可穿戴设备中实现自然且高效的自然手部手势控制的问题。现有手部手势识别技术虽已取得显著进展，但仍面临三大挑战：系统需直观易用、适应不同用户与环境，并具备足够的功耗效率以支持实际可穿戴应用。论文的关键解决方案在于精心选择微手势（microgestures），包括拇指在食指上的双向横向滑动以及拇指与食指尖的双击动作。这些手势利用人类自然的手部运动模式，确保了直观性，无需用户学习复杂的命令序列。此外，为应对用户和环境的多样性，研究提出了新颖的仿真方法，实现了全面的领域采样，而无需大量真实世界数据收集。最终，通过优化的架构设计，在仅使用合成训练数据的情况下，实现了卓越的性能，F1分数超过80%，并在低功耗下保持高效运行（6-8 mW）。这一突破显著提升了超低功耗视觉系统的部署可能性，并为无缝人机交互开辟了新路径。

链接: https://arxiv.org/abs/2503.07825
作者: Prarthana Bhattacharyya,Joshua Mitton,Ryan Page,Owen Morgan,Oliver Powell,Benjamin Menzies,Gabriel Homewood,Kemi Jacobs,Paolo Baesso,Taru Muhonen,Richard Vigars,Louis Berridge
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages, 17 figures. Prarthana Bhattacharyya, Joshua Mitton, Ryan Page, Owen Morgan, and Oliver Powell contributed equally to this paper

点击查看摘要

Abstract:We present an advance in wearable technology: a mobile-optimized, real-time, ultra-low-power event camera system that enables natural hand gesture control for smart glasses, dramatically improving user experience. While hand gesture recognition in computer vision has advanced significantly, critical challenges remain in creating systems that are intuitive, adaptable across diverse users and environments, and energy-efficient enough for practical wearable applications. Our approach tackles these challenges through carefully selected microgestures: lateral thumb swipes across the index finger (in both directions) and a double pinch between thumb and index fingertips. These human-centered interactions leverage natural hand movements, ensuring intuitive usability without requiring users to learn complex command sequences. To overcome variability in users and environments, we developed a novel simulation methodology that enables comprehensive domain sampling without extensive real-world data collection. Our power-optimised architecture maintains exceptional performance, achieving F1 scores above 80% on benchmark datasets featuring diverse users and environments. The resulting models operate at just 6-8 mW when exploiting the Qualcomm Snapdragon Hexagon DSP, with our 2-channel implementation exceeding 70% F1 accuracy and our 6-channel model surpassing 80% F1 accuracy across all gesture classes in user studies. These results were achieved using only synthetic training data. This improves on the state-of-the-art for F1 accuracy by 20% with a power reduction 25x when using DSP. This advancement brings deploying ultra-low-power vision systems in wearable devices closer and opens new possibilities for seamless human-computer interaction.
zh

[CV-190] Elderly Activity Recognition in the Wild: Results from the EAR Challenge WACV2025

【速读】：该论文致力于解决老年人日常活动识别（Elderly Action Recognition, EAR）的问题，具体目标是识别老年人的日常生活活动（Activities of Daily Living, ADLs），涵盖六个动作类别，并使用多样化数据集进行评估。论文的关键解决方案在于基于一种最先进的动作识别模型，通过针对老年人特定数据集的迁移学习（transfer learning）进行微调，以增强模型的适应性。此外，为了提高泛化能力并减轻数据集偏差，研究团队精心筛选了来自多个公开来源的训练数据，并应用了针对性的数据预处理技术。这一方法使模型在公开排行榜上达到了0.81455的准确率，验证了其在分类老年人活动方面的有效性。

链接: https://arxiv.org/abs/2503.07821
作者: Anh-Kiet Duong
机构: L3i Laboratory, La Rochelle University (拉罗谢尔大学L3i实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, EAR-CV4Smalls@WACV2025

点击查看摘要

Abstract:This paper presents our solution for the Elderly Action Recognition (EAR) Challenge, part of the Computer Vision for Smalls Workshop at WACV 2025. The competition focuses on recognizing Activities of Daily Living (ADLs) performed by the elderly, covering six action categories with a diverse dataset. Our approach builds upon a state-of-the-art action recognition model, fine-tuned through transfer learning on elderly-specific datasets to enhance adaptability. To improve generalization and mitigate dataset bias, we carefully curated training data from multiple publicly available sources and applied targeted pre-processing techniques. Our solution currently achieves 0.81455 accuracy on the public leaderboard, highlighting its effectiveness in classifying elderly activities. Source codes are publicly available at this https URL.
zh

[CV-191] POp-GS: Next Best View in 3D-Gaussian Splatting with P-Optimality

【速读】：本文旨在解决三维高斯点 splatting (3D-GS) 中不确定性量化的问题，这一问题限制了其在主动感知任务中的应用，例如通过新图像获取信息或识别可从内存中移除的冗余图像以应对在线3D-GS SLAM中的资源约束。论文的关键解决方案是通过最优实验设计（Optimal Experimental Design）的视角重新构建3D-GS的信息量化问题，其中T-最优性（T-Optimality）和D-最优性（D-Optimality）在两个常用数据集上的定量和定性评估中表现出色。此外，作者提出了一种3D-GS不确定性的块对角近似方法，以更准确地计算信息增益，但需付出更高的计算代价。

链接: https://arxiv.org/abs/2503.07819
作者: Joey Wilson,Marcelino Almeida,Sachit Mahajan,Martin Labrie,Maani Ghaffari,Omid Ghasemalizadeh,Min Sun,Cheng-Hao Kuo,Arnab Sen
机构: University of Michigan; Amazon Lab 126 (亚马逊实验室126)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In this paper, we present a novel algorithm for quantifying uncertainty and information gained within 3D Gaussian Splatting (3D-GS) through P-Optimality. While 3D-GS has proven to be a useful world model with high-quality rasterizations, it does not natively quantify uncertainty. Quantifying uncertainty in parameters of 3D-GS is necessary to understand the information gained from acquiring new images as in active perception, or identify redundant images which can be removed from memory due to resource constraints in online 3D-GS SLAM. We propose to quantify uncertainty and information gain in 3D-GS by reformulating the problem through the lens of optimal experimental design, which is a classical solution to measuring information gain. By restructuring information quantification of 3D-GS through optimal experimental design, we arrive at multiple solutions, of which T-Optimality and D-Optimality perform the best quantitatively and qualitatively as measured on two popular datasets. Additionally, we propose a block diagonal approximation of the 3D-GS uncertainty, which provides a measure of correlation for computing more accurate information gain, at the expense of a greater computation cost.
zh

[CV-192] AgriField3D: A Curated 3D Point Cloud and Procedural Model Dataset of Field-Grown Maize from a Diversity Panel

【速读】：该论文旨在解决三维农业研究中因缺乏大规模、多样化数据集而导致的人工智能（AI）应用局限性，特别是在玉米研究中的挑战。传统二维图像数据集虽丰富，但无法捕捉三维数据所提供的关键结构细节，如叶片架构、植株体积及空间排列。为克服这一限制，论文提出AgriField3D数据集，其核心解决方案的关键在于结合高精度三维点云数据与高性能的参数化过程模型。具体而言，通过使用非均匀有理B样条（NURBS）生成叶片表面和植株架构的过程模型，并采用粒子群优化（PSO）与可微编程相结合的两步优化方法，实现了精确且可扩展的重建能力。此外，论文通过基于图的分割技术实现叶片和茎秆的独立分离，并辅以严格的手动质量控制流程，确保数据集的标注一致性和准确性。最终，AgriField3D数据集不仅提供了包含多分辨率子采样版本的全面资源，还通过整合高保真过程模型和人工验证，为基于AI的表型分析、植株结构分析以及农业研究中的三维应用奠定了坚实基础。

链接: https://arxiv.org/abs/2503.07813
作者: Elvis Kimara,Mozhgan Hadadi,Jackson Godbersen,Aditya Balu,Talukder Jubery,Yawei Li,Adarsh Krishnamurthy,Patrick S. Schnable,Baskar Ganapathysubramanian
机构: Department of Mechanical Engineering, Iowa State University (爱荷华州立大学), Ames, USA; Translational AI Research and Education Center (翻译AI研究中心), Ames, USA; Plant Science Institute, Iowa State University (爱荷华州立大学植物科学研究所), Ames, USA; Interdepartmental Genetics and Genomics Graduate Program, Iowa State University (爱荷华州立大学跨学科遗传与基因组学研究生项目), Ames, USA; Department of Agronomy, Iowa State University (爱荷华州立大学农学系), Ames, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Elvis Kimara and Mozhgan Hadadi contributed equally to this work

点击查看摘要

Abstract:The application of artificial intelligence (AI) in three-dimensional (3D) agricultural research, particularly for maize, has been limited by the scarcity of large-scale, diverse datasets. While 2D image datasets are abundant, they fail to capture essential structural details such as leaf architecture, plant volume, and spatial arrangements that 3D data provide. To address this limitation, we present AgriField3D (this https URL), a curated dataset of 3D point clouds of field-grown maize plants from a diverse genetic panel, designed to be AI-ready for advancing agricultural research. Our dataset comprises over 1,000 high-quality point clouds collected using a Terrestrial Laser Scanner, complemented by procedural models that provide structured, parametric representations of maize plants. These procedural models, generated using Non-Uniform Rational B-Splines (NURBS) and optimized via a two-step process combining Particle Swarm Optimization (PSO) and differentiable programming, enable precise, scalable reconstructions of leaf surfaces and plant architectures. To enhance usability, we performed graph-based segmentation to isolate individual leaves and stalks, ensuring consistent labeling across all samples. We also conducted rigorous manual quality control on all datasets, correcting errors in segmentation, ensuring accurate leaf ordering, and validating metadata annotations. The dataset further includes metadata detailing plant morphology and quality, alongside multi-resolution subsampled versions (100k, 50k, 10k points) optimized for various computational needs. By integrating point cloud data of field grown plants with high-fidelity procedural models and ensuring meticulous manual validation, AgriField3D provides a comprehensive foundation for AI-driven phenotyping, plant structural analysis, and 3D applications in agricultural research.
zh

[CV-193] Self-supervised Normality Learning and Divergence Vector-guided Model Merging for Zero-shot Congenital Heart Disease Detection in Fetal Ultrasound Videos

【速读】：该论文旨在解决先天性心脏病（Congenital Heart Disease, CHD）检测中因标注数据稀缺及隐私保护限制导致的深度学习模型开发难题。具体而言，集中收集大规模真实世界罕见病数据集（如CHD）面临协调难度大和资源消耗高的挑战，同时日益严格的隐私政策限制了跨机构的数据共享。为应对这些挑战，论文首次提出了一种基于隐私保护的零样本CHD检测框架，将CHD检测转化为正常性建模问题，并结合模型融合技术。该框架的关键在于引入了一种名为Sparse Tube Ultrasound Distillation (STUD) 的方法，其中每个医院站点首先利用自监督视频异常检测（Video Anomaly Detection, VAD）模型和自蒸馏损失函数，在正常的胎儿心脏超声视频片段上训练稀疏视频管模型，从而独立学习健康病例的分布。为在不交换数据的情况下聚合分布式模型的知识，论文进一步提出了Divergence Vector-Guided Model Merging（DivMerge）方法，通过该方法将特定于站点的模型合并为单一的VAD模型，同时保持隐私性和领域无关的时空表示能力。实验结果显示，该方法在外部测试集上的准确率和F1分数分别比单个站点模型提升了23.77%和30.13%，验证了其有效性和泛化能力。

链接: https://arxiv.org/abs/2503.07799
作者: Pramit Saha,Divyanshu Mishra,Netzahualcoyotl Hernandez-Cruz,Olga Patey,Aris Papageorghiou,Yuki M. Asano,J. Alison Noble
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Congenital Heart Disease (CHD) is one of the leading causes of fetal mortality, yet the scarcity of labeled CHD data and strict privacy regulations surrounding fetal ultrasound (US) imaging present significant challenges for the development of deep learning-based models for CHD detection. Centralised collection of large real-world datasets for rare conditions, such as CHD, from large populations requires significant co-ordination and resource. In addition, data governance rules increasingly prevent data sharing between sites. To address these challenges, we introduce, for the first time, a novel privacy-preserving, zero-shot CHD detection framework that formulates CHD detection as a normality modeling problem integrated with model merging. In our framework dubbed Sparse Tube Ultrasound Distillation (STUD), each hospital site first trains a sparse video tube-based self-supervised video anomaly detection (VAD) model on normal fetal heart US clips with self-distillation loss. This enables site-specific models to independently learn the distribution of healthy cases. To aggregate knowledge across the decentralized models while maintaining privacy, we propose a Divergence Vector-Guided Model Merging approach, DivMerge, that combines site-specific models into a single VAD model without data exchange. Our approach preserves domain-agnostic rich spatio-temporal representations, ensuring generalization to unseen CHD cases. We evaluated our approach on real-world fetal US data collected from 5 hospital sites. Our merged model outperformed site-specific models by 23.77% and 30.13% in accuracy and F1-score respectively on external test sets.
zh

[CV-194] EAZY: Eliminating Hallucinations in LVLMs by Zeroing out Hallucinatory Image Tokens

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）中存在的对象幻觉（object hallucination）问题，即模型生成的输出中错误地包含实际上不存在的对象。论文将研究焦点从语言模型的主干转移到图像输入源，探索特定图像标记如何导致幻觉现象。研究的关键发现是：一小部分具有高注意力分数的图像标记是对象幻觉的主要驱动因素。通过移除这些引起幻觉的图像标记（仅占所有图像标记的1.5%），可以有效缓解这一问题，并且这一发现适用于不同的模型和数据集。基于此洞察，作者提出了名为EAZY的新方法，这是一种无需训练的方法，能够自动识别并消除幻觉性图像标记。EAZY在无监督对象幻觉检测任务中实现了比先前方法提高15%的性能，并且在减轻幻觉的同时保持了模型的实用价值，还能无缝适应各种LVLM架构。

链接: https://arxiv.org/abs/2503.07772
作者: Liwei Che,Tony Qingze Liu,Jing Jia,Weiyi Qin,Ruixiang Tang,Vladimir Pavlovic
机构: Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite their remarkable potential, Large Vision-Language Models (LVLMs) still face challenges with object hallucination, a problem where their generated outputs mistakenly incorporate objects that do not actually exist. Although most works focus on addressing this issue within the language-model backbone, our work shifts the focus to the image input source, investigating how specific image tokens contribute to hallucinations. Our analysis reveals a striking finding: a small subset of image tokens with high attention scores are the primary drivers of object hallucination. By removing these hallucinatory image tokens (only 1.5% of all image tokens), the issue can be effectively mitigated. This finding holds consistently across different models and datasets. Building on this insight, we introduce EAZY, a novel, training-free method that automatically identifies and Eliminates hAllucinations by Zeroing out hallucinatorY image tokens. We utilize EAZY for unsupervised object hallucination detection, achieving 15% improvement compared to previous methods. Additionally, EAZY demonstrates remarkable effectiveness in mitigating hallucinations while preserving model utility and seamlessly adapting to various LVLM architectures.
zh

[CV-195] NimbleReg: A light-weight deep-learning framework for diffeomorphic image registration MICCAI2025

【速读】：本文旨在解决现有基于深度学习（Deep Learning, DL）的图像配准方法中存在的硬件资源消耗大以及多区域分割融合机制不足的问题。传统方法通常依赖于繁琐的网格表示，而轻量级方法虽利用边界表面表示分割，但缺乏有效融合多区域映射为整体微分同胚变换的机制。为应对这些挑战，本文提出NimbleReg，一种基于PointNet主干网络的轻量级DL框架，通过利用微分同胚的静止速度场参数化方法，实现从多个分割区域表面到整个环境空间的整体微分同胚变换。关键在于结合点云处理能力和微分同胚约束，从而在保证配准精度的同时显著降低计算复杂度。

链接: https://arxiv.org/abs/2503.07768
作者: Antoine Legouhy,Ross Callaghan,Nolah Mazet,Vivien Julienne,Hojjat Azadbakht,Hui Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted in MICCAI 2025 conference

点击查看摘要

Abstract:This paper presents NimbleReg, a light-weight deep-learning (DL) framework for diffeomorphic image registration leveraging surface representation of multiple segmented anatomical regions. Deep learning has revolutionized image registration but most methods typically rely on cumbersome gridded representations, leading to hardware-intensive models. Reliable fine-grained segmentations, that are now accessible at low cost, are often used to guide the alignment. Light-weight methods representing segmentations in terms of boundary surfaces have been proposed, but they lack mechanism to support the fusion of multiple regional mappings into an overall diffeomorphic transformation. Building on these advances, we propose a DL registration method capable of aligning surfaces from multiple segmented regions to generate an overall diffeomorphic transformation for the whole ambient space. The proposed model is light-weight thanks to a PointNet backbone. Diffeomoprhic properties are guaranteed by taking advantage of the stationary velocity field parametrization of diffeomorphisms. We demonstrate that this approach achieves alignment comparable to state-of-the-art DL-based registration techniques that consume images.
zh

[CV-196] Better Pose Initialization for Fast and Robust 2D/3D Pelvis Registration

【速读】：该论文旨在解决基于优化的姿势估计算法中2D/3D骨盆配准易受初始化影响而导致收敛失败的问题。论文的关键在于提出了一种利用学习得到的初始化函数来改善配准精度的方法，即使是一个粗略的初始值也能显著提升配准准确性并提高整体计算效率。这种方法在面对极端姿态变化的挑战性情况下同样有效。实验验证表明，所提出的方法能够始终实现鲁棒且精确的配准，从而增强临床应用中2D/3D配准的可靠性。

链接: https://arxiv.org/abs/2503.07767
作者: Yehyun Suh,J. Ryan Martin,Daniel Moyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an approach for improving 2D/3D pelvis registration in optimization-based pose estimators using a learned initialization function. Current methods often fail to converge to the optimal solution when initialized naively. We find that even a coarse initializer greatly improves pose estimator accuracy, and improves overall computational efficiency. This approach proves to be effective also in challenging cases under more extreme pose variation. Experimental validation demonstrates that our method consistently achieves robust and accurate registration, enhancing the reliability of 2D/3D registration for clinical applications.
zh

[CV-197] SegResMamba: An Efficient Architecture for 3D Medical Image Segmentation

【速读】：该论文旨在解决Transformer架构在应用于3D医学图像数据集时面临的高训练时间和内存需求问题，这些问题限制了模型的可扩展性并增加了碳足迹。论文的关键解决方案是提出了一种名为SegResMamba的高效3D分割模型，该模型通过借鉴结构化状态空间模型（SSMs）的最新进展，显著降低了计算复杂度、内存使用、训练时间和环境影响，同时保持了高性能。与现有的最先进的架构相比，SegResMamba在训练过程中使用的内存减少了一半以上，实现了资源需求的大幅降低而性能相当。

链接: https://arxiv.org/abs/2503.07766
作者: Badhan Kumar Das,Ajay Singh,Saahil Islam,Gengyan Zhao,Andreas Maier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Transformer architecture has opened a new paradigm in the domain of deep learning with its ability to model long-range dependencies and capture global context and has outpaced the traditional Convolution Neural Networks (CNNs) in many aspects. However, applying Transformer models to 3D medical image datasets presents significant challenges due to their high training time, and memory requirements, which not only hinder scalability but also contribute to elevated CO _2 footprint. This has led to an exploration of alternative models that can maintain or even improve performance while being more efficient and environmentally sustainable. Recent advancements in Structured State Space Models (SSMs) effectively address some of the inherent limitations of Transformers, particularly their high memory and computational demands. Inspired by these advancements, we propose an efficient 3D segmentation model for medical imaging called SegResMamba, designed to reduce computation complexity, memory usage, training time, and environmental impact while maintaining high performance. Our model uses less than half the memory during training compared to other state-of-the-art (SOTA) architectures, achieving comparable performance with significantly reduced resource demands.
zh

[CV-198] 2D/3D Registration of Acetabular Hip Implants Under Perspective Projection and Fully Differentiable Ellipse Fitting

【速读】：该论文旨在解决全髋关节置换术中通过前后位髋关节荧光透视图像精确估计髋臼假体的方位（前倾角、倾斜角）和位置的问题。解决方案的关键在于构建了一种考虑荧光镜几何畸变的前向投影模型，并实现了可微椭圆拟合，以提高估计结果与真实值之间的相似性。这种方法能够在临床和手术环境中提供高精度且计算需求低的解决方案。

链接: https://arxiv.org/abs/2503.07763
作者: Yehyun Suh,J. Ryan Martin,Daniel Moyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel method for estimating the orientation and the position of acetabular hip implants in total hip arthroplasty using full anterior-posterior hip fluoroscopy images. Our method accounts for distortions induced in the fluoroscope geometry, estimating acetabular component pose by creating a forward model of the perspective projection and implementing differentiable ellipse fitting for the similarity of our estimation from the ground truth. This approach enables precise estimation of the implant’s rotation (anteversion, inclination) and the translation under the fluoroscope induced deformation. Experimental results from both numerically simulated and digitally reconstructed radiograph environments demonstrate high accuracy with minimal computational demands, offering enhanced precision and applicability in clinical and surgical settings.
zh

[CV-199] SANDRO: a Robust Solver with a Splitting Strategy for Point Cloud Registration ICRA

【速读】：该论文旨在解决点云配准在高离群点率或长时间收敛至合适解的挑战。现有方法在面对高离群点率时往往失效或耗时过长。为解决这些问题，论文提出了一种名为SANDRO（Splitting strategy for point cloud Alignment using Non-convex anD Robust Optimization）的新算法，其核心在于结合迭代重加权最小二乘法（Iteratively Reweighted Least Squares, IRLS）框架与具有渐进非凸性的鲁棒损失函数，并辅以一种专门设计用于处理高离群点率和离群点偏态分布的分裂策略。这一方案的关键创新在于通过鲁棒优化和分裂策略显著提升了在高离群点率及点云对称性等困难场景下的收敛性能，从而实现了比现有最先进的方法高出20%的成功率（针对真实数据集）和60%的成功率（针对合成数据集）。

链接: https://arxiv.org/abs/2503.07743
作者: Michael Adlerstein,João Carlos Virgolino Soares,Angelo Bratta,Claudio Semini
机构: Dynamic Legged Systems (DLS) lab, Istituto Italiano di Tecnologia (IIT)(意大利技术研究院), Genova, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2025

点击查看摘要

Abstract:Point cloud registration is a critical problem in computer vision and robotics, especially in the field of navigation. Current methods often fail when faced with high outlier rates or take a long time to converge to a suitable solution. In this work, we introduce a novel algorithm for point cloud registration called SANDRO (Splitting strategy for point cloud Alignment using Non-convex anD Robust Optimization), which combines an Iteratively Reweighted Least Squares (IRLS) framework with a robust loss function with graduated non-convexity. This approach is further enhanced by a splitting strategy designed to handle high outlier rates and skewed distributions of outliers. SANDRO is capable of addressing important limitations of existing methods, as in challenging scenarios where the presence of high outlier rates and point cloud symmetries significantly hinder convergence. SANDRO achieves superior performance in terms of success rate when compared to the state-of-the-art methods, demonstrating a 20% improvement from the current state of the art when tested on the Redwood real dataset and 60% improvement when tested on synthetic data.
zh

[CV-200] SIRE: SE(3) Intrinsic Rigidity Embeddings

【速读】：该论文试图解决从因果场景视频中发现物体运动和重建动态场景的问题。解决方案的关键在于引入了一种名为SIRE（Self-Supervised Intrinsic Rigidity Embeddings）的自监督方法，通过从视频中学习内在刚性嵌入来实现这一目标。SIRE采用端到端可微的方式训练图像编码器以估计场景的刚性和几何结构，并通过简单的四维重建损失函数进行监督：利用估计的几何和刚性，将二维点轨迹提升到SE(3)轨迹，再重新投影回二维并与原始轨迹对比以获得监督信号。这种方法不仅能够在视频数据集上学习通用的图像先验，还能在单个视频上捕捉特定场景的结构，展示了极高的数据效率。

链接: https://arxiv.org/abs/2503.07739
作者: Cameron Smith,Basile Van Hoorick,Vitor Guizilini,Yue Wang
机构: University of Southern California (南加州大学); Toyota Research Institute (丰田研究所); NVIDIA Research (英伟达研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least-squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re-projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene-specific structure - highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self-supervised depth estimation. Our findings suggest that SIRE can learn strong geometry and motion rigidity priors from video data, with minimal supervision.
zh

[CV-201] Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

【速读】：该论文试图解决现有图像生成模型在处理多语言文本提示（尤其是中文）时存在的模型偏差、有限的文字渲染能力以及对中国文化细微差别的理解不足等问题。解决方案的关键在于提出Seedream 2.0，这是一个以双语（中文-英文）为基础的图像生成基础模型，其核心优势包括：1）开发了一个强大的数据系统以促进知识集成，并结合一个平衡描述准确性与丰富性的字幕系统；2）引入自研的双语大语言模型作为文本编码器，直接从海量数据中学习本地化知识，从而生成具有高保真度且符合文化背景和美学表达的图像；3）采用Glyph-Aligned ByT5实现灵活的字符级文字渲染，同时利用Scaled ROPE增强分辨率泛化能力；4）通过多阶段的后训练优化（如SFT和RLHF迭代），进一步提升整体性能。此外，Seedream 2.0经过多次强化学习人类反馈（RLHF）迭代调整，使其输出高度符合人类偏好，并具备良好的指令驱动图像编辑能力。

链接: https://arxiv.org/abs/2503.07703
作者: Lixue Gong,Xiaoxia Hou,Fanshi Li,Liang Li,Xiaochen Lian,Fei Liu,Liyang Liu,Wei Liu,Wei Lu,Yichun Shi,Shiqi Sun,Yu Tian,Zhi Tian,Peng Wang,Xun Wang,Ye Wang,Guofeng Wu,Jie Wu,Xin Xia,Xuefeng Xiao,Linjie Yang,Zhonghua Zhai,Xinyu Zhang,Qi Zhang,Yuwei Zhang,Shijia Zhao,Jianchao Yang,Weilin Huang
机构: Seed Vision Team (Seed Vision 团队); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Official Page: this https URL

点击查看摘要

Abstract:Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.
zh

[CV-202] RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories CVPR2025

【速读】：该论文旨在解决扩散模型（Diffusion Models）在实际应用中生成速度慢的关键挑战，同时避免现有加速方法因减少步骤而导致的样本质量下降、可控性减弱或训练复杂度增加的问题。论文提出的解决方案核心在于引入RayFlow这一新颖的扩散框架，通过引导每个样本沿着一条独特的路径向特定实例的目标分布（instance-specific target distribution）收敛，从而在大幅减少采样步数的同时保持生成多样性与稳定性。此外，论文还提出了Time Sampler，这是一种重要性采样技术，用于通过聚焦关键时间步（crucial timesteps）来提升训练效率。这些创新点共同构成了RayFlow的核心优势。

链接: https://arxiv.org/abs/2503.07699
作者: Huiyang Shao,Xin Xia,Yuhong Yang,Yuxi Ren,Xing Wang,Xuefeng Xiao
机构: ByteDance Inc. (字节跳动)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 5 figures, CVPR 2025

点击查看摘要

Abstract:Diffusion models have achieved remarkable success across various domains. However, their slow generation speed remains a critical challenge. Existing acceleration methods, while aiming to reduce steps, often compromise sample quality, controllability, or introduce training complexities. Therefore, we propose RayFlow, a novel diffusion framework that addresses these limitations. Unlike previous methods, RayFlow guides each sample along a unique path towards an instance-specific target distribution. This method minimizes sampling steps while preserving generation diversity and stability. Furthermore, we introduce Time Sampler, an importance sampling technique to enhance training efficiency by focusing on crucial timesteps. Extensive experiments demonstrate RayFlow’s superiority in generating high-quality images with improved speed, control, and training efficiency compared to existing acceleration techniques.
zh

[CV-203] Data Foundations for Large Scale Multimodal Clinical Foundation Models

【速读】：该论文试图解决现有临床人工智能（Clinical AI）基准和模型主要局限于少数模态和任务的问题，这限制了大规模多模态方法的发展，而这些方法能够全面评估患者的健康和福祉。论文的关键解决方案在于引入了Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB)，这是一个统一多种临床数据模态（包括成像、语言、时间序列和图结构数据）的综合基准。CLIMB包含来自2D图像、3D视频、时间序列、图数据以及多模态数据的总计451万患者样本，容量达19.01太字节。通过广泛的实证评估，研究发现多任务预训练显著提升了在未充分研究领域的性能，并在超声分析和心电图（ECG）分析中分别实现了高达29%和23%的改进。此外，基于CLIMB的预训练有效提高了模型对新任务的泛化能力，且当与适当的融合策略结合时，单模态编码器的优秀表现可以很好地转化为多模态性能。这些发现为设计新的架构和预训练策略以推动临床AI研究奠定了基础。

链接: https://arxiv.org/abs/2503.07667
作者: Wei Dai,Peilin Chen,Malinda Lu,Daniel Li,Haowen Wei,Hejie Cui,Paul Pu Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Recent advances in clinical AI have enabled remarkable progress across many clinical domains. However, existing benchmarks and models are primarily limited to a small set of modalities and tasks, which hinders the development of large-scale multimodal methods that can make holistic assessments of patient health and well-being. To bridge this gap, we introduce Clinical Large-Scale Integrative Multimodal Benchmark (CLIMB), a comprehensive clinical benchmark unifying diverse clinical data across imaging, language, temporal, and graph modalities. CLIMB comprises 4.51 million patient samples totaling 19.01 terabytes distributed across 2D imaging, 3D video, time series, graphs, and multimodal data. Through extensive empirical evaluation, we demonstrate that multitask pretraining significantly improves performance on understudied domains, achieving up to 29% improvement in ultrasound and 23% in ECG analysis over single-task learning. Pretraining on CLIMB also effectively improves models’ generalization capability to new tasks, and strong unimodal encoder performance translates well to multimodal performance when paired with task-appropriate fusion strategies. Our findings provide a foundation for new architecture designs and pretraining strategies to advance clinical AI research. Code is released at this https URL.
zh

[CV-204] DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving ICLR2025

【速读】：该论文旨在解决现有端到端自动驾驶（End-to-end Autonomous Driving, E2E-AD）方法中存在的累积误差、训练不稳定以及任务间协同利用不足的问题。同时，针对现有方法采用密集鸟瞰图（BEV）表示带来的长距离感知和长时间时序融合计算挑战，提出了解决方案。论文的关键创新在于设计了一个名为DriveTransformer的简化E2E-AD框架，其核心在于三个关键特性：任务并行性（所有代理、地图和规划查询在每个模块中直接相互交互）、稀疏表示（任务查询直接与原始传感器特征交互）、流式处理（任务查询作为历史信息存储和传递）。通过引入任务自注意力、传感器交叉注意力和时间交叉注意力这三个统一操作，大幅降低了系统复杂度，并显著提升了训练稳定性及性能，在Bench2Drive和nuScenes等基准测试中实现了最先进的表现。

链接: https://arxiv.org/abs/2503.07656
作者: Xiaosong Jia,Junqi You,Zhiyuan Zhang,Junchi Yan
机构: Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学计算机科学学院与人工智能学院); Shanghai AI Lab (上海人工智能实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICLR2025

点击查看摘要

Abstract:End-to-end autonomous driving (E2E-AD) has emerged as a trend in the field of autonomous driving, promising a data-driven, scalable approach to system design. However, existing E2E-AD methods usually adopt the sequential paradigm of perception-prediction-planning, which leads to cumulative errors and training instability. The manual ordering of tasks also limits the system`s ability to leverage synergies between tasks (for example, planning-aware perception and game-theoretic interactive prediction and planning). Moreover, the dense BEV representation adopted by existing methods brings computational challenges for long-range perception and long-term temporal fusion. To address these challenges, we present DriveTransformer, a simplified E2E-AD framework for the ease of scaling up, characterized by three key features: Task Parallelism (All agent, map, and planning queries direct interact with each other at each block), Sparse Representation (Task queries direct interact with raw sensor features), and Streaming Processing (Task queries are stored and passed as history information). As a result, the new framework is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention, which significantly reduces the complexity of system and leads to better training stability. DriveTransformer achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.
zh

[CV-205] A Case Study of Counting the Number of Unique Users in Linear and Non-Linear Trails – A Multi-Agent System Approach

【速读】：该论文旨在解决传统公园使用分析方法因依赖单点传感器而无法区分唯一用户，导致在人力和成本上的局限性问题。为应对这一挑战，论文提出了一种基于低成本摄像头分布式网络的多智能体系统，通过视频数据的自动处理与分析来跟踪和识别唯一用户。关键在于利用现有算法从视频数据中提取包括速度、方向、活动类型、服装颜色及性别等用户属性，并通过跨摄像头协作构建运动轨迹以准确统计唯一访客数量。该方案通过与人工计数对比以及模拟场景验证，实现了72%的独特用户识别成功率，证明了其在自动化公园活动监测中的潜力。尽管存在摄像机布局及环境因素等挑战，但研究结果表明，该系统提供了一个可扩展且经济高效的实时公园使用分析与访客行为追踪解决方案。

链接: https://arxiv.org/abs/2503.07651
作者: Tanvir Rahman
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parks play a crucial role in enhancing the quality of life by providing recreational spaces and environmental benefits. Understanding the patterns of park usage, including the number of visitors and their activities, is essential for effective security measures, infrastructure maintenance, and resource allocation. Traditional methods rely on single-entry sensors that count total visits but fail to distinguish unique users, limiting their effectiveness due to manpower and cost this http URL advancements in affordable video surveillance and networked processing, more comprehensive park usage analysis is now feasible. This study proposes a multi-agent system leveraging low-cost cameras in a distributed network to track and analyze unique users. As a case study, we deployed this system at the Jack A. Markell (JAM) Trail in Wilmington, Delaware, and Hall Trail in Newark, Delaware. The system captures video data, autonomously processes it using existing algorithms, and extracts user attributes such as speed, direction, activity type, clothing color, and gender. These attributes are shared across cameras to construct movement trails and accurately count unique visitors. Our approach was validated through comparison with manual human counts and simulated scenarios under various conditions. The results demonstrate a 72% success rate in identifying unique users, setting a benchmark in automated park activity monitoring. Despite challenges such as camera placement and environmental factors, our findings suggest that this system offers a scalable, cost-effective solution for real-time park usage analysis and visitor behavior tracking.
zh

[CV-206] COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

【速读】：该论文旨在解决基于自监督方式将视频模态中的丰富语义知识迁移到惯性测量单元（IMU）模态的问题，以克服IMU在大规模标注数据不足及泛化能力较弱的局限性。论文的关键解决方案是提出了一种名为COMODO的跨模态自监督蒸馏框架，通过利用预训练且冻结的视频编码器构建动态实例队列，并对视频和IMU嵌入特征分布进行对齐，从而实现无需标注的情况下从视频模态向IMU模态的知识迁移。这种设计使IMU编码器能够继承视频模态中的丰富语义信息，同时保持其在实际应用中的高效性。

链接: https://arxiv.org/abs/2503.07259
作者: Baiyu Chen,Wilson Wongso,Zechen Li,Yonchanok Khaokaew,Hao Xue,Flora Salim
机构: The University of New South Wales (新南威尔士大学), Sydney, NSW, Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR). However, their high power consumption, privacy concerns, and dependence on lighting conditions limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient and privacy-preserving alternative, yet they suffer from limited large-scale annotated datasets, leading to weaker generalization in downstream tasks. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings. By distilling knowledge from video representations, our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets demonstrate that COMODO consistently improves downstream classification performance, achieving results comparable to or exceeding fully supervised fine-tuned models. Moreover, COMODO exhibits strong cross-dataset generalization. Benefiting from its simplicity, our method is also generally applicable to various video and time-series pre-trained models, offering the potential to leverage more powerful teacher and student foundation models in future research. The code is available at this https URL .
zh

[CV-207] Generative Video Bi-flow

【速读】：该论文旨在解决无条件生成高质量视频的问题，同时减少生成过程中的累积误差。传统方法通过将噪声映射到新帧，计算成本较高且容易产生漂移误差。为应对这些问题，论文提出了一种基于神经常微分方程（ODE）流的生成式视频模型，其关键在于引入双线性目标：首先直接从过去映射到未来的视频帧，避免了噪声映射带来的高计算开销；其次，在训练过程中通过添加噪声联合学习去除累积误差。实验表明，该方法在多种视频数据集上的生成质量与条件扩散模型的基线相当，但速度更快，即所需的ODE求解器步数更少。

链接: https://arxiv.org/abs/2503.06364
作者: Chen Liu,Tobias Ritschel
机构: University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We propose a novel generative video model by robustly learning temporal change as a neural Ordinary Differential Equation (ODE) flow with a bilinear objective of combining two aspects: The first is to map from the past into future video frames directly. Previous work has mapped the noise to new frames, a more computationally expensive process. Unfortunately, starting from the previous frame, instead of noise, is more prone to drifting errors. Hence, second, we additionally learn how to remove the accumulated errors as the joint objective by adding noise during training. We demonstrate unconditional video generation in a streaming manner for various video datasets, all at competitive quality compared to a baseline conditional diffusion but with higher speed, i.e., fewer ODE solver steps.
zh

[CV-208] Adversarial Dependence Minimization

【速读】：该论文致力于解决现有机器学习方法在从数据中提取最小冗余表示时未能完全消除特征维度之间依赖性的问题，特别是线性不相关变量仍可能存在的非线性关系。论文的关键解决方案是一种可微且可扩展的依赖性最小化算法，该算法超越了传统的两两线性去相关方法。具体而言，该方法通过一个对抗博弈实现：小型网络用于识别特征维度之间的依赖性，而编码器则利用这些信息来减少依赖性。这一机制确保了算法不仅能够处理线性关系，还能有效应对非线性依赖，从而提供更鲁棒的表示学习能力。

链接: https://arxiv.org/abs/2502.03227
作者: Pierre-François De Plaen,Tinne Tuytelaars,Marc Proesmans,Luc Van Gool
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many machine learning techniques rely on minimizing the covariance between output feature dimensions to extract minimally redundant representations from data. However, these methods do not eliminate all dependencies/redundancies, as linearly uncorrelated variables can still exhibit nonlinear relationships. This work provides a differentiable and scalable algorithm for dependence minimization that goes beyond linear pairwise decorrelation. Our method employs an adversarial game where small networks identify dependencies among feature dimensions, while the encoder exploits this information to reduce dependencies. We provide empirical evidence of the algorithm’s convergence and demonstrate its utility in three applications: extending PCA to nonlinear decorrelation, improving the generalization of image classification methods, and preventing dimensional collapse in self-supervised representation learning.
zh

[CV-209] Rethinking Diffusion Model in High Dimension

【速读】：该论文试图解决的问题是：尽管扩散模型（Diffusion Models）在高维数据生成任务中表现出色，但其是否真正克服了维度灾难（Curse of Dimensionality），以及其工作原理是否如假设所述——通过学习潜在概率分布的统计特性以实现样本采样。论文的关键解决方案在于通过对扩散模型的目标函数和推理方法进行详细分析，提出了一个简单统一的框架，揭示了目标函数拟合在高维稀疏场景下的本质变化，并基于此框架发现了更高效的推理方法。这一框架不仅简化了对扩散模型工作方式的理解，还避免了传统方法中依赖马尔可夫链（Markov Chains）和随机微分方程（SDEs）等统计概念的需求。

链接: https://arxiv.org/abs/2503.08643
作者: Zhenxin Zheng,Zhenjie Zheng
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Curse of Dimensionality is an unavoidable challenge in statistical probability models, yet diffusion models seem to overcome this limitation, achieving impressive results in high-dimensional data generation. Diffusion models assume that they can learn the statistical properties of the underlying probability distribution, enabling sampling from this distribution to generate realistic samples. But is this really how they work? To address this question, this paper conducts a detailed analysis of the objective function and inference methods of diffusion models, leading to several important conclusions that help answer the above question: 1) In high-dimensional sparse scenarios, the target of the objective function fitting degrades from a weighted sum of multiple samples to a single sample. 2) The mainstream inference methods can all be represented within a simple unified framework, without requiring statistical concepts such as Markov chains and SDEs. 3) Guided by this simple framework, more efficient inference methods can be discovered.
zh

[CV-210] Vision Transformer for Intracranial Hemorrhage Classification in CT Scans Using an Entropy-Aware Fuzzy Integral Strategy for Adaptive Scan-Level Decision Fusion

【速读】：该论文旨在解决颅内出血（Intracranial Hemorrhage, ICH）亚型分类这一临床挑战，以实现更准确和及时的诊断，从而支持有效的临床决策。论文的关键创新在于提出了一种基于先进金字塔视觉Transformer（Pyramid Vision Transformer, PVT）的模型，通过其分层注意力机制捕获脑部CT扫描中的局部和全局空间依赖性。此外，采用SHAP驱动的特征选择方法识别最具判别力的特征成分，并构建潜在特征空间以训练增强神经网络，从而降低计算复杂度。同时，引入基于熵感知的信息聚合策略及模糊积分算子，融合多层CT切片信息，考虑切片间依赖关系，确保更全面和可靠的扫描级诊断。实验结果表明，该PVT基框架在分类准确性、精确性和鲁棒性方面显著优于现有深度学习架构。关键解决方案在于结合SHAP驱动的特征选择、基于Transformer的建模以及熵感知模糊积分算子用于决策融合。

链接: https://arxiv.org/abs/2503.08609
作者: Mehdi Hosseini Chagahi,Niloufar Delfan,Behzad Moshiri,Md. Jalil Piran,Jaber Hatam Parikhan
机构: School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran (伊朗德黑兰大学工程学院电气与计算机工程学院); Department of Electrical and Computer Engineering University of Waterloo, Waterloo, Canada (加拿大滑铁卢大学电气与计算机工程系); Department of Computer Science and Engineering, Sejong University, Seoul 05006, South Korea (韩国首尔世宗大学计算机科学与工程系)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Intracranial hemorrhage (ICH) is a critical medical emergency caused by the rupture of cerebral blood vessels, leading to internal bleeding within the skull. Accurate and timely classification of hemorrhage subtypes is essential for effective clinical decision-making. To address this challenge, we propose an advanced pyramid vision transformer (PVT)-based model, leveraging its hierarchical attention mechanisms to capture both local and global spatial dependencies in brain CT scans. Instead of processing all extracted features indiscriminately, A SHAP-based feature selection method is employed to identify the most discriminative components, which are then used as a latent feature space to train a boosting neural network, reducing computational complexity. We introduce an entropy-aware aggregation strategy along with a fuzzy integral operator to fuse information across multiple CT slices, ensuring a more comprehensive and reliable scan-level diagnosis by accounting for inter-slice dependencies. Experimental results show that our PVT-based framework significantly outperforms state-of-the-art deep learning architectures in terms of classification accuracy, precision, and robustness. By combining SHAP-driven feature selection, transformer-based modeling, and an entropy-aware fuzzy integral operator for decision fusion, our method offers a scalable and computationally efficient AI-driven solution for automated ICH subtype classification.
zh

[CV-211] Posterior-Mean Denoising Diffusion Model for Realistic PET Image Reconstruction

【速读】：该论文旨在解决基于深度学习的正电子发射断层扫描（PET）图像重建中感知-失真权衡的问题。传统的回归模型通常生成过度平滑且细节不足的图像（低失真但低感知质量），而基于生成对抗网络（GAN）或似然后验采样的方法则容易引入不理想的伪影（高失真但高感知质量），限制了其临床应用价值。为实现更优的感知-失真平衡，论文提出了一种名为后验均值去噪扩散模型（PMDM-PET）的新方法。其关键是基于最近建立的数学理论，在扩散模型空间中探索感知-失真函数的闭式表达，通过首先在最小均方误差（MSE）下获得后验均值的PET预测，然后将这些预测的分布最优传输到真实PET图像的分布，从而生成具有最小失真和最佳感知质量的现实PET图像，并在定性和定量评估中超越了五种最新的先进深度学习基准模型。

链接: https://arxiv.org/abs/2503.08546
作者: Yiran Sun,Osama Mawlawi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Positron Emission Tomography (PET) is a functional imaging modality that enables the visualization of biochemical and physiological processes across various tissues. Recently, deep learning (DL)-based methods have demonstrated significant progress in directly mapping sinograms to PET images. However, regression-based DL models often yield overly smoothed reconstructions lacking of details (i.e., low distortion, low perceptual quality), whereas GAN-based and likelihood-based posterior sampling models tend to introduce undesirable artifacts in predictions (i.e., high distortion, high perceptual quality), limiting their clinical applicability. To achieve a robust perception-distortion tradeoff, we propose Posterior-Mean Denoising Diffusion Model (PMDM-PET), a novel approach that builds upon a recently established mathematical theory to explore the closed-form expression of perception-distortion function in diffusion model space for PET image reconstruction from sinograms. Specifically, PMDM-PET first obtained posterior-mean PET predictions under minimum mean square error (MSE), then optimally transports the distribution of them to the ground-truth PET images distribution. Experimental results demonstrate that PMDM-PET not only generates realistic PET images with possible minimum distortion and optimal perceptual quality but also outperforms five recent state-of-the-art (SOTA) DL baselines in both qualitative visual inspection and quantitative pixel-wise metrics PSNR (dB)/SSIM/NRMSE.
zh

[CV-212] 3D Medical Imaging Segmentation on Non-Contrast CT

【速读】：该论文旨在解决非对比增强 CT 图像分割（Non-Contrast CT Image Segmentation）问题，强调其在计算机视觉中的重要性及现有方法的局限性。论文的关键在于探索和验证一种先进的分割方法，其中 nnUNet 被认为是当前最优秀的方法，尤其在多种分割任务中表现出色。解决方案的关键在于引入全局上下文建模（Global Context Modeling），以提升语义标注（Semantic Labeling）和掩膜生成（Mask Generation）的准确性。此外，论文还提出未来研究方向，包括处理类别不平衡问题（Long-Tail Problem）、利用预训练模型以及探索自监督或对比学习技术。

链接: https://arxiv.org/abs/2503.08361
作者: Canxuan Gang,Yuhan Peng
机构: AI Geeks (AI极客社)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: tech report

点击查看摘要

Abstract:This technical report analyzes non-contrast CT image segmentation in computer vision. It revisits a proposed method, examines the background of non-contrast CT imaging, and highlights the significance of segmentation. The study reviews representative methods, including convolutional-based and CNN-Transformer hybrid approaches, discussing their contributions, advantages, and limitations. The nnUNet stands out as the state-of-the-art method across various segmentation tasks. The report explores the relationship between the proposed method and existing approaches, emphasizing the role of global context modeling in semantic labeling and mask generation. Future directions include addressing the long-tail problem, utilizing pre-trained models for medical imaging, and exploring self-supervised or contrastive pre-training techniques. This report offers insights into non-contrast CT image segmentation and potential advancements in the field.
zh

[CV-213] Physics-based AI methodology for Material Parameter Extraction from Optical Data

【速读】：该论文旨在解决从光谱光学数据中提取材料参数的问题。解决方案的关键在于提出了一种基于物理学的神经网络方法，通过将经典优化框架与多尺度目标检测框架相结合，探索将物理知识融入神经网络的效果。研究在太赫兹和红外频率下的模拟透射光谱上验证了该方法的性能，其设计旨在实现自主性、鲁棒性和时间效率，使其特别适用于工业和社会应用。

链接: https://arxiv.org/abs/2503.08183
作者: M. Koumans,J.L.M. van Mechelen(Eindhoven University of Technology)
机构: Electrical Engineering, Eindhoven University of Technology (埃因霍温理工大学)
类目: Computational Physics (physics.comp-ph); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: Submitted for IRMMW-THz 2025 conference proceedings

点击查看摘要

Abstract:We report on a novel methodology for extracting material parameters from spectroscopic optical data using a physics-based neural network. The proposed model integrates classical optimization frameworks with a multi-scale object detection framework, specifically exploring the effect of incorporating physics into the neural network. We validate and analyze its performance on simulated transmission spectra at terahertz and infrared frequencies. Compared to traditional model-based approaches, our method is designed to be autonomous, robust, and time-efficient, making it particularly relevant for industrial and societal applications.
zh

[CV-214] Denoising via Repainting: an image denoising method using layer wise medical image repainting

【速读】：该论文致力于解决医学图像去噪问题，以提高临床诊断的可靠性并支持后续基于图像的任务。论文的关键解决方案在于提出了一种多尺度方法，将各向异性高斯滤波与渐进式贝塞尔路径重绘相结合。该方法通过构建尺度空间金字塔，在减少噪声的同时保留重要的结构细节。其关键是利用粗到细的迭代优化框架，首先在最粗糙尺度下分割并重绘部分去噪后的图像组件，随后在更精细尺度上逐步重建小而复杂的结构，同时保持大范围同质区域的平滑性。此外，通过引入均方误差和自交约束来确保路径优化过程中的形状一致性。实验结果表明，该方法在多个MRI数据集上的峰值信噪比（PSNR）和结构相似性指数（SSIM）方面优于对比方法，展现出强大的跨域去噪能力和临床应用潜力。

链接: https://arxiv.org/abs/2503.08094
作者: Arghya Pal,Sailaja Rajanala,CheeMing Ting,Raphael Phan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image denoising is essential for improving the reliability of clinical diagnosis and guiding subsequent image-based tasks. In this paper, we propose a multi-scale approach that integrates anisotropic Gaussian filtering with progressive Bezier-path redrawing. Our method constructs a scale-space pyramid to mitigate noise while preserving critical structural details. Starting at the coarsest scale, we segment partially denoised images into coherent components and redraw each using a parametric Bezier path with representative color. Through iterative refinements at finer scales, small and intricate structures are accurately reconstructed, while large homogeneous regions remain robustly smoothed. We employ both mean square error and self-intersection constraints to maintain shape coherence during path optimization. Empirical results on multiple MRI datasets demonstrate consistent improvements in PSNR and SSIM over competing methods. This coarse-to-fine framework offers a robust, data-efficient solution for cross-domain denoising, reinforcing its potential clinical utility and versatility. Future work extends this technique to three-dimensional data.
zh

[CV-215] Deep Perceptual Enhancement for Medical Image Analysis ALT

【速读】：该论文旨在解决因硬件局限性导致的医学图像低质量（如对比度低、亮度不适、噪声等）问题，这些问题直接影响诊断过程并增加医生决策的复杂性。论文提出了一种基于端到端学习策略的方法，通过完全卷积深度网络全面解决感知增强问题，包括对比度校正、亮度校正和去噪等。解决方案的关键在于利用残差块和残差门控机制减少视觉伪影，并通过多目标函数引导以实现感知上合理的增强效果。实验结果表明，该方法在峰值信噪比（PSNR）和DeltaE指标上分别优于现有方法5.00-7.00 dB和4.00-6.00，显著提升了医学图像分析任务的性能并在实际应用中展现出潜力。

链接: https://arxiv.org/abs/2503.08027
作者: S M A Sharif,Rizwan Ali Naqvi,Mithun Biswas,Woong-Kee Loh
机构: rigel-IT (rigel-IT); Department of Unmanned Vehicle Engineering, Sejong University (无人飞行器工程系, 圣洁大学); School of Computing, Gachon University (计算科学学院, 果川大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Journal of Biomedical and Health Informatics, 2022

点击查看摘要

Abstract:Due to numerous hardware shortcomings, medical image acquisition devices are susceptible to producing low-quality (i.e., low contrast, inappropriate brightness, noisy, etc.) images. Regrettably, perceptually degraded images directly impact the diagnosis process and make the decision-making manoeuvre of medical practitioners notably complicated. This study proposes to enhance such low-quality images by incorporating end-to-end learning strategies for accelerating medical image analysis tasks. To the best concern, this is the first work in medical imaging which comprehensively tackles perceptual enhancement, including contrast correction, luminance correction, denoising, etc., with a fully convolutional deep network. The proposed network leverages residual blocks and a residual gating mechanism for diminishing visual artefacts and is guided by a multi-term objective function to perceive the perceptually plausible enhanced images. The practicability of the deep medical image enhancement method has been extensively investigated with sophisticated experiments. The experimental outcomes illustrate that the proposed method could outperform the existing enhancement methods for different medical image modalities by 5.00 to 7.00 dB in peak signal-to-noise ratio (PSNR) metrics and 4.00 to 6.00 in DeltaE metrics. Additionally, the proposed method can drastically improve the medical image analysis tasks’ performance and reveal the potentiality of such an enhancement method in real-world applications. Code Available: this https URL
zh

[CV-216] Whiteness-based bilevel estimation of weighted TV parameter maps for image denoising

【速读】：该论文致力于解决通过加性白高斯噪声（Additive White Gaussian Noise, AWGN）污染的图像去噪问题，并提出了一种基于归一化残差白化损失（normalised residual whiteness loss）的双层优化策略，用于估计加权总变分（Weighted Total Variation, WTV）参数图。与依赖于参考数据（supervised）或部分参考数据（semi-supervised）以及噪声强度信息的监督或半监督方法不同，该方法完全无监督（fully unsupervised）。其关键在于采用早停策略（early stopping）以避免过拟合噪声，该策略基于自然图像集合上的最优性能统计量。数值结果展示了所提方法在标量和像素相关参数图下的性能，并与监督及无监督方法进行了对比。

链接: https://arxiv.org/abs/2503.07814
作者: Monica Pragliola,Luca Calatroni,Alessandro Lanza
机构: 未知
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We consider a bilevel optimisation strategy based on normalised residual whiteness loss for estimating the weighted total variation parameter maps for denoising images corrupted by additive white Gaussian noise. Compared to supervised and semi-supervised approaches relying on prior knowledge of (approximate) reference data and/or information on the noise magnitude, the proposal is fully unsupervised. To avoid noise overfitting an early stopping strategy is used, relying on simple statistics of optimal performances on a set of natural images. Numerical results comparing the supervised/unsupervised procedures for scalar/pixel-dependent \mboxparameter maps are shown.
zh

[CV-217] AdaptSR: Low-Rank Adaptation for Efficient and Scalable Real-World Super-Resolution

【速读】：该论文旨在解决从低分辨率图像恢复高频细节和纹理的问题，特别是在复杂且未知的真实世界退化场景下，传统超分辨率（Super-Resolution, SR）方法面临挑战。尽管基于生成对抗网络（GAN）的方法能够提升图像的真实性，但其训练过程不稳定且容易引入不自然的伪影；而扩散模型虽然有潜力，却需要极高的计算资源。为克服这些问题，论文提出了一种名为AdaptSR的关键解决方案：一种针对双三次插值（bicubic interpolation）预训练模型的低秩适应（Low-Rank Adaptation, LoRA）框架。AdaptSR通过利用特定架构的见解和选择性层更新，仅需微调轻量级LoRA层即可实现高效的现实世界任务适配，同时保持预训练主干模型不变。这种方法不仅显著降低了内存和计算需求，还使得在轻量化硬件上进行实际应用成为可能。实验结果表明，AdaptSR相比基于GAN和扩散模型的SR方法，在真实SR基准测试中的峰值信噪比（PSNR）提高了多达4 dB，感知评分提升了2%，并且在参数量减少92%的情况下实现了与完整模型微调相当甚至更好的性能，从而能够在几分钟内快速适应实际SR任务。

链接: https://arxiv.org/abs/2503.07748
作者: Cansu Korkmaz,Nancy Mehta,Radu Timofte
机构: Koc University (科奇大学); University of Würzburg (维尔茨堡大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages including 3 pages of references, 7 figures and 7 tables

点击查看摘要

Abstract:Recovering high-frequency details and textures from low-resolution images remains a fundamental challenge in super-resolution (SR), especially when real-world degradations are complex and unknown. While GAN-based methods enhance realism, they suffer from training instability and introduce unnatural artifacts. Diffusion models, though promising, demand excessive computational resources, often requiring multiple GPU days, even for single-step variants. Rather than naively fine-tuning entire models or adopting unstable generative approaches, we introduce AdaptSR, a low-rank adaptation (LoRA) framework that efficiently repurposes bicubic-trained SR models for real-world tasks. AdaptSR leverages architecture-specific insights and selective layer updates to optimize real SR adaptation. By updating only lightweight LoRA layers while keeping the pretrained backbone intact, it captures domain-specific adjustments without adding inference cost, as the adapted layers merge seamlessly post-training. This efficient adaptation not only reduces memory and compute requirements but also makes real-world SR feasible on lightweight hardware. Our experiments demonstrate that AdaptSR outperforms GAN and diffusion-based SR methods by up to 4 dB in PSNR and 2% in perceptual scores on real SR benchmarks. More impressively, it matches or exceeds full model fine-tuning while training 92% fewer parameters, enabling rapid adaptation to real SR tasks within minutes.
zh

人工智能

[AI-0] A Grid Cell-Inspired Structured Vector Algebra for Cognitive Maps

链接: https://arxiv.org/abs/2503.08608
作者: Sven Krausse,Emre Neftci,Friedrich T. Sommer,Alpha Renner
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: 10 pages, 5 figures, accepted at the 2025 Neuro Inspired Computational Elements (NICE) conference

点击查看摘要

Abstract:The entorhinal-hippocampal formation is the mammalian brain’s navigation system, encoding both physical and abstract spaces via grid cells. This system is well-studied in neuroscience, and its efficiency and versatility make it attractive for applications in robotics and machine learning. While continuous attractor networks (CANs) successfully model entorhinal grid cells for encoding physical space, integrating both continuous spatial and abstract spatial computations into a unified framework remains challenging. Here, we attempt to bridge this gap by proposing a mechanistic model for versatile information processing in the entorhinal-hippocampal formation inspired by CANs and Vector Symbolic Architectures (VSAs), a neuro-symbolic computing framework. The novel grid-cell VSA (GC-VSA) model employs a spatially structured encoding scheme with 3D neuronal modules mimicking the discrete scales and orientations of grid cell modules, reproducing their characteristic hexagonal receptive fields. In experiments, the model demonstrates versatility in spatial and abstract tasks: (1) accurate path integration for tracking locations, (2) spatio-temporal representation for querying object locations and temporal relations, and (3) symbolic reasoning using family trees as a structured test case for hierarchical relationships.

[AI-1] EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments

链接: https://arxiv.org/abs/2503.08604
作者: Dongping Li,Tielong Cai,Tianci Tang,Wenhao Chai,Katherine Rose Driggs-Campbell,Gaoang Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Developing autonomous home robots controlled by natural language has long been a pursuit of human. While advancements in large language models (LLMs) and embodied intelligence make this goal closer, several challenges persist: the lack of a unified benchmark for more complex robot tasks, limited evaluation methods and metrics, data incompatibility between LLMs and mobile manipulation trajectories. To address these issues, we introduce Embodied Mobile Manipulation in Open Environments (EMMOE), which requires agents to interpret user instructions and execute long-horizon everyday tasks in continuous space. EMMOE seamlessly integrates high-level and low-level embodied tasks into a unified framework, along with three new metrics for more diverse assessment. Additionally, we collect EMMOE-100, which features in various task attributes, detailed process annotations, re-plans after failures, and two sub-datasets for LLM training. Furthermore, we design HomieBot, a sophisticated agent system consists of LLM with Direct Preference Optimization (DPO), light weighted navigation and manipulation models, and multiple error detection mechanisms. Finally, we demonstrate HomieBot’s performance and the evaluation of different models and policies.

[AI-2] When Discourse Stalls: Moving Past Five Semantic Stopsigns about Generative AI in Design Research

链接: https://arxiv.org/abs/2503.08565
作者: Willem van der Maden,Vera van der Burg,Brett A. Halperin,Petra Jääskeläinen,Joseph Lindley,Derek Lomas,Timothy Merritt
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This essay examines how Generative AI (GenAI) is rapidly transforming design practices and how discourse often falls into over-simplified narratives that impede meaningful research and practical progress. We identify and deconstruct five prevalent “semantic stopsigns” – reductive framings about GenAI in design that halt deeper inquiry and limit productive engagement. Reflecting upon two expert workshops at ACM conferences and semi-structured interviews with design practitioners, we analyze how these stopsigns manifest in research and practice. Our analysis develops mid-level knowledge that bridges theoretical discourse and practical implementation, helping designers and researchers interrogate common assumptions about GenAI in their own contexts. By recasting these stopsigns into more nuanced frameworks, we provide the design research community with practical approaches for thinking about and working with these emerging technologies.

[AI-3] MoE-Loco: Mixture of Experts for Multitask Locomotion

链接: https://arxiv.org/abs/2503.08564
作者: Runhan Huang,Shaoting Zhu,Yilun Du,Hang Zhao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 10 figures

点击查看摘要

Abstract:We present MoE-Loco, a Mixture of Experts (MoE) framework for multitask locomotion for legged robots. Our method enables a single policy to handle diverse terrains, including bars, pits, stairs, slopes, and baffles, while supporting quadrupedal and bipedal gaits. Using MoE, we mitigate the gradient conflicts that typically arise in multitask reinforcement learning, improving both training efficiency and performance. Our experiments demonstrate that different experts naturally specialize in distinct locomotion behaviors, which can be leveraged for task migration and skill composition. We further validate our approach in both simulation and real-world deployment, showcasing its robustness and adaptability.

[AI-4] Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation Learning Policies

链接: https://arxiv.org/abs/2503.08558
作者: Chen Xu,Tony Khuong Nguyen,Emma Dixon,Christopher Rodriguez,Patrick Miller,Robert Lee,Paarth Shah,Rares Ambrus,Haruki Nishimura,Masha Itkina
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent years have witnessed impressive robotic manipulation systems driven by advances in imitation learning and generative modeling, such as diffusion- and flow-based approaches. As robot policy performance increases, so does the complexity and time horizon of achievable tasks, inducing unexpected and diverse failure modes that are difficult to predict a priori. To enable trustworthy policy deployment in safety-critical human environments, reliable runtime failure detection becomes important during policy inference. However, most existing failure detection approaches rely on prior knowledge of failure modes and require failure data during training, which imposes a significant challenge in practicality and scalability. In response to these limitations, we present FAIL-Detect, a modular two-stage approach for failure detection in imitation learning-based robotic manipulation. To accurately identify failures from successful training data alone, we frame the problem as sequential out-of-distribution (OOD) detection. We first distill policy inputs and outputs into scalar signals that correlate with policy failures and capture epistemic uncertainty. FAIL-Detect then employs conformal prediction (CP) as a versatile framework for uncertainty quantification with statistical guarantees. Empirically, we thoroughly investigate both learned and post-hoc scalar signal candidates on diverse robotic manipulation tasks. Our experiments show learned signals to be mostly consistently effective, particularly when using our novel flow-based density estimator. Furthermore, our method detects failures more accurately and faster than state-of-the-art (SOTA) failure detection baselines. These results highlight the potential of FAIL-Detect to enhance the safety and reliability of imitation learning-based robotic systems as they progress toward real-world deployment.

[AI-5] Reasoning and Sampling-Augmented MCQ Difficulty Prediction via LLM s

链接: https://arxiv.org/abs/2503.08551
作者: Wanyong Feng,Peter Tran,Stephen Sireci,Andrew Lan
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The difficulty of multiple-choice questions (MCQs) is a crucial factor for educational assessments. Predicting MCQ difficulty is challenging since it requires understanding both the complexity of reaching the correct option and the plausibility of distractors, i.e., incorrect options. In this paper, we propose a novel, two-stage method to predict the difficulty of MCQs. First, to better estimate the complexity of each MCQ, we use large language models (LLMs) to augment the reasoning steps required to reach each option. We use not just the MCQ itself but also these reasoning steps as input to predict the difficulty. Second, to capture the plausibility of distractors, we sample knowledge levels from a distribution to account for variation among students responding to the MCQ. This setup, inspired by item response theory (IRT), enable us to estimate the likelihood of students selecting each (both correct and incorrect) option. We align these predictions with their ground truth values, using a Kullback-Leibler (KL) divergence-based regularization objective, and use estimated likelihoods to predict MCQ difficulty. We evaluate our method on two real-world \emphmath MCQ and response datasets with ground truth difficulty values estimated using IRT. Experimental results show that our method outperforms all baselines, up to a 28.3% reduction in mean squared error and a 34.6% improvement in the coefficient of determination. We also qualitatively discuss how our novel method results in higher accuracy in predicting MCQ difficulty.

[AI-6] Mellow: a small audio language model for reasoning

链接: https://arxiv.org/abs/2503.08540
作者: Soham Deshmukh,Satvik Dixit,Rita Singh,Bhiksha Raj
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Checkpoint and dataset available at: this https URL

点击查看摘要

Abstract:Multimodal Audio-Language Models (ALMs) can understand and reason over both audio and text. Typically, reasoning performance correlates with model size, with the best results achieved by models exceeding 8 billion parameters. However, no prior work has explored enabling small audio-language models to perform reasoning tasks, despite the potential applications for edge devices. To address this gap, we introduce Mellow, a small Audio-Language Model specifically designed for reasoning. Mellow achieves state-of-the-art performance among existing small audio-language models and surpasses several larger models in reasoning capabilities. For instance, Mellow scores 52.11 on MMAU, comparable to SoTA Qwen2 Audio (which scores 52.5) while using 50 times fewer parameters and being trained on 60 times less data (audio hrs). To train Mellow, we introduce ReasonAQA, a dataset designed to enhance audio-grounded reasoning in models. It consists of a mixture of existing datasets (30% of the data) and synthetically generated data (70%). The synthetic dataset is derived from audio captioning datasets, where Large Language Models (LLMs) generate detailed and multiple-choice questions focusing on audio events, objects, acoustic scenes, signal properties, semantics, and listener emotions. To evaluate Mellow’s reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. Our training dataset, findings, and baseline pave the way for developing small ALMs capable of reasoning.

[AI-7] Chemical reasoning in LLM s unlocks steerable synthesis planning and reaction mechanism elucidation

链接: https://arxiv.org/abs/2503.08537
作者: Andres M Bran,Theo A Neukomm,Daniel P Armstrong,Zlatko Jončev,Philippe Schwaller
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:While machine learning algorithms have been shown to excel at specific chemical tasks, they have struggled to capture the strategic thinking that characterizes expert chemical reasoning, limiting their widespread adoption. Here we demonstrate that large language models (LLMs) can serve as powerful chemical reasoning engines when integrated with traditional search algorithms, enabling a new approach to computer-aided chemistry that mirrors human expert thinking. Rather than using LLMs to directly manipulate chemical structures, we leverage their ability to evaluate chemical strategies and guide search algorithms toward chemically meaningful solutions. We demonstrate this paradigm through two fundamental challenges: strategy-aware retrosynthetic planning and mechanism elucidation. In retrosynthetic planning, our method allows chemists to specify desired synthetic strategies in natural language to find routes that satisfy these constraints in vast searches. In mechanism elucidation, LLMs guide the search for plausible reaction mechanisms by combining chemical principles with systematic exploration. Our approach shows strong performance across diverse chemical tasks, with larger models demonstrating increasingly sophisticated chemical reasoning. Our approach establishes a new paradigm for computer-aided chemistry that combines the strategic understanding of LLMs with the precision of traditional chemical tools, opening possibilities for more intuitive and powerful chemical reasoning systems.

[AI-8] A Triple-Inertial Accelerated Alternating Optimization Method for Deep Learning Training

链接: https://arxiv.org/abs/2503.08489
作者: Chengcheng Yan,Jiawei Xu,Qingsong Wang,Zheng Peng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The stochastic gradient descent (SGD) algorithm has achieved remarkable success in training deep learning models. However, it has several limitations, including susceptibility to vanishing gradients, sensitivity to input data, and a lack of robust theoretical guarantees. In recent years, alternating minimization (AM) methods have emerged as a promising alternative for model training by employing gradient-free approaches to iteratively update model parameters. Despite their potential, these methods often exhibit slow convergence rates. To address this challenge, we propose a novel Triple-Inertial Accelerated Alternating Minimization (TIAM) framework for neural network training. The TIAM approach incorporates a triple-inertial acceleration strategy with a specialized approximation method, facilitating targeted acceleration of different terms in each sub-problem optimization. This integration improves the efficiency of convergence, achieving superior performance with fewer iterations. Additionally, we provide a convergence analysis of the TIAM algorithm, including its global convergence properties and convergence rate. Extensive experiments validate the effectiveness of the TIAM method, showing significant improvements in generalization capability and computational efficiency compared to existing approaches, particularly when applied to the rectified linear unit (ReLU) and its variants.

[AI-9] Optimizing Ride-Pooling Operations with Extended Pickup and Drop-Off Flexibility

链接: https://arxiv.org/abs/2503.08472
作者: Hao Jiang,Yixing Xu,Pradeep Varakantham
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Ride-Pool Matching Problem (RMP) is central to on-demand ride-pooling services, where vehicles must be matched with multiple requests while adhering to service constraints such as pickup delays, detour limits, and vehicle capacity. Most existing RMP solutions assume passengers are picked up and dropped off at their original locations, neglecting the potential for passengers to walk to nearby spots to meet vehicles. This assumption restricts the optimization potential in ride-pooling operations. In this paper, we propose a novel matching method that incorporates extended pickup and drop-off areas for passengers. We first design a tree-based approach to efficiently generate feasible matches between passengers and vehicles. Next, we optimize vehicle routes to cover all designated pickup and drop-off locations while minimizing total travel distance. Finally, we employ dynamic assignment strategies to achieve optimal matching outcomes. Experiments on city-scale taxi datasets demonstrate that our method improves the number of served requests by up to 13% and average travel distance by up to 21% compared to leading existing solutions, underscoring the potential of leveraging passenger mobility to significantly enhance ride-pooling service efficiency.

[AI-10] Accelerating MoE Model Inference with Expert Sharding

链接: https://arxiv.org/abs/2503.08467
作者: Oana Balmau,Anne-Marie Kermarrec,Rafael Pires,André Loureiro Espírito Santo,Martijn de Vos,Milos Vujasinovic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: To appear in the proceedings of the 5th Workshop on Machine Learning and Systems (EuroMLSys 25)

点击查看摘要

Abstract:Mixture of experts (MoE) models achieve state-of-the-art results in language modeling but suffer from inefficient hardware utilization due to imbalanced token routing and communication overhead. While prior work has focused on optimizing MoE training and decoder architectures, inference for encoder-based MoE models in a multi-GPU with expert parallelism setting remains underexplored. We introduce MoEShard, an inference system that achieves perfect load balancing through tensor sharding of MoE experts. Unlike existing approaches that rely on heuristic capacity factors or drop tokens, MoEShard evenly distributes computation across GPUs and ensures full token retention, maximizing utilization regardless of routing skewness. We achieve this through a strategic row- and column-wise decomposition of expert matrices. This reduces idle time and avoids bottlenecks caused by imbalanced expert assignments. Furthermore, MoEShard minimizes kernel launches by fusing decomposed expert computations, significantly improving throughput. We evaluate MoEShard against DeepSpeed on encoder-based architectures, demonstrating speedups of up to 6.4 \times in time to first token (TTFT). Our results show that tensor sharding, when properly applied to experts, is a viable and effective strategy for efficient MoE inference.

[AI-11] Status and Future Prospects of the Standardization Framework Industry 4.0: A European Perspective

链接: https://arxiv.org/abs/2503.08460
作者: Olga Meyer,Marvin Boell,Christoph Legat
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The rapid development of Industry 4.0 technologies requires robust and comprehensive standardization to ensure interoperability, safety and efficiency in the Industry of the Future. This paper examines the fundamental role and functionality of standardization, with a particular focus on its importance in Europe’s regulatory framework. Based on this, selected topics in context of standardization activities in context intelligent manufacturing and digital twins are highlighted and, by that, an overview of the Industry 4.0 standards framework is provided. This paper serves both as an informative guide to the existing standards in Industry 4.0 with respect to Artificial Intelligence and Digital Twins, and as a call to action for increased cooperation between standardization bodies and the research community. By fostering such collaboration, we aim to facilitate the continued development and implementation of standards that will drive innovation and progress in the manufacturing sector.

[AI-12] V-Max: Making RL practical for Autonomous Driving

链接: https://arxiv.org/abs/2503.08388
作者: Valentin Charraut,Thomas Tournaire,Waël Doulazmi,Thibault Buhet
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning-based decision-making has the potential to enable generalizable Autonomous Driving (AD) policies, reducing the engineering overhead of rule-based approaches. Imitation Learning (IL) remains the dominant paradigm, benefiting from large-scale human demonstration datasets, but it suffers from inherent limitations such as distribution shift and imitation gaps. Reinforcement Learning (RL) presents a promising alternative, yet its adoption in AD remains limited due to the lack of standardized and efficient research frameworks. To this end, we introduce V-Max, an open research framework providing all the necessary tools to make RL practical for AD. V-Max is built on Waymax, a hardware-accelerated AD simulator designed for large-scale experimentation. We extend it using ScenarioNet’s approach, enabling the fast simulation of diverse AD datasets. V-Max integrates a set of observation and reward functions, transformer-based encoders, and training pipelines. Additionally, it includes adversarial evaluation settings and an extensive set of evaluation metrics. Through a large-scale benchmark, we analyze how network architectures, observation functions, training data, and reward shaping impact RL performance.

[AI-13] InfluenceNet: AI Models for Banzhaf and Shapley Value Prediction

链接: https://arxiv.org/abs/2503.08381
作者: Benjamin Kempinski,Tal Kachman
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 20 pages main text + 6 pages appendix, 11 figures. Accepted to IntelliSys 2025

点击查看摘要

Abstract:Power indices are essential in assessing the contribution and influence of individual agents in multi-agent systems, providing crucial insights into collaborative dynamics and decision-making processes. While invaluable, traditional computational methods for exact or estimated power indices values require significant time and computational constraints, especially for large (n\ge10) coalitions. These constraints have historically limited researchers’ ability to analyse complex multi-agent interactions comprehensively. To address this limitation, we introduce a novel Neural Networks-based approach that efficiently estimates power indices for voting games, demonstrating comparable and often superiour performance to existing tools in terms of both speed and accuracy. This method not only addresses existing computational bottlenecks, but also enables rapid analysis of large coalitions, opening new avenues for multi-agent system research by overcoming previous computational limitations and providing researchers with a more accessible, scalable analytical this http URL increased efficiency will allow for the analysis of more complex and realistic multi-agent scenarios.

[AI-14] Prototype-based Heterogeneous Federated Learning for Blade Icing Detection in Wind Turbines with Class Imbalanced Data

链接: https://arxiv.org/abs/2503.08325
作者: Lele Qi,Mengna Liu,Xu Cheng,Fan Shi,Xiufeng Liu,Shengyong Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Wind farms, typically in high-latitude regions, face a high risk of blade icing. Traditional centralized training methods raise serious privacy concerns. To enhance data privacy in detecting wind turbine blade icing, traditional federated learning (FL) is employed. However, data heterogeneity, resulting from collections across wind farms in varying environmental conditions, impacts the model’s optimization capabilities. Moreover, imbalances in wind turbine data lead to models that tend to favor recognizing majority classes, thus neglecting critical icing anomalies. To tackle these challenges, we propose a federated prototype learning model for class-imbalanced data in heterogeneous environments to detect wind turbine blade icing. We also propose a contrastive supervised loss function to address the class imbalance problem. Experiments on real data from 20 turbines across two wind farms show our method outperforms five FL models and five class imbalance methods, with an average improvement of 19.64% in ( mF_\beta ) and 5.73% in ( m )BA compared to the second-best method, BiFL.

[AI-15] Evaluating Interpretable Reinforcement Learning by Distilling Policies into Programs

链接: https://arxiv.org/abs/2503.08322
作者: Hector Kohler,Quentin Delfosse,Waris Radji,Riad Akrour,Philippe Preux
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages of main text, under review

点击查看摘要

Abstract:There exist applications of reinforcement learning like medicine where policies need to be ‘‘interpretable’’ by humans. User studies have shown that some policy classes might be more interpretable than others. However, it is costly to conduct human studies of policy interpretability. Furthermore, there is no clear definition of policy interpretabiliy, i.e., no clear metrics for interpretability and thus claims depend on the chosen definition. We tackle the problem of empirically evaluating policies interpretability without humans. Despite this lack of clear definition, researchers agree on the notions of ‘‘simulatability’’: policy interpretability should relate to how humans understand policy actions given states. To advance research in interpretable reinforcement learning, we contribute a new methodology to evaluate policy interpretability. This new methodology relies on proxies for simulatability that we use to conduct a large-scale empirical evaluation of policy interpretability. We use imitation learning to compute baseline policies by distilling expert neural networks into small programs. We then show that using our methodology to evaluate the baselines interpretability leads to similar conclusions as user studies. We show that increasing interpretability does not necessarily reduce performances and can sometimes increase them. We also show that there is no policy class that better trades off interpretability and performance across tasks making it necessary for researcher to have methodologies for comparing policies interpretability.

[AI-16] Seeing and Reasoning with Confidence: Supercharging Multimodal LLM s with an Uncertainty-Aware Agent ic Framework

链接: https://arxiv.org/abs/2503.08308
作者: Zhuo Zhi,Chen Feng,Adam Daneshmend,Mine Orlu,Andreas Demosthenous,Lu Yin,Da Li,Ziquan Liu,Miguel R. D. Rodrigues
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) show promise in tasks like visual question answering (VQA) but still face challenges in multimodal reasoning. Recent works adapt agentic frameworks or chain-of-thought (CoT) reasoning to improve performance. However, CoT-based multimodal reasoning often demands costly data annotation and fine-tuning, while agentic approaches relying on external tools risk introducing unreliable output from these tools. In this paper, we propose Seeing and Reasoning with Confidence (SRICE), a training-free multimodal reasoning framework that integrates external vision models with uncertainty quantification (UQ) into an MLLM to address these challenges. Specifically, SRICE guides the inference process by allowing MLLM to autonomously select regions of interest through multi-stage interactions with the help of external tools. We propose to use a conformal prediction-based approach to calibrate the output of external tools and select the optimal tool by estimating the uncertainty of an MLLM’s output. Our experiment shows that the average improvement of SRICE over the base MLLM is 4.6% on five datasets and the performance on some datasets even outperforms fine-tuning-based methods, revealing the significance of ensuring reliable tool use in an MLLM agent.

[AI-17] General-Purpose Aerial Intelligent Agents Empowered by Large Language Models

链接: https://arxiv.org/abs/2503.08302
作者: Ji Zhao,Xiao Lin
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The emergence of large language models (LLMs) opens new frontiers for unmanned aerial vehicle (UAVs), yet existing systems remain confined to predefined tasks due to hardware-software co-design challenges. This paper presents the first aerial intelligent agent capable of open-world task execution through tight integration of LLM-based reasoning and robotic autonomy. Our hardware-software co-designed system addresses two fundamental limitations: (1) Onboard LLM operation via an edge-optimized computing platform, achieving 5-6 tokens/sec inference for 14B-parameter models at 220W peak power; (2) A bidirectional cognitive architecture that synergizes slow deliberative planning (LLM task planning) with fast reactive control (state estimation, mapping, obstacle avoidance, and motion planning). Validated through preliminary results using our prototype, the system demonstrates reliable task planning and scene understanding in communication-constrained environments, such as sugarcane monitoring, power grid inspection, mine tunnel exploration, and biological observation applications. This work establishes a novel framework for embodied aerial artificial intelligence, bridging the gap between task planning and robotic autonomy in open environments.

[AI-18] Large Language Model as Meta-Surrogate for Data-Driven Many-Task Optimization: A Proof-of-Principle Study

链接: https://arxiv.org/abs/2503.08301
作者: Xian-Rong Zhang,Yue-Jiao Gong,Jun Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 13 pages

点击查看摘要

Abstract:In many-task optimization scenarios, surrogate models are valuable for mitigating the computational burden of repeated fitness evaluations across tasks. This study proposes a novel meta-surrogate framework to assist many-task optimization, by leveraging the knowledge transfer strengths and emergent capabilities of large language models (LLMs). We formulate a unified framework for many-task fitness prediction, by defining a universal model with metadata to fit a group of problems. Fitness prediction is performed on metadata and decision variables, enabling efficient knowledge sharing across tasks and adaptability to new tasks. The LLM-based meta-surrogate treats fitness prediction as conditional probability estimation, employing a unified token sequence representation for task metadata, inputs, and outputs. This approach facilitates efficient inter-task knowledge sharing through shared token embeddings and captures complex task dependencies via multi-task model training. Experimental results demonstrate the model’s emergent generalization ability, including zero-shot performance on problems with unseen dimensions. When integrated into evolutionary transfer optimization (ETO), our framework supports dual-level knowledge transfer – at both the surrogate and individual levels – enhancing optimization efficiency and robustness. This work establishes a novel foundation for applying LLMs in surrogate modeling, offering a versatile solution for many-task optimization.

[AI-19] D3PO: Preference-Based Alignment of Discrete Diffusion Models

链接: https://arxiv.org/abs/2503.08295
作者: Umberto Borso,Davide Paglieri,Jude Wells,Tim Rocktäschel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have achieved state-of-the-art performance across multiple domains, with recent advancements extending their applicability to discrete data. However, aligning discrete diffusion models with task-specific preferences remains challenging, particularly in scenarios where explicit reward functions are unavailable. In this work, we introduce Discrete Diffusion DPO (D3PO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains. Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution. We validate D3PO on a structured binary sequence generation task, demonstrating that the method effectively aligns model outputs with preferences while maintaining structural validity. Our results highlight that D3PO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches. Future research will explore extending D3PO to more complex generative tasks, including language modeling and protein sequence generation, as well as investigating alternative noise schedules, such as uniform noising, to enhance flexibility across different applications.

[AI-20] Guess What I am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents

链接: https://arxiv.org/abs/2503.08193
作者: Rui Xu,MingYu Wang,XinTao Wang,Dakuan Lu,Xiaoyu Tan,Wei Chu,Yinghui Xu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in LLM-based role-playing language agents (RPLAs) have attracted broad attention in various applications. While chain-of-thought reasoning has shown importance in many tasks for LLMs, the internal thinking processes of RPLAs remain unexplored. Understanding characters’ inner thoughts is crucial for developing advanced RPLAs. In this paper, we introduce ROLETHINK, a novel benchmark constructed from literature for evaluating character thought generation. We propose the task of inner thought reasoning, which includes two sets: the gold set that compares generated thoughts with original character monologues, and the silver set that uses expert synthesized character analyses as references. To address this challenge, we propose MIRROR, a chain-of-thought approach that generates character thoughts by retrieving memories, predicting character reactions, and synthesizing motivations. Through extensive experiments, we demonstrate the importance of inner thought reasoning for RPLAs, and MIRROR consistently outperforms existing methods. Resources are available at this https URL.

[AI-21] Privacy-Enhancing Paradigms within Federated Multi-Agent Systems

链接: https://arxiv.org/abs/2503.08175
作者: Zitong Shi,Guancheng Wan,Wenke Huang,Guibin Zhang,Jiawei Shao,Mang Ye,Carl Yang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LLM-based Multi-Agent Systems (MAS) have proven highly effective in solving complex problems by integrating multiple agents, each performing different roles. However, in sensitive domains, they face emerging privacy protection challenges. In this paper, we introduce the concept of Federated MAS, highlighting the fundamental differences between Federated MAS and traditional FL. We then identify key challenges in developing Federated MAS, including: 1) heterogeneous privacy protocols among agents, 2) structural differences in multi-party conversations, and 3) dynamic conversational network structures. To address these challenges, we propose Embedded Privacy-Enhancing Agents (EPEAgent), an innovative solution that integrates seamlessly into the Retrieval-Augmented Generation (RAG) phase and the context retrieval stage. This solution minimizes data flows, ensuring that only task-relevant, agent-specific information is shared. Additionally, we design and generate a comprehensive dataset to evaluate the proposed paradigm. Extensive experiments demonstrate that EPEAgent effectively enhances privacy protection while maintaining strong system performance. The code will be availiable at this https URL

[AI-22] Investigating the Effectiveness of a Socratic Chain-of-Thoughts Reasoning Method for Task Planning in Robotics A Case Study

链接: https://arxiv.org/abs/2503.08174
作者: Veronica Bot,Zheyuan Xu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated unprecedented capability in reasoning with natural language. Coupled with this development is the emergence of embodied AI in robotics. Despite showing promise for verbal and written reasoning tasks, it remains unknown whether LLMs are capable of navigating complex spatial tasks with physical actions in the real world. To this end, it is of interest to investigate applying LLMs to robotics in zero-shot learning scenarios, and in the absence of fine-tuning - a feat which could significantly improve human-robot interaction, alleviate compute cost, and eliminate low-level programming tasks associated with robot tasks. To explore this question, we apply GPT-4(Omni) with a simulated Tiago robot in Webots engine for an object search task. We evaluate the effectiveness of three reasoning strategies based on Chain-of-Thought (CoT) sub-task list generation with the Socratic method (SocraCoT) (in order of increasing rigor): (1) Non-CoT/Non-SocraCoT, (2) CoT only, and (3) SocraCoT. Performance was measured in terms of the proportion of tasks successfully completed and execution time (N = 20). Our preliminary results show that when combined with chain-of-thought reasoning, the Socratic method can be used for code generation for robotic tasks that require spatial awareness. In extension of this finding, we propose EVINCE-LoC; a modified EVINCE method that could further enhance performance in highly complex and or dynamic testing scenarios. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.08174 [cs.RO] (or arXiv:2503.08174v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2503.08174 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-23] XAI4Extremes: An interpretable machine learning framework for understanding extreme-weather precursors under climate change

链接: https://arxiv.org/abs/2503.08163
作者: Jiawen Wei,Aniruddha Bora,Vivek Oommen,Chenyu Dong,Juntao Yang,Jeff Adie,Chen Chen,Simon See,George Karniadakis,Gianmarco Mengaldo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Extreme weather events are increasing in frequency and intensity due to climate change. This, in turn, is exacting a significant toll in communities worldwide. While prediction skills are increasing with advances in numerical weather prediction and artificial intelligence tools, extreme weather still present challenges. More specifically, identifying the precursors of such extreme weather events and how these precursors may evolve under climate change remain unclear. In this paper, we propose to use post-hoc interpretability methods to construct relevance weather maps that show the key extreme-weather precursors identified by deep learning models. We then compare this machine view with existing domain knowledge to understand whether deep learning models identified patterns in data that may enrich our understanding of extreme-weather precursors. We finally bin these relevant maps into different multi-year time periods to understand the role that climate change is having on these precursors. The experiments are carried out on Indochina heatwaves, but the methodology can be readily extended to other extreme weather events worldwide.

[AI-24] Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models

链接: https://arxiv.org/abs/2503.08117
作者: Weiguo Gao,Ming Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 37 pages, 11 figures

点击查看摘要

Abstract:The increasing prevalence of synthetic data in training loops has raised concerns about model collapse, where generative models degrade when trained on their own outputs. While prior work focuses on this self-consuming process, we study an underexplored yet prevalent phenomenon: co-evolving generative models that shape each other’s training through iterative feedback. This is common in multimodal AI ecosystems, such as social media platforms, where text models generate captions that guide image models, and the resulting images influence the future adaptation of the text model. We take a first step by analyzing such a system, modeling the text model as a multinomial distribution and the image model as a conditional multi-dimensional Gaussian distribution. Our analysis uncovers three key results. First, when one model remains fixed, the other collapses: a frozen image model causes the text model to lose diversity, while a frozen text model leads to an exponential contraction of image diversity, though fidelity remains bounded. Second, in fully interactive systems, mutual reinforcement accelerates collapse, with image contraction amplifying text homogenization and vice versa, leading to a Matthew effect where dominant texts sustain higher image diversity while rarer texts collapse faster. Third, we analyze stabilization strategies implicitly introduced by real-world external influences. Random corpus injections for text models and user-content injections for image models prevent collapse while preserving both diversity and fidelity. Our theoretical findings are further validated through experiments.

[AI-25] Instruction-Augmented Long-Horizon Planning : Embedding Grounding Mechanisms in Embodied Mobile Manipulation

链接: https://arxiv.org/abs/2503.08084
作者: Fangyuan Wang,Shipeng Lyu,Peng Zhou,Anqing Duan,Guodong Guo,David Navarro-Alarcon
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 17 pages, 11 figures

点击查看摘要

Abstract:Enabling humanoid robots to perform long-horizon mobile manipulation planning in real-world environments based on embodied perception and comprehension abilities has been a longstanding challenge. With the recent rise of large language models (LLMs), there has been a notable increase in the development of LLM-based planners. These approaches either utilize human-provided textual representations of the real world or heavily depend on prompt engineering to extract such representations, lacking the capability to quantitatively understand the environment, such as determining the feasibility of manipulating objects. To address these limitations, we present the Instruction-Augmented Long-Horizon Planning (IALP) system, a novel framework that employs LLMs to generate feasible and optimal actions based on real-time sensor feedback, including grounded knowledge of the environment, in a closed-loop interaction. Distinct from prior works, our approach augments user instructions into PDDL problems by leveraging both the abstract reasoning capabilities of LLMs and grounding mechanisms. By conducting various real-world long-horizon tasks, each consisting of seven distinct manipulatory skills, our results demonstrate that the IALP system can efficiently solve these tasks with an average success rate exceeding 80%. Our proposed method can operate as a high-level planner, equipping robots with substantial autonomy in unstructured environments through the utilization of multi-modal sensor inputs.

[AI-26] Degradation Self-Supervised Learning for Lithium-ion Battery Health Diagnostics

链接: https://arxiv.org/abs/2503.08083
作者: J. C. Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Health evaluation for lithium-ion batteries (LIBs) typically relies on constant charging/discharging protocols, often neglecting scenarios involving dynamic current profiles prevalent in electric vehicles. Conventional health indicators for LIBs also depend on the uniformity of measured data, restricting their adaptability to non-uniform conditions. In this study, a novel training strategy for estimating LIB health based on the paradigm of self-supervised learning is proposed. A multiresolution analysis technique, empirical wavelet transform, is utilized to decompose non-stationary voltage signals in the frequency domain. This allows the removal of ineffective components for the health evaluation model. The transformer neural network serves as the model backbone, and a loss function is designed to describe the capacity degradation behavior with the assumption that the degradation in LIBs across most operating conditions is inevitable and irreversible. The results show that the model can learn the aging characteristics by analyzing sequences of voltage and current profiles obtained at various time intervals from the same LIB cell. The proposed method is successfully applied to the Stanford University LIB aging dataset, derived from electric vehicle real driving profiles. Notably, this approach achieves an average correlation coefficient of 0.9 between the evaluated health index and the degradation of actual capacity, demonstrating its efficacy in capturing LIB health degradation. This research highlights the feasibility of training deep neural networks using unlabeled LIB data, offering cost-efficient means and unleashing the potential of the measured information.

[AI-27] STGDPM:Vessel Trajectory Prediction with Spatio-Temporal Graph Diffusion Probabilistic Model DASFAA2025

链接: https://arxiv.org/abs/2503.08065
作者: Jin Wenzhe,Tang Haina,Zhang Xudong
类目: Artificial Intelligence (cs.AI)
*备注: This paper has been ACCEPTED as a FULL PAPER at DASFAA 2025

点击查看摘要

Abstract:Vessel trajectory prediction is a critical component for ensuring maritime traffic safety and avoiding collisions. Due to the inherent uncertainty in vessel behavior, trajectory prediction systems must adopt a multimodal approach to accurately model potential future motion states. However, existing vessel trajectory prediction methods lack the ability to comprehensively model behavioral multi-modality. To better capture multimodal behavior in interactive scenarios, we propose modeling interactions as dynamic graphs, replacing traditional aggregation-based techniques that rely on vessel states. By leveraging the natural multimodal capabilities of diffusion models, we frame the trajectory prediction task as an inverse process of motion uncertainty diffusion, wherein uncertainties across potential navigational areas are progressively eliminated until the desired trajectories is produced. In summary, we pioneer the integration of Spatio-Temporal Graph (STG) with diffusion models in ship trajectory prediction. Extensive experiments on real Automatic Identification System (AIS) data validate the superiority of our approach.

[AI-28] Counterfactual Language Reasoning for Explainable Recommendation Systems

链接: https://arxiv.org/abs/2503.08051
作者: Guanrong Li,Haolin Yang,Xinyu Liu,Zhen Wu,Xinyu Dai
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Explainable recommendation systems leverage transparent reasoning to foster user trust and improve decision-making processes. Current approaches typically decouple recommendation generation from explanation creation, violating causal precedence principles where explanatory factors should logically precede outcomes. This paper introduces a novel framework integrating structural causal models with large language models to establish causal consistency in recommendation pipelines. Our methodology enforces explanation factors as causal antecedents to recommendation predictions through causal graph construction and counterfactual adjustment. We particularly address the confounding effect of item popularity that distorts personalization signals in explanations, developing a debiasing mechanism that disentangles genuine user preferences from conformity bias. Through comprehensive experiments across multiple recommendation scenarios, we demonstrate that CausalX achieves superior performance in recommendation accuracy, explanation plausibility, and bias mitigation compared to baselines.

[AI-29] MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models ICRA2025

链接: https://arxiv.org/abs/2503.08007
作者: Han Zhao,Wenxuan Song,Donglin Wang,Xinyang Tong,Pengxiang Ding,Xuelian Cheng,Zongyuan Ge
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted by ICRA 2025

点击查看摘要

Abstract:Developing versatile quadruped robots that can smoothly perform various actions and tasks in real-world environments remains a significant challenge. This paper introduces a novel vision-language-action (VLA) model, mixture of robotic experts (MoRE), for quadruped robots that aim to introduce reinforcement learning (RL) for fine-tuning large-scale VLA models with a large amount of mixed-quality data. MoRE integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model (MLLM), forming a sparse-activated mixture-of-experts model. This design enables the model to effectively adapt to a wide array of downstream tasks. Moreover, we employ a reinforcement learning-based training objective to train our model as a Q-function after deeply exploring the structural properties of our tasks. Effective learning from automatically collected mixed-quality data enhances data efficiency and model performance. Extensive experiments demonstrate that MoRE outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios. We further validate our method in real-world scenarios, confirming the practicality of our approach and laying a solid foundation for future research on multi-task learning in quadruped robots.

[AI-30] Injecting Imbalance Sensitivity for Multi-Task Learning

链接: https://arxiv.org/abs/2503.08006
作者: Zhipeng Zhou,Liu Liu,Peilin Zhao,Wei Gong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Multi-task learning (MTL) has emerged as a promising approach for deploying deep learning models in real-life applications. Recent studies have proposed optimization-based learning paradigms to establish task-shared representations in MTL. However, our paper empirically argues that these studies, specifically gradient-based ones, primarily emphasize the conflict issue while neglecting the potentially more significant impact of imbalance/dominance in MTL. In line with this perspective, we enhance the existing baseline method by injecting imbalance-sensitivity through the imposition of constraints on the projected norms. To demonstrate the effectiveness of our proposed IMbalance-sensitive Gradient (IMGrad) descent method, we evaluate it on multiple mainstream MTL benchmarks, encompassing supervised learning tasks as well as reinforcement learning. The experimental results consistently demonstrate competitive performance.

[AI-31] LLM -Powered Knowledge Graphs for Enterprise Intelligence and Analytics

链接: https://arxiv.org/abs/2503.07993
作者: Rajeev Kumar,Kumar Ishan,Harishankar Kumar,Abhinandan Singla
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Disconnected data silos within enterprises obstruct the extraction of actionable insights, diminishing efficiency in areas such as product development, client engagement, meeting preparation, and analytics-driven decision-making. This paper introduces a framework that uses large language models (LLMs) to unify various data sources into a comprehensive, activity-centric knowledge graph. The framework automates tasks such as entity extraction, relationship inference, and semantic enrichment, enabling advanced querying, reasoning, and analytics across data types like emails, calendars, chats, documents, and logs. Designed for enterprise flexibility, it supports applications such as contextual search, task prioritization, expertise discovery, personalized recommendations, and advanced analytics to identify trends and actionable insights. Experimental results demonstrate its success in the discovery of expertise, task management, and data-driven decision making. By integrating LLMs with knowledge graphs, this solution bridges disconnected systems and delivers intelligent analytics-powered enterprise tools.

[AI-32] Boundary Prompting: Elastic Urban Region Representation via Graph-based Spatial Tokenization

链接: https://arxiv.org/abs/2503.07991
作者: Haojia Zhu,Jiahui Jin,Dong Kan,Rouxi Shen,Ruize Wang,Xiangguo Sun,Jinghui Zhang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Urban region representation is essential for various applications such as urban planning, resource allocation, and policy development. Traditional methods rely on fixed, predefined region boundaries, which fail to capture the dynamic and complex nature of real-world urban areas. In this paper, we propose the Boundary Prompting Urban Region Representation Framework (BPURF), a novel approach that allows for elastic urban region definitions. BPURF comprises two key components: (1) A spatial token dictionary, where urban entities are treated as tokens and integrated into a unified token graph, and (2) a region token set representation model which utilize token aggregation and a multi-channel model to embed token sets corresponding to region boundaries. Additionally, we propose fast token set extraction strategy to enable online token set extraction during training and prompting. This framework enables the definition of urban regions through boundary prompting, supporting varying region boundaries and adapting to different tasks. Extensive experiments demonstrate the effectiveness of BPURF in capturing the complex characteristics of urban regions.

[AI-33] Provable Zero-Shot Generalization in Offline Reinforcement Learning

链接: https://arxiv.org/abs/2503.07988
作者: Zhiyong Wang,Chen Yang,John C.S. Lui,Dongruo Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 30 pages, 1 figure, 1 table

点击查看摘要

Abstract:In this work, we study offline reinforcement learning (RL) with zero-shot generalization property (ZSG), where the agent has access to an offline dataset including experiences from different environments, and the goal of the agent is to train a policy over the training environments which performs well on test environments without further interaction. Existing work showed that classical offline RL fails to generalize to new, unseen environments. We propose pessimistic empirical risk minimization (PERM) and pessimistic proximal policy optimization (PPPO), which leverage pessimistic policy evaluation to guide policy learning and enhance generalization. We show that both PERM and PPPO are capable of finding a near-optimal policy with ZSG. Our result serves as a first step in understanding the foundation of the generalization phenomenon in offline reinforcement learning.

[AI-34] Hierarchical Contact-Rich Trajectory Optimization for Multi-Modal Manipulation using Tight Convex Relaxations ICRA

链接: https://arxiv.org/abs/2503.07963
作者: Yuki Shirai,Arvind Raghunathan,Devesh K. Jha
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 2025 IEEE International Conference on Robotics and Automation (2025 ICRA)

点击查看摘要

Abstract:Designing trajectories for manipulation through contact is challenging as it requires reasoning of object \ robot trajectories as well as complex contact sequences simultaneously. In this paper, we present a novel framework for simultaneously designing trajectories of robots, objects, and contacts efficiently for contact-rich manipulation. We propose a hierarchical optimization framework where Mixed-Integer Linear Program (MILP) selects optimal contacts between robot \ object using approximate dynamical constraints, and then a NonLinear Program (NLP) optimizes trajectory of the robot(s) and object considering full nonlinear constraints. We present a convex relaxation of bilinear constraints using binary encoding technique such that MILP can provide tighter solutions with better computational complexity. The proposed framework is evaluated on various manipulation tasks where it can reason about complex multi-contact interactions while providing computational advantages. We also demonstrate our framework in hardware experiments using a bimanual robot system.

[AI-35] LLM -based Corroborating and Refuting Evidence Retrieval for Scientific Claim Verification

链接: https://arxiv.org/abs/2503.07937
作者: Siyuan Wang,James R. Foulds,Md Osman Gani,Shimei Pan
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we introduce CIBER (Claim Investigation Based on Evidence Retrieval), an extension of the Retrieval-Augmented Generation (RAG) framework designed to identify corroborating and refuting documents as evidence for scientific claim verification. CIBER addresses the inherent uncertainty in Large Language Models (LLMs) by evaluating response consistency across diverse interrogation probes. By focusing on the behavioral analysis of LLMs without requiring access to their internal information, CIBER is applicable to both white-box and black-box models. Furthermore, CIBER operates in an unsupervised manner, enabling easy generalization across various scientific domains. Comprehensive evaluations conducted using LLMs with varying levels of linguistic proficiency reveal CIBER’s superior performance compared to conventional RAG approaches. These findings not only highlight the effectiveness of CIBER but also provide valuable insights for future advancements in LLM-based scientific claim verification.

[AI-36] he StudyChat Dataset: Student Dialogues With ChatGPT in an Artificial Intelligence Course

链接: https://arxiv.org/abs/2503.07928
作者: Hunter McNichols,Andrew Lan
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Pre-print

点击查看摘要

Abstract:The widespread availability of large language models (LLMs), such as ChatGPT, has significantly impacted education, raising both opportunities and challenges. Students can frequently interact with LLM-powered, interactive learning tools, but their usage patterns need to be analyzed to ensure ethical usage of these tools. To better understand how students interact with LLMs in an academic setting, we introduce \textbfStudyChat, a publicly available dataset capturing real-world student interactions with an LLM-powered tutoring chatbot in a semester-long, university-level artificial intelligence (AI) course. We deploy a web application that replicates ChatGPT’s core functionalities, and use it to log student interactions with the LLM while working on programming assignments. We collect 1,197 conversations, which we annotate using a dialogue act labeling schema inspired by observed interaction patterns and prior research. Additionally, we analyze these interactions, highlight behavioral trends, and analyze how specific usage patterns relate to course outcomes. \textbfStudyChat provides a rich resource for the learning sciences and AI in education communities, enabling further research into the evolving role of LLMs in education.

[AI-37] Safety Guardrails for LLM -Enabled Robots

链接: https://arxiv.org/abs/2503.07885
作者: Zachary Ravichandran,Alexander Robey,Vijay Kumar,George J. Pappas,Hamed Hassani
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although the integration of large language models (LLMs) into robotics has unlocked transformative capabilities, it has also introduced significant safety concerns, ranging from average-case LLM errors (e.g., hallucinations) to adversarial jailbreaking attacks, which can produce harmful robot behavior in real-world settings. Traditional robot safety approaches do not address the novel vulnerabilities of LLMs, and current LLM safety guardrails overlook the physical risks posed by robots operating in dynamic real-world environments. In this paper, we propose RoboGuard, a two-stage guardrail architecture to ensure the safety of LLM-enabled robots. RoboGuard first contextualizes pre-defined safety rules by grounding them in the robot’s environment using a root-of-trust LLM, which employs chain-of-thought (CoT) reasoning to generate rigorous safety specifications, such as temporal logic constraints. RoboGuard then resolves potential conflicts between these contextual safety specifications and a possibly unsafe plan using temporal logic control synthesis, which ensures safety compliance while minimally violating user preferences. Through extensive simulation and real-world experiments that consider worst-case jailbreaking attacks, we demonstrate that RoboGuard reduces the execution of unsafe plans from 92% to below 2.5% without compromising performance on safe plans. We also demonstrate that RoboGuard is resource-efficient, robust against adaptive attacks, and significantly enhanced by enabling its root-of-trust LLM to perform CoT reasoning. These results underscore the potential of RoboGuard to mitigate the safety risks and enhance the reliability of LLM-enabled robots.

[AI-38] LLM IdxAdvis: Resource-Efficient Index Advisor Utilizing Large Language Model

链接: https://arxiv.org/abs/2503.07884
作者: Xinxin Zhao,Haoyang Li,Jing Zhang,Xinmei Huang,Tieying Zhang,Jianjun Chen,Rui Shi,Cuiping Li,Hong Chen
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Index recommendation is essential for improving query performance in database management systems (DBMSs) through creating an optimal set of indexes under specific constraints. Traditional methods, such as heuristic and learning-based approaches, are effective but face challenges like lengthy recommendation time, resource-intensive training, and poor generalization across different workloads and database schemas. To address these issues, we propose LLMIdxAdvis, a resource-efficient index advisor that uses large language models (LLMs) without extensive fine-tuning. LLMIdxAdvis frames index recommendation as a sequence-to-sequence task, taking target workload, storage constraint, and corresponding database environment as input, and directly outputting recommended indexes. It constructs a high-quality demonstration pool offline, using GPT-4-Turbo to synthesize diverse SQL queries and applying integrated heuristic methods to collect both default and refined labels. During recommendation, these demonstrations are ranked to inject database expertise via in-context learning. Additionally, LLMIdxAdvis extracts workload features involving specific column statistical information to strengthen LLM’s understanding, and introduces a novel inference scaling strategy combining vertical scaling (via ‘‘Index-Guided Major Voting’’ and Best-of-N) and horizontal scaling (through iterative ‘‘self-optimization’’ with database feedback) to enhance reliability. Experiments on 3 OLAP and 2 real-world benchmarks reveal that LLMIdxAdvis delivers competitive index recommendation with reduced runtime, and generalizes effectively across different workloads and database schemas.

[AI-39] Right Reward Right Time for Federated Learning

链接: https://arxiv.org/abs/2503.07869
作者: Thanh Linh Nguyen,Dinh Thai Hoang,Diep N. Nguyen,Quoc-Viet Pham
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT)
*备注: IEEE Journal Submission

点击查看摘要

Abstract:Critical learning periods (CLPs) in federated learning (FL) refer to early stages during which low-quality contributions (e.g., sparse training data availability) can permanently impair the learning performance of the global model owned by the model owner (i.e., the cloud server). However, strategies to motivate clients with high-quality contributions to join the FL training process and share trained model updates during CLPs remain underexplored. Additionally, existing incentive mechanisms in FL treat all training periods equally, which consequently fails to motivate clients to participate early. Compounding this challenge is the cloud’s limited knowledge of client training capabilities due to privacy regulations, leading to information asymmetry. Therefore, in this article, we propose a time-aware incentive mechanism, called Right Reward Right Time (R3T), to encourage client involvement, especially during CLPs, to maximize the utility of the cloud in FL. Specifically, the cloud utility function captures the trade-off between the achieved model performance and payments allocated for clients’ contributions, while accounting for clients’ time and system capabilities, efforts, joining time, and rewards. Then, we analytically derive the optimal contract for the cloud and devise a CLP-aware mechanism to incentivize early participation and efforts while maximizing cloud utility, even under information asymmetry. By providing the right reward at the right time, our approach can attract the highest-quality contributions during CLPs. Simulation and proof-of-concept studies show that R3T increases cloud utility and is more economically effective than benchmarks. Notably, our proof-of-concept results show up to a 47.6% reduction in the total number of clients and up to a 300% improvement in convergence time while reaching competitive test accuracies compared with incentive mechanism benchmarks.

[AI-40] CIMAGE: Exploiting the Conditional Independence in Masked Graph Auto-encoders WSDM2025

链接: https://arxiv.org/abs/2503.07852
作者: Jongwon Park,Heesoo Jung,Hogun Park
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to the WSDM 2025 Oral. This is an extended version of the original submission. Typos are also corrected

点击查看摘要

Abstract:Recent Self-Supervised Learning (SSL) methods encapsulating relational information via masking in Graph Neural Networks (GNNs) have shown promising performance. However, most existing approaches rely on random masking strategies in either feature or graph space, which may fail to capture task-relevant information fully. We posit that this limitation stems from an inability to achieve minimum redundancy between masked and unmasked components while ensuring maximum relevance of both to potential downstream tasks. Conditional Independence (CI) inherently satisfies the minimum redundancy and maximum relevance criteria, but its application typically requires access to downstream labels. To address this challenge, we introduce CIMAGE, a novel approach that leverages Conditional Independence to guide an effective masking strategy within the latent space. CIMAGE utilizes CI-aware latent factor decomposition to generate two distinct contexts, leveraging high-confidence pseudo-labels derived from unsupervised graph clustering. In this framework, the pretext task involves reconstructing the masked second context solely from the information provided by the first context. Our theoretical analysis further supports the superiority of CIMAGE’s novel CI-aware masking method by demonstrating that the learned embedding exhibits approximate linear separability, which enables accurate predictions for the downstream task. Comprehensive evaluations across diverse graph benchmarks illustrate the advantage of CIMAGE, with notably higher average rankings on node classification and link prediction tasks. Notably, our proposed model highlights the under-explored potential of CI in enhancing graph SSL methodologies and offers enriched insights for effective graph representation learning.

[AI-41] Actual Causation and Nondeterministic Causal Models

链接: https://arxiv.org/abs/2503.07849
作者: Sander Beckers
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at CLeaR 2025

点击查看摘要

Abstract:In (Beckers, 2025) I introduced nondeterministic causal models as a generalization of Pearl’s standard deterministic causal models. I here take advantage of the increased expressivity offered by these models to offer a novel definition of actual causation (that also applies to deterministic models). Instead of motivating the definition by way of (often subjective) intuitions about examples, I proceed by developing it based entirely on the unique function that it can fulfil in communicating and learning a causal model. First I generalize the more basic notion of counterfactual dependence, second I show how this notion has a vital role to play in the logic of causal discovery, third I introduce the notion of a structural simplification of a causal model, and lastly I bring both notions together in my definition of actual causation. Although novel, the resulting definition arrives at verdicts that are almost identical to those of my previous definition (Beckers, 2021, 2022).

[AI-42] Safe Explicable Policy Search

链接: https://arxiv.org/abs/2503.07848
作者: Akkamahadevi Hanni,Jonathan Montaño,Yu Zhang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:When users work with AI agents, they form conscious or subconscious expectations of them. Meeting user expectations is crucial for such agents to engage in successful interactions and teaming. However, users may form expectations of an agent that differ from the agent’s planned behaviors. These differences lead to the consideration of two separate decision models in the planning process to generate explicable behaviors. However, little has been done to incorporate safety considerations, especially in a learning setting. We present Safe Explicable Policy Search (SEPS), which aims to provide a learning approach to explicable behavior generation while minimizing the safety risk, both during and after learning. We formulate SEPS as a constrained optimization problem where the agent aims to maximize an explicability score subject to constraints on safety and a suboptimality criterion based on the agent’s model. SEPS innovatively combines the capabilities of Constrained Policy Optimization and Explicable Policy Search. We evaluate SEPS in safety-gym environments and with a physical robot experiment to show that it can learn explicable behaviors that adhere to the agent’s safety requirements and are efficient. Results show that SEPS can generate safe and explicable behaviors while ensuring a desired level of performance w.r.t. the agent’s objective, and has real-world relevance in human-AI teaming.

[AI-43] Group Fairness in Multi-Task Reinforcement Learning

链接: https://arxiv.org/abs/2503.07817
作者: Kefan Song,Runnan Jiang,Rohan Chandra,Shangtong Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper addresses a critical societal consideration in the application of Reinforcement Learning (RL): ensuring equitable outcomes across different demographic groups in multi-task settings. While previous work has explored fairness in single-task RL, many real-world applications are multi-task in nature and require policies to maintain fairness across all tasks. We introduce a novel formulation of multi-task group fairness in RL and propose a constrained optimization algorithm that explicitly enforces fairness constraints across multiple tasks simultaneously. We have shown that our proposed algorithm does not violate fairness constraints with high probability and with sublinear regret in the finite-horizon episodic setting. Through experiments in RiverSwim and MuJoCo environments, we demonstrate that our approach better ensures group fairness across multiple tasks compared to previous methods that lack explicit multi-task fairness constraints in both the finite-horizon setting and the infinite-horizon setting. Our results show that the proposed algorithm achieves smaller fairness gaps while maintaining comparable returns across different demographic groups and tasks, suggesting its potential for addressing fairness concerns in real-world multi-task RL applications.

[AI-44] Efficient Neural Clause-Selection Reinforcement

链接: https://arxiv.org/abs/2503.07792
作者: Martin Suda
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 15 pages main text, 3 page bibliography, 6 page appendix

点击查看摘要

Abstract:Clause selection is arguably the most important choice point in saturation-based theorem proving. Framing it as a reinforcement learning (RL) task is a way to challenge the human-designed heuristics of state-of-the-art provers and to instead automatically evolve – just from prover experiences – their potentially optimal replacement. In this work, we present a neural network architecture for scoring clauses for clause selection that is powerful yet efficient to evaluate. Following RL principles to make design decisions, we integrate the network into the Vampire theorem prover and train it from successful proof attempts. An experiment on the diverse TPTP benchmark finds the neurally guided prover improve over a baseline strategy, from which it initially learns–in terms of the number of in-training-unseen problems solved under a practically relevant, short CPU instruction limit–by 20%.

[AI-45] Joint Explainability-Performance Optimization With Surrogate Models for AI-Driven Edge Services

链接: https://arxiv.org/abs/2503.07784
作者: Foivos Charalampakos,Thomas Tsouparopoulos,Iordanis Koutsopoulos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Explainable AI is a crucial component for edge services, as it ensures reliable decision making based on complex AI models. Surrogate models are a prominent approach of XAI where human-interpretable models, such as a linear regression model, are trained to approximate a complex (black-box) model’s predictions. This paper delves into the balance between the predictive accuracy of complex AI models and their approximation by surrogate ones, advocating that both these models benefit from being learned simultaneously. We derive a joint (bi-level) training scheme for both models and we introduce a new algorithm based on multi-objective optimization (MOO) to simultaneously minimize both the complex model’s prediction error and the error between its outputs and those of the surrogate. Our approach leads to improvements that exceed 99% in the approximation of the black-box model through the surrogate one, as measured by the metric of Fidelity, for a compromise of less than 3% absolute reduction in the black-box model’s predictive accuracy, compared to single-task and multi-task learning baselines. By improving Fidelity, we can derive more trustworthy explanations of the complex model’s outcomes from the surrogate, enabling reliable AI applications for intelligent services at the network edge.

[AI-46] Sensemaking in Novel Environments: How Human Cognition Can Inform Artificial Agents

链接: https://arxiv.org/abs/2503.07783
作者: Robert E. Patterson(1),Regina Buccello-Stout(1),Mary E. Frame(2),Anna M. Maresca(2),Justin Nelson(1),Barbara Acker-Mills(2),Erica Curtis(2),Jared Culbertson(3),Kevin Schmidt(3),Scott Clouse(3),Steve Rogers(3) ((1) Air Force Research Laboratory, (2) Parallax Advanced Research, (3) Autonomy Capability Team (ACT3) Wright-Patterson AFB)
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:One of the most vital cognitive skills to possess is the ability to make sense of objects, events, and situations in the world. In the current paper, we offer an approach for creating artificially intelligent agents with the capacity for sensemaking in novel environments. Objectives: to present several key ideas: (1) a novel unified conceptual framework for sensemaking (which includes the existence of sign relations embedded within and across frames); (2) interaction among various content-addressable, distributed-knowledge structures via shared attributes (whose net response would represent a synthesized object, event, or situation serving as a sign for sensemaking in a novel environment). Findings: we suggest that attributes across memories can be shared and recombined in novel ways to create synthesized signs, which can denote certain outcomes in novel environments (i.e., sensemaking).

[AI-47] Evaluating LLaMA 3.2 for Software Vulnerability Detection

链接: https://arxiv.org/abs/2503.07770
作者: José Gonçalves,Miguel Silva,Bernardo Cabral,Tiago Dias,Eva Maia,Isabel Praça,Ricardo Severino,Luís Lino Ferreira
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
*备注: 14 pages, 4 tables, EICC 2025: European Interdisciplinary Cybersecurity Conference 2025

点击查看摘要

Abstract:Deep Learning (DL) has emerged as a powerful tool for vulnerability detection, often outperforming traditional solutions. However, developing effective DL models requires large amounts of real-world data, which can be difficult to obtain in sufficient quantities. To address this challenge, DiverseVul dataset has been curated as the largest dataset of vulnerable and non-vulnerable C/C++ functions extracted exclusively from real-world projects. Its goal is to provide high-quality, large-scale samples for training DL models. However, during our study several inconsistencies were identified in the raw dataset while applying pre-processing techniques, highlighting the need for a refined version. In this work, we present a refined version of DiverseVul dataset, which is used to fine-tune a large language model, LLaMA 3.2, for vulnerability detection. Experimental results show that the use of pre-processing techniques led to an improvement in performance, with the model achieving an F1-Score of 66%, a competitive result when compared to our baseline, which achieved a 47% F1-Score in software vulnerability detection.

[AI-48] A Simple Approach to Constraint-Aware Imitation Learning with Application to Autonomous Racing IROS2025

链接: https://arxiv.org/abs/2503.07737
作者: Shengfan Cao,Eunhyek Joa,Francesco Borrelli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Submitted to IEEE IROS 2025

点击查看摘要

Abstract:Guaranteeing constraint satisfaction is challenging in imitation learning (IL), particularly in tasks that require operating near a system’s handling limits. Traditional IL methods often struggle to enforce constraints, leading to suboptimal performance in high-precision tasks. In this paper, we present a simple approach to incorporating safety into the IL objective. Through simulations, we empirically validate our approach on an autonomous racing task with both full-state and image feedback, demonstrating improved constraint satisfaction and greater consistency in task performance compared to a baseline method.

[AI-49] Automated Benchmark Generation for Repository-Level Coding Tasks ICLR’25

链接: https://arxiv.org/abs/2503.07701
作者: Konstantinos Vergopoulos,Mark Niklas Müller,Martin Vechev
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted at DL4C@ICLR’25 and FMWild@ICLR’25

点击查看摘要

Abstract:Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench. This benchmark challenges code agents to generate patches addressing GitHub issues given the full repository as context. The correctness of generated patches is then evaluated by executing a human-written test suite extracted from the repository after the issue’s resolution. However, constructing benchmarks like SWE-Bench requires substantial manual effort to set up historically accurate execution environments for testing. Crucially, this severely limits the number of considered repositories, e.g., just 12 for SWE-Bench. Considering so few repositories, selected for their popularity runs the risk of leading to a distributional mismatch, i.e., the measured performance may not be representative of real-world scenarios potentially misguiding development efforts. In this work, we address this challenge and introduce SetUpAgent, a fully automated system capable of historically accurate dependency setup, test execution, and result parsing. Using SetUpAgent, we generate two new datasets: (i) SWEE-Bench an extended version of SWE-Bench encompassing hundreds of repositories, and (ii) SWA-Bench a benchmark focusing on applications rather than libraries. Comparing these datasets to SWE-Bench with respect to their characteristics and code agent performance, we find significant distributional differences, including lower issue description quality and detail level, higher fix complexity, and most importantly up to 40% lower agent success rates.

[AI-50] A Task and Motion Planning Framework Using Iteratively Deepened AND/OR Graph Networks

链接: https://arxiv.org/abs/2503.07700
作者: Hossein Karami,Antony Thomas,Fulvio Mastrogiovanni
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we present an approach for integrated task and motion planning based on an AND/OR graph network, which is used to represent task-level states and actions, and we leverage it to implement different classes of task and motion planning problems (TAMP). Several problems that fall under task and motion planning do not have a predetermined number of sub-tasks to achieve a goal. For example, while retrieving a target object from a cluttered workspace, in principle the number of object re-arrangements required to finally grasp it cannot be known ahead of time. To address this challenge, and in contrast to traditional planners, also those based on AND/OR graphs, we grow the AND/OR graph at run-time by progressively adding sub-graphs until grasping the target object becomes feasible, which yields a network of AND/OR graphs. The approach is extended to enable multi-robot task and motion planning, and (i) it allows us to perform task allocation while coordinating the activity of a given number of robots, and (ii) can handle multi-robot tasks involving an a priori unknown number of sub-tasks. The approach is evaluated and validated both in simulation and with a real dual-arm robot manipulator, that is, Baxter from Rethink Robotics. In particular, for the single-robot task and motion planning, we validated our approach in three different TAMP domains. Furthermore, we also use three different robots for simulation, namely, Baxter, Franka Emika Panda manipulators, and a PR2 robot. Experiments show that our approach can be readily scaled to scenarios with many objects and robots, and is capable of handling different classes of TAMP problems.

[AI-51] Fully Autonomous Programming using Iterative Multi-Agent Debugging with Large Language Models

链接: https://arxiv.org/abs/2503.07693
作者: Anastasiia Grishina,Vadim Liventsev,Aki Härmä,Leon Moonen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Software Engineering (cs.SE)
*备注: Accepted for publication in ACM Trans. Evol. Learn. Optim., February 2025. arXiv admin note: text overlap with arXiv:2304.10423

点击查看摘要

Abstract:Program synthesis with Large Language Models (LLMs) suffers from a “near-miss syndrome”: the generated code closely resembles a correct solution but fails unit tests due to minor errors. We address this with a multi-agent framework called Synthesize, Execute, Instruct, Debug, and Repair (SEIDR). Effectively applying SEIDR to instruction-tuned LLMs requires determining (a) optimal prompts for LLMs, (b) what ranking algorithm selects the best programs in debugging rounds, and © balancing the repair of unsuccessful programs with the generation of new ones. We empirically explore these trade-offs by comparing replace-focused, repair-focused, and hybrid debug strategies. We also evaluate lexicase and tournament selection to rank candidates in each generation. On Program Synthesis Benchmark 2 (PSB2), our framework outperforms both conventional use of OpenAI Codex without a repair phase and traditional genetic programming approaches. SEIDR outperforms the use of an LLM alone, solving 18 problems in C++ and 20 in Python on PSB2 at least once across experiments. To assess generalizability, we employ GPT-3.5 and Llama 3 on the PSB2 and HumanEval-X benchmarks. Although SEIDR with these models does not surpass current state-of-the-art methods on the Python benchmarks, the results on HumanEval-C++ are promising. SEIDR with Llama 3-8B achieves an average pass@100 of 84.2%. Across all SEIDR runs, 163 of 164 problems are solved at least once with GPT-3.5 in HumanEval-C++, and 162 of 164 with the smaller Llama 3-8B. We conclude that SEIDR effectively overcomes the near-miss syndrome in program synthesis with LLMs.

[AI-52] Artificial Intelligence in Deliberation: The AI Penalty and the Emergence of a New Deliberative Divide

链接: https://arxiv.org/abs/2503.07690
作者: Andreas Jungherr,Adrian Rauchfleisch
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Digital deliberation has expanded democratic participation, yet challenges remain. This includes processing information at scale, moderating discussions, fact-checking, or attracting people to participate. Recent advances in artificial intelligence (AI) offer potential solutions, but public perceptions of AI’s role in deliberation remain underexplored. Beyond efficiency, democratic deliberation is about voice and recognition. If AI is integrated into deliberation, public trust, acceptance, and willingness to participate may be affected. We conducted a preregistered survey experiment with a representative sample in Germany (n=1850) to examine how information about AI-enabled deliberation influences willingness to participate and perceptions of deliberative quality. Respondents were randomly assigned to treatments that provided them information about deliberative tasks facilitated by either AI or humans. Our findings reveal a significant AI-penalty. Participants were less willing to engage in AI-facilitated deliberation and rated its quality lower than human-led formats. These effects were moderated by individual predispositions. Perceptions of AI’s societal benefits and anthropomorphization of AI showed positive interaction effects on people’s interest to participate in AI-enabled deliberative formats and positive quality assessments, while AI risk assessments showed negative interactions with information about AI-enabled deliberation. These results suggest AI-enabled deliberation faces substantial public skepticism, potentially even introducing a new deliberative divide. Unlike traditional participation gaps based on education or demographics, this divide is shaped by attitudes toward AI. As democratic engagement increasingly moves online, ensuring AI’s role in deliberation does not discourage participation or deepen inequalities will be a key challenge for future research and policy.

[AI-53] Adaptive routing protocols for determining optimal paths in AI multi-agent systems: a priority- and learning-enhanced approach

链接: https://arxiv.org/abs/2503.07686
作者: Theodor Panayotov,Ivo Emanuilov
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As distributed artificial intelligence (AI) and multi-agent architectures grow increasingly complex, the need for adaptive, context-aware routing becomes paramount. This paper introduces an enhanced, adaptive routing algorithm tailored for AI multi-agent networks, integrating priority-based cost functions and dynamic learning mechanisms. Building on an extended Dijkstra-based framework, we incorporate multi-faceted parameters such as task complexity, user request priority, agent capabilities, bandwidth, latency, load, model sophistication, and reliability. We further propose dynamically adaptive weighting factors, tuned via reinforcement learning (RL), to continuously evolve routing policies based on observed network performance. Additionally, heuristic filtering and hierarchical routing structures improve scalability and responsiveness. Our approach yields context-sensitive, load-aware, and priority-focused routing decisions that not only reduce latency for critical tasks but also optimize overall resource utilization, ultimately enhancing the robustness, flexibility, and efficiency of multi-agent systems.

[AI-54] Ways of Seeing and Selling AI Art

链接: https://arxiv.org/abs/2503.07685
作者: Imke van Heerden
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:In early 2025, Augmented Intelligence - Christie’s first AI art auction - drew criticism for showcasing a controversial genre. Amid wider legal uncertainty, artists voiced concerns over data mining practices, notably with respect to copyright. The backlash could be viewed as a microcosm of AI’s contested position in the creative economy. Touching on the auction’s presentation, reception, and results, this paper explores how, among social dissonance, machine learning finds its place in the artworld. Foregrounding responsible innovation, the paper provides a balanced perspective that champions creators’ rights and brings nuance to this polarised debate. With a focus on exhibition design, it centres framing, which refers to the way a piece is presented to influence consumer perception. Context plays a central role in shaping our understanding of how good, valuable, and even ethical an artwork is. In this regard, Augmented Intelligence situates AI art within a surprisingly traditional framework, leveraging hallmarks of “high art” to establish the genre’s cultural credibility. Generative AI has a clear economic dimension, converging questions of artistic merit with those of monetary worth. Scholarship on ways of seeing, or framing, could substantively inform the interpretation and evaluation of creative outputs, including assessments of their aesthetic and commercial value.

[AI-55] A Time Series Multitask Framework Integrating a Large Language Model Pre-Trained Time Series Model and Knowledge Graph

链接: https://arxiv.org/abs/2503.07682
作者: Shule Hao,Junpeng Bao,Chuncheng Lu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series analysis is crucial in fields like finance, transportation, and industry. However, traditional models often focus solely on temporal features, limiting their ability to capture underlying information. This paper proposes a novel time series multitask framework, called LTM, which integrates temporal features with textual descriptions to enhance analytical and predictive capabilities. LTM combines pre-trained time series model, large language model (LLM), and knowledge graph to tackle time series tasks, including forecasting, imputation, and anomaly detection. LTM achieves improved performance with a few trainable parameters. It is very efficient and practical. LTM encodes time series data into patches and enriches user-provided prompts using knowledge graphs to generate enhanced prompts. A novel feature fusion method embeds prompts into each patch encoding, which is processed by a frozen LLM, followed by a feature enhancement module and a time decoder module. During fine-tuning stage, cosine similarity between prompts and temporal patches is integrated into the loss function to boost performance. Experiments on benchmark datasets show that LTM significantly outperforms existing methods. It provides a robust and versatile solution for time series tasks.

[AI-56] Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

链接: https://arxiv.org/abs/2503.07680
作者: Yongqiang Yao,Jingru Tan,Kaihuan Liang,Feizhao Zhang,Yazhe Niu,Jiahao Hu,Ruihao Gong,Dahua Lin,Ningyi Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies. In particular, the HBP constructs multi-level data packing groups, each optimized with a distinct packing length. It assigns training samples to their optimal groups and configures each group with the most effective settings, including sequential parallelism degree and gradient checkpointing configuration. To effectively utilize multi-level groups of data, we design a dynamic training pipeline specifically tailored to HBP, including curriculum learning, adaptive sequential parallelism, and stable loss. Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest DeepSeek-V2 (236B) MOE model, our method speeds up the training by 2.4 \times with competitive performance.

[AI-57] Using a single actor to output personalized policy for different intersections

链接: https://arxiv.org/abs/2503.07678
作者: Kailing Zhou,Chengwei Zhang,Furui Zhan,Wanting Liu,Yihong Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Recently, with the development of Multi-agent reinforcement learning (MARL), adaptive traffic signal control (ATSC) has achieved satisfactory results. In traffic scenarios with multiple intersections, MARL treats each intersection as an agent and optimizes traffic signal control strategies through learning and real-time decision-making. Considering that observation distributions of intersections might be different in real-world scenarios, shared parameter methods might lack diversity and thus lead to high generalization requirements in the shared-policy network. A typical solution is to increase the size of network parameters. However, simply increasing the scale of the network does not necessarily improve policy generalization, which is validated in our experiments. Accordingly, an approach that considers both the personalization of intersections and the efficiency of parameter sharing is required. To this end, we propose Hyper-Action Multi-Head Proximal Policy Optimization (HAMH-PPO), a Centralized Training with Decentralized Execution (CTDE) MARL method that utilizes a shared PPO policy network to deliver personalized policies for intersections with non-iid observation distributions. The centralized critic in HAMH-PPO uses graph attention units to calculate the graph representations of all intersections and outputs a set of value estimates with multiple output heads for each intersection. The decentralized execution actor takes the local observation history as input and output distributions of action as well as a so-called hyper-action to balance the multiple values estimated from the centralized critic to further guide the updating of TSC policies. The combination of hyper-action and multi-head values enables multiple agents to share a single actor-critic while achieving personalized policies.

[AI-58] PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leverag ing Sparsity

链接: https://arxiv.org/abs/2503.07677
作者: Kwanyoung Kim,Byeongsu Sim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 29 pages, 19 figures

点击查看摘要

Abstract:Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS, which boosts pre-trained models (U-Net/Transformer) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention, our PLADIS unleashes the latent potential of text-to-image diffusion models, enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models. Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution.

[AI-59] he Janus Face of Innovation: Global Disparities and Divergent Options

链接: https://arxiv.org/abs/2503.07676
作者: Nihat Mugurtay
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:This article examines how unequal access to AI innovation creates systemic challenges for developing countries. Differential access to AI innovation results from the acute competition between domestic and global actors. While developing nations contribute significantly to AI development through data annotation labor, they face limited access to advanced AI technologies and are increasingly caught between divergent regulatory approaches from democratic and authoritarian tendencies. This brief paper analyzes how more affordable AI engagement and Western countries’ development cooperation present developing nations with a complex choice between accessibility and governance standards. I argue this challenge entails new institutional mechanisms for technology transfer and regulatory cooperation, while carefully balancing universal standards with local needs. In turn, good practices could help developing countries close the deepening gap of global technological divides, while ensuring responsible AI development in developing countries.

[AI-60] DynTaskMAS: A Dynamic Task Graph-driven Framework for Asynchronous and Parallel LLM -based Multi-Agent Systems

链接: https://arxiv.org/abs/2503.07675
作者: Junwei Yu,Yepeng Ding,Hiroyuki Sato
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) in Multi-Agent Systems (MAS) has opened new possibilities for artificial intelligence, yet current implementations face significant challenges in resource management, task coordination, and system efficiency. While existing frameworks demonstrate the potential of LLM-based agents in collaborative problem-solving, they often lack sophisticated mechanisms for parallel execution and dynamic task management. This paper introduces DynTaskMAS, a novel framework that orchestrates asynchronous and parallel operations in LLM-based MAS through dynamic task graphs. The framework features four key innovations: (1) a Dynamic Task Graph Generator that intelligently decomposes complex tasks while maintaining logical dependencies, (2) an Asynchronous Parallel Execution Engine that optimizes resource utilization through efficient task scheduling, (3) a Semantic-Aware Context Management System that enables efficient information sharing among agents, and (4) an Adaptive Workflow Manager that dynamically optimizes system performance. Experimental evaluations demonstrate that DynTaskMAS achieves significant improvements over traditional approaches: a 21-33% reduction in execution time across task complexities (with higher gains for more complex tasks), a 35.4% improvement in resource utilization (from 65% to 88%), and near-linear throughput scaling up to 16 concurrent agents (3.47X improvement for 4X agents). Our framework establishes a foundation for building scalable, high-performance LLM-based multi-agent systems capable of handling complex, dynamic tasks efficiently.

[AI-61] VNet: A Novel Time Series Analysis Method Based on Dynamic Convolution and 3D-Variation ICLR2025

链接: https://arxiv.org/abs/2503.07674
作者: Chenghan Li,Mingchen Li,Ruisheng Diao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025

点击查看摘要

Abstract:With the recent development and advancement of Transformer and MLP architectures, significant strides have been made in time series analysis. Conversely, the performance of Convolutional Neural Networks (CNNs) in time series analysis has fallen short of expectations, diminishing their potential for future applications. Our research aims to enhance the representational capacity of Convolutional Neural Networks (CNNs) in time series analysis by introducing novel perspectives and design innovations. To be specific, We introduce a novel time series reshaping technique that considers the inter-patch, intra-patch, and cross-variable dimensions. Consequently, we propose TVNet, a dynamic convolutional network leveraging a 3D perspective to employ time series analysis. TVNet retains the computational efficiency of CNNs and achieves state-of-the-art results in five key time series analysis tasks, offering a superior balance of efficiency and performance over the state-of-the-art Transformer-based and MLP-based models. Additionally, our findings suggest that TVNet exhibits enhanced transferability and robustness. Therefore, it provides a new perspective for applying CNN in advanced time series analysis tasks.

[AI-62] he potential role of AI agents in transforming nuclear medicine research and cancer management in India

链接: https://arxiv.org/abs/2503.07673
作者: Rajat Vashistha,Arif Gulzar,Parveen Kundu,Punit Sharma,Mark Brunstein,Viktor Vegh
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:India faces a significant cancer burden, with an incidence-to-mortality ratio indicating that nearly three out of five individuals diagnosed with cancer succumb to the disease. While the limitations of physical healthcare infrastructure are widely acknowledged as a primary challenge, concerted efforts by government and healthcare agencies are underway to mitigate these constraints. However, given the country’s vast geography and high population density, it is imperative to explore alternative soft infrastructure solutions to complement existing frameworks. Artificial Intelligence agents are increasingly transforming problem-solving approaches across various domains, with their application in medicine proving particularly transformative. In this perspective, we examine the potential role of AI agents in advancing nuclear medicine for cancer research, diagnosis, and management in India. We begin with a brief overview of AI agents and their capabilities, followed by a proposed agent-based ecosystem that can address prevailing sustainability challenges in India nuclear medicine.

[AI-63] Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLM s

链接: https://arxiv.org/abs/2503.07663
作者: Dingkun Zhang,Shuhan Qi,Xinyu Xiao,Kehai Chen,Xuan Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have enhanced their versatility as they integrate a growing number of modalities. Considering the heavy cost of training MLLMs, it is necessary to reuse the existing ones and further extend them to more modalities through Modality-incremental Continual Learning (MCL). However, this often comes with a performance degradation in the previously learned modalities. In this work, we revisit the MCL and investigate a more severe issue it faces in contrast to traditional continual learning, that its degradation comes not only from catastrophic forgetting but also from the misalignment between the modality-agnostic and modality-specific components. To address this problem, we propose an elegantly simple MCL paradigm called “MErge then ReAlign” (MERA). Our method avoids introducing heavy training overhead or modifying the model architecture, hence is easy to deploy and highly reusable in the MLLM community. Extensive experiments demonstrate that, despite the simplicity of MERA, it shows impressive performance, holding up to a 99.84% Backward Relative Gain when extending to four modalities, achieving a nearly lossless MCL performance.

[AI-64] Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy

链接: https://arxiv.org/abs/2503.07661
作者: Wei Junhao,Yu Zhe,Sakuma Jun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Model merging is a technique that combines multiple finetuned models into a single model without additional training, allowing a free-rider to cheaply inherit specialized capabilities. This study investigates methodologies to suppress unwanted model merging by free-riders. Existing methods such as model watermarking or fingerprinting can only detect merging in hindsight. In contrast, we propose a first proactive defense against model merging. Specifically, our defense method modifies the model parameters so that the model is disrupted if the model is merged with any other model, while its functionality is kept unchanged if not merged with others. Our approach consists of two modules, rearranging MLP parameters and scaling attention heads, which push the model out of the shared basin in parameter space, causing the merging performance with other models to degrade significantly. We conduct extensive experiments on image classification, image generation, and text classification to demonstrate that our defense severely disrupts merging while retaining the functionality of the post-protect model. Moreover, we analyze potential adaptive attacks and further propose a dropout-based pruning to improve our proposal’s robustness.

[AI-65] Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity

链接: https://arxiv.org/abs/2503.07660
作者: HyunJin Kim,Xiaoyuan Yi,Jing Yao,Muhua Huang,JinYeong Bak,James Evans,Xing Xie
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent leap in AI capabilities, driven by big generative models, has sparked the possibility of achieving Artificial General Intelligence (AGI) and further triggered discussions on Artificial Superintelligence (ASI), a system surpassing all humans across all domains. This gives rise to the critical research question of: If we realize ASI, how do we align it with human values, ensuring it benefits rather than harms human society, a.k.a., the Superalignment problem. Despite ASI being regarded by many as solely a hypothetical concept, in this paper, we argue that superalignment is achievable and research on it should advance immediately, through simultaneous and alternating optimization of task competence and value conformity. We posit that superalignment is not merely a safeguard for ASI but also necessary for its realization. To support this position, we first provide a formal definition of superalignment rooted in the gap between capability and capacity and elaborate on our argument. Then we review existing paradigms, explore their interconnections and limitations, and illustrate a potential path to superalignment centered on two fundamental principles. We hope this work sheds light on a practical approach for developing the value-aligned next-generation AI, garnering greater benefits and reducing potential harms for humanity.

[AI-66] SplitQuantV2: Enhancing Low-Bit Quantization of LLM s Without GPUs

链接: https://arxiv.org/abs/2503.07657
作者: Jaewoo Song,Fangzhen Lin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The quantization of large language models (LLMs) is crucial for deploying them on devices with limited computational resources. While advanced quantization algorithms offer improved performance compared to the basic linear quantization, they typically require high-end graphics processing units (GPUs), are often restricted to specific deep neural network (DNN) frameworks, and require calibration datasets. This limitation poses challenges for using such algorithms on various neural processing units (NPUs) and edge AI devices, which have diverse model formats and frameworks. In this paper, we show SplitQuantV2, an innovative algorithm designed to enhance low-bit linear quantization of LLMs, can achieve results comparable to those of advanced algorithms. SplitQuantV2 preprocesses models by splitting linear and convolution layers into functionally equivalent, quantization-friendly structures. The algorithm’s platform-agnostic, concise, and efficient nature allows for implementation without the need for GPUs. Our evaluation on the Llama 3.2 1B Instruct model using the AI2’s Reasoning Challenge (ARC) dataset demonstrates that SplitQuantV2 improves the accuracy of the INT4 quantization model by 11.76%p, matching the performance of the original floating-point model. Remarkably, SplitQuantV2 took only 2 minutes 6 seconds to preprocess the 1B model and perform linear INT4 quantization using only an Apple M4 CPU. SplitQuantV2 provides a practical solution for low-bit quantization on LLMs, especially when complex, computation-intensive algorithms are inaccessible due to hardware limitations or framework incompatibilities.

[AI-67] GraphT5: Unified Molecular Graph-Language Modeling via Multi-Modal Cross-Token Attention

链接: https://arxiv.org/abs/2503.07655
作者: Sangyeup Kim,Nayeon Kim,Yinhua Piao,Sun Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Molecular language modeling tasks such as molecule captioning have been recognized for their potential to further understand molecular properties that can aid drug discovery or material synthesis based on chemical reactions. Unlike the common use of molecule graphs in predicting molecular properties, most methods in molecular language modeling rely heavily on SMILES sequences. This preference is because the task involves generating a sequence of multiple tokens using transformer-based models. Therefore, a main challenge is determining how to integrate graph data, which contains structural and spatial information about molecules, with text data. In addition, simply using both 1D SMILES text and 2D graph as inputs without addressing how they align and represent the molecule structure in different modalities makes it challenging to fully utilize structural knowledge about molecules. To this end, we propose GraphT5, a multi-modal framework that integrates 1D SMILES text and 2D graph representations of molecules for molecular language modeling. Specifically, we introduce a novel cross-token attention module in GraphT5 to bridge the gap arising from the fundamental differences between the two modalities of molecule representations. Cross-token attention exploits implicit information between SMILES and graphs of molecules, resulting from their interactions at a fine-grained token level that benefits molecular language modeling. Extensive experiments including molecule captioning, IUPAC name prediction tasks, and case studies show that our GraphT5 outperforms the latest baseline approaches, which validates the effectiveness of our GraphT5 in sufficiently utilizing 1D SMILES text and 2D graph representations.

[AI-68] Insights into Schizophrenia: Leverag ing Machine Learning for Early Identification via EEG ERP and Demographic Attributes

链接: https://arxiv.org/abs/2503.07650
作者: Sara Alkhalifa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 6 figures and 2 tables

点击查看摘要

Abstract:The research presents a machine learning (ML) classifier designed to differentiate between schizophrenia patients and healthy controls by utilising features extracted from electroencephalogram (EEG) data, specifically focusing on event-related potentials (ERPs) and certain demographic variables. The dataset comprises data from 81 participants, encompassing 32 healthy controls and 49 schizophrenia patients, all sourced from an online dataset. After preprocessing the dataset, our ML model achieved an accuracy of 99.980%. This performance outperforms earlier research, including those that used deep learning methods. Additionally, an analysis was conducted to assess individual features’ contribution to improving classification accuracy. This involved systematically excluding specific features from the original dataset one at a time, and another technique involved an iterative process of removing features based on their entropy scores incrementally. The impact of these removals on model performance was evaluated to identify the most informative features.

[AI-69] S-RAG : Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster

链接: https://arxiv.org/abs/2503.07649
作者: Kanghui Ning,Zijie Pan,Yu Liu,Yushan Jiang,James Y. Zhang,Kashif Rasul,Anderson Schneider,Lintao Ma,Yuriy Nevmyvaka,Dongjin Song
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) and Foundation Models (FMs) have become prevalent for time series forecasting tasks. However, fine-tuning large language models (LLMs) for forecasting enables the adaptation to specific domains but may not generalize well across diverse, unseen datasets. Meanwhile, existing time series foundation models (TSFMs) lack inherent mechanisms for domain adaptation and suffer from limited interpretability, making them suboptimal for zero-shot forecasting. To this end, we present TS-RAG, a retrieval-augmented generation based time series forecasting framework that enhances the generalization capability and interpretability of TSFMs. Specifically, TS-RAG leverages pre-trained time series encoders to retrieve semantically relevant time series segments from a dedicated knowledge database, incorporating contextual patterns for the given time series query. Next, we develop a learnable Mixture-of-Experts (MoE)-based augmentation module, which dynamically fuses retrieved time series patterns with the TSFM’s representation of the input query, improving forecasting accuracy without requiring task-specific fine-tuning. Thorough empirical studies on seven public benchmark datasets demonstrate that TS-RAG achieves state-of-the-art zero-shot forecasting performance, outperforming TSFMs by up to 6.51% across diverse domains and showcasing desired interpretability.

[AI-70] ConstellationNet: Reinventing Spatial Clustering through GNNs

链接: https://arxiv.org/abs/2503.07643
作者: Aidan Gao,Junhong Lin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spatial clustering is a crucial field, finding universal use across criminology, pathology, and urban planning. However, most spatial clustering algorithms cannot pull information from nearby nodes and suffer performance drops when dealing with higher dimensionality and large datasets, making them suboptimal for large-scale and high-dimensional clustering. Due to modern data growing in size and dimension, clustering algorithms become weaker when addressing multifaceted issues. To improve upon this, we develop ConstellationNet, a convolution neural network(CNN)-graph neural network(GNN) framework that leverages the embedding power of a CNN, the neighbor aggregation of a GNN, and a neural network’s ability to deal with batched data to improve spatial clustering and classification with graph augmented predictions. ConstellationNet achieves state-of-the-art performance on both supervised classification and unsupervised clustering across several datasets, outperforming state-of-the-art classification and clustering while reducing model size and training time by up to tenfold and improving baselines by 10 times. Because of its fast training and powerful nature, ConstellationNet holds promise in fields like epidemiology and medical imaging, able to quickly train on new data to develop robust responses.

[AI-71] Deep ARTMAP: Generalized Hierarchical Learning with Adaptive Resonance Theory

链接: https://arxiv.org/abs/2503.07641
作者: Niklas M. Melton,Leonardo Enzo Brito da Silva,Sasha Petrenko,Donald. C. Wunsch II
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:This paper presents Deep ARTMAP, a novel extension of the ARTMAP architecture that generalizes the self-consistent modular ART (SMART) architecture to enable hierarchical learning (supervised and unsupervised) across arbitrary transformations of data. The Deep ARTMAP framework operates as a divisive clustering mechanism, supporting an arbitrary number of modules with customizable granularity within each module. Inter-ART modules regulate the clustering at each layer, permitting unsupervised learning while enforcing a one-to-many mapping from clusters in one layer to the next. While Deep ARTMAP reduces to both ARTMAP and SMART in particular configurations, it offers significantly enhanced flexibility, accommodating a broader range of data transformations and learning modalities.

[AI-72] BrainNet-MoE: Brain-Inspired Mixture-of-Experts Learning for Neurological Disease Identification

链接: https://arxiv.org/abs/2503.07640
作者: Jing Zhang,Xiaowei Yu,Tong Chen,Chao Cao,Mingheng Chen,Yan Zhuang,Yanjun Lyu,Lu Zhang,Li Su,Tianming Liu,Dajiang Zhu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The Lewy body dementia (LBD) is the second most common neurodegenerative dementia after Alzheimer’s disease (AD). Early differentiation between AD and LBD is crucial because they require different treatment approaches, but this is challenging due to significant clinical overlap, heterogeneity, complex pathogenesis, and the rarity of LBD. While recent advances in artificial intelligence (AI) demonstrate powerful learning capabilities and offer new hope for accurate diagnosis, existing methods primarily focus on designing “neural-level networks”. Our work represents a pioneering effort in modeling system-level artificial neural network called BrainNet-MoE for brain modeling and diagnosing. Inspired by the brain’s hierarchical organization of bottom-up sensory integration and top-down control, we design a set of disease-specific expert groups to process brain sub-network under different condition, A disease gate mechanism guides the specializa-tion of expert groups, while a transformer layer enables communication be-tween all sub-networks, generating a comprehensive whole-brain represen-tation for downstream disease classification. Experimental results show superior classification accuracy with interpretable insights into how brain sub-networks contribute to different neurodegenerative conditions.

[AI-73] Leverag ing Taxonomy Similarity for Next Activity Prediction in Patient Treatment

链接: https://arxiv.org/abs/2503.07638
作者: Martin Kuhn,Joscha Grüger,Tobias Geyer,Ralph Bergmann
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid progress in modern medicine presents physicians with complex challenges when planning patient treatment. Techniques from the field of Predictive Business Process Monitoring, like Next-activity-prediction (NAP) can be used as a promising technique to support physicians in treatment planning, by proposing a possible next treatment step. Existing patient data, often in the form of electronic health records, can be analyzed to recommend the next suitable step in the treatment process. However, the use of patient data poses many challenges due to its knowledge-intensive character, high variability and scarcity of medical data. To overcome these challenges, this article examines the use of the knowledge encoded in taxonomies to improve and explain the prediction of the next activity in the treatment process. This study proposes the TS4NAP approach, which uses medical taxonomies (ICD-10-CM and ICD-10-PCS) in combination with graph matching to assess the similarities of medical codes to predict the next treatment step. The effectiveness of the proposed approach will be evaluated using event logs that are derived from the MIMIC-IV dataset. The results highlight the potential of using domain-specific knowledge held in taxonomies to improve the prediction of the next activity, and thus can improve treatment planning and decision-making by making the predictions more explainable.

[AI-74] An Optimization Algorithm for Multimodal Data Alignment ACL

链接: https://arxiv.org/abs/2503.07636
作者: Wei Zhang,Xinyue Wang,Lan Yu,Shi Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ACL SRW submission

点击查看摘要

Abstract:In the data era, the integration of multiple data types, known as multimodality, has become a key area of interest in the research community. This interest is driven by the goal to develop cutting edge multimodal models capable of serving as adaptable reasoning engines across a wide range of modalities and domains. Despite the fervent development efforts, the challenge of optimally representing different forms of data within a single unified latent space a crucial step for enabling effective multimodal reasoning has not been fully addressed. To bridge this gap, we introduce AlignXpert, an optimization algorithm inspired by Kernel CCA crafted to maximize the similarities between N modalities while imposing some other constraints. This work demonstrates the impact on improving data representation for a variety of reasoning tasks, such as retrieval and classification, underlining the pivotal importance of data representation.

[AI-75] Impact of Level 2/3 Automated Driving Technology on Road Work Zone Safety

链接: https://arxiv.org/abs/2503.07634
作者: Zhepu Xu,Ziyi Song,Yupu Dong,Peiyan Chen
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:As China’s road network enters the maintenance era, work zones will become a common sight on the roads. With the development of automated driving, vehicles equipped with Level 2/3 automated driving capabilities will also become a common presence on the roads. When these vehicles pass through work zones, automated driving may disengage, which can have complex effects on traffic safety. This paper explores the impact of Level 2/3 automated driving technology on road safety in high-speed highway work zone environments. Through microscopic traffic simulation method and using full-type traffic conflict technique, factors such as market penetration rate (MPR), traffic volume level, disengagement threshold, and driver takeover style are studied to understand their impact on work zone safety. The study found that the impact of automated driving technology on work zone safety is complex. Disengagement of automated vehicles in work zones reduces the proportion of vehicles that can maintain automated driving status. If takeover is not timely or adequate, it can easily lead to new traffic conflicts. Different factors have varying degrees of impact on work zone safety. Increasing MPR helps reduce the occurrence of single-vehicle conflicts, but it also increases the possibility of multi-vehicle conflicts. Therefore, future research and improvement directions should focus on optimizing the disengagement detection and takeover mechanisms of automated driving systems.

[AI-76] Junior Software Developers Perspectives on Adopting LLM s for Software Engineering: a Systematic Literature Review

链接: https://arxiv.org/abs/2503.07556
作者: Samuel Ferino,Rashina Hoda,John Grundy,Christoph Treude
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many studies exploring the adoption of Large Language Model-based tools for software development by junior developers have emerged in recent years. These studies have sought to understand developers’ perspectives about using those tools, a fundamental pillar for successfully adopting LLM-based tools in Software Engineering. The aim of this paper is to provide an overview of junior software developers’ perspectives and use of LLM-based tools for software engineering (LLM4SE). We conducted a systematic literature review (SLR) following guidelines by Kitchenham et al. on 56 primary studies, applying the definition for junior software developers as software developers with equal or less than five years of experience, including Computer Science/Software Engineering students. We found that the majority of the studies focused on comprehending the different aspects of integrating AI tools in SE. Only 8.9% of the studies provide a clear definition for junior software developers, and there is no uniformity. Searching for relevant information is the most common task using LLM tools. ChatGPT was the most common LLM tool present in the studies (and experiments). A majority of the studies (83.9%) report both positive and negative perceptions about the impact of adopting LLM tools. We also found and categorised advantages, challenges, and recommendations regarding LLM adoption. Our results indicate that developers are using LLMs not just for code generation, but also to improve their development skills. Critically, they are not just experiencing the benefits of adopting LLM tools, but they are also aware of at least a few LLM limitations, such as the generation of wrong suggestions, potential data leaking, and AI hallucination. Our findings offer implications for software engineering researchers, educators, and developers.

[AI-77] Addressing Selection Bias in Computerized Adaptive Testing: A User-Wise Aggregate Influence Function Approach CIKM2023

链接: https://arxiv.org/abs/2308.11912
作者: Soonwoo Kwon,Sojung Kim,Seunghyun Lee,Jin-Young Kim,Suyeong An,Kyuseok Kim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: CIKM 2023

点击查看摘要

Abstract:Computerized Adaptive Testing (CAT) is a widely used, efficient test mode that adapts to the examinee’s proficiency level in the test domain. CAT requires pre-trained item profiles, for CAT iteratively assesses the student real-time based on the registered items’ profiles, and selects the next item to administer using candidate items’ profiles. However, obtaining such item profiles is a costly process that involves gathering a large, dense item-response data, then training a diagnostic model on the collected data. In this paper, we explore the possibility of leveraging response data collected in the CAT service. We first show that this poses a unique challenge due to the inherent selection bias introduced by CAT, i.e., more proficient students will receive harder questions. Indeed, when naively training the diagnostic model using CAT response data, we observe that item profiles deviate significantly from the ground-truth. To tackle the selection bias issue, we propose the user-wise aggregate influence function method. Our intuition is to filter out users whose response data is heavily biased in an aggregate manner, as judged by how much perturbation the added data will introduce during parameter estimation. This way, we may enhance the performance of CAT while introducing minimal bias to the item profiles. We provide extensive experiments to demonstrate the superiority of our proposed method based on the three public datasets and one dataset that contains real-world CAT response data.

[AI-78] YuE: Scaling Open Foundation Models for Long-Form Music Generation

链接: https://arxiv.org/abs/2503.08638
作者: Ruibin Yuan,Hanfeng Lin,Shuyue Guo,Ge Zhang,Jiahao Pan,Yongyi Zang,Haohe Liu,Yiming Liang,Wenye Ma,Xingjian Du,Xinrun Du,Zhen Ye,Tianyu Zheng,Yinghao Ma,Minghao Liu,Zeyue Tian,Ziya Zhou,Liumeng Xue,Xingwei Qu,Yizhi Li,Shangda Wu,Tianhao Shen,Ziyang Ma,Jun Zhan,Chunhui Wang,Yatian Wang,Xiaowei Chi,Xinyue Zhang,Zhenzhu Yang,Xiangzhou Wang,Shansong Liu,Lingrui Mei,Peng Li,Junjie Wang,Jianwei Yu,Guojian Pang,Xu Li,Zihao Wang,Xiaohuan Zhou,Lijun Yu,Emmanouil Benetos,Yong Chen,Chenghua Lin,Xie Chen,Gus Xia,Zhaoxiang Zhang,Chao Zhang,Wenhu Chen,Xinyu Zhou,Xipeng Qiu,Roger Dannenberg,Jiaheng Liu,Jian Yang,Wenhao Huang,Wei Xue,Xu Tan,Yike Guo
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
*备注: this https URL

点击查看摘要

Abstract:We tackle the task of long-form music generation–particularly the challenging \textbflyrics-to-song problem–by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE’s learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation

[AI-79] MT-NAM: An Efficient and Adaptive Model for Epileptic Seizure Detection

链接: https://arxiv.org/abs/2503.08251
作者: Arshia Afzal,Volkan Cevher,Mahsa Shoaran
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to IEEE-TBME

点击查看摘要

Abstract:Enhancing the accuracy and efficiency of machine learning algorithms employed in neural interface systems is crucial for advancing next-generation intelligent therapeutic devices. However, current systems often utilize basic machine learning models that do not fully exploit the natural structure of brain signals. Additionally, existing learning models used for neural signal processing often demonstrate low speed and efficiency during inference. To address these challenges, this study introduces Micro Tree-based NAM (MT-NAM), a distilled model based on the recently proposed Neural Additive Models (NAM). The MT-NAM achieves a remarkable 100 \times improvement in inference speed compared to standard NAM, without compromising accuracy. We evaluate our approach on the CHB-MIT scalp EEG dataset, which includes recordings from 24 patients with varying numbers of sessions and seizures. NAM achieves an 85.3% window-based sensitivity and 95% specificity. Interestingly, our proposed MT-NAM shows only a 2% reduction in sensitivity compared to the original NAM. To regain this sensitivity, we utilize a test-time template adjuster (T3A) as an update mechanism, enabling our model to achieve higher sensitivity during test time by accommodating transient shifts in neural signals. With this online update approach, MT-NAM achieves the same sensitivity as the standard NAM while achieving approximately 50 \times acceleration in inference speed.

[AI-80] ProTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models

链接: https://arxiv.org/abs/2503.08179
作者: Zicheng Ma,Chuanliu Fan,Zhicong Wang,Zhenyu Chen,Xiaohan Lin,Yanheng Li,Shihao Feng,Jun Zhang,Ziqiang Cao,Yi Qin Gao
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
*备注: 40 pages, 9 figures

点击查看摘要

Abstract:Large language models have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure-dependent. The absence of structure-aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next-Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state-of-the-art domain expert model with a twofold increase in accuracy. Our framework enables high-quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProTeX empowers decoder-only LLMs to effectively address diverse spectrum of protein-related tasks.

[AI-81] Revolution of Wireless Signal Recognition for 6G: Recent Advances Challenges and Future Directions

链接: https://arxiv.org/abs/2503.08091
作者: Hao Zhang,Fuhui Zhou,Hongyang Du,Qihui Wu,Chau Yuen
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: submitted to IEEE Communications Surveys Tutorials

点击查看摘要

Abstract:Wireless signal recognition (WSR) is a crucial technique for intelligent communications and spectrum sharing in the next six-generation (6G) wireless communication networks. It can be utilized to enhance network performance and efficiency, improve quality of service (QoS), and improve network security and reliability. Additionally, WSR can be applied for military applications such as signal interception, signal race, and signal abduction. In the past decades, great efforts have been made for the research of WSR. Earlier works mainly focus on model-based methods, including likelihood-based (LB) and feature-based (FB) methods, which have taken the leading position for many years. With the emergence of artificial intelligence (AI), intelligent methods including machine learning-based (ML-based) and deep learning-based (DL-based) methods have been developed to extract the features of the received signals and perform the classification. In this work, we provide a comprehensive review of WSR from the view of applications, main tasks, recent advances, datasets and evaluation metrics, challenges, and future directions. Specifically, intelligent WSR methods are introduced from the perspective of model, data, learning and implementation. Moreover, we analyze the challenges for WSR from the view of complex, dynamic, and open 6G wireless environments and discuss the future directions for WSR. This survey is expected to provide a comprehensive overview of the state-of-the-art WSR techniques and inspire new research directions for WSR in 6G networks.

[AI-82] A Neural Symbolic Model for Space Physics

链接: https://arxiv.org/abs/2503.07994
作者: Jie Ying,Haowei Lin,Chao Yue,Yajie Chen,Chao Xiao,Quanqi Shi,Yitao Liang,Shing-Tung Yau,Yuan Zhou,Jianzhu Ma
类目: olar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Space Physics (physics.space-ph)
*备注:

点击查看摘要

Abstract:In this study, we unveil a new AI model, termed PhyE2E, to discover physical formulas through symbolic regression. PhyE2E simplifies symbolic regression by decomposing it into sub-problems using the second-order derivatives of an oracle neural network, and employs a transformer model to translate data into symbolic formulas in an end-to-end manner. The resulting formulas are refined through Monte-Carlo Tree Search and Genetic Programming. We leverage a large language model to synthesize extensive symbolic expressions resembling real physics, and train the model to recover these formulas directly from data. A comprehensive evaluation reveals that PhyE2E outperforms existing state-of-the-art approaches, delivering superior symbolic accuracy, precision in data fitting, and consistency in physical units. We deployed PhyE2E to five applications in space physics, including the prediction of sunspot numbers, solar rotational angular velocity, emission line contribution functions, near-Earth plasma pressure, and lunar-tide plasma signals. The physical formulas generated by AI demonstrate a high degree of accuracy in fitting the experimental data from satellites and astronomical telescopes. We have successfully upgraded the formula proposed by NASA in 1993 regarding solar activity, and for the first time, provided the explanations for the long cycle of solar activity in an explicit form. We also found that the decay of near-Earth plasma pressure is proportional to r^2 to Earth, where subsequent mathematical derivations are consistent with satellite data from another independent study. Moreover, we found physical formulas that can describe the relationships between emission lines in the extreme ultraviolet spectrum of the Sun, temperatures, electron densities, and magnetic fields. The formula obtained is consistent with the properties that physicists had previously hypothesized it should possess.

[AI-83] Efficient and Accurate Estimation of Lipschitz Constants for Hybrid Quantum-Classical Decision Models

链接: https://arxiv.org/abs/2503.07992
作者: Sajjad Hashemian,Mohammad Saeed Arvenaghi
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 14 pages, 5 figuers, Submitted to TASE 2025

点击查看摘要

Abstract:In this paper, we propose a novel framework for efficiently and accurately estimating Lipschitz constants in hybrid quantum-classical decision models. Our approach integrates classical neural network with quantum variational circuits to address critical issues in learning theory such as fairness verification, robust training, and generalization. By a unified convex optimization formulation, we extend existing classical methods to capture the interplay between classical and quantum layers. This integrated strategy not only provide a tight bound on the Lipschitz constant but also improves computational efficiency with respect to the previous methods. Comments: 14 pages, 5 figuers, Submitted to TASE 2025 Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.07992 [quant-ph] (or arXiv:2503.07992v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2503.07992 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-84] A Theory of Learning with Autoregressive Chain of Thought

链接: https://arxiv.org/abs/2503.07932
作者: Nirmit Joshi,Gal Vardi,Adam Block,Surbhi Goel,Zhiyuan Li,Theodor Misiakiewicz,Nathan Srebro
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: Comments are welcome

点击查看摘要

Abstract:For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.

[AI-85] A primer on optimal transport for causal inference with observational data

链接: https://arxiv.org/abs/2503.07811
作者: Florian F Gunsilius
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Econometrics (econ.EM)
*备注: 24 pages, 5 figures

点击查看摘要

Abstract:The theory of optimal transportation has developed into a powerful and elegant framework for comparing probability distributions, with wide-ranging applications in all areas of science. The fundamental idea of analyzing probabilities by comparing their underlying state space naturally aligns with the core idea of causal inference, where understanding and quantifying counterfactual states is paramount. Despite this intuitive connection, explicit research at the intersection of optimal transport and causal inference is only beginning to develop. Yet, many foundational models in causal inference have implicitly relied on optimal transport principles for decades, without recognizing the underlying connection. Therefore, the goal of this review is to offer an introduction to the surprisingly deep existing connections between optimal transport and the identification of causal effects with observational data – where optimal transport is not just a set of potential tools, but actually builds the foundation of model assumptions. As a result, this review is intended to unify the language and notation between different areas of statistics, mathematics, and econometrics, by pointing out these existing connections, and to explore novel problems and directions for future work in both areas derived from this realization.

[AI-86] Probabilistic Shielding for Safe Reinforcement Learning AAAI2025

链接: https://arxiv.org/abs/2503.07671
作者: Edwin Hamel-De le Court,Francesco Belardinelli,Alex W. Goodall
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 3 figures, Conference: AAAI 2025

点击查看摘要

Abstract:In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximise their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an optimal policy among all policies that satisfy a given safety constraint. However, strict safety guarantees are often provided through approaches based on linear programming, and thus have limited scaling. In this paper we present a new, scalable method, which enjoys strict formal guarantees for Safe RL, in the case where the safety dynamics of the Markov Decision Process (MDP) are known, and safety is defined as an undiscounted probabilistic avoidance property. Our approach is based on state-augmentation of the MDP, and on the design of a shield that restricts the actions available to the agent. We show that our approach provides a strict formal safety guarantee that the agent stays safe at training and test time. Furthermore, we demonstrate that our approach is viable in practice through experimental evaluation.

[AI-87] Principal deuterium Hugoniot via Quantum Monte Carlo and Δ-learning DATE

链接: https://arxiv.org/abs/2301.03570
作者: Giacomo Tenti,Kousuke Nakano,Andrea Tirelli,Sandro Sorella,Michele Casula
类目: rongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
*备注: 7 + 10 pages; new version with improved QMC dataset. Hugoniot curve and discussion updated accordingly, main physical outcomes unchanged

点击查看摘要

Abstract:We present a study of the principal deuterium Hugoniot for pressures up to 150 GPa, using Machine Learning potentials (MLPs) trained with Quantum Monte Carlo (QMC) energies, forces and pressures. In particular, we adopted a recently proposed workflow based on the combination of Gaussian kernel regression and \Delta -learning. By fully taking advantage of this method, we explicitly considered finite-temperature electrons in the dynamics, whose effects are highly relevant for temperatures above 10 kK. The Hugoniot curve obtained by our MLPs shows a good agreement with the most recent experiments, particularly in the region below 60 GPa. At larger pressures, our Hugoniot curve is slightly more compressible than the one yielded by experiments, whose uncertainties generally increase, however, with pressure. Our work demonstrates that QMC can be successfully combined with \Delta -learning to deploy reliable MLPs for complex extended systems across different thermodynamic conditions, by keeping the QMC precision at the computational cost of a mean-field calculation.

机器学习

[LG-0] Understanding and Mitigating Distribution Shifts For Machine Learning Force Fields

链接: https://arxiv.org/abs/2503.08674
作者: Tobias Kreiman,Aditi S. Krishnapriyan
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. In order to characterize and better understand distribution shifts in MLFFs, we conduct diagnostic experiments on chemical datasets, revealing common shifts that pose significant challenges, even for large foundation models trained on extensive data. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. We then propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time refinement strategies that incur minimal computational cost and do not use expensive ab initio reference labels. The first strategy, based on spectral graph theory, modifies the edges of test graphs to align with graph structures seen during training. Our second strategy improves representations for out-of-distribution systems at test-time by taking gradient steps using an auxiliary objective, such as a cheap physical prior. Our test-time refinement strategies significantly reduce errors on out-of-distribution systems, suggesting that MLFFs are capable of and can move towards modeling diverse chemical spaces, but are not being effectively trained to do so. Our experiments establish clear benchmarks for evaluating the generalization capabilities of the next generation of MLFFs. Our code is available at this https URL.

[LG-1] Extra Clients at No Extra Cost: Overcome Data Heterogeneity in Federated Learning with Filter Decomposition

链接: https://arxiv.org/abs/2503.08652
作者: Wei Chen,Qiang Qiu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data heterogeneity is one of the major challenges in federated learning (FL), which results in substantial client variance and slow convergence. In this study, we propose a novel solution: decomposing a convolutional filter in FL into a linear combination of filter subspace elements, i.e., filter atoms. This simple technique transforms global filter aggregation in FL into aggregating filter atoms and their atom coefficients. The key advantage here involves mathematically generating numerous cross-terms by expanding the product of two weighted sums from filter atom and atom coefficient. These cross-terms effectively emulate many additional latent clients, significantly reducing model variance, which is validated by our theoretical analysis and empirical observation. Furthermore, our method permits different training schemes for filter atoms and atom coefficients for highly adaptive model personalization and communication efficiency. Empirical results on benchmark datasets demonstrate that our filter decomposition technique substantially improves the accuracy of FL methods, confirming its efficacy in addressing data heterogeneity.

[LG-2] Coefficient-to-Basis Network: A Fine-Tunable Operator Learning Framework for Inverse Problems with Adaptive Discretizations and Theoretical Guarantees

链接: https://arxiv.org/abs/2503.08642
作者: Zecheng Zhang,Hao Liu,Wenjing Liao,Guang Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a Coefficient-to-Basis Network (C2BNet), a novel framework for solving inverse problems within the operator learning paradigm. C2BNet efficiently adapts to different discretizations through fine-tuning, using a pre-trained model to significantly reduce computational cost while maintaining high accuracy. Unlike traditional approaches that require retraining from scratch for new discretizations, our method enables seamless adaptation without sacrificing predictive performance. Furthermore, we establish theoretical approximation and generalization error bounds for C2BNet by exploiting low-dimensional structures in the underlying datasets. Our analysis demonstrates that C2BNet adapts to low-dimensional structures without relying on explicit encoding mechanisms, highlighting its robustness and efficiency. To validate our theoretical findings, we conducted extensive numerical experiments that showcase the superior performance of C2BNet on several inverse problems. The results confirm that C2BNet effectively balances computational efficiency and accuracy, making it a promising tool to solve inverse problems in scientific computing and engineering applications.

[LG-3] How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks?

链接: https://arxiv.org/abs/2503.08633
作者: Gal Alon,Yehuda Dar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning is the task of updating a trained model to forget specific training data without retraining from scratch. In this paper, we investigate how unlearning of deep neural networks (DNNs) is affected by the model parameterization level, which corresponds here to the DNN width. We define validation-based tuning for several unlearning methods from the recent literature, and show how these methods perform differently depending on (i) the DNN parameterization level, (ii) the unlearning goal (unlearned data privacy or bias removal), (iii) whether the unlearning method explicitly uses the unlearned examples. Our results show that unlearning excels on overparameterized models, in terms of balancing between generalization and achieving the unlearning goal; although for bias removal this requires the unlearning method to use the unlearned examples. We further elucidate our error-based analysis by measuring how much the unlearning changes the classification decision regions in the proximity of the unlearned examples, and avoids changing them elsewhere. By this we show that the unlearning success for overparameterized models stems from the ability to delicately change the model functionality in small regions in the input space while keeping much of the model functionality unchanged.

[LG-4] An Analysis of Safety Guarantees in Multi-Task Bayesian Optimization

链接: https://arxiv.org/abs/2503.08555
作者: Jannis O. Luebsen,Annika Eichler
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In many practical scenarios of black box optimization, the objective function is subject to constraints that must be satisfied to avoid undesirable outcomes. Such constraints are typically unknown and must be learned during optimization. Safe Bayesian optimization aims to find the global optimum while ensuring that the constraints are satisfied with high probability. However, it is often sample-inefficient due to the small initial feasible set, which requires expansion by evaluating the objective or constraint functions, limiting its applicability to low-dimensional or inexpensive problems. To enhance sample efficiency, additional information from cheap simulations can be leveraged, albeit at the cost of safeness guarantees. This paper introduces a novel safe multi-task Bayesian optimization algorithm that integrates multiple tasks while maintaining high-probability safety. We derive robust uniform error bounds for the multi-task case and demonstrate the effectiveness of the approach on benchmark functions and a control problem. Our results show a significant improvement in sample efficiency, making the proposed method well-suited for expensive-to-evaluate functions.

[LG-5] DISTINGUISH Workflow: A New Paradigm of Dynamic Well Placement Using Generative Machine Learning

链接: https://arxiv.org/abs/2503.08509
作者: Sergey Alyaev,Kristian Fossum,Hibat Errahmen Djecta,Jan Tveranger,Ahmed H. Elsheikh
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Geophysics (physics.geo-ph); Applications (stat.AP)
*备注: The conference version of this paper is published in EAGE ECMOR 2024 proceedings: this https URL

点击查看摘要

Abstract:The real-time process of directional changes while drilling, known as geosteering, is crucial for hydrocarbon extraction and emerging directional drilling applications such as geothermal energy, civil infrastructure, and CO2 storage. The geo-energy industry seeks an automatic geosteering workflow that continually updates the subsurface uncertainties and captures the latest geological understanding given the most recent observations in real-time. We propose “DISTINGUISH”: a real-time, AI-driven workflow designed to transform geosteering by integrating Generative Adversarial Networks (GANs) for geological parameterization, ensemble methods for model updating, and global discrete dynamic programming (DDP) optimization for complex decision-making during directional drilling operations. The DISTINGUISH framework relies on offline training of a GAN model to reproduce relevant geology realizations and a Forward Neural Network (FNN) to model Logging-While-Drilling (LWD) tools’ response for a given geomodel. This paper introduces a first-of-its-kind workflow that progressively reduces GAN-geomodel uncertainty around and ahead of the drilling bit and adjusts the well plan accordingly. The workflow automatically integrates real-time LWD data with a DDP-based decision support system, enhancing predictive models of geology ahead of drilling and leading to better steering decisions. We present a simple yet representative benchmark case and document the performance target achieved by the DISTINGUISH workflow prototype. This benchmark will be a foundation for future methodological advancements and workflow refinements. Comments: The conference version of this paper is published in EAGE ECMOR 2024 proceedings: this https URL Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Geophysics (physics.geo-ph); Applications (stat.AP) Cite as: arXiv:2503.08509 [cs.LG] (or arXiv:2503.08509v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.08509 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] he Space Between: On Folding Symmetries and Sampling ICLR

链接: https://arxiv.org/abs/2503.08502
作者: Michal Lewandowski,Bernhard Heinzl,Raphael Pisoni,Bernhard A.Moser
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted at the ICLR Workshop on Neural Network Weights as a New Data Modality, 2025

点击查看摘要

Abstract:Recent findings suggest that consecutive layers of neural networks with the ReLU activation function \emphfold the input space during the learning process. While many works hint at this phenomenon, an approach to quantify the folding was only recently proposed by means of a space folding measure based on Hamming distance in the ReLU activation space. We generalize this measure to a wider class of activation functions through introduction of equivalence classes of input data, analyse its mathematical and computational properties and come up with an efficient sampling strategy for its implementation. Moreover, it has been observed that space folding values increase with network depth when the generalization error is low, but decrease when the error increases. This underpins that learned symmetries in the data manifold (e.g., invariance under reflection) become visible in terms of space folds, contributing to the network’s generalization capacity. Inspired by these findings, we outline a novel regularization scheme that encourages the network to seek solutions characterized by higher folding values.

[LG-7] Learning to Match Unpaired Data with Minimum Entropy Coupling

链接: https://arxiv.org/abs/2503.08501
作者: Mustapha Bounoua,Giulio Franzese,Pietro Michiardi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal data is a precious asset enabling a variety of downstream tasks in machine learning. However, real-world data collected across different modalities is often not paired, which is a significant challenge to learn a joint distribution. A prominent approach to address the modality coupling problem is Minimum Entropy Coupling (MEC), which seeks to minimize the joint Entropy, while satisfying constraints on the marginals. Existing approaches to the MEC problem focus on finite, discrete distributions, limiting their application for cases involving continuous data. In this work, we propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models that learn to approximate and minimize the joint Entropy through a cooperative scheme, while satisfying a relaxed version of the marginal constraints. We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks, including unsupervised single-cell multi-omics data alignment and unpaired image translation, outperforming specialized methods.

[LG-8] Soft Actor-Critic-based Control Barrier Adaptation for Robust Autonomous Navigation in Unknown Environments ICRA

链接: https://arxiv.org/abs/2503.08479
作者: Nicholas Mohammad,Nicola Bezzo
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: To Appear in 2025 IEEE/RSJ International Conference on Robotics and Automation (ICRA), 2025

点击查看摘要

Abstract:Motion planning failures during autonomous navigation often occur when safety constraints are either too conservative, leading to deadlocks, or too liberal, resulting in collisions. To improve robustness, a robot must dynamically adapt its safety constraints to ensure it reaches its goal while balancing safety and performance measures. To this end, we propose a Soft Actor-Critic (SAC)-based policy for adapting Control Barrier Function (CBF) constraint parameters at runtime, ensuring safe yet non-conservative motion. The proposed approach is designed for a general high-level motion planner, low-level controller, and target system model, and is trained in simulation only. Through extensive simulations and physical experiments, we demonstrate that our framework effectively adapts CBF constraints, enabling the robot to reach its final goal without compromising safety.

[LG-9] Data Driven Decision Making with Time Series and Spatio-temporal Data ICDE2025

链接: https://arxiv.org/abs/2503.08473
作者: Bin Yang,Yuxuan Liang,Chenjuan Guo,Christian S. Jensen
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: This paper is accepted by ICDE 2025

点击查看摘要

Abstract:Time series data captures properties that change over time. Such data occurs widely, ranging from the scientific and medical domains to the industrial and environmental domains. When the properties in time series exhibit spatial variations, we often call the data spatio-temporal. As part of the continued digitalization of processes throughout society, increasingly large volumes of time series and spatio-temporal data are available. In this tutorial, we focus on data-driven decision making with such data, e.g., enabling greener and more efficient transportation based on traffic time series forecasting. The tutorial adopts the holistic paradigm of “data-governance-analytics-decision.” We first introduce the data foundation of time series and spatio-temporal data, which is often heterogeneous. Next, we discuss data governance methods that aim to improve data quality. We then cover data analytics, focusing on five desired characteristics: automation, robustness, generality, explainability, and resource efficiency. We finally cover data-driven decision making strategies and briefly discuss promising research directions. We hope that the tutorial will serve as a primary resource for researchers and practitioners who are interested in value creation from time series and spatio-temporal data.

[LG-10] MinGRU-Based Encoder for Turbo Autoencoder Frameworks ICML

链接: https://arxiv.org/abs/2503.08451
作者: Rick Fritschek,Rafael F. Schaefer
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, accepted at ICMLCN25

点击查看摘要

Abstract:Early neural channel coding approaches leveraged dense neural networks with one-hot encodings to design adaptive encoder-decoder pairs, improving block error rate (BLER) and automating the design process. However, these methods struggled with scalability as the size of message sets and block lengths increased. TurboAE addressed this challenge by focusing on bit-sequence inputs rather than symbol-level representations, transforming the scalability issue associated with large message sets into a sequence modeling problem. While recurrent neural networks (RNNs) were a natural fit for sequence processing, their reliance on sequential computations made them computationally expensive and inefficient for long sequences. As a result, TurboAE adopted convolutional network blocks, which were faster to train and more scalable, but lacked the sequential modeling advantages of RNNs. Recent advances in efficient RNN architectures, such as minGRU and minLSTM, and structured state space models (SSMs) like S4 and S6, overcome these limitations by significantly reducing memory and computational overhead. These models enable scalable sequence processing, making RNNs competitive for long-sequence tasks. In this work, we revisit RNNs for Turbo autoencoders by integrating the lightweight minGRU model with a Mamba block from SSMs into a parallel Turbo autoencoder framework. Our results demonstrate that this hybrid design matches the performance of convolutional network-based Turbo autoencoder approaches for short sequences while significantly improving scalability and training efficiency for long block lengths. This highlights the potential of efficient RNNs in advancing neural channel coding for long-sequence scenarios.

[LG-11] A Deep-Learning Iterative Stacked Approach for Prediction of Reactive Dissolution in Porous Media

链接: https://arxiv.org/abs/2503.08410
作者: Marcos Cirne,Hannah Menke,Alhasan Abdellatif,Julien Maes,Florian Doster,Ahmed H. Elsheikh
类目: Machine Learning (cs.LG)
*备注: 24 pages, 16 figures

点击查看摘要

Abstract:Simulating reactive dissolution of solid minerals in porous media has many subsurface applications, including carbon capture and storage (CCS), geothermal systems and oil gas recovery. As traditional direct numerical simulators are computationally expensive, it is of paramount importance to develop faster and more efficient alternatives. Deep-learning-based solutions, most of them built upon convolutional neural networks (CNNs), have been recently designed to tackle this problem. However, these solutions were limited to approximating one field over the domain (e.g. velocity field). In this manuscript, we present a novel deep learning approach that incorporates both temporal and spatial information to predict the future states of the dissolution process at a fixed time-step horizon, given a sequence of input states. The overall performance, in terms of speed and prediction accuracy, is demonstrated on a numerical simulation dataset, comparing its prediction results against state-of-the-art approaches, also achieving a speedup around 10^4 over traditional numerical simulators.

[LG-12] Uncertainty Quantification for Multi-fidelity Simulations

链接: https://arxiv.org/abs/2503.08408
作者: Swapnil Kumar
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: Imperial College London, Master Thesis

点击查看摘要

Abstract:The work focuses on gathering high-fidelity and low-fidelity numerical simulations data using Nektar++ (Solver based on Applied Mathematics) and XFOIL respectively. The utilization of the higher polynomial distribution in calculating the Coefficient of lift and drag has demonstrated superior accuracy and precision. Further, Co-kriging Data fusion and Adaptive sampling technique has been used to obtain the precise data predictions for the lift and drag within the confined domain without conducting the costly simulations on HPC clusters. This creates a methodology to quantifying uncertainty in computational fluid dynamics by minimizing the required number of samples. To minimize the reliability on high-fidelity numerical simulations in Uncertainty Quantification, a multi-fidelity strategy has been adopted. The effectiveness of the multi-fidelity deep neural network model has been validated through the approximation of benchmark functions across 1-, 32-, and 100-dimensional, encompassing both linear and nonlinear correlations. The surrogate modelling results showed that multi-fidelity deep neural network model has shown excellent approximation capabilities for the test functions and multi-fidelity deep neural network method has outperformed Co-kriging in effectiveness. In addition to that, multi-fidelity deep neural network model is utilized for the simulation of aleatory uncertainty propagation in 1-, 32-, and 100 dimensional function test, considering both uniform and Gaussian distributions for input uncertainties. The results have shown that multi-fidelity deep neural network model has efficiently predicted the probability density distributions of quantities of interest as well as the statistical moments with precision and accuracy. The Co-Kriging model has exhibited limitations when addressing 32-Dimension problems due to the limitation of memory capacity for storage and manipulation.

[LG-13] (boldsymbolθ_l boldsymbolθ_u)-Parametric Multi-Task Optimization: Joint Search in Solution and Infinite Task Spaces

链接: https://arxiv.org/abs/2503.08394
作者: Tingyang Wei,Jiao Liu,Abhishek Gupta,Puay Siew Tan,Yew-Soon Ong
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task optimization is typically characterized by a fixed and finite set of optimization tasks. The present paper relaxes this condition by considering a non-fixed and potentially infinite set of optimization tasks defined in a parameterized, continuous and bounded task space. We refer to this unique problem setting as parametric multi-task optimization (PMTO). Assuming the bounds of the task parameters to be ( \boldsymbol\theta_l , \boldsymbol\theta_u ), a novel ( \boldsymbol\theta_l , \boldsymbol\theta_u )-PMTO algorithm is crafted to enable joint search over tasks and their solutions. This joint search is supported by two approximation models: (1) for mapping solutions to the objective spaces of all tasks, which provably accelerates convergence by acting as a conduit for inter-task knowledge transfers, and (2) for probabilistically mapping tasks to the solution space, which facilitates evolutionary exploration of under-explored regions of the task space. At the end of a full ( \boldsymbol\theta_l , \boldsymbol\theta_u )-PMTO run, the acquired models enable rapid identification of optimized solutions for any task lying within the specified bounds. This outcome is validated on both synthetic test problems and practical case studies, with the significant real-world applicability of PMTO shown towards fast reconfiguration of robot controllers under changing task conditions. The potential of PMTO to vastly speedup the search for solutions to minimax optimization problems is also demonstrated through an example in robust engineering design.

[LG-14] Gait in Eight: Efficient On-Robot Learning for Omnidirectional Quadruped Locomotion

链接: https://arxiv.org/abs/2503.08375
作者: Nico Bohlinger,Jonathan Kinzel,Daniel Palenicek,Lukasz Antczak,Jan Peters
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:On-robot Reinforcement Learning is a promising approach to train embodiment-aware policies for legged robots. However, the computational constraints of real-time learning on robots pose a significant challenge. We present a framework for efficiently learning quadruped locomotion in just 8 minutes of raw real-time training utilizing the sample efficiency and minimal computational overhead of the new off-policy algorithm CrossQ. We investigate two control architectures: Predicting joint target positions for agile, high-speed locomotion and Central Pattern Generators for stable, natural gaits. While prior work focused on learning simple forward gaits, our framework extends on-robot learning to omnidirectional locomotion. We demonstrate the robustness of our approach in different indoor and outdoor environments.

[LG-15] Density Ratio-based Proxy Causal Learning Without Density Ratios AISTATS2025

链接: https://arxiv.org/abs/2503.08371
作者: Bariscan Bozkurt,Ben Deaner,Dimitri Meunier,Liyuan Xu,Arthur Gretton
类目: Machine Learning (cs.LG)
*备注: AISTATS 2025 accepted, 81 pages

点击查看摘要

Abstract:We address the setting of Proxy Causal Learning (PCL), which has the goal of estimating causal effects from observed data in the presence of hidden confounding. Proxy methods accomplish this task using two proxy variables related to the latent confounder: a treatment proxy (related to the treatment) and an outcome proxy (related to the outcome). Two approaches have been proposed to perform causal effect estimation given proxy variables; however only one of these has found mainstream acceptance, since the other was understood to require density ratio estimation - a challenging task in high dimensions. In the present work, we propose a practical and effective implementation of the second approach, which bypasses explicit density ratio estimation and is suitable for continuous and high-dimensional treatments. We employ kernel ridge regression to derive estimators, resulting in simple closed-form solutions for dose-response and conditional dose-response curves, along with consistency guarantees. Our methods empirically demonstrate superior or comparable performance to existing frameworks on synthetic and real-world datasets.

[LG-16] Flexible and Efficient Probabilistic PDE Solvers through Gaussian Markov Random Fields

链接: https://arxiv.org/abs/2503.08343
作者: Tim Weiland,Marvin Pförtner,Philipp Hennig
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Mechanistic knowledge about the physical world is virtually always expressed via partial differential equations (PDEs). Recently, there has been a surge of interest in probabilistic PDE solvers – Bayesian statistical models mostly based on Gaussian process (GP) priors which seamlessly combine empirical measurements and mechanistic knowledge. As such, they quantify uncertainties arising from e.g. noisy or missing data, unknown PDE parameters or discretization error by design. Prior work has established connections to classical PDE solvers and provided solid theoretical guarantees. However, scaling such methods to large-scale problems remains a fundamental challenge primarily due to dense covariance matrices. Our approach addresses the scalability issues by leveraging the Markov property of many commonly used GP priors. It has been shown that such priors are solutions to stochastic PDEs (SPDEs) which when discretized allow for highly efficient GP regression through sparse linear algebra. In this work, we show how to leverage this prior class to make probabilistic PDE solvers practical, even for large-scale nonlinear PDEs, through greatly accelerated inference mechanisms. Additionally, our approach also allows for flexible and physically meaningful priors beyond what can be modeled with covariance functions. Experiments confirm substantial speedups and accelerated convergence of our physics-informed priors in nonlinear settings.

[LG-17] MFRS: A Multi-Frequency Reference Series Approach to Scalable and Accurate Time-Series Forecasting

链接: https://arxiv.org/abs/2503.08328
作者: Liang Yu,Lai Tu,Xiang Bai
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multivariate time-series forecasting holds immense value across diverse applications, requiring methods to effectively capture complex temporal and inter-variable dynamics. A key challenge lies in uncovering the intrinsic patterns that govern predictability, beyond conventional designs, focusing on network architectures to explore latent relationships or temporal dependencies. Inspired by signal decomposition, this paper posits that time series predictability is derived from periodic characteristics at different frequencies. Consequently, we propose a novel time series forecasting method based on multi-frequency reference series correlation analysis. Through spectral analysis on long-term training data, we identify dominant spectral components and their harmonics to design base-pattern reference series. Unlike signal decomposition, which represents the original series as a linear combination of basis signals, our method uses a transformer model to compute cross-attention between the original series and reference series, capturing essential features for forecasting. Experiments on major open and synthetic datasets show state-of-the-art performance. Furthermore, by focusing on attention with a small number of reference series rather than pairwise variable attention, our method ensures scalability and broad applicability. The source code is available at: this https URL

[LG-18] Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference

链接: https://arxiv.org/abs/2503.08311
作者: Pol G. Recasens,Ferran Agullo,Yue Zhu,Chen Wang,Eun Kyung Lee,Olivier Tardieu,Jordi Torres,Josep Ll. Berral
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Pol G. Recasens, Ferran Agullo: equal contribution

点击查看摘要

Abstract:Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference. While batching is commonly used to increase throughput, performance gains plateau beyond a certain batch size, especially with smaller models, a phenomenon that existing literature typically explains as a shift to the compute-bound regime. In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck. To address this, we propose a Batching Configuration Advisor (BCA) that optimizes memory allocation, reducing GPU memory requirements with minimal impact on throughput. The freed memory and underutilized GPU compute capabilities can then be leveraged by concurrent workloads. Specifically, we use model replication to improve serving throughput and GPU utilization. Our findings challenge conventional assumptions about LLM inference, offering new insights and practical strategies for improving resource utilization, particularly for smaller language models.

[LG-19] ELECTRA: A Symmetry-breaking Cartesian Network for Charge Density Prediction with Floating Orbitals

链接: https://arxiv.org/abs/2503.08305
作者: Jonas Elsborg,Luca Thiede,Alán Aspuru-Guzik,Tejs Vegge,Arghya Bhowmik
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph)
*备注: 8 pages, 3 figures, 1 table

点击查看摘要

Abstract:We present the Electronic Tensor Reconstruction Algorithm (ELECTRA) - an equivariant model for predicting electronic charge densities using “floating” orbitals. Floating orbitals are a long-standing idea in the quantum chemistry community that promises more compact and accurate representations by placing orbitals freely in space, as opposed to centering all orbitals at the position of atoms. Finding ideal placements of these orbitals requires extensive domain knowledge though, which thus far has prevented widespread adoption. We solve this in a data-driven manner by training a Cartesian tensor network to predict orbital positions along with orbital coefficients. This is made possible through a symmetry-breaking mechanism that is used to learn position displacements with lower symmetry than the input molecule while preserving the rotation equivariance of the charge density itself. Inspired by recent successes of Gaussian Splatting in representing densities in space, we are using Gaussians as our orbitals and predict their weights and covariance matrices. Our method achieves a state-of-the-art balance between computational efficiency and predictive accuracy on established benchmarks.

[LG-20] A systematic literature review of unsupervised learning algorithms for anomalous traffic detection based on flows

链接: https://arxiv.org/abs/2503.08293
作者: Alberto Miguel-Diez,Adrián Campazas-Vega,Claudia Álvarez-Aparicio,Gonzalo Esteban-Costales,Ángel Manuel Guerrero-Higueras
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This article has been accepted for publication in Logic Journal of the IGPL Published by Oxford University Press

点击查看摘要

Abstract:The constant increase of devices connected to the Internet, and therefore of cyber-attacks, makes it necessary to analyze network traffic in order to recognize malicious activity. Traditional packet-based analysis methods are insufficient because in large networks the amount of traffic is so high that it is unfeasible to review all communications. For this reason, flows is a suitable approach for this situation, which in future 5G networks will have to be used, as the number of packets will increase dramatically. If this is also combined with unsupervised learning models, it can detect new threats for which it has not been trained. This paper presents a systematic review of the literature on unsupervised learning algorithms for detecting anomalies in network flows, following the PRISMA guideline. A total of 63 scientific articles have been reviewed, analyzing 13 of them in depth. The results obtained show that autoencoder is the most used option, followed by SVM, ALAD, or SOM. On the other hand, all the datasets used for anomaly detection have been collected, including some specialised in IoT or with real data collected from honeypots.

[LG-21] LangTime: A Language-Guided Unified Model for Time Series Forecasting with Proximal Policy Optimization

链接: https://arxiv.org/abs/2503.08271
作者: Wenzhe Niu,Zongxia Xie,Yanru Sun,Wei He,Man Xu,Chao Hao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research has shown an increasing interest in utilizing pre-trained large language models (LLMs) for a variety of time series applications. However, there are three main challenges when using LLMs as foundational models for time series forecasting: (1) Cross-domain generalization. (2) Cross-modality alignment. (3) Error accumulation in autoregressive frameworks. To address these challenges, we proposed LangTime, a language-guided unified model for time series forecasting that incorporates cross-domain pre-training with reinforcement learning-based fine-tuning. Specifically, LangTime constructs Temporal Comprehension Prompts (TCPs), which include dataset-wise and channel-wise instructions, to facilitate domain adaptation and condense time series into a single token, enabling LLMs to understand better and align temporal data. To improve autoregressive forecasting, we introduce TimePPO, a reinforcement learning-based fine-tuning algorithm. TimePPO mitigates error accumulation by leveraging a multidimensional rewards function tailored for time series and a repeat-based value estimation strategy. Extensive experiments demonstrate that LangTime achieves state-of-the-art cross-domain forecasting performance, while TimePPO fine-tuning effectively enhances the stability and accuracy of autoregressive forecasting.

[LG-22] Dynamic DBSCAN with Euler Tour Sequences AISTATS2025

链接: https://arxiv.org/abs/2503.08246
作者: Seiyun Shin,Ilan Shomorony,Peter Macgregor
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: AISTATS 2025

点击查看摘要

Abstract:We propose a fast and dynamic algorithm for Density-Based Spatial Clustering of Applications with Noise (DBSCAN) that efficiently supports online updates. Traditional DBSCAN algorithms, designed for batch processing, become computationally expensive when applied to dynamic datasets, particularly in large-scale applications where data continuously evolves. To address this challenge, our algorithm leverages the Euler Tour Trees data structure, enabling dynamic clustering updates without the need to reprocess the entire dataset. This approach preserves a near-optimal accuracy in density estimation, as achieved by the state-of-the-art static DBSCAN method (Esfandiari et al., 2021) Our method achieves an improved time complexity of O(d \log^3(n) + \log^4(n)) for every data point insertion and deletion, where n and d denote the total number of updates and the data dimension, respectively. Empirical studies also demonstrate significant speedups over conventional DBSCANs in real-time clustering of dynamic datasets, while maintaining comparable or superior clustering quality.

[LG-23] ExMAG: Learning of Maximally Ancestral Graphs

链接: https://arxiv.org/abs/2503.08245
作者: Petr Ryšavý,Pavel Rytíř,Xiaoyu He,Jakub Mareček,Georgios Korpas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As one transitions from statistical to causal learning, one is seeking the most appropriate causal model. Dynamic Bayesian networks are a popular model, where a weighted directed acyclic graph represents the causal relationships. Stochastic processes are represented by its vertices, and weighted oriented edges suggest the strength of the causal relationships. When there are confounders, one would like to utilize both oriented edges (when the direction of causality is clear) and edges that are not oriented (when there is a confounder), yielding mixed graphs. A little-studied extension of acyclicity to this mixed-graph setting is known as maximally ancestral graphs. We propose a score-based learning algorithm for learning maximally ancestral graphs. A mixed-integer quadratic program is formulated, and an algorithmic approach is proposed, in which the pre-generation of exponentially many constraints is avoided by generating only violated constraints in the so-called branch-and-cut (``lazy constraint’') method. Comparing the novel approach to the state-of-the-art, we show that the proposed approach turns out to produce more accurate results when applied to small and medium-sized synthetic instances containing up to 25 variables.

[LG-24] angentially Aligned Integrated Gradients for User-Friendly Explanations

链接: https://arxiv.org/abs/2503.08240
作者: Lachlan Simpson,Federico Costanza,Kyle Millar,Adriel Cheng,Cheng-Chew Lim,Hong Gunn Chew
类目: Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注: To appear in the proceedings of the 32nd Irish Conference on Artificial Intelligence and Cognitive Science

点击查看摘要

Abstract:Integrated gradients is prevalent within machine learning to address the black-box problem of neural networks. The explanations given by integrated gradients depend on a choice of base-point. The choice of base-point is not a priori obvious and can lead to drastically different explanations. There is a longstanding hypothesis that data lies on a low dimensional Riemannian manifold. The quality of explanations on a manifold can be measured by the extent to which an explanation for a point lies in its tangent space. In this work, we propose that the base-point should be chosen such that it maximises the tangential alignment of the explanation. We formalise the notion of tangential alignment and provide theoretical conditions under which a base-point choice will provide explanations lying in the tangent space. We demonstrate how to approximate the optimal base-point on several well-known image classification datasets. Furthermore, we compare the optimal base-point choice with common base-points and three gradient explainability models.

[LG-25] Route Sparse Autoencoder to Interpret Large Language Models

链接: https://arxiv.org/abs/2503.08200
作者: Wei Shi,Sihang Li,Tao Liang,Mingyang Wan,Gojun Ma,Xiang Wang,Xiangnan He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at this https URL.

[LG-26] Concept-Driven Deep Learning for Enhanced Protein-Specific Molecular Generation

链接: https://arxiv.org/abs/2503.08160
作者: Taojie Kuang,Qianli Ma,Athanasios V. Vasilakos,Yu Wang,Qiang(Shawn)Cheng,Zhixiang Ren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, deep learning techniques have made significant strides in molecular generation for specific targets, driving advancements in drug discovery. However, existing molecular generation methods present significant limitations: those operating at the atomic level often lack synthetic feasibility, drug-likeness, and interpretability, while fragment-based approaches frequently overlook comprehensive factors that influence protein-molecule interactions. To address these challenges, we propose a novel fragment-based molecular generation framework tailored for specific proteins. Our method begins by constructing a protein subpocket and molecular arm concept-based neural network, which systematically integrates interaction force information and geometric complementarity to sample molecular arms for specific protein subpockets. Subsequently, we introduce a diffusion model to generate molecular backbones that connect these arms, ensuring structural integrity and chemical diversity. Our approach significantly improves synthetic feasibility and binding affinity, with a 4% increase in drug-likeness and a 6% improvement in synthetic feasibility. Furthermore, by integrating explicit interaction data through a concept-based model, our framework enhances interpretability, offering valuable insights into the molecular design process.

[LG-27] Domain Adaptation and Entanglement: an Optimal Transport Perspective AISTATS’25

链接: https://arxiv.org/abs/2503.08155
作者: Okan Koç,Alexander Soen,Chao-Kai Chiang,Masashi Sugiyama
类目: Machine Learning (cs.LG)
*备注: Accepted for publication in AISTATS’25

点击查看摘要

Abstract:Current machine learning systems are brittle in the face of distribution shifts (DS), where the target distribution that the system is tested on differs from the source distribution used to train the system. This problem of robustness to DS has been studied extensively in the field of domain adaptation. For deep neural networks, a popular framework for unsupervised domain adaptation (UDA) is domain matching, in which algorithms try to align the marginal distributions in the feature or output space. The current theoretical understanding of these methods, however, is limited and existing theoretical results are not precise enough to characterize their performance in practice. In this paper, we derive new bounds based on optimal transport that analyze the UDA problem. Our new bounds include a term which we dub as \emphentanglement, consisting of an expectation of Wasserstein distance between conditionals with respect to changing data distributions. Analysis of the entanglement term provides a novel perspective on the unoptimizable aspects of UDA. In various experiments with multiple models across several DS scenarios, we show that this term can be used to explain the varying performance of UDA algorithms. Comments: Accepted for publication in AISTATS’25 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.08155 [cs.LG] (or arXiv:2503.08155v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.08155 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Okan Koc [view email] [v1] Tue, 11 Mar 2025 08:10:03 UTC (560 KB)

[LG-28] Scaling Probabilistic Circuits via Data Partitioning

链接: https://arxiv.org/abs/2503.08141
作者: Jonas Seng,Florian Peter Busch,Pooja Prasad,Devendra Singh Dhami,Martin Mundt,Kristian Kersting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probabilistic circuits (PCs) enable us to learn joint distributions over a set of random variables and to perform various probabilistic queries in a tractable fashion. Though the tractability property allows PCs to scale beyond non-tractable models such as Bayesian Networks, scaling training and inference of PCs to larger, real-world datasets remains challenging. To remedy the situation, we show how PCs can be learned across multiple machines by recursively partitioning a distributed dataset, thereby unveiling a deep connection between PCs and federated learning (FL). This leads to federated circuits (FCs) – a novel and flexible federated learning (FL) framework that (1) allows one to scale PCs on distributed learning environments (2) train PCs faster and (3) unifies for the first time horizontal, vertical, and hybrid FL in one framework by re-framing FL as a density estimation problem over distributed datasets. We demonstrate FC’s capability to scale PCs on various large-scale datasets. Also, we show FC’s versatility in handling horizontal, vertical, and hybrid FL within a unified framework on multiple classification tasks.

[LG-29] Large Scale Multi-Task Bayesian Optimization with Large Language Models

链接: https://arxiv.org/abs/2503.08131
作者: Yimeng Zeng,Natalie Maus,Haydn Thomas Jones,Jeffrey Tao,Fangping Wan,Marcelo Der Torossian Torres,Cesar de la Fuente-Nunez,Ryan Marcus,Osbert Bastani,Jacob R. Gardner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In multi-task Bayesian optimization, the goal is to leverage experience from optimizing existing tasks to improve the efficiency of optimizing new ones. While approaches using multi-task Gaussian processes or deep kernel transfer exist, the performance improvement is marginal when scaling to more than a moderate number of tasks. We introduce a novel approach leveraging large language models (LLMs) to learn from, and improve upon, previous optimization trajectories, scaling to approximately 2000 distinct tasks. Specifically, we propose an iterative framework in which an LLM is fine-tuned using the high quality solutions produced by BayesOpt to generate improved initializations that accelerate convergence for future optimization tasks based on previous search trajectories. We evaluate our method on two distinct domains: database query optimization and antimicrobial peptide design. Results demonstrate that our approach creates a positive feedback loop, where the LLM’s generated initializations gradually improve, leading to better optimization performance. As this feedback loop continues, we find that the LLM is eventually able to generate solutions to new tasks in just a few shots that are better than the solutions produced by “from scratch” by Bayesian optimization while simultaneously requiring significantly fewer oracle calls.

[LG-30] Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors

链接: https://arxiv.org/abs/2503.08099
作者: Runxi Cheng,Feng Xiong,Yongxian Wei,Wanyun Zhu,Chun Yuan
类目: Machine Learning (cs.LG)
*备注: 23 pages, 15 figures, 9 tables

点击查看摘要

Abstract:Model merging seeks to integrate task-specific expert models into a unified architecture while preserving multi-task generalization capabilities, yet parameter interference between constituent models frequently induces performance degradation. Although prior work has explored many merging strategies, resolving interference without additional data for retraining or test-time computation remains challenging. In this paper, we theoretically demonstrate that the task vectors of the linear layer constitute an approximate linear subspace for its corresponding input. Therefore, we can minimize interference under the guidance of task vectors. Based on this insight, we propose \textbfWUDI-Merging (\textbfWhoever started the interference sho\textbfUld en\textbfD \textbfIt), a simple yet effective model merging method that eliminates interference without any additional data or rescaling coefficients. Comprehensive empirical evaluations across vision and language benchmarks demonstrate our method’s superiority, achieving state-of-the-art performance in data-free model merging scenarios (average 10.9% improvement versus baseline methods) while even outperforming mainstream test-time adaptation approaches by 3.3%, and only very few computing resources are required. The code will be publicly available soon.

[LG-31] Evidential Uncertainty Probes for Graph Neural Networks AISTATS2025

链接: https://arxiv.org/abs/2503.08097
作者: Linlin Yu,Kangshuo Li,Pritom Kumar Saha,Yifei Lou,Feng Chen
类目: Machine Learning (cs.LG)
*备注: AISTATS 2025

点击查看摘要

Abstract:Accurate quantification of both aleatoric and epistemic uncertainties is essential when deploying Graph Neural Networks (GNNs) in high-stakes applications such as drug discovery and financial fraud detection, where reliable predictions are critical. Although Evidential Deep Learning (EDL) efficiently quantifies uncertainty using a Dirichlet distribution over predictive probabilities, existing EDL-based GNN (EGNN) models require modifications to the network architecture and retraining, failing to take advantage of pre-trained models. We propose a plug-and-play framework for uncertainty quantification in GNNs that works with pre-trained models without the need for retraining. Our Evidential Probing Network (EPN) uses a lightweight Multi-Layer-Perceptron (MLP) head to extract evidence from learned representations, allowing efficient integration with various GNN architectures. We further introduce evidence-based regularization techniques, referred to as EPN-reg, to enhance the estimation of epistemic uncertainty with theoretical justifications. Extensive experiments demonstrate that the proposed EPN-reg achieves state-of-the-art performance in accurate and efficient uncertainty quantification, making it suitable for real-world deployment.

[LG-32] ForceGrip: Data-Free Curriculum Learning for Realistic Grip Force Control in VR Hand Manipulation

链接: https://arxiv.org/abs/2503.08061
作者: DongHeun Han,Byungmin Kim,RoUn Lee,KyeongMin Kim,Hyoseok Hwang,HyeongYeop Kang
类目: Robotics (cs.RO); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 19 pages, 10 figs (with appendix)

点击查看摘要

Abstract:Realistic hand manipulation is a key component of immersive virtual reality (VR), yet existing methods often rely on a kinematic approach or motion-capture datasets that omit crucial physical attributes such as contact forces and finger torques. Consequently, these approaches prioritize tight, one-size-fits-all grips rather than reflecting users’ intended force levels. We present ForceGrip, a deep learning agent that synthesizes realistic hand manipulation motions, faithfully reflecting the user’s grip force intention. Instead of mimicking predefined motion datasets, ForceGrip uses generated training scenarios-randomizing object shapes, wrist movements, and trigger input flows-to challenge the agent with a broad spectrum of physical interactions. To effectively learn from these complex tasks, we employ a three-phase curriculum learning framework comprising Finger Positioning, Intention Adaptation, and Dynamic Stabilization. This progressive strategy ensures stable hand-object contact, adaptive force control based on user inputs, and robust handling under dynamic conditions. Additionally, a proximity reward function enhances natural finger motions and accelerates training convergence. Quantitative and qualitative evaluations reveal ForceGrip’s superior force controllability and plausibility compared to state-of-the-art methods.

[LG-33] Symbolic Neural Ordinary Differential Equations AAAI2025

链接: https://arxiv.org/abs/2503.08059
作者: Xin Li,Chengli Zhao,Xue Zhang,Xiaojun Duan
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注: Accepted in AAAI 2025

点击查看摘要

Abstract:Differential equations are widely used to describe complex dynamical systems with evolving parameters in nature and engineering. Effectively learning a family of maps from the parameter function to the system dynamics is of great significance. In this study, we propose a novel learning framework of symbolic continuous-depth neural networks, termed Symbolic Neural Ordinary Differential Equations (SNODEs), to effectively and accurately learn the underlying dynamics of complex systems. Specifically, our learning framework comprises three stages: initially, pre-training a predefined symbolic neural network via a gradient flow matching strategy; subsequently, fine-tuning this network using Neural ODEs; and finally, constructing a general neural network to capture residuals. In this process, we apply the SNODEs framework to partial differential equation systems through Fourier analysis, achieving resolution-invariant modeling. Moreover, this framework integrates the strengths of symbolism and connectionism, boasting a universal approximation theorem while significantly enhancing interpretability and extrapolation capabilities relative to state-of-the-art baseline methods. We demonstrate this through experiments on several representative complex systems. Therefore, our framework can be further applied to a wide range of scientific problems, such as system bifurcation and control, reconstruction and forecasting, as well as the discovery of new equations.

[LG-34] Accurate INT8 Training Through Dynamic Block-Level Fallback

链接: https://arxiv.org/abs/2503.08040
作者: Pengle Zhang,Jia Wei,Jintao Zhang,Jun Zhu,Jianfei Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer models have achieved remarkable success across various AI applications but face significant training costs. Low-bit training, such as INT8 training, can leverage computational units with higher throughput, and has already demonstrated its effectiveness on GPT2 models with block-level quantization. However, it struggles with modern Transformer variants incorporating GLU units. This is because those variants demonstrate complex distributions of activation outliers. To address the challenge, we propose Fallback Quantization, implementing mixed-precision GEMM that dynamically falls back 8-bit to 16-bit for activation blocks containing outliers. Experiments show that our approach is robustly competent in both fine-tuning and pretraining settings. Moreover, our method achieves a 1.57x end-to-end training speedup on RTX4090 GPUs.

[LG-35] Empirical Error Estimates for Graph Sparsification

链接: https://arxiv.org/abs/2503.08031
作者: Siyao Wang,Miles E. Lopes
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph sparsification is a well-established technique for accelerating graph-based learning algorithms, which uses edge sampling to approximate dense graphs with sparse ones. Because the sparsification error is random and unknown, users must contend with uncertainty about the reliability of downstream computations. Although it is possible for users to obtain conceptual guidance from theoretical error bounds in the literature, such results are typically impractical at a numerical level. Taking an alternative approach, we propose to address these issues from a data-driven perspective by computing empirical error estimates. The proposed error estimates are highly versatile, and we demonstrate this in four use cases: Laplacian matrix approximation, graph cut queries, graph-structured regression, and spectral clustering. Moreover, we provide two theoretical guarantees for the error estimates, and explain why the cost of computing them is manageable in comparison to the overall cost of a typical graph sparsification workflow.

[LG-36] GPT -PPG: A GPT -based Foundation Model for Photoplethysmography Signals

链接: https://arxiv.org/abs/2503.08015
作者: Zhaoliang Chen,Cheng Ding,Saurabh Kataria,Runze Yan,Minxiao Wang,Randall Lee,Xiao Hu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This study introduces a novel application of a Generative Pre-trained Transformer (GPT) model tailored for photoplethysmography (PPG) signals, serving as a foundation model for various downstream tasks. Adapting the standard GPT architecture to suit the continuous characteristics of PPG signals, our approach demonstrates promising results. Our models are pre-trained on our extensive dataset that contains more than 200 million 30s PPG samples. We explored different supervised fine-tuning techniques to adapt our model to downstream tasks, resulting in performance comparable to or surpassing current state-of-the-art (SOTA) methods in tasks like atrial fibrillation detection. A standout feature of our GPT model is its inherent capability to perform generative tasks such as signal denoising effectively, without the need for further fine-tuning. This success is attributed to the generative nature of the GPT framework.

[LG-37] Multiplayer Information Asymmetric Bandits in Metric Spaces

链接: https://arxiv.org/abs/2503.08004
作者: William Chang,Aditi Kartik
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years the information asymmetric Lipschitz bandits In this paper we studied the Lipschitz bandit problem applied to the multiplayer information asymmetric problem studied in \citechang2022online, chang2023optimal. More specifically we consider information asymmetry in rewards, actions, or both. We adopt the CAB algorithm given in \citekleinberg2004nearly which uses a fixed discretization to give regret bounds of the same order (in the dimension of the action) space in all 3 problem settings. We also adopt their zooming algorithm \cite kleinberg2008multiwhich uses an adaptive discretization and apply it to information asymmetry in rewards and information asymmetry in actions.

[LG-38] Predicting and Understanding College Student Mental Health with Interpretable Machine Learning ALT

链接: https://arxiv.org/abs/2503.08002
作者: Meghna Roy Chowdhury,Wei Xuan,Shreyas Sen,Yixue Zhao,Yi Ding
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 12 pages, 10 figures, ACM/IEEE International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE '25), June 24–26, 2025, New York, NY, USA

点击查看摘要

Abstract:Mental health issues among college students have reached critical levels, significantly impacting academic performance and overall wellbeing. Predicting and understanding mental health status among college students is challenging due to three main factors: the necessity for large-scale longitudinal datasets, the prevalence of black-box machine learning models lacking transparency, and the tendency of existing approaches to provide aggregated insights at the population level rather than individualized understanding. To tackle these challenges, this paper presents I-HOPE, the first Interpretable Hierarchical mOdel for Personalized mEntal health prediction. I-HOPE is a two-stage hierarchical model, validated on the College Experience Study, the longest longitudinal mobile sensing dataset. This dataset spans five years and captures data from both pre-pandemic periods and the COVID-19 pandemic. I-HOPE connects raw behavioral features to mental health status through five defined behavioral categories as interaction labels. This approach achieves a prediction accuracy of 91%, significantly surpassing the 60-70% accuracy of baseline methods. In addition, our model distills complex patterns into interpretable and individualized insights, enabling the future development of tailored interventions and improving mental health support. The code is available at this https URL. Comments: 12 pages, 10 figures, ACM/IEEE International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE '25), June 24–26, 2025, New York, NY, USA Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) Cite as: arXiv:2503.08002 [cs.LG] (or arXiv:2503.08002v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.08002 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3721201.3721372 Focus to learn more DOI(s) linking to related resources

[LG-39] Almost Linear Time Consistent Mode Estimation and Quick Shift Clustering

链接: https://arxiv.org/abs/2503.07995
作者: Sajjad Hashemian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 3 figures, Accepted to OLA 2025

点击查看摘要

Abstract:In this paper, we propose a method for density-based clustering in high-dimensional spaces that combines Locality-Sensitive Hashing (LSH) with the Quick Shift algorithm. The Quick Shift algorithm, known for its hierarchical clustering capabilities, is extended by integrating approximate Kernel Density Estimation (KDE) using LSH to provide efficient density estimates. The proposed approach achieves almost linear time complexity while preserving the consistency of density-based clustering.

[LG-40] Regulatory DNA sequence Design with Reinforcement Learning

链接: https://arxiv.org/abs/2503.07981
作者: Zhao Yang,Bing Su,Chuan Cao,Ji-Rong Wen
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Cis-regulatory elements (CREs), such as promoters and enhancers, are relatively short DNA sequences that directly regulate gene expression. The fitness of CREs, measured by their ability to modulate gene expression, highly depends on the nucleotide sequences, especially specific motifs known as transcription factor binding sites (TFBSs). Designing high-fitness CREs is crucial for therapeutic and bioengineering applications. Current CRE design methods are limited by two major drawbacks: (1) they typically rely on iterative optimization strategies that modify existing sequences and are prone to local optima, and (2) they lack the guidance of biological prior knowledge in sequence optimization. In this paper, we address these limitations by proposing a generative approach that leverages reinforcement learning (RL) to fine-tune a pre-trained autoregressive (AR) model. Our method incorporates data-driven biological priors by deriving computational inference-based rewards that simulate the addition of activator TFBSs and removal of repressor TFBSs, which are then integrated into the RL process. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types, demonstrating its ability to generate high-fitness CREs while maintaining sequence diversity. The code is available at this https URL.

[LG-41] Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection

链接: https://arxiv.org/abs/2503.07978
作者: Jiahao Xu,Zikai Zhang,Rui Hu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The distributed nature of training makes Federated Learning (FL) vulnerable to backdoor attacks, where malicious model updates aim to compromise the global model’s performance on specific tasks. Existing defense methods show limited efficacy as they overlook the inconsistency between benign and malicious model updates regarding both general and fine-grained directions. To fill this gap, we introduce AlignIns, a novel defense method designed to safeguard FL systems against backdoor attacks. AlignIns looks into the direction of each model update through a direction alignment inspection process. Specifically, it examines the alignment of model updates with the overall update direction and analyzes the distribution of the signs of their significant parameters, comparing them with the principle sign across all model updates. Model updates that exhibit an unusual degree of alignment are considered malicious and thus be filtered out. We provide the theoretical analysis of the robustness of AlignIns and its propagation error in FL. Our empirical results on both independent and identically distributed (IID) and non-IID datasets demonstrate that AlignIns achieves higher robustness compared to the state-of-the-art defense methods. The code is available at this https URL.

[LG-42] Boundary Regression for Leitmotif Detection in Music Audio

链接: https://arxiv.org/abs/2503.07977
作者: Sihun Lee,Dasaem Jeong
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 2 pages, 1 figure; presented at the 2024 ISMIR conference Late-Breaking Demo

点击查看摘要

Abstract:Leitmotifs are musical phrases that are reprised in various forms throughout a piece. Due to diverse variations and instrumentation, detecting the occurrence of leitmotifs from audio recordings is a highly challenging task. Leitmotif detection may be handled as a subcategory of audio event detection, where leitmotif activity is predicted at the frame level. However, as leitmotifs embody distinct, coherent musical structures, a more holistic approach akin to bounding box regression in visual object detection can be helpful. This method captures the entirety of a motif rather than fragmenting it into individual frames, thereby preserving its musical integrity and producing more useful predictions. We present our experimental results on tackling leitmotif detection as a boundary regression task.

[LG-43] Overlap-aware meta-learning attention to enhance hypergraph neural networks for node classification

链接: https://arxiv.org/abs/2503.07961
作者: Murong Yang,Shihui Ying,Xin-Jian Xu
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: latex, 45 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Although hypergraph neural networks (HGNNs) have emerged as a powerful framework for analyzing complex datasets, their practical performance often remains limited. On one hand, existing networks typically employ a single type of attention mechanism, focusing on either structural or feature similarities during message passing. On the other hand, assuming that all nodes in current hypergraph models have the same level of overlap may lead to suboptimal generalization. To overcome these limitations, we propose a novel framework, overlap-aware meta-learning attention for hypergraph neural networks (OMA-HGNN). First, we introduce a hypergraph attention mechanism that integrates both structural and feature similarities. Specifically, we linearly combine their respective losses with weighted factors for the HGNN model. Second, we partition nodes into different tasks based on their diverse overlap levels and develop a multi-task Meta-Weight-Net (MWN) to determine the corresponding weighted factors. Third, we jointly train the internal MWN model with the losses from the external HGNN model and train the external model with the weighted factors from the internal model. To evaluate the effectiveness of OMA-HGNN, we conducted experiments on six real-world datasets and benchmarked its perfor-mance against nine state-of-the-art methods for node classification. The results demonstrate that OMA-HGNN excels in learning superior node representations and outperforms these baselines.

[LG-44] Recent Advances in Hypergraph Neural Networks

链接: https://arxiv.org/abs/2503.07959
作者: Murong Yang,Xin-Jian Xu
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Latex, 35 pages, 1 figures

点击查看摘要

Abstract:The growing interest in hypergraph neural networks (HGNNs) is driven by their capacity to capture the complex relationships and patterns within hypergraph structured data across various domains, including computer vision, complex networks, and natural language processing. This paper comprehensively reviews recent advances in HGNNs and presents a taxonomy of mainstream models based on their architectures: hypergraph convolutional networks (HGCNs), hypergraph attention networks (HGATs), hypergraph autoencoders (HGAEs), hypergraph recurrent networks (HGRNs), and deep hypergraph generative models (DHGGMs). For each category, we delve into its practical applications, mathematical mechanisms, literature contributions, and open problems. Finally, we discuss some common challenges and promising research this http URL paper aspires to be a helpful resource that provides guidance for future research and applications of HGNNs.

[LG-45] Counterfactual Explanations for Model Ensembles Using Entropic Risk Measures

链接: https://arxiv.org/abs/2503.07934
作者: Erfaun Noorani,Pasan Dissanayake,Faisal Hamman,Sanghamitra Dutta
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Systems and Control (eess.SY); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Counterfactual explanations indicate the smallest change in input that can translate to a different outcome for a machine learning model. Counterfactuals have generated immense interest in high-stakes applications such as finance, education, hiring, etc. In several use-cases, the decision-making process often relies on an ensemble of models rather than just one. Despite significant research on counterfactuals for one model, the problem of generating a single counterfactual explanation for an ensemble of models has received limited interest. Each individual model might lead to a different counterfactual, whereas trying to find a counterfactual accepted by all models might significantly increase cost (effort). We propose a novel strategy to find the counterfactual for an ensemble of models using the perspective of entropic risk measure. Entropic risk is a convex risk measure that satisfies several desirable properties. We incorporate our proposed risk measure into a novel constrained optimization to generate counterfactuals for ensembles that stay valid for several models. The main significance of our measure is that it provides a knob that allows for the generation of counterfactuals that stay valid under an adjustable fraction of the models. We also show that a limiting case of our entropic-risk-based strategy yields a counterfactual valid for all models in the ensemble (worst-case min-max approach). We study the trade-off between the cost (effort) for the counterfactual and its validity for an ensemble by varying degrees of risk aversion, as determined by our risk parameter knob. We validate our performance on real-world datasets.

[LG-46] Hyperoctant Search Clustering: A Method for Clustering Data in High-Dimensional Hyperspheres

链接: https://arxiv.org/abs/2503.07917
作者: Mauricio Toledo-Acosta,Luis Ángel Ramos-García,Jorge Hermosillo-Valadez
类目: Machine Learning (cs.LG)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:Clustering of high-dimensional data sets is a growing need in artificial intelligence, machine learning and pattern recognition. In this paper, we propose a new clustering method based on a combinatorial-topological approach applied to regions of space defined by signs of coordinates (hyperoctants). In high-dimensional spaces, this approach often reduces the size of the dataset while preserving sufficient topological features. According to a density criterion, the method builds clusters of data points based on the partitioning of a graph, whose vertices represent hyperoctants, and whose edges connect neighboring hyperoctants under the Levenshtein distance. We call this method HyperOctant Search Clustering. We prove some mathematical properties of the method. In order to as assess its performance, we choose the application of topic detection, which is an important task in text mining. Our results suggest that our method is more stable under variations of the main hyperparameter, and remarkably, it is not only a clustering method, but also a tool to explore the dataset from a topological perspective, as it directly provides information about the number of hyperoctants where there are data points. We also discuss the possible connections between our clustering method and other research fields.

[LG-47] Cross-platform Prediction of Depression Treatment Outcome Using Location Sensory Data on Smartphones

链接: https://arxiv.org/abs/2503.07883
作者: Soumyashree Sahoo,Chinmaey Shende,Md. Zakir Hossain,Parit Patel,Yushuo Niu,Xinyu Wang,Shweta Ware,Jinbo Bi,Jayesh Kamath,Alexander Russel,Dongjin Song,Qian Yang,Bing Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Currently, depression treatment relies on closely monitoring patients response to treatment and adjusting the treatment as needed. Using self-reported or physician-administrated questionnaires to monitor treatment response is, however, burdensome, costly and suffers from recall bias. In this paper, we explore using location sensory data collected passively on smartphones to predict treatment outcome. To address heterogeneous data collection on Android and iOS phones, the two predominant smartphone platforms, we explore using domain adaptation techniques to map their data to a common feature space, and then use the data jointly to train machine learning models. Our results show that this domain adaptation approach can lead to significantly better prediction than that with no domain adaptation. In addition, our results show that using location features and baseline self-reported questionnaire score can lead to F1 score up to 0.67, comparable to that obtained using periodic self-reported questionnaires, indicating that using location data is a promising direction for predicting depression treatment outcome.

[LG-48] ReLATE: Resilient Learner Selection for Multivariate Time-Series Classification Against Adversarial Attacks AAAI-25

链接: https://arxiv.org/abs/2503.07882
作者: Cagla Ipek Kocal,Onat Gungor,Aaron Tartz,Tajana Rosing,Baris Aksanli
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted by the AAAI-25 Workshop on Artificial Intelligence for Time Series Analysis (AI4TS)

点击查看摘要

Abstract:Minimizing computational overhead in time-series classification, particularly in deep learning models, presents a significant challenge. This challenge is further compounded by adversarial attacks, emphasizing the need for resilient methods that ensure robust performance and efficient model selection. We introduce ReLATE, a framework that identifies robust learners based on dataset similarity, reduces computational overhead, and enhances resilience. ReLATE maintains multiple deep learning models in well-known adversarial attack scenarios, capturing model performance. ReLATE identifies the most analogous dataset to a given target using a similarity metric, then applies the optimal model from the most similar dataset. ReLATE reduces computational overhead by an average of 81.2%, enhancing adversarial resilience and streamlining robust model selection, all without sacrificing performance, within 4.2% of Oracle.

[LG-49] Reproducibility and Artifact Consistency of the SIGIR 2022 Recommender Systems Papers Based on Message Passing

链接: https://arxiv.org/abs/2503.07823
作者: Maurizio Ferrari Dacrema,Michael Benigni,Nicola Ferro
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Graph-based techniques relying on neural networks and embeddings have gained attention as a way to develop Recommender Systems (RS) with several papers on the topic presented at SIGIR 2022 and 2023. Given the importance of ensuring that published research is methodologically sound and reproducible, in this paper we analyze 10 graph-based RS papers, most of which were published at SIGIR 2022, and assess their impact on subsequent work published in SIGIR 2023. Our analysis reveals several critical points that require attention: (i) the prevalence of bad practices, such as erroneous data splits or information leakage between training and testing data, which call into question the validity of the results; (ii) frequent inconsistencies between the provided artifacts (source code and data) and their descriptions in the paper, causing uncertainty about what is actually being evaluated; and (iii) the preference for new or complex baselines that are weaker compared to simpler ones, creating the impression of continuous improvement even when, particularly for the Amazon-Book dataset, the state-of-the-art has significantly worsened. Due to these issues, we are unable to confirm the claims made in most of the papers we examined and attempted to reproduce.

[LG-50] Strengthening the Internal Adversarial Robustness in Lifted Neural Networks

链接: https://arxiv.org/abs/2503.07818
作者: Christopher Zach
类目: Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:Lifted neural networks (i.e. neural architectures explicitly optimizing over respective network potentials to determine the neural activities) can be combined with a type of adversarial training to gain robustness for internal as well as input layers, in addition to improved generalization performance. In this work we first investigate how adversarial robustness in this framework can be further strengthened by solely modifying the training loss. In a second step we fix some remaining limitations and arrive at a novel training loss for lifted neural networks, that combines targeted and untargeted adversarial perturbations.

[LG-51] Sublinear Algorithms for Wasserstein and Total Variation Distances: Applications to Fairness and Privacy Auditing

链接: https://arxiv.org/abs/2503.07775
作者: Debabrota Basu,Debarshi Chanda
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Data Structures and Algorithms (cs.DS); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Resource-efficiently computing representations of probability distributions and the distances between them while only having access to the samples is a fundamental and useful problem across mathematical sciences. In this paper, we propose a generic algorithmic framework to estimate the PDF and CDF of any sub-Gaussian distribution while the samples from them arrive in a stream. We compute mergeable summaries of distributions from the stream of samples that require sublinear space w.r.t. the number of observed samples. This allows us to estimate Wasserstein and Total Variation (TV) distances between any two sub-Gaussian distributions while samples arrive in streams and from multiple sources (e.g. federated learning). Our algorithms significantly improves on the existing methods for distance estimation incurring super-linear time and linear space complexities. In addition, we use the proposed estimators of Wasserstein and TV distances to audit the fairness and privacy of the ML algorithms. We empirically demonstrate the efficiency of the algorithms for estimating these distances and auditing using both synthetic and real-world datasets.

[LG-52] Graphint: Graph-based Time Series Clustering Visualisation Tool

链接: https://arxiv.org/abs/2503.07698
作者: Paul Boniol,Donato Tiano,Angela Bonifati,Themis Palpanas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the exponential growth of time series data across diverse domains, there is a pressing need for effective analysis tools. Time series clustering is important for identifying patterns in these datasets. However, prevailing methods often encounter obstacles in maintaining data relationships and ensuring interpretability. We present Graphint, an innovative system based on the k -Graph methodology that addresses these challenges. Graphint integrates a robust time series clustering algorithm with an interactive tool for comparison and interpretation. More precisely, our system allows users to compare results against competing approaches, identify discriminative subsequences within specified datasets, and visualize the critical information utilized by k -Graph to generate outputs. Overall, Graphint offers a comprehensive solution for extracting actionable insights from complex temporal datasets.

[LG-53] PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models NAACL2025

链接: https://arxiv.org/abs/2503.07697
作者: Michael-Andrei Panaitescu-Liess,Pankayaraj Pathmanathan,Yigitcan Kaya,Zora Che,Bang An,Sicheng Zhu,Aakriti Agrawal,Furong Huang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 18 pages, 18 figures. Accepted at NAACL 2025

点击查看摘要

Abstract:As the capabilities of large language models (LLMs) continue to expand, their usage has become increasingly prevalent. However, as reflected in numerous ongoing lawsuits regarding LLM-generated content, addressing copyright infringement remains a significant challenge. In this paper, we introduce PoisonedParrot: the first stealthy data poisoning attack that induces an LLM to generate copyrighted content even when the model has not been directly trained on the specific copyrighted material. PoisonedParrot integrates small fragments of copyrighted text into the poison samples using an off-the-shelf LLM. Despite its simplicity, evaluated in a wide range of experiments, PoisonedParrot is surprisingly effective at priming the model to generate copyrighted content with no discernible side effects. Moreover, we discover that existing defenses are largely ineffective against our attack. Finally, we make the first attempt at mitigating copyright-infringement poisoning attacks by proposing a defense: ParrotTrap. We encourage the community to explore this emerging threat model further.

[LG-54] Log Optimization Simplification Method for Predicting Remaining Time

链接: https://arxiv.org/abs/2503.07683
作者: Jianhong Ye,Siyuan Zhang,Yan Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Information systems generate a large volume of event log data during business operations, much of which consists of low-value and redundant information. When performance predictions are made directly from these logs, the accuracy of the predictions can be compromised. Researchers have explored methods to simplify and compress these data while preserving their valuable components. Most existing approaches focus on reducing the dimensionality of the data by eliminating redundant and irrelevant features. However, there has been limited investigation into the efficiency of execution both before and after event log simplification. In this paper, we present a prediction point selection algorithm designed to avoid the simplification of all points that function similarly. We select sequences or self-loop structures to form a simplifiable segment, and we optimize the deviation between the actual simplifiable value and the original data prediction value to prevent over-simplification. Experiments indicate that the simplified event log retains its predictive performance and, in some cases, enhances its predictive accuracy compared to the original event log.

[LG-55] ransforming Traditional Neural Networks into Neuromorphic Quantum-Cognitive Models: A Tutorial with Applications

链接: https://arxiv.org/abs/2503.07681
作者: Milan Maksimovic,Ivan S. Maksymov
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum technologies are increasingly pervasive, underpinning the operation of numerous electronic, optical and medical devices. Today, we are also witnessing rapid advancements in quantum computing and communication. However, access to quantum technologies in computation remains largely limited to professionals in research organisations and high-tech industries. This paper demonstrates how traditional neural networks can be transformed into neuromorphic quantum models, enabling anyone with a basic understanding of undergraduate-level machine learning to create quantum-inspired models that mimic the functioning of the human brain – all using a standard laptop. We present several examples of these quantum machine learning transformations and explore their potential applications, aiming to make quantum technology more accessible and practical for broader use. The examples discussed in this paper include quantum-inspired analogues of feedforward neural networks, recurrent neural networks, Echo State Network reservoir computing and Bayesian neural networks, demonstrating that a quantum approach can both optimise the training process and equip the models with certain human-like cognitive characteristics.

[LG-56] WECAR: An End-Edge Collaborative Inference and Training Framework for WiFi-Based Continuous Human Activity Recognition

链接: https://arxiv.org/abs/2503.07669
作者: Rong Li,Tao Deng,Siwei Feng,He Huang,Juncheng Jia,Di Yuan,Keqin Li
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2502.17483

点击查看摘要

Abstract:WiFi-based human activity recognition (HAR) holds significant promise for ubiquitous sensing in smart environments. A critical challenge lies in enabling systems to dynamically adapt to evolving scenarios, learning new activities without catastrophic forgetting of prior knowledge, while adhering to the stringent computational constraints of edge devices. Current approaches struggle to reconcile these requirements due to prohibitive storage demands for retaining historical data and inefficient parameter utilization. We propose WECAR, an end-edge collaborative inference and training framework for WiFi-based continuous HAR, which decouples computational workloads to overcome these limitations. In this framework, edge devices handle model training, lightweight optimization, and updates, while end devices perform efficient inference. WECAR introduces two key innovations, i.e., dynamic continual learning with parameter efficiency and hierarchical distillation for end deployment. For the former, we propose a transformer-based architecture enhanced by task-specific dynamic model expansion and stability-aware selective retraining. For the latter, we propose a dual-phase distillation mechanism that includes multi-head self-attention relation distillation and prefix relation distillation. We implement WECAR based on heterogeneous hardware using Jetson Nano as edge devices and the ESP32 as end devices, respectively. Our experiments across three public WiFi datasets reveal that WECAR not only outperforms several state-of-the-art methods in performance and parameter efficiency, but also achieves a substantial reduction in the model’s parameter count post-optimization without sacrificing accuracy. This validates its practicality for resource-constrained environments.

[LG-57] he Computational Complexity of Positive Non-Clashing Teaching in Graphs ICLR2025

链接: https://arxiv.org/abs/2503.07665
作者: Robert Ganian,Liana Khazaliya,Fionn Mc Inerney,Mathis Rocton
类目: Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: The short version of this paper will appear in the proceedings of ICLR 2025

点击查看摘要

Abstract:We study the classical and parameterized complexity of computing the positive non-clashing teaching dimension of a set of concepts, that is, the smallest number of examples per concept required to successfully teach an intelligent learner under the considered, previously established model. For any class of concepts, it is known that this problem can be effortlessly transferred to the setting of balls in a graph G. We establish (1) the NP-hardness of the problem even when restricted to instances with positive non-clashing teaching dimension k=2 and where all balls in the graph are present, (2) near-tight running time upper and lower bounds for the problem on general graphs, (3) fixed-parameter tractability when parameterized by the vertex integrity of G, and (4) a lower bound excluding fixed-parameter tractability when parameterized by the feedback vertex number and pathwidth of G, even when combined with k. Our results provide a nearly complete understanding of the complexity landscape of computing the positive non-clashing teaching dimension and answer open questions from the literature.

[LG-58] MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration

链接: https://arxiv.org/abs/2503.07654
作者: Jinguang Wang,Jingyu Wang,Haifeng Sun,Tingting Yang,Zirui Zhuang,Wanyi Ning,Yuexi Yin,Qi Qi,Jianxin Liao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.

[LG-59] he day-ahead scenario generation method for new energy based on an improved conditional generative diffusion model

链接: https://arxiv.org/abs/2503.07648
作者: Changgang Wang,Wei Liu,Yu Cao,Dong Liang,Yang Li,Jingshan Mo
类目: Machine Learning (cs.LG)
*备注: in Chinese language, Accepted by Power System Technology

点击查看摘要

Abstract:In the context of the rising share of new energy generation, accurately generating new energy output scenarios is crucial for day-ahead power system scheduling. Deep learning-based scenario generation methods can address this need, but their black-box nature raises concerns about interpretability. To tackle this issue, this paper introduces a method for day-ahead new energy scenario generation based on an improved conditional generative diffusion model. This method is built on the theoretical framework of Markov chains and variational inference. It first transforms historical data into pure noise through a diffusion process, then uses conditional information to guide the denoising process, ultimately generating scenarios that satisfy the conditional distribution. Additionally, the noise table is improved to a cosine form, enhancing the quality of the generated scenarios. When applied to actual wind and solar output data, the results demonstrate that this method effectively generates new energy output scenarios with good adaptability.

[LG-60] On the Importance of Clearsky Model in Short-Term Solar Radiation Forecasting

链接: https://arxiv.org/abs/2503.07647
作者: Cyril Voyant,Milan Despotovic,Gilles Notton,Yves-Marie Saint-Drenan,Mohammed Asloune,Luis Garcia-Gutierrez
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 20 pages, 10 Figures and 1 Table

点击查看摘要

Abstract:Clearsky models are widely used in solar energy for many applications such as quality control, resource assessment, satellite-base irradiance estimation and forecasting. However, their use in forecasting and nowcasting is associated with a number of challenges. Synchronization errors, reliance on the Clearsky index (ratio of the global horizontal irradiance to its cloud-free counterpart) and high sensitivity of the clearsky model to errors in aerosol optical depth at low solar elevation limit their added value in real-time applications. This paper explores the feasibility of short-term forecasting without relying on a clearsky model. We propose a Clearsky-Free forecasting approach using Extreme Learning Machine (ELM) models. ELM learns daily periodicity and local variability directly from raw Global Horizontal Irradiance (GHI) data. It eliminates the need for Clearsky normalization, simplifying the forecasting process and improving scalability. Our approach is a non-linear adaptative statistical method that implicitely learns the irradiance in cloud-free conditions removing the need for an clear-sky model and the related operational issues. Deterministic and probabilistic results are compared to traditional benchmarks, including ARMA with McClear-generated Clearsky data and quantile regression for probabilistic forecasts. ELM matches or outperforms these methods, providing accurate predictions and robust uncertainty quantification. This approach offers a simple, efficient solution for real-time solar forecasting. By overcoming the stationarization process limitations based on usual multiplicative scheme Clearsky models, it provides a flexible and reliable framework for modern energy systems.

[LG-61] BicliqueEncoder: An Efficient Method for Link Prediction in Bipartite Networks using Formal Concept Analysis and Transformer Encoder

链接: https://arxiv.org/abs/2503.07645
作者: Hongyuan Yang,Siqi Peng,Akihiro Yamamoto
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 33 pages, 8 figures

点击查看摘要

Abstract:We propose a novel and efficient method for link prediction in bipartite networks, using \textitformal concept analysis (FCA) and the Transformer encoder. Link prediction in bipartite networks finds practical applications in various domains such as product recommendation in online sales, and prediction of chemical-disease interaction in medical science. Since for link prediction, the topological structure of a network contains valuable information, many approaches focus on extracting structural features and then utilizing them for link prediction. Bi-cliques, as a type of structural feature of bipartite graphs, can be utilized for link prediction. Although several link prediction methods utilizing bi-cliques have been proposed and perform well in rather small datasets, all of them face challenges with scalability when dealing with large datasets since they demand substantial computational resources. This limits the practical utility of these approaches in real-world applications. To overcome the limitation, we introduce a novel approach employing iceberg concept lattices and the Transformer encoder. Our method requires fewer computational resources, making it suitable for large-scale datasets while maintaining high prediction performance. We conduct experiments on five large real-world datasets that exceed the capacity of previous bi-clique-based approaches to demonstrate the efficacy of our method. Additionally, we perform supplementary experiments on five small datasets to compare with the previous bi-clique-based methods for bipartite link prediction and demonstrate that our method is more efficient than the previous ones.

[LG-62] dnamite: A Python Package for Neural Additive Models

链接: https://arxiv.org/abs/2503.07642
作者: Mike Van Ness,Madeleine Udell
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Additive models offer accurate and interpretable predictions for tabular data, a critical tool for statistical modeling. Recent advances in Neural Additive Models (NAMs) allow these models to handle complex machine learning tasks, including feature selection and survival analysis, on large-scale data. This paper introduces dnamite, a Python package that implements NAMs for these advanced applications. dnamite provides a scikit-learn style interface to train regression, classification, and survival analysis NAMs, with built-in support for feature selection. We describe the methodology underlying dnamite, its design principles, and its implementation. Through an application to the MIMIC III clinical dataset, we demonstrate the utility of dnamite in a real-world setting where feature selection and survival analysis are both important. The package is publicly available via pip and documented at this http URL.

[LG-63] Is Pre-training Applicable to the Decoder for Dense Prediction?

链接: https://arxiv.org/abs/2503.07637
作者: Chao Ning,Wanshui Gan,Weihao Xuan,Naoto Yokoya
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-trained encoders are widely employed in dense prediction tasks for their capability to effectively extract visual features from images. The decoder subsequently processes these features to generate pixel-level predictions. However, due to structural differences and variations in input data, only encoders benefit from pre-learned representations from vision benchmarks such as image classification and self-supervised learning, while decoders are typically trained from scratch. In this paper, we introduce \times Net, which facilitates a “pre-trained encoder \times pre-trained decoder” collaboration through three innovative designs. \times Net enables the direct utilization of pre-trained models within the decoder, integrating pre-learned representations into the decoding process to enhance performance in dense prediction tasks. By simply coupling the pre-trained encoder and pre-trained decoder, \times Net distinguishes itself as a highly promising approach. Remarkably, it achieves this without relying on decoding-specific structures or task-specific algorithms. Despite its streamlined design, \times Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state-of-the-art performance particularly in monocular depth estimation. and semantic segmentation, achieving state-of-the-art results, especially in monocular depth estimation. embedding algorithms. Despite its streamlined design, \times Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state-of-the-art performance particularly in monocular depth estimation.

[LG-64] A Quantum Neural Network Transfer-Learning Model for Forecasting Problems with Continuous and Discrete Variables

链接: https://arxiv.org/abs/2503.07633
作者: Ismael Abdulrahman
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:This study introduces a continuous-variable quantum neural network (CV-QNN) model designed as a transfer-learning approach for forecasting problems. The proposed quantum technique features a simple structure with only eight trainable parameters, a single quantum layer with two wires to create entanglement, and ten quantum gates, hence the name QNNet10, effectively mimicking the functionality of classical neural networks. A notable aspect is that the quantum network achieves high accuracy with random initialization after a single iteration. This pretrained model is innovative as it requires no training or parameter tuning when applied to new datasets, allowing for parameter freezing while enabling the addition of a final layer for fine-tuning. Additionally, an equivalent discrete-variable quantum neural network (DV-QNN) is presented, structured similarly to the CV model. However, analysis shows that the two-wire DV model does not significantly enhance performance. As a result, a four-wire DV model is proposed, achieving comparable results but requiring a larger and more complex structure with additional gates. The pretrained model is applied to five forecasting problems of varying sizes, demonstrating its effectiveness.

[LG-65] Hierarchical autoregressive neural networks in three-dimensional statistical system

链接: https://arxiv.org/abs/2503.08610
作者: Piotr Białas,Vaibhav Chahar,Piotr Korcyl,Tomasz Stebel,Mateusz Winiarski,Dawid Zapolski
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Autoregressive Neural Networks (ANN) have been recently proposed as a mechanism to improve the efficiency of Monte Carlo algorithms for several spin systems. The idea relies on the fact that the total probability of a configuration can be factorized into conditional probabilities of each spin, which in turn can be approximated by a neural network. Once trained, the ANNs can be used to sample configurations from the approximated probability distribution and to evaluate explicitly this probability for a given configuration. It has also been observed that such conditional probabilities give access to information-theoretic observables such as mutual information or entanglement entropy. So far, these methods have been applied to two-dimensional statistical systems or one-dimensional quantum systems. In this paper, we describe a generalization of the hierarchical algorithm to three spatial dimensions and study its performance on the example of the Ising model. We discuss the efficiency of the training and also describe the scaling with the system’s dimensionality by comparing results for two- and three-dimensional Ising models with the same number of spins. Finally, we provide estimates of thermodynamical observables for the three-dimensional Ising model, such as the entropy and free energy in a range of temperatures across the phase transition.

[LG-66] Sparsity-Induced Global Matrix Autoregressive Model with Auxiliary Network Data

链接: https://arxiv.org/abs/2503.08579
作者: Sanyou Wu,Dan Yang,Yan Xu,Long Feng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Jointly modeling and forecasting economic and financial variables across a large set of countries has long been a significant challenge. Two primary approaches have been utilized to address this issue: the vector autoregressive model with exogenous variables (VARX) and the matrix autoregression (MAR). The VARX model captures domestic dependencies, but treats variables exogenous to represent global factors driven by international trade. In contrast, the MAR model simultaneously considers variables from multiple countries but ignores the trade network. In this paper, we propose an extension of the MAR model that achieves these two aims at once, i.e., studying both international dependencies and the impact of the trade network on the global economy. Additionally, we introduce a sparse component to the model to differentiate between systematic and idiosyncratic cross-predictability. To estimate the model parameters, we propose both a likelihood estimation method and a bias-corrected alternating minimization version. We provide theoretical and empirical analyses of the model’s properties, alongside presenting intriguing economic insights derived from our findings.

[LG-67] Accelerated Distributed Optimization with Compression and Error Feedback

链接: https://arxiv.org/abs/2503.08427
作者: Yuan Gao,Anton Rodomanov,Jeremy Rack,Sebastian U. Stich
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern machine learning tasks often involve massive datasets and models, necessitating distributed optimization algorithms with reduced communication overhead. Communication compression, where clients transmit compressed updates to a central server, has emerged as a key technique to mitigate communication bottlenecks. However, the theoretical understanding of stochastic distributed optimization with contractive compression remains limited, particularly in conjunction with Nesterov acceleration – a cornerstone for achieving faster convergence in optimization. In this paper, we propose a novel algorithm, ADEF (Accelerated Distributed Error Feedback), which integrates Nesterov acceleration, contractive compression, error feedback, and gradient difference compression. We prove that ADEF achieves the first accelerated convergence rate for stochastic distributed optimization with contractive compression in the general convex regime. Numerical experiments validate our theoretical findings and demonstrate the practical efficacy of ADEF in reducing communication costs while maintaining fast convergence. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) Cite as: arXiv:2503.08427 [math.OC] (or arXiv:2503.08427v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2503.08427 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-68] Energy Scale Degradation in Sparse Quantum Solvers: A Barrier to Quantum Utility

链接: https://arxiv.org/abs/2503.08303
作者: Thang N. Dinh,Cao P. Cong
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum computing offers a promising route for tackling hard optimization problems by encoding them as Ising models. However, sparse qubit connectivity requires the use of minor-embedding, mapping logical qubits onto chains of physical qubits, which necessitates stronger intra-chain coupling to maintain consistency. This elevated coupling strength forces a rescaling of the Hamiltonian due to hardware-imposed limits on the allowable ranges of coupling strengths, reducing the energy gaps between competing states, thus, degrading the solver’s performance. Here, we introduce a theoretical model that quantifies this degradation. We show that as the connectivity degree increases, the effective temperature rises as a polynomial function, resulting in a success probability that decays exponentially. Our analysis further establishes worst-case bounds on the energy scale degradation based on the inverse conductance of chain subgraphs, revealing two most important drivers of chain strength, \textitchain volume and \textitchain connectivity. Our findings indicate that achieving quantum advantage is inherently challenging. Experiments on D-Wave quantum annealers validate these findings, highlighting the need for hardware with improved connectivity and optimized scale-aware embedding algorithms.

[LG-69] Massively Parallel Expectation Maximization For Approximate Posteriors

链接: https://arxiv.org/abs/2503.08264
作者: Thomas Heap,Sam Bowyer,Laurence Aitchison
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bayesian inference for hierarchical models can be very challenging. MCMC methods have difficulty scaling to large models with many observations and latent variables. While variational inference (VI) and reweighted wake-sleep (RWS) can be more scalable, they are gradient-based methods and so often require many iterations to converge. Our key insight was that modern massively parallel importance weighting methods (Bowyer et al., 2024) give fast and accurate posterior moment estimates, and we can use these moment estimates to rapidly learn an approximate posterior. Specifically, we propose using expectation maximization to fit the approximate posterior, which we call QEM. The expectation step involves computing the posterior moments using high-quality massively parallel estimates from Bowyer et al. (2024). The maximization step involves fitting the approximate posterior using these moments, which can be done straightforwardly for simple approximate posteriors such as Gaussian, Gamma, Beta, Dirichlet, Binomial, Multinomial, Categorical, etc. (or combinations thereof). We show that QEM is faster than state-of-the-art, massively parallel variants of RWS and VI, and is invariant to reparameterizations of the model that dramatically slow down gradient based methods.

[LG-70] How good is PAC-Bayes at explaining generalisation?

链接: https://arxiv.org/abs/2503.08231
作者: Antoine Picard-Weibel,Eugenio Clerico,Roman Moscoviz,Benjamin Guedj
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We discuss necessary conditions for a PAC-Bayes bound to provide a meaningful generalisation guarantee. Our analysis reveals that the optimal generalisation guarantee depends solely on the distribution of the risk induced by the prior distribution. In particular, achieving a target generalisation level is only achievable if the prior places sufficient mass on high-performing predictors. We relate these requirements to the prevalent practice of using data-dependent priors in deep learning PAC-Bayes applications, and discuss the implications for the claim that PAC-Bayes ``explains’’ generalisation.

[LG-71] o Use or Not to Use a Universal Force Field

链接: https://arxiv.org/abs/2503.08207
作者: Denan Li,Jiyuan Yang,Xiangkai Chen,Lintao Yu,Shi Liu
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:Artificial intelligence (AI) is revolutionizing scientific research, particularly in computational materials science, by enabling more accurate and efficient simulations. Machine learning force fields (MLFFs) have emerged as powerful tools for molecular dynamics (MD) simulations, potentially offering quantum-mechanical accuracy with the efficiency of classical MD. This Perspective evaluates the viability of universal MLFFs for simulating complex materials systems from the standpoint of a potential practitioner. Using the temperature-driven ferroelectric-paraelectric phase transition of PbTiO _3 as a benchmark, we assess leading universal force fields, including CHGNet, MACE, M3GNet, and GPTFF, alongside specialized models like UniPero. While universal MLFFs trained on PBE-derived datasets perform well in predicting equilibrium properties, they largely fail to capture realistic finite-temperature phase transitions under constant-pressure MD, often exhibiting unphysical instabilities. These shortcomings stem from inherited biases in exchange-correlation functionals and limited generalization to anharmonic interactions governing dynamic behavior. However, fine-tuning universal models or employing system-specific MLFFs like UniPero successfully restores predictive accuracy. We advocates for hybrid approaches combining universal pretraining with targeted optimization, improved error quantification frameworks, and community-driven benchmarks to advance MLFFs as robust tools for computational materials discovery.

[LG-72] Functional Unit: A New Perspective on Materials Science Research Paradigms

链接: https://arxiv.org/abs/2503.08104
作者: Caichao Ye,Tao Feng,Weishu Liu,Wenqing Zhang
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:New materials have long marked the civilization level, serving as an impetus for technological progress and societal transformation. The classic structure-property correlations were key of materials science and engineering. However, the knowledge of materials faces significant challenges in adapting to exclusively data-driven approaches for new material discovery. This perspective introduces the concepts of functional units (FUs) to fill the gap in understanding of material structure-property correlations and knowledge inheritance as the “composition-microstructure” paradigm transitions to a data-driven AI paradigm transitions. Firstly, we provide a bird’s-eye view of the research paradigm evolution from early “process-structure-properties-performance” to contemporary data-driven AI new trend. Next, we highlight recent advancements in the characterization of functional units across diverse material systems, emphasizing their critical role in multiscale material design. Finally, we discuss the integration of functional units into the new AI-driven paradigm of materials science, addressing both opportunities and challenges in computational materials innovation.

[LG-73] Median Consensus Embedding for Dimensionality Reduction

链接: https://arxiv.org/abs/2503.08103
作者: Yui Tomo,Daisuke Yoneoka
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposes median consensus embedding (MCE) to address variability in low-dimensional embeddings caused by random initialization in dimensionality reduction techniques such as t-distributed stochastic neighbor embedding. MCE is defined as the geometric median of multiple embeddings. By assuming multiple embeddings as independent and identically distributed random samples and applying large deviation theory, we prove that MCE achieves consistency at an exponential rate. Furthermore, we develop a practical algorithm to implement MCE by constructing a distance function between embeddings based on the Frobenius norm of the pairwise distance matrix of data points. Application to real-world data demonstrates that MCE converges rapidly and significantly reduces instability. These results confirm that MCE effectively mitigates random initialization issues in embedding methods.

[LG-74] Locally Private Nonparametric Contextual Multi-armed Bandits

链接: https://arxiv.org/abs/2503.08098
作者: Yuheng Ma,Feiyu Jiang,Zifeng Zhao,Hanfang Yang,Yi Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Motivated by privacy concerns in sequential decision-making on sensitive data, we address the challenge of nonparametric contextual multi-armed bandits (MAB) under local differential privacy (LDP). We develop a uniform-confidence-bound-type estimator, showing its minimax optimality supported by a matching minimax lower bound. We further consider the case where auxiliary datasets are available, subject also to (possibly heterogeneous) LDP constraints. Under the widely-used covariate shift framework, we propose a jump-start scheme to effectively utilize the auxiliary data, the minimax optimality of which is further established by a matching lower bound. Comprehensive experiments on both synthetic and real-world datasets validate our theoretical results and underscore the effectiveness of the proposed methods.

[LG-75] Computational bottlenecks for denoising diffusions

链接: https://arxiv.org/abs/2503.08028
作者: Andrea Montanari,Viet Vu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 36 pages

点击查看摘要

Abstract:Denoising diffusions provide a general strategy to sample from a probability distribution \mu in \mathbbR^d by constructing a stochastic process (\hat\boldsymbol x_t:t\ge 0) in \mathbb R^d such that the distribution of \hat\boldsymbol x_T at large times T approximates \mu . The drift \boldsymbol m:\mathbb R^d\times\mathbb R\to\mathbb R^d of this diffusion process is learned from data (samples from \mu ) by minimizing the so-called score-matching objective. In order for the generating process to be efficient, it must be possible to evaluate (an approximation of) \boldsymbol m(\boldsymbol y,t) in polynomial time. Is every probability distribution \mu , for which sampling is tractable, also amenable to sampling via diffusions? We provide evidence to the contrary by constructing a probability distribution \mu for which sampling is easy, but the drift of the diffusion process is intractable – under a popular conjecture on information-computation gaps in statistical estimation. We further show that any polynomial-time computable drift can be modified in a way that changes minimally the score matching objective and yet results in incorrect sampling. Comments: 36 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2503.08028 [stat.ML] (or arXiv:2503.08028v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2503.08028 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-76] wo-Dimensional Deep ReLU CNN Approximation for Korobov Functions: A Constructive Approach

链接: https://arxiv.org/abs/2503.07976
作者: Qin Fang,Lei Shi,Min Xu,Ding-Xuan Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates approximation capabilities of two-dimensional (2D) deep convolutional neural networks (CNNs), with Korobov functions serving as a benchmark. We focus on 2D CNNs, comprising multi-channel convolutional layers with zero-padding and ReLU activations, followed by a fully connected layer. We propose a fully constructive approach for building 2D CNNs to approximate Korobov functions and provide rigorous analysis of the complexity of the constructed networks. Our results demonstrate that 2D CNNs achieve near-optimal approximation rates under the continuous weight selection model, significantly alleviating the curse of dimensionality. This work provides a solid theoretical foundation for 2D CNNs and illustrates their potential for broader applications in function approximation.

[LG-77] Benign Overfitting and the Geometry of the Ridge Regression Solution in Binary Classification

链接: https://arxiv.org/abs/2503.07966
作者: Alexander Tsigler,Luiz F. O. Chamon,Spencer Frei,Peter L. Bartlett
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 115 pages, 2 figures

点击查看摘要

Abstract:In this work, we investigate the behavior of ridge regression in an overparameterized binary classification task. We assume examples are drawn from (anisotropic) class-conditional cluster distributions with opposing means and we allow for the training labels to have a constant level of label-flipping noise. We characterize the classification error achieved by ridge regression under the assumption that the covariance matrix of the cluster distribution has a high effective rank in the tail. We show that ridge regression has qualitatively different behavior depending on the scale of the cluster mean vector and its interaction with the covariance matrix of the cluster distributions. In regimes where the scale is very large, the conditions that allow for benign overfitting turn out to be the same as those for the regression task. We additionally provide insights into how the introduction of label noise affects the behavior of the minimum norm interpolator (MNI). The optimal classifier in this setting is a linear transformation of the cluster mean vector and in the noiseless setting the MNI approximately learns this transformation. On the other hand, the introduction of label noise can significantly change the geometry of the solution while preserving the same qualitative behavior.

[LG-78] Discriminative versus Generative Approaches to Simulation-based Inference

链接: https://arxiv.org/abs/2503.07962
作者: Benjamin Sluijter,Sascha Diefenbacher,Wahid Bhimji,Benjamin Nachman
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 11 pages, 8 figures

点击查看摘要

Abstract:Most of the fundamental, emergent, and phenomenological parameters of particle and nuclear physics are determined through parametric template fits. Simulations are used to populate histograms which are then matched to data. This approach is inherently lossy, since histograms are binned and low-dimensional. Deep learning has enabled unbinned and high-dimensional parameter estimation through neural likelihiood(-ratio) estimation. We compare two approaches for neural simulation-based inference (NSBI): one based on discriminative learning (classification) and one based on generative modeling. These two approaches are directly evaluated on the same datasets, with a similar level of hyperparameter optimization in both cases. In addition to a Gaussian dataset, we study NSBI using a Higgs boson dataset from the FAIR Universe Challenge. We find that both the direct likelihood and likelihood ratio estimation are able to effectively extract parameters with reasonable uncertainties. For the numerical examples and within the set of hyperparameters studied, we found that the likelihood ratio method is more accurate and/or precise. Both methods have a significant spread from the network training and would require ensembling or other mitigation strategies in practice.

[LG-79] Cost-Aware Optimal Pairwise Pure Exploration AISTATS2025

链接: https://arxiv.org/abs/2503.07877
作者: Di Wu,Chengshuai Shi,Ruida Zhou,Cong Shen
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: AISTATS 2025

点击查看摘要

Abstract:Pure exploration is one of the fundamental problems in multi-armed bandits (MAB). However, existing works mostly focus on specific pure exploration tasks, without a holistic view of the general pure exploration problem. This work fills this gap by introducing a versatile framework to study pure exploration, with a focus on identifying the pairwise relationships between targeted arm pairs. Moreover, unlike existing works that only optimize the stopping time (i.e., sample complexity), this work considers that arms are associated with potentially different costs and targets at optimizing the cumulative cost that occurred during learning. Under the general framework of pairwise pure exploration with arm-specific costs, a performance lower bound is derived. Then, a novel algorithm, termed CAET (Cost-Aware Pairwise Exploration Task), is proposed. CAET builds on the track-and-stop principle with a novel design to handle the arm-specific costs, which can potentially be zero and thus represent a very challenging case. Theoretical analyses prove that the performance of CAET approaches the lower bound asymptotically. Special cases are further discussed, including an extension to regret minimization, which is another major focus of MAB. The effectiveness and efficiency of CAET are also verified through experimental results under various settings.

[LG-80] Pure Exploration with Feedback Graphs

链接: https://arxiv.org/abs/2503.07824
作者: Alessio Russo,Yichen Song,Aldo Pacchiano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the sample complexity of pure exploration in an online learning problem with a feedback graph. This graph dictates the feedback available to the learner, covering scenarios between full-information, pure bandit feedback, and settings with no feedback on the chosen action. While variants of this problem have been investigated for regret minimization, no prior work has addressed the pure exploration setting, which is the focus of our study. We derive an instance-specific lower bound on the sample complexity of learning the best action with fixed confidence, even when the feedback graph is unknown and stochastic, and present unidentifiability results for Bernoulli rewards. Additionally, our findings reveal how the sample complexity scales with key graph-dependent quantities. Lastly, we introduce TaS-FG (Track and Stop for Feedback Graphs), an asymptotically optimal algorithm, and demonstrate its efficiency across different graph configurations.

[LG-81] Uncertainty quantification and posterior sampling for network reconstruction

链接: https://arxiv.org/abs/2503.07736
作者: Tiago P. Peixoto
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Data Analysis, Statistics and Probability (physics.data-an); Physics and Society (physics.soc-ph)
*备注: 16 pages, 12 figures. Code available in this https URL

点击查看摘要

Abstract:Network reconstruction is the task of inferring the unseen interactions between elements of a system, based only on their behavior or dynamics. This inverse problem is in general ill-posed, and admits many solutions for the same observation. Nevertheless, the vast majority of statistical methods proposed for this task – formulated as the inference of a graphical generative model – can only produce a ``point estimate,‘’ i.e. a single network considered the most likely. In general, this can give only a limited characterization of the reconstruction, since uncertainties and competing answers cannot be conveyed, even if their probabilities are comparable, while being structurally different. In this work we present an efficient MCMC algorithm for sampling from posterior distributions of reconstructed networks, which is able to reveal the full population of answers for a given reconstruction problem, weighted according to their plausibilities. Our algorithm is general, since it does not rely on specific properties of particular generative models, and is specially suited for the inference of large and sparse networks, since in this case an iteration can be performed in time O(N\log^2 N) for a network of N nodes, instead of O(N^2) , as would be the case for a more naive approach. We demonstrate the suitability of our method in providing uncertainties and consensus of solutions (which provably increases the reconstruction accuracy) in a variety of synthetic and empirical cases.

[LG-82] Personalized Convolutional Dictionary Learning of Physiological Time Series

链接: https://arxiv.org/abs/2503.07687
作者: Axel Roques,Samuel Gruffaz,Kyurae Kim,Alain Oliviero-Durmus,Laurent Oudre
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Human physiological signals tend to exhibit both global and local structures: the former are shared across a population, while the latter reflect inter-individual variability. For instance, kinetic measurements of the gait cycle during locomotion present common characteristics, although idiosyncrasies may be observed due to biomechanical disposition or pathology. To better represent datasets with local-global structure, this work extends Convolutional Dictionary Learning (CDL), a popular method for learning interpretable representations, or dictionaries, of time-series data. In particular, we propose Personalized CDL (PerCDL), in which a local dictionary models local information as a personalized spatiotemporal transformation of a global dictionary. The transformation is learnable and can combine operations such as time warping and rotation. Formal computational and statistical guarantees for PerCDL are provided and its effectiveness on synthetic and real human locomotion data is demonstrated.

[LG-83] Antibiotic Resistance Microbiology Dataset (ARMD): A De-identified Resource for Studying Antimicrobial Resistance Using Electronic Health Records

链接: https://arxiv.org/abs/2503.07664
作者: Fateme Nateghi Haredasht,Fatemeh Amrollahi,Manoj Maddali,Nicholas Marshall,Stephen P. Ma,Lauren N. Cooper,Richard J. Medford,Sanjat Kanjilal,Niaz Banaei,Stanley Deresinski,Mary K. Goldstein,Steven M. Asch,Amy Chang,Jonathan H. Chen
类目: Quantitative Methods (q-bio.QM); Information Retrieval (cs.IR); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The Antibiotic Resistance Microbiology Dataset (ARMD) is a de-identified resource derived from electronic health records (EHR) that facilitates research into antimicrobial resistance (AMR). ARMD encompasses data from adult patients, focusing on microbiological cultures, antibiotic susceptibilities, and associated clinical and demographic features. Key attributes include organism identification, susceptibility patterns for 55 antibiotics, implied susceptibility rules, and de-identified patient information. This dataset supports studies on antimicrobial stewardship, causal inference, and clinical decision-making. ARMD is designed to be reusable and interoperable, promoting collaboration and innovation in combating AMR. This paper describes the dataset’s acquisition, structure, and utility while detailing its de-identification process.

[LG-84] Physics- and data-driven Active Learning of neural network representations for free energy functions of materials from statistical mechanics

链接: https://arxiv.org/abs/2503.07619
作者: Jamie Holber,Krishna Garikipati
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate free energy representations are crucial for understanding phase dynamics in materials. We employ a scale-bridging approach to incorporate atomistic information into our free energy model by training a neural network on DFT-informed Monte Carlo data. To optimize sampling in the high-dimensional Monte Carlo space, we present an Active Learning framework that integrates space-filling sampling, uncertainty-based sampling, and physics-informed sampling. Additionally, our approach includes methods such as hyperparameter tuning, dynamic sampling, and novelty enforcement. These strategies can be combined to reduce MSE,either globally or in targeted regions of interest,while minimizing the number of required data points. The framework introduced here is broadly applicable to Monte Carlo sampling of a range of materials systems.

信息检索

[IR-0] LongEval at CLEF 2025: Longitudinal Evaluation of IR Model Performance ECIR2025

链接: https://arxiv.org/abs/2503.08541
作者: Matteo Cancellieri,Alaa El-Ebshihy,Tobias Fink,Petra Galuščáková,Gabriela Gonzalez-Saez,Lorraine Goeuriot,David Iommi,Jüri Keller,Petr Knoth,Philippe Mulhem,Florina Piroi,David Pride,Philipp Schaer
类目: Information Retrieval (cs.IR)
*备注: Accepted for ECIR 2025. To be published in Advances in Information Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings

点击查看摘要

Abstract:This paper presents the third edition of the LongEval Lab, part of the CLEF 2025 conference, which continues to explore the challenges of temporal persistence in Information Retrieval (IR). The lab features two tasks designed to provide researchers with test data that reflect the evolving nature of user queries and document relevance over time. By evaluating how model performance degrades as test data diverge temporally from training data, LongEval seeks to advance the understanding of temporal dynamics in IR systems. The 2025 edition aims to engage the IR and NLP communities in addressing the development of adaptive models that can maintain retrieval quality over time in the domains of web search and scientific retrieval.

[IR-1] KAP: MLLM -assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents

链接: https://arxiv.org/abs/2503.08452
作者: Hsin-Ling Hsu,Ping-Sheng Lin,Jing-Di Lin,Jengnan Tzeng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We propose Knowledge-Aware Preprocessing (KAP), a two-stage preprocessing framework tailored for Traditional Chinese non-narrative documents, designed to enhance retrieval accuracy in Hybrid Retrieval systems. Hybrid Retrieval, which integrates Sparse Retrieval (e.g., BM25) and Dense Retrieval (e.g., vector embeddings), has become a widely adopted approach for improving search effectiveness. However, its performance heavily depends on the quality of input text, which is often degraded when dealing with non-narrative documents such as PDFs containing financial statements, contractual clauses, and tables. KAP addresses these challenges by integrating Multimodal Large Language Models (MLLMs) with LLM-driven post-OCR processing, refining extracted text to reduce OCR noise, restore table structures, and optimize text format. By ensuring better compatibility with Hybrid Retrieval, KAP improves the accuracy of both Sparse and Dense Retrieval methods without modifying the retrieval architecture itself.

[IR-2] Weighted Tensor Decompositions for Context-aware Collaborative Filtering

链接: https://arxiv.org/abs/2503.08393
作者: Joey De Pauw,Bart Goethals
类目: Information Retrieval (cs.IR)
*备注: Workshop on Context-Aware Recommender Systems, September 18, 2023, Singapore

点击查看摘要

Abstract:Over recent years it has become well accepted that user interest is not static or immutable. There are a variety of contextual factors, such as time of day, the weather or the user’s mood, that influence the current interests of the user. Modelling approaches need to take these factors into account if they want to succeed at finding the most relevant content to recommend given the situation. A popular method for context-aware recommendation is to encode context attributes as extra dimensions of the classic user-item interaction matrix, effectively turning it into a tensor, followed by applying the appropriate tensor decomposition methods to learn missing values. However, unlike with matrix factorization, where all decompositions are essentially a product of matrices, there exist many more options for decomposing tensors by combining vector, matrix and tensor products. We study the most successful decomposition methods that use weighted square loss and categorize them based on their tensor structure and regularization strategy. Additionally, we further extend the pool of methods by filling in the missing combinations. In this paper we provide an overview of the properties of the different decomposition methods, such as their complexity, scalability, and modelling capacity. These benefits are then contrasted with the performances achieved in offline experiments to gain more insight into which method to choose depending on a specific situation and constraints. Comments: Workshop on Context-Aware Recommender Systems, September 18, 2023, Singapore Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2503.08393 [cs.IR] (or arXiv:2503.08393v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2503.08393 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] SoTCKGE:Continual Knowledge Graph Embedding Based on Spatial Offset Transformation

链接: https://arxiv.org/abs/2503.08189
作者: Xinyan Wang,Jinshuo Liu,Cheng Bi,Kaijian Xie,Meng Wang,Juan Deng,Jeff Pan
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Current Continual Knowledge Graph Embedding (CKGE) methods primarily rely on translation-based embedding methods, leveraging previously acquired knowledge to initialize new facts. To enhance learning efficiency, these methods often integrate fine-tuning or continual learning strategies. However, this compromises the model’s prediction accuracy and the translation-based methods lack support for complex relational structures (multi-hop relations). To tackle this challenge, we propose a novel CKGE framework SoTCKGE grounded in Spatial Offset Transformation. Within this framework, entity positions are defined as being jointly determined by base position vectors and offset vectors. This not only enhances the model’s ability to represent complex relational structures but also allows for the embedding update of both new and old knowledge through simple spatial offset transformations, without the need for continuous learning methods. Furthermore, we introduce a hierarchical update strategy and a balanced embedding method to refine the parameter update process, effectively minimizing training costs and augmenting model accuracy. To comprehensively assess the performance of our model, we have conducted extensive experimlents on four publicly accessible datasets and a new dataset constructed by us. Experimental results demonstrate the advantage of our model in enhancing multi-hop relationship learning and further improving prediction accuracy.

[IR-4] MultiConIR: Towards multi-condition Information Retrieval

链接: https://arxiv.org/abs/2503.08046
作者: Xuan Lu,Sifan Liu,Bochao Yin,Yongqi Li,Xinghao Chen,Hui Su,Yaohui Jin,Wenjun Zeng,Xiaoyu Shen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In this paper, we introduce MultiConIR, the first benchmark designed to evaluate retrieval models in multi-condition scenarios. Unlike existing datasets that primarily focus on single-condition queries from search engines, MultiConIR captures real-world complexity by incorporating five diverse domains: books, movies, people, medical cases, and legal documents. We propose three tasks to systematically assess retrieval and reranking models on multi-condition robustness, monotonic relevance ranking, and query format sensitivity. Our findings reveal that existing retrieval and reranking models struggle with multi-condition retrieval, with rerankers suffering severe performance degradation as query complexity increases. We further investigate the performance gap between retrieval and reranking models, exploring potential reasons for these discrepancies, and analysis the impact of different pooling strategies on condition placement sensitivity. Finally, we highlight the strengths of GritLM and Nv-Embed, which demonstrate enhanced adaptability to multi-condition queries, offering insights for future retrieval models. The code and datasets are available at this https URL.

[IR-5] Uncovering Cross-Domain Recommendation Ability of Large Language Models

链接: https://arxiv.org/abs/2503.07761
作者: Xinyi Liu,Ruijie Wang,Dachun Sun,Dilek Hakkani-Tur,Tarek Abdelzaher
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Cross-Domain Recommendation (CDR) seeks to enhance item retrieval in low-resource domains by transferring knowledge from high-resource domains. While recent advancements in Large Language Models (LLMs) have demonstrated their potential in Recommender Systems (RS), their ability to effectively transfer domain knowledge for improved recommendations remains underexplored. To bridge this gap, we propose LLM4CDR, a novel CDR pipeline that constructs context-aware prompts by leveraging users’ purchase history sequences from a source domain along with shared features between source and target domains. Through extensive experiments, we show that LLM4CDR achieves strong performance, particularly when using LLMs with large parameter sizes and when the source and target domains exhibit smaller domain gaps. For instance, incorporating CD and Vinyl purchase history for recommendations in Movies and TV yields a 64.28 percent MAP 1 improvement. We further investigate key factors including source domain data, domain gap, prompt design, and LLM size, which impact LLM4CDR’s effectiveness in CDR tasks. Our results highlight that LLM4CDR excels when leveraging a single, closely related source domain and benefits significantly from larger LLMs. These insights pave the way for future research on LLM-driven cross-domain recommendations.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-03-12

目录

概览 (2025-03-12)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载